URL Shortener at Global Scale
Design a globally distributed URL shortener that handles billions of redirects per day with low latency, abuse controls, and reliable analytics.
What You'll Learn
- •How to split create-link control plane from ultra-low-latency redirect data plane
- •ID generation strategies for short-code uniqueness without central bottlenecks
- •Cache hierarchy design: edge cache, regional cache, and metadata fallback
- •Negative caching and tombstones to enforce deletions and abuse takedowns
- •Asynchronous click analytics ingestion with backpressure and sampling
- •Partitioning and indexing patterns for high-cardinality link metadata
- •Safety patterns for custom alias collisions and idempotent create requests
- •Cost modeling for compute, storage, and cross-region traffic
Interview Simulation
Run a timed mock interview for this project and get a scored debrief.
Quick Context
A URL shortener is mostly a read-heavy redirect service with strict latency requirements and untrusted input. The system must map short codes to long URLs, support custom aliases, enforce abuse policies, and propagate deletes quickly. Unlike toy designs, production systems cannot put analytics on the critical path, cannot rely on a single sequence generator, and cannot tolerate stale or deleted links being served for long periods. The main challenge is preserving low-latency global redirects while maintaining correctness, safety, and predictable operational costs.
- •Serve 12,000,000,000 redirects/day with peak 180,000 req/sWhy?
- •Keep redirect path p99 under 20ms globally and p95 under 12ms in-regionWhy?
- •Support 70M+ link creations/day with idempotent create semanticsWhy?
- •Guarantee short-code uniqueness and custom alias collision detection
Key Numbers(hover for details)
Requirements
Create a short code for a long URL with optional expiration, tags, and tenant metadata.
Why it matters: Core feature for product value and API integration.
Allow users to request custom aliases with uniqueness checks and reserved-word policies.
Why it matters: Enterprise and marketing workflows depend on branded URLs.
Resolve short code and issue HTTP redirect quickly from edge locations.
Why it matters: Redirect path is the user-facing SLO-critical workflow.
Update destination URL, set expiry, disable, or hard-delete links safely.
Why it matters: Operations and compliance require lifecycle controls.
Capture click events with geo/device/referrer metadata and provide aggregates.
Why it matters: Analytics is a key monetization and customer retention feature.
Block malicious domains and enforce per-tenant rate limits and quota policies.
Why it matters: Prevents platform abuse and protects sender reputation.
Support idempotency keys so retries do not create duplicate short links.
Why it matters: Client retries are common under network turbulence.
Allow batch creation jobs for campaign pipelines with status tracking.
Why it matters: Large customers need operational throughput beyond single-item calls.
Architecture Evolution
Single-region API service with PostgreSQL and Redis cache. Suitable for 100-1,000 users and 2K-5K redirects/sec at $120-400/month.
Click any component to inspect details, or use Trace Flow to animate redirect and write paths.
Legend
What Changed & Why
- •One API service handles create and redirect requests
- •Redis cache for hot short-code lookups
- •PostgreSQL stores link metadata and click counters
- •Background worker computes simple daily analytics
- •Basic rate limiting per API key
Key Decisions
5 decisionsSnowflake-style 64-bit IDs encoded to Base62 short codes
Edge redirect service with regional cache-first lookup
Tombstone-first delete propagation with delayed hard purge
Async event stream with sampled click events and rollups
Transactional alias reservation with unique index + idempotency key
API Design
The API separates redirect serving from management operations. Redirect endpoint is unauthenticated but protected by abuse, bot, and rate controls. Management APIs require OAuth2/JWT and tenant scopes.
Base URL
https://api.shortly.dev/v1Authentication
Management endpoints require bearer tokens with tenant scopes (`links:write`, `links:read`, `analytics:read`). Bulk imports can use signed HMAC jobs for high-throughput ingestion.
Endpoints
/links/links/bulk/links/{shortCode}/analytics/links/{shortCode}Webhooks
Triggered when a link is disabled or hard-deleted.
Payload
{
"event": "link.deleted",
"tenantId": "tnt_42",
"shortCode": "b9Q4kT",
"deletedAt": "2026-02-08T11:07:22Z",
"actor": "user_91"
}Triggered when destination or behavior exceeds abuse thresholds.
Payload
{
"event": "abuse.detected",
"shortCode": "x2ab91",
"riskScore": 0.93,
"action": "auto_disabled",
"reason": "malware_domain"
}Code Samples
TypeScript utility converting monotonically increasing 64-bit IDs to compact Base62 codes.
Data Model & Queries
-- URL Shortener Core Schema (PostgreSQL)
CREATE TABLE tenants (
tenant_id UUID PRIMARY KEY,
name VARCHAR(128) NOT NULL,
plan VARCHAR(32) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE links (
link_id BIGINT PRIMARY KEY,
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
short_code VARCHAR(16) NOT NULL,
long_url TEXT NOT NULL,
long_url_hash BYTEA NOT NULL,
canonical_domain VARCHAR(255) NOT NULL,
status VARCHAR(24) NOT NULL CHECK (status IN ('active', 'disabled', 'deleted', 'expired')),
is_custom_alias BOOLEAN NOT NULL DEFAULT FALSE,
created_by UUID,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ,
deleted_at TIMESTAMPTZ,
metadata JSONB NOT NULL DEFAULT '{}',
UNIQUE (tenant_id, short_code)
);
CREATE TABLE alias_claims (
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
alias VARCHAR(64) NOT NULL,
short_code VARCHAR(16) NOT NULL,
claimed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (tenant_id, alias)
);
CREATE TABLE idempotency_keys (
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
idempotency_key UUID NOT NULL,
request_hash BYTEA NOT NULL,
status VARCHAR(16) NOT NULL CHECK (status IN ('in_progress', 'completed', 'failed')),
response_payload JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
last_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (tenant_id, idempotency_key)
);
CREATE TABLE deletion_tombstones (
tenant_id UUID NOT NULL REFERENCES tenants(tenant_id),
short_code VARCHAR(16) NOT NULL,
tombstone_version BIGINT NOT NULL,
reason VARCHAR(64) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (tenant_id, short_code)
);
-- High-volume click stream landing table (partition by day)
CREATE TABLE click_events (
event_id UUID NOT NULL,
tenant_id UUID NOT NULL,
short_code VARCHAR(16) NOT NULL,
clicked_at TIMESTAMPTZ NOT NULL,
country_code CHAR(2),
device_type VARCHAR(24),
referrer_domain VARCHAR(255),
ip_hash BYTEA,
user_agent_hash BYTEA,
edge_region VARCHAR(24),
PRIMARY KEY (event_id, clicked_at)
) PARTITION BY RANGE (clicked_at);
CREATE TABLE click_events_2026_02 PARTITION OF click_events
FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
CREATE TABLE daily_link_metrics (
tenant_id UUID NOT NULL,
short_code VARCHAR(16) NOT NULL,
metric_date DATE NOT NULL,
country_code CHAR(2) NOT NULL DEFAULT 'ZZ',
device_type VARCHAR(24) NOT NULL DEFAULT 'unknown',
clicks BIGINT NOT NULL,
uniques BIGINT,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (tenant_id, short_code, metric_date, country_code, device_type)
);
CREATE TABLE blocked_domains (
domain VARCHAR(255) PRIMARY KEY,
risk_score NUMERIC(4,3) NOT NULL,
source VARCHAR(64) NOT NULL,
active BOOLEAN NOT NULL DEFAULT TRUE,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_links_tenant_status_created
ON links (tenant_id, status, created_at DESC);
CREATE INDEX idx_links_active_lookup
ON links (tenant_id, short_code)
WHERE status = 'active';
CREATE INDEX idx_links_expires_at
ON links (expires_at)
WHERE expires_at IS NOT NULL;
CREATE INDEX idx_alias_claims_short_code
ON alias_claims (short_code);
CREATE INDEX idx_idempotency_last_seen
ON idempotency_keys (tenant_id, last_seen_at DESC);
CREATE INDEX idx_tombstones_version
ON deletion_tombstones (tombstone_version DESC);
CREATE INDEX idx_click_events_tenant_code_time
ON click_events (tenant_id, short_code, clicked_at DESC);
CREATE INDEX idx_click_events_country_time
ON click_events (country_code, clicked_at DESC);
CREATE INDEX idx_daily_metrics_date
ON daily_link_metrics (metric_date DESC, tenant_id);- •Separates link metadata from click event firehose so redirect lookups stay lightweight.
- •Uses dedicated tombstone table to enforce deletion semantics across eventually consistent caches.
- •Stores idempotency keys explicitly to guarantee safe client retries.
- •Keeps alias claims isolated to simplify uniqueness and conflict management.
- •Partitions click events by time for retention, pruning, and low-cost scans.
- •Pre-aggregates daily metrics for dashboard performance and cost control.
Resolve active link for redirect path
EXPLAIN ANALYZE
SELECT long_url, expires_at
FROM links
WHERE tenant_id = $1
AND short_code = $2
AND status = 'active'
LIMIT 1;Find links nearing expiration in next 24 hours
EXPLAIN ANALYZE
SELECT short_code, expires_at
FROM links
WHERE tenant_id = $1
AND status = 'active'
AND expires_at IS NOT NULL
AND expires_at < NOW() + INTERVAL '24 hours'
ORDER BY expires_at ASC
LIMIT 2000;Top links by clicks in past 7 days
EXPLAIN ANALYZE
SELECT short_code, SUM(clicks) AS total_clicks
FROM daily_link_metrics
WHERE tenant_id = $1
AND metric_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY short_code
ORDER BY total_clicks DESC
LIMIT 100;Identify unresolved tombstone propagation gaps
EXPLAIN ANALYZE
SELECT t.short_code, t.tombstone_version, t.created_at
FROM deletion_tombstones t
LEFT JOIN links l
ON l.tenant_id = t.tenant_id
AND l.short_code = t.short_code
WHERE t.tenant_id = $1
AND (l.short_code IS NULL OR l.status <> 'deleted')
ORDER BY t.tombstone_version DESC
LIMIT 500;idx_links_active_lookup (tenant_id, short_code) WHERE status='active'Fast redirect path lookup for active links without scanning inactive rows.idx_links_tenant_status_created (tenant_id, status, created_at DESC)Supports management UI and latest-link listing per tenant.idx_tombstones_version (tombstone_version DESC)Efficient replication/invalidation catch-up by version.idx_click_events_tenant_code_time (tenant_id, short_code, clicked_at DESC)Optimizes time-range analytics for a specific link.idx_daily_metrics_date (metric_date DESC, tenant_id)Accelerates dashboard windows and billing-period summaries.Scaling & Bottlenecks
Primary Bottleneck
Regional cache miss spikes during campaign traffic and failover events
Mitigation
Increase hot-key replication and introduce adaptive prewarming for scheduled campaigns.
What You'd Change
- •Maintain top-N global hot set in every region
- •Prewarm campaign aliases based on scheduled launch windows
- •Add cache miss circuit-breakers to protect metadata store
Primary Bottleneck
Metadata read replica lag and analytics consumer backlog
Mitigation
Split metadata read/write pools and scale stream partitions with tenant-aware balancing.
What You'd Change
- •Promote read replica pools by tenant tier
- •Increase stream partitions and isolate high-volume tenants
- •Use pre-aggregated hourly rollups to reduce query scan volume
Primary Bottleneck
Cross-region replication egress and global invalidation consistency
Mitigation
Move to regional ownership cells with selective replication and compacted invalidation topics.
What You'd Change
- •Adopt per-tenant region ownership and bounded-staleness failover
- •Replicate control/tombstone state globally; keep raw events region-local
- •Deploy consistency watchdog jobs for cache-vs-store parity
Failure Scenarios
Monitoring & Observability
Monitoring must separately track redirect SLO, metadata health, invalidation correctness, abuse pipeline latency, and analytics freshness. The key anti-pattern is relying only on API latency without observing cache parity and deletion propagation.
redirect_p99_mshistogramP99 end-to-end redirect latency from edge POP.
cache_hit_rate_pctgaugeRegional cache hit rate for redirect lookups.
metadata_read_qpsgaugeRead QPS against metadata store from cache misses.
metadata_replica_lag_mshistogramReplication lag for metadata read replicas.
tombstone_replication_lag_secondsgaugeDelay for deletion/takedown propagation across regions.
create_api_p99_mshistogramP99 latency of create-link control-plane API.
abuse_check_p95_mshistogramLatency for domain safety/abuse checks in create path.
click_stream_lag_secondsgaugeConsumer lag of click analytics stream.
analytics_freshness_delay_secondsgaugeDelay between click ingest and dashboard visibility.
alias_conflict_rate_pctgaugeCustom alias conflict ratio over create attempts.
Redirect latency SLO is violated globally or in one major region.
redirect_p99_ms > 35 for 10mRunbook: Inspect cache hit drop, DB miss amplification, and edge saturation. Activate hotset prewarm + traffic shedding if needed.
Cache hit rate collapse likely causing metadata overload.
cache_hit_rate_pct < 90 for 5mRunbook: Check recent invalidation events, shard health, and failover state. Temporarily extend TTL for top stable keys.
Deleted links may still be resolvable in some regions.
tombstone_replication_lag_seconds > 10 for 3mRunbook: Force global invalidate for affected prefixes, replay invalidation topic, and run parity check job.
Dashboard freshness degraded; customers may see delayed analytics.
analytics_freshness_delay_seconds > 300 for 15mRunbook: Scale stream consumers, inspect partition skew, and shift heavy tenants to dedicated partitions.
Create path safety checks are slow and risk API degradation.
abuse_check_p95_ms > 120 for 10mRunbook: Enable cached risk fallback, throttle bulk create traffic, and investigate upstream threat intel service.
Metadata replicas too stale for safe redirect miss handling.
metadata_replica_lag_ms > 2000 for 10mRunbook: Reduce miss pressure, promote healthier replica, and tune replication bandwidth/QoS.
Redirect SLO
Consistency & Safety
Control Plane
Analytics Pipeline
Scale Calculator
Estimate redirect capacity, storage growth, event pipeline load, and monthly compute/storage/network cost for a multi-region URL shortener.
* Estimates based on simplified AWS pricing. Actual costs may vary.
Cost & Capacity
Test Your Understanding
Failure Diagnosis
Architecture Decisions
Summary & Takeaways
- 1.The core challenge is keeping redirect data plane fast while control and analytics planes evolve independently.
- 2.Snowflake-style IDs + Base62 give high-throughput uniqueness without global DB contention.
- 3.Deletion correctness requires tombstones and replication parity checks, not just cache TTL expiration.
- 4.Analytics must be asynchronous and sampled to preserve redirect SLO and cost control.
- 5.Negative caching and hotset prewarming are practical defenses against miss storms.
- 6.At scale, network and analytics storage often dominate cost over raw compute.
- •Add tenant-level routing policies and geo-fencing for data residency requirements.
- •Implement online abuse model with feature store and near-real-time scoring.
- •Introduce active-active metadata conflict resolution with CRDT-inspired merge policy.
- •Build dynamic cache TTL policy based on link popularity decay curves.
- •Add campaign launch scheduler that prewarms caches based on expected traffic envelopes.