SystemDesign Pro
ProjectsPathsKnowledgebaseAbout
PrivacyTermsRefundsCookiesContact
© 2026 SystemDesign Pro. All rights reserved.
0/10
advancedrate-limitingapimulti-regionredisconsistency

API Rate-Limiting as a Multi-Region Service

Design a globally consistent rate limiting service with low latency and multi-region enforcement.

What You'll Learn

  • •Token bucket and sliding window log algorithms: implementation trade-offs, burst handling, and precision guarantees
  • •Distributed counter synchronization using gossip and windowed CRDT consumed counters for bounded global quota drift
  • •Edge-cached policy evaluation with push-based invalidation for sub-5ms decision latency
  • •Hot-key mitigation through counter sharding, local pre-aggregation, and adaptive splitting
  • •Lua scripting in Redis for atomic token bucket operations (check-and-decrement in single round-trip)
  • •Graceful degradation strategies: local-only mode, fail-open vs fail-closed policies during network partitions
  • •Multi-tier rate limiting: per-user, per-API-key, per-IP, per-organization with hierarchical quota inheritance

Interview Simulation

Run a timed mock interview for this project and get a scored debrief.

Quick Context

Problem

A multi-region rate limiting service sits on the critical path of every API request, making sub-millisecond decisions about whether to allow or throttle traffic based on configurable quotas. The core tension is between global accuracy (enforcing exact quotas across all regions) and local latency (making decisions without cross-region round trips). At companies like Cloudflare, Stripe, and AWS API Gateway, rate limiters handle millions of decisions per second across dozens of geographic regions. The key challenges are: (1) choosing the right algorithm that balances burst tolerance with precision, (2) synchronizing distributed counters across regions without adding latency to the hot path, (3) handling hot keys where a single tenant generates disproportionate traffic, (4) gracefully degrading when regions lose connectivity while preventing quota abuse, and (5) supporting hierarchical rate limits (per-user within per-org within global) with efficient evaluation. Success metrics: p99 decision latency <5ms, global quota accuracy within 1%, zero false rejections during regional failover, and <100ms policy propagation for kill-switch scenarios.

Constraints & Assumptions8 items
  • •
    Handle 2 million rate checks per second globally with p99 decision latency under 5ms
    Why?
  • •
    Enforce global quotas across 8 regions with at most 1% overshoot during normal operation
    Why?
  • •
    Support 1 million active rate limit policies with hierarchical inheritance (user < org < global)
    Why?
  • •
    Propagate policy changes (especially kill switches) to all regions within 100ms
    Why?

Key Numbers(hover for details)

2M/sec
Rate Checks
Peak global check throughput across all regions
p99 <5ms
Decision Latency
End-to-end rate check including policy lookup
1M
Policies
Active rate limit policies across all tenants
100M
Counters
Active counter keys in memory across all regions
99.99%
Availability
Rate limiting service uptime SLA
8
Regions
Geographic deployment regions
<2s
Sync Lag
Cross-region counter synchronization delay
99%
Accuracy
Global quota enforcement accuracy (1% overshoot tolerance)

Requirements

Multi-Algorithm Support

Support token bucket (with configurable burst), sliding window log (exact counting), sliding window counter (approximate but memory-efficient), and fixed window algorithms

Why it matters: Different use cases need different precision/performance trade-offs - API quotas need burst tolerance while abuse prevention needs exact counting

Hierarchical Rate Limits

Enforce rate limits at multiple levels: per-user, per-API-key, per-IP, per-organization, and global, with configurable inheritance and override rules

Why it matters: Enterprise customers need organization-level quotas that subdivide among their users while respecting global platform limits

Global Quota Enforcement

Synchronize counters across all 8 regions so that a tenant consuming quota in US-East cannot exceed their global limit by also consuming in EU-West

Why it matters: Without cross-region sync, a tenant with a 1000/min limit could effectively get 8000/min by distributing requests across regions

Burst Control

Allow configurable burst sizes above the sustained rate, with automatic recovery based on token replenishment rate

Why it matters: Legitimate traffic is bursty - denying short spikes degrades user experience and causes unnecessary retry storms

Policy Management API

CRUD operations for rate limit policies with versioning, audit trail, approval workflows, and bulk import/export

Why it matters: Operations teams need to quickly adjust limits during incidents or onboard new customers with custom quotas

Response Headers

Return standard rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After) on every response

Why it matters: Clients need visibility into their quota usage to implement proper backoff and retry logic

Override & Bypass

Support per-request bypass tokens for internal services and temporary limit overrides for incident response

Why it matters: Internal health checks and monitoring should not count against quotas; incident response may need temporary limit increases

Architecture Evolution

Single-region rate limiter with Redis backend. Token bucket implemented via Lua script for atomic check-and-decrement. Suitable for <100K checks/sec with single-region deployment.

Click a component for details, or run Trace Flow to inspect end-to-end movement.

Press enter or space to select a node.You can then use the arrow keys to move the node around. Press delete to remove it and escape to cancel.
Press enter or space to select an edge. You can then press delete to remove it or escape to cancel.

Legend

Gateway
Checker
Counter Store
Policy
Audit

What Changed & Why

  • •Single Redis instance with token bucket Lua script
  • •Policy stored in PostgreSQL, cached in application memory
  • •Fixed window counters as fallback for simple limits
  • •Standard rate limit response headers on every response
  • •Basic admin API for CRUD on policies

Key Decisions

5 decisions
Rate Limiting Algorithm: Token Bucket vs Sliding Window vs Fixed Window4 alternatives

Token bucket with configurable burst size and refill rate, with sliding window counter as secondary algorithm

Counter Store: Redis Cluster vs DynamoDB vs Custom In-Memory3 alternatives

Redis Cluster with hash-slot sharding and Lua scripting for atomic operations

Cross-Region Sync: Gossip Protocol vs Strong Consensus vs Async Replication3 alternatives

Gossip-based windowed CRDT consumed counters with adaptive sync frequency per counter velocity

Policy Distribution: Push-Based vs Pull-Based vs Hybrid3 alternatives

Push-based via Kafka with local edge cache and pull-based fallback

Hot-Key Mitigation: Counter Sharding vs Local Pre-Aggregation vs Dedicated Path3 alternatives

Automatic counter sharding with local pre-aggregation for detected hot keys

API Design

The Rate Limiting API provides endpoints for rate check decisions, policy management, and quota monitoring. The check endpoint is designed for ultra-low latency on the hot path, while management endpoints support comprehensive CRUD operations with audit trails.

Base URL

https://api.ratelimit.io/v1

Authentication

API Key + mTLS

Rate check endpoints use mTLS for zero-overhead authentication. Management endpoints use API keys with RBAC. Admin operations require MFA-verified tokens.

Endpoints

POST/checkCheck rate limit (hot path)

Webhooks

EVENT
policy.updated

Fired when a rate limit policy is created, updated, or deleted

Payload

{
  "event": "policy.updated",
  "policy_id": "pol_xyz789",
  "action": "updated",
  "version": 2,
  "changed_by": "admin@example.com",
  "timestamp": "2024-01-15T10:30:00Z"
}
EVENT
quota.exhausted

Fired when a key exhausts its quota (useful for alerting customers)

Payload

{
  "event": "quota.exhausted",
  "key": "user:usr_abc123:api:/v1/orders",
  "policy_id": "pol_xyz789",
  "limit": 1000,
  "consumed": 1000,
  "reset_at": "2024-01-15T10:31:00Z",
  "timestamp": "2024-01-15T10:30:42Z"
}
EVENT
hotkey.detected

Fired when automatic hot-key detection triggers counter sharding

Payload

{
  "event": "hotkey.detected",
  "key": "org:org_456:api:*",
  "checks_per_second": 15000,
  "shard_count": 4,
  "region": "us-east-1",
  "timestamp": "2024-01-15T10:30:00Z"
}

Code Samples

Redis Lua: Atomic Token BucketProduction

Lua script implementing token bucket with atomic check-and-consume. Runs entirely within Redis for single round-trip latency.

Data Model & Queries

Schema
SQL
-- Rate Limiting Platform Schema

-- Rate limit policies with hierarchical inheritance
CREATE TABLE rate_policies (
    policy_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,
    description TEXT,
    algorithm VARCHAR(50) NOT NULL DEFAULT 'token_bucket',
    -- CHECK (algorithm IN ('token_bucket', 'sliding_window_log', 'sliding_window_counter', 'fixed_window')),
    limit_value INT NOT NULL,
    window_seconds INT NOT NULL,
    burst_size INT NOT NULL DEFAULT 0,
    refill_rate DECIMAL(10,4), -- tokens per second for token bucket
    parent_policy_id UUID REFERENCES rate_policies(policy_id),
    hierarchy_level VARCHAR(20) NOT NULL DEFAULT 'user',
    -- CHECK (hierarchy_level IN ('user', 'api_key', 'organization', 'global')),
    targeting_rules JSONB DEFAULT '{}',
    is_active BOOLEAN NOT NULL DEFAULT true,
    version INT NOT NULL DEFAULT 1,
    created_by VARCHAR(255) NOT NULL,
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_policies_active ON rate_policies (is_active, hierarchy_level);
CREATE INDEX idx_policies_parent ON rate_policies (parent_policy_id) WHERE parent_policy_id IS NOT NULL;
CREATE INDEX idx_policies_targeting ON rate_policies USING GIN (targeting_rules);

-- Policy audit log for compliance and debugging
CREATE TABLE policy_audit_log (
    audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    policy_id UUID NOT NULL REFERENCES rate_policies(policy_id),
    action VARCHAR(20) NOT NULL, -- 'created', 'updated', 'deleted', 'activated', 'deactivated'
    old_values JSONB,
    new_values JSONB,
    changed_by VARCHAR(255) NOT NULL,
    change_reason TEXT,
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_audit_policy ON policy_audit_log (policy_id, created_at DESC);
CREATE INDEX idx_audit_time ON policy_audit_log (created_at DESC);

-- Counter sync state for cross-region reconciliation
CREATE TABLE counter_sync_state (
    counter_key TEXT NOT NULL,
    region VARCHAR(50) NOT NULL,
    local_count BIGINT NOT NULL DEFAULT 0,
    last_sync_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    sync_generation BIGINT NOT NULL DEFAULT 0,
    PRIMARY KEY (counter_key, region)
);

CREATE INDEX idx_sync_stale ON counter_sync_state (last_sync_at) WHERE last_sync_at < NOW() - INTERVAL '10 seconds';

-- Hot key tracking for automatic sharding decisions
CREATE TABLE hot_key_events (
    event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    counter_key TEXT NOT NULL,
    region VARCHAR(50) NOT NULL,
    checks_per_second INT NOT NULL,
    shard_count INT NOT NULL DEFAULT 1,
    action VARCHAR(20) NOT NULL, -- 'detected', 'sharded', 'unsharded'
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_hotkey_recent ON hot_key_events (counter_key, created_at DESC);

-- Sampled decision log for troubleshooting and quota analytics
CREATE TABLE rate_check_log (
    check_id BIGSERIAL PRIMARY KEY,
    checked_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    counter_key TEXT NOT NULL,
    policy_id UUID REFERENCES rate_policies(policy_id),
    region VARCHAR(50) NOT NULL,
    decision VARCHAR(16) NOT NULL, -- 'allowed', 'denied', 'degraded_allow'
    checks_per_second INT,
    observed_latency_ms DECIMAL(10,3)
);

CREATE INDEX idx_rate_check_denied_recent ON rate_check_log (checked_at DESC, counter_key)
WHERE decision = 'denied';

-- Redis key schemas (documented for operations)
-- rate:{level}:{id} => HASH {tokens: float, last_refill: timestamp}
-- rate:{level}:{id}:shard:{n} => HASH (sharded counter)
-- policy:{policy_id} => JSON (cached policy)
-- hotkey:{key} => INT (check count in current window)
Why This Schema
  • •Hierarchical rate_policies with parent_policy_id enables user -> org -> global quota inheritance
  • •JSONB targeting_rules with GIN index allows flexible policy matching without schema changes
  • •Separate policy_audit_log provides immutable compliance trail for every policy change
  • •counter_sync_state tracks consumed counter state per region and generation for cross-region reconciliation
  • •hot_key_events captures sharding decisions for post-incident analysis
  • •Sampled rate_check_log supports short-horizon troubleshooting without storing every decision event
  • •Redis HASH for token bucket state enables atomic Lua script operations in single key
  • •Auto-TTL on Redis keys prevents memory bloat from expired counters
Common Queries

Resolve policy hierarchy for a rate check

SQL
WITH RECURSIVE policy_chain AS (
    SELECT policy_id, parent_policy_id, limit_value, burst_size, hierarchy_level, 1 as depth
    FROM rate_policies WHERE policy_id = $1 AND is_active = true
    UNION ALL
    SELECT p.policy_id, p.parent_policy_id, p.limit_value, p.burst_size, p.hierarchy_level, pc.depth + 1
    FROM rate_policies p JOIN policy_chain pc ON p.policy_id = pc.parent_policy_id
    WHERE p.is_active = true
)
SELECT * FROM policy_chain ORDER BY depth DESC;

Atomic token bucket check via Redis Lua

Redis/Bash
EVALSHA <sha> 1 rate:user:usr_abc123 1000 16.67 <now_us> 1 50

Find stale counter sync state (potential drift)

SQL
SELECT counter_key, region,
    EXTRACT(EPOCH FROM NOW() - last_sync_at) as seconds_stale,
    local_count
FROM counter_sync_state
WHERE last_sync_at < NOW() - INTERVAL '10 seconds'
ORDER BY last_sync_at ASC
LIMIT 100;

Top over-limit keys in last hour

SQL
SELECT counter_key, COUNT(*) as deny_count,
    MAX(checks_per_second) as peak_rate
FROM rate_check_log
WHERE decision = 'denied' AND checked_at > NOW() - INTERVAL '1 hour'
GROUP BY counter_key
ORDER BY deny_count DESC
LIMIT 20;
Index Rationale
idx_policies_activeFast lookup of active policies by hierarchy level during rate checks
idx_policies_parentTraverse policy hierarchy for inherited limits (partial index excludes root)
idx_policies_targetingMatch requests to policies using JSONB containment queries
idx_audit_policyRetrieve audit history for a specific policy (compliance, debugging)
idx_sync_staleFind counter keys with stale sync state for reconciliation
idx_hotkey_recentTrack hot-key sharding history for capacity planning
idx_rate_check_denied_recentInvestigate top denied keys quickly during incidents

Scaling & Bottlenecks

Now

Primary Bottleneck

Hot-key counters cause single Redis node to saturate while other nodes are idle

Mitigation

Automatic counter sharding distributes hot keys across multiple Redis hash slots

What You'd Change

  • •Deploy hot-key detector monitoring checks/sec per counter key
  • •Implement automatic counter splitting when key exceeds 1000 checks/sec
  • •Add in-process L1 cache with 100ms TTL to absorb repeated checks
2× Scale

Primary Bottleneck

Cross-region counter drift allows up to 15% overshoot during sync windows

Mitigation

Reduce sync interval for high-velocity counters and implement generation-aware merge of per-region consumed deltas

What You'd Change

  • •Migrate from fixed-interval gossip to adaptive sync based on counter velocity
  • •Implement windowed consumed counters with generation IDs for conflict-free cross-region merging
  • •Add priority sync channel for counters approaching their limits (>80% utilization)
  • •Deploy regional counter aggregators to reduce fan-out overhead
10× Scale

Primary Bottleneck

Counter store memory exceeds single-cluster capacity with 1B+ active keys

Mitigation

Implement tiered counter storage with hot/warm partitioning and probabilistic counting for long-tail

What You'd Change

  • •Partition counters: hot (>10 checks/sec) in Redis, warm (<10/sec) in DynamoDB
  • •Use Count-Min Sketch for approximate counting of long-tail keys
  • •Deploy counter pre-allocation for predictable daily traffic patterns
  • •Implement counter compaction: merge idle sub-counters back to single key

Failure Scenarios

Monitoring & Observability

Comprehensive monitoring should tie serving latency, quota correctness, sync freshness, and region-level resilience into one operational view with fast diagnosis paths.

Key Metrics
ratelimit_check_latencyhistogram

End-to-end latency of rate check decisions

Good: p99 < 5ms
Warning: p99 > 10ms
Critical: p99 > 50ms
ratelimit_decision_totalcounter

Total rate limit decisions by result (allowed, denied, degraded_allow, error)

Good: error_rate < 0.01%
Warning: error_rate > 0.1%
Critical: error_rate > 1%
policy_cache_hit_rategauge

Percentage of policy lookups served from cache

Good: > 99%
Warning: < 95%
Critical: < 90%
counter_sync_laggauge

Global p99 lag between local consumed counters and merged cross-region view

Good: < 1000ms
Warning: > 3000ms
Critical: > 10000ms
sync_lag_per_regiongauge

Per-region lag to latest sync generation

Good: all regions < 1500ms
Warning: any region > 3000ms
Critical: any region > 8000ms
quota_overshoot_pctgauge

Percentage by which global consumption exceeds quota limit

Good: < 1%
Warning: > 5%
Critical: > 15%
hot_key_countgauge

Number of counter keys classified as hot (>1K checks/sec)

Good: < 100
Warning: > 500
Critical: > 2000
checks_per_regiongauge

Request checks served per second in each region

Good: within 20% of expected traffic split
Warning: deviation >35%
Critical: deviation >60%
redis_health_per_regiongauge

Composite region score from primary health, replica lag, and failover state

Good: > 0.99
Warning: 0.95-0.99
Critical: < 0.95
deny_ratio_by_policygauge

Per-policy denied/total ratio used to detect misconfigured quotas

Good: baseline +/- 20%
Warning: spike > 2x baseline
Critical: spike > 4x baseline
Alert Rules
HighCheckLatencywarning

Rate check latency exceeding SLO, may impact API response times

ratelimit_check_latency_p99 > 10ms for 2m

Runbook: Check Redis cluster health, policy cache hit rate, and counter shard balance

CounterSyncStalecritical

Cross-region sync stale, global quotas may be inaccurate

counter_sync_lag > 5000ms for 1m

Runbook: Inspect sync transport, regional gossip fanout, and generation advancement

RegionalSyncSkewcritical

One or more regions are lagging badly and can over-consume quota

max(sync_lag_per_region) > 8000ms for 3m

Runbook: Isolate lagging region, force full sync replay, temporarily tighten local quota budget

QuotaOvershootwarning

Global quota overshoot exceeding tolerance, possible abuse or sync issue

quota_overshoot_pct > 10% for 5m

Runbook: Trigger immediate counter reconciliation, check hot-key traffic, and escalate high-priority sync lane

RedisUnavailablecritical

Redis counter store unavailable, rate limiter in degraded mode

redis_connection_errors > 10 in 1m

Runbook: Check Redis failover status, enable conservative in-process fallback counters, and watch degraded_allow ratio

PolicyCacheCollapsewarning

Policy cache hit rate collapse likely causing DB thundering herd

policy_cache_hit_rate < 90 for 2m

Runbook: Enable stale-while-revalidate, inject cache jitter, and rate-limit bulk invalidations

DenySpikeByPolicywarning

One or more policies are denying unexpectedly high traffic

max(deny_ratio_by_policy) > 0.7 for 10m

Runbook: Compare current policy version to previous, review rollout intent, and simulate policy impact

Dashboard Layout

Serving SLO

ratelimit_check_latencyratelimit_decision_totalpolicy_cache_hit_rate

Quota Correctness

quota_overshoot_pctcounter_sync_lagdeny_ratio_by_policy

Regional Reliability

checks_per_regionsync_lag_per_regionredis_health_per_region

Hot Key Pressure

hot_key_countdeny_ratio_by_policyratelimit_check_latency

Scale Calculator

Estimate quota-check throughput, Redis memory footprint, sync bandwidth, and monthly spend for multi-region rate limiting. Inputs separate key cardinality from active traffic so hot-key cost behavior is realistic.

Configuration
2.00M checks/sec
10.00K checks/sec10.00M checks/sec
100.00M counters
100.00K counters1.00B counters
1.80M keys/sec
1.00K keys/sec5.00M keys/sec
64.00 bytes
32.00 bytes256.00 bytes
8.00 regions
1.00 regions20.00 regions
3.00 %
0.10 %40.00 %
70.00 %
0.00 %100.00 %
Calculated Results
Daily Checks
Total rate checks per day
172.80B checks
Hot Keys/Second
Active keys routed through high-priority synchronization
54.00Kkeys/sec
Logical Counter Memory
Counter state across all regions before replication
47.68 GB
Physical Counter Memory
Replicated Redis memory footprint
95.37 GB
Sync Bandwidth
Cross-region sync network usage for hot keys
0.00MB/s
Redis Nodes/Region
Recommended Redis cluster nodes per region
0
Rate Checker Instances
Stateless checker nodes handling inbound traffic
17
Decision Workers
Async workers for policy refresh, sync merge, and background reconciliation
12
Storage Cost
Redis memory fleet cost
$858
Compute Cost
Redis + checker fleet compute
$4,880
Network Cost
Cross-region replication and sync transfer
$0
Estimated Monthly Cost
ElastiCache + compute + inter-region transfer
$5,738
Cost per Million Checks
Effective blended serving cost at configured load
$0
Estimated Monthly Cost (AWS)
$5,738/month
Compute Cost$4,880
Storage Cost$858
Network Cost$0

* Estimates based on simplified AWS pricing. Actual costs may vary.

Cost & Capacity

Traffic Estimates
Peak Rate Checks
Global peak across all regions
2M/sec
Avg Rate Checks
Average sustained throughput
850K/sec
Daily Checks
Total rate checks per day at average rate
73B
Policy Lookups
Matches check rate (1:1)
2M/sec
Counter Syncs
Cross-region counter deltas for hot keys
12K/sec
Storage Estimates
Active Counters (Logical)
100M counters x 64 bytes x 8 regions (regional local state)
51.2 GB
Counter Replica Footprint
Logical counter state with replication factor 2
102.4 GB
Policy Store
1M policies with targeting rules
500 MB
Audit + Decision Logs
Policy audit plus sampled denied/slow decisions
120 GB/month
Sync State
Counter sync metadata across regions
2 GB

Test Your Understanding

Knowledge Check
Test your understanding of distributed rate limiting with real-world failure scenarios and design decisions.
4

Failure Diagnosis

3

Architecture Decisions

Summary & Takeaways

Key Takeaways
  • 1.Token bucket is the best general-purpose algorithm: it naturally handles bursts, is implementable atomically via Lua script, and has O(1) memory per key
  • 2.The core tension in distributed rate limiting is global accuracy vs local latency - windowed consumed counters with adaptive sync offer a practical bounded-drift trade-off
  • 3.Hot-key mitigation through automatic counter sharding is essential - a single large tenant should not degrade the experience for others
  • 4.Graceful degradation must be per-policy configurable: billing quotas should fail closed while abuse prevention can fail open
  • 5.Cross-region sync lag means global quotas are inherently approximate - design for 1-5% overshoot tolerance rather than exact enforcement
  • 6.Rate limiting is on the critical path of every API request - the rate limiter itself must never become the bottleneck or single point of failure
If I Had More Time
  • •Implement adaptive quotas that learn from traffic patterns and automatically adjust limits based on usage trends
  • •Add fairness controls using weighted fair queuing to prevent quota starvation among sub-limits
  • •Build a self-service quota management portal where customers can view usage, request increases, and configure alerts
  • •Implement rate limit simulation that predicts the impact of policy changes before deployment
  • •Add cost-based rate limiting where different endpoints consume different amounts of quota based on resource usage