API Rate-Limiting as a Multi-Region Service
Design a globally consistent rate limiting service with low latency and multi-region enforcement.
What You'll Learn
- •Token bucket and sliding window log algorithms: implementation trade-offs, burst handling, and precision guarantees
- •Distributed counter synchronization using gossip and windowed CRDT consumed counters for bounded global quota drift
- •Edge-cached policy evaluation with push-based invalidation for sub-5ms decision latency
- •Hot-key mitigation through counter sharding, local pre-aggregation, and adaptive splitting
- •Lua scripting in Redis for atomic token bucket operations (check-and-decrement in single round-trip)
- •Graceful degradation strategies: local-only mode, fail-open vs fail-closed policies during network partitions
- •Multi-tier rate limiting: per-user, per-API-key, per-IP, per-organization with hierarchical quota inheritance
Interview Simulation
Run a timed mock interview for this project and get a scored debrief.
Quick Context
A multi-region rate limiting service sits on the critical path of every API request, making sub-millisecond decisions about whether to allow or throttle traffic based on configurable quotas. The core tension is between global accuracy (enforcing exact quotas across all regions) and local latency (making decisions without cross-region round trips). At companies like Cloudflare, Stripe, and AWS API Gateway, rate limiters handle millions of decisions per second across dozens of geographic regions. The key challenges are: (1) choosing the right algorithm that balances burst tolerance with precision, (2) synchronizing distributed counters across regions without adding latency to the hot path, (3) handling hot keys where a single tenant generates disproportionate traffic, (4) gracefully degrading when regions lose connectivity while preventing quota abuse, and (5) supporting hierarchical rate limits (per-user within per-org within global) with efficient evaluation. Success metrics: p99 decision latency <5ms, global quota accuracy within 1%, zero false rejections during regional failover, and <100ms policy propagation for kill-switch scenarios.
- •Handle 2 million rate checks per second globally with p99 decision latency under 5msWhy?
- •Enforce global quotas across 8 regions with at most 1% overshoot during normal operationWhy?
- •Support 1 million active rate limit policies with hierarchical inheritance (user < org < global)Why?
- •Propagate policy changes (especially kill switches) to all regions within 100msWhy?
Key Numbers(hover for details)
Requirements
Support token bucket (with configurable burst), sliding window log (exact counting), sliding window counter (approximate but memory-efficient), and fixed window algorithms
Why it matters: Different use cases need different precision/performance trade-offs - API quotas need burst tolerance while abuse prevention needs exact counting
Enforce rate limits at multiple levels: per-user, per-API-key, per-IP, per-organization, and global, with configurable inheritance and override rules
Why it matters: Enterprise customers need organization-level quotas that subdivide among their users while respecting global platform limits
Synchronize counters across all 8 regions so that a tenant consuming quota in US-East cannot exceed their global limit by also consuming in EU-West
Why it matters: Without cross-region sync, a tenant with a 1000/min limit could effectively get 8000/min by distributing requests across regions
Allow configurable burst sizes above the sustained rate, with automatic recovery based on token replenishment rate
Why it matters: Legitimate traffic is bursty - denying short spikes degrades user experience and causes unnecessary retry storms
CRUD operations for rate limit policies with versioning, audit trail, approval workflows, and bulk import/export
Why it matters: Operations teams need to quickly adjust limits during incidents or onboard new customers with custom quotas
Return standard rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After) on every response
Why it matters: Clients need visibility into their quota usage to implement proper backoff and retry logic
Support per-request bypass tokens for internal services and temporary limit overrides for incident response
Why it matters: Internal health checks and monitoring should not count against quotas; incident response may need temporary limit increases
Architecture Evolution
Single-region rate limiter with Redis backend. Token bucket implemented via Lua script for atomic check-and-decrement. Suitable for <100K checks/sec with single-region deployment.
Click a component for details, or run Trace Flow to inspect end-to-end movement.
Legend
What Changed & Why
- •Single Redis instance with token bucket Lua script
- •Policy stored in PostgreSQL, cached in application memory
- •Fixed window counters as fallback for simple limits
- •Standard rate limit response headers on every response
- •Basic admin API for CRUD on policies
Key Decisions
5 decisionsToken bucket with configurable burst size and refill rate, with sliding window counter as secondary algorithm
Redis Cluster with hash-slot sharding and Lua scripting for atomic operations
Gossip-based windowed CRDT consumed counters with adaptive sync frequency per counter velocity
Push-based via Kafka with local edge cache and pull-based fallback
Automatic counter sharding with local pre-aggregation for detected hot keys
API Design
The Rate Limiting API provides endpoints for rate check decisions, policy management, and quota monitoring. The check endpoint is designed for ultra-low latency on the hot path, while management endpoints support comprehensive CRUD operations with audit trails.
Base URL
https://api.ratelimit.io/v1Authentication
Rate check endpoints use mTLS for zero-overhead authentication. Management endpoints use API keys with RBAC. Admin operations require MFA-verified tokens.
Endpoints
/checkWebhooks
Fired when a rate limit policy is created, updated, or deleted
Payload
{
"event": "policy.updated",
"policy_id": "pol_xyz789",
"action": "updated",
"version": 2,
"changed_by": "admin@example.com",
"timestamp": "2024-01-15T10:30:00Z"
}Fired when a key exhausts its quota (useful for alerting customers)
Payload
{
"event": "quota.exhausted",
"key": "user:usr_abc123:api:/v1/orders",
"policy_id": "pol_xyz789",
"limit": 1000,
"consumed": 1000,
"reset_at": "2024-01-15T10:31:00Z",
"timestamp": "2024-01-15T10:30:42Z"
}Fired when automatic hot-key detection triggers counter sharding
Payload
{
"event": "hotkey.detected",
"key": "org:org_456:api:*",
"checks_per_second": 15000,
"shard_count": 4,
"region": "us-east-1",
"timestamp": "2024-01-15T10:30:00Z"
}Code Samples
Lua script implementing token bucket with atomic check-and-consume. Runs entirely within Redis for single round-trip latency.
Data Model & Queries
-- Rate Limiting Platform Schema
-- Rate limit policies with hierarchical inheritance
CREATE TABLE rate_policies (
policy_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
description TEXT,
algorithm VARCHAR(50) NOT NULL DEFAULT 'token_bucket',
-- CHECK (algorithm IN ('token_bucket', 'sliding_window_log', 'sliding_window_counter', 'fixed_window')),
limit_value INT NOT NULL,
window_seconds INT NOT NULL,
burst_size INT NOT NULL DEFAULT 0,
refill_rate DECIMAL(10,4), -- tokens per second for token bucket
parent_policy_id UUID REFERENCES rate_policies(policy_id),
hierarchy_level VARCHAR(20) NOT NULL DEFAULT 'user',
-- CHECK (hierarchy_level IN ('user', 'api_key', 'organization', 'global')),
targeting_rules JSONB DEFAULT '{}',
is_active BOOLEAN NOT NULL DEFAULT true,
version INT NOT NULL DEFAULT 1,
created_by VARCHAR(255) NOT NULL,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_policies_active ON rate_policies (is_active, hierarchy_level);
CREATE INDEX idx_policies_parent ON rate_policies (parent_policy_id) WHERE parent_policy_id IS NOT NULL;
CREATE INDEX idx_policies_targeting ON rate_policies USING GIN (targeting_rules);
-- Policy audit log for compliance and debugging
CREATE TABLE policy_audit_log (
audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
policy_id UUID NOT NULL REFERENCES rate_policies(policy_id),
action VARCHAR(20) NOT NULL, -- 'created', 'updated', 'deleted', 'activated', 'deactivated'
old_values JSONB,
new_values JSONB,
changed_by VARCHAR(255) NOT NULL,
change_reason TEXT,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_audit_policy ON policy_audit_log (policy_id, created_at DESC);
CREATE INDEX idx_audit_time ON policy_audit_log (created_at DESC);
-- Counter sync state for cross-region reconciliation
CREATE TABLE counter_sync_state (
counter_key TEXT NOT NULL,
region VARCHAR(50) NOT NULL,
local_count BIGINT NOT NULL DEFAULT 0,
last_sync_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
sync_generation BIGINT NOT NULL DEFAULT 0,
PRIMARY KEY (counter_key, region)
);
CREATE INDEX idx_sync_stale ON counter_sync_state (last_sync_at) WHERE last_sync_at < NOW() - INTERVAL '10 seconds';
-- Hot key tracking for automatic sharding decisions
CREATE TABLE hot_key_events (
event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
counter_key TEXT NOT NULL,
region VARCHAR(50) NOT NULL,
checks_per_second INT NOT NULL,
shard_count INT NOT NULL DEFAULT 1,
action VARCHAR(20) NOT NULL, -- 'detected', 'sharded', 'unsharded'
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_hotkey_recent ON hot_key_events (counter_key, created_at DESC);
-- Sampled decision log for troubleshooting and quota analytics
CREATE TABLE rate_check_log (
check_id BIGSERIAL PRIMARY KEY,
checked_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
counter_key TEXT NOT NULL,
policy_id UUID REFERENCES rate_policies(policy_id),
region VARCHAR(50) NOT NULL,
decision VARCHAR(16) NOT NULL, -- 'allowed', 'denied', 'degraded_allow'
checks_per_second INT,
observed_latency_ms DECIMAL(10,3)
);
CREATE INDEX idx_rate_check_denied_recent ON rate_check_log (checked_at DESC, counter_key)
WHERE decision = 'denied';
-- Redis key schemas (documented for operations)
-- rate:{level}:{id} => HASH {tokens: float, last_refill: timestamp}
-- rate:{level}:{id}:shard:{n} => HASH (sharded counter)
-- policy:{policy_id} => JSON (cached policy)
-- hotkey:{key} => INT (check count in current window)- •Hierarchical rate_policies with parent_policy_id enables user -> org -> global quota inheritance
- •JSONB targeting_rules with GIN index allows flexible policy matching without schema changes
- •Separate policy_audit_log provides immutable compliance trail for every policy change
- •counter_sync_state tracks consumed counter state per region and generation for cross-region reconciliation
- •hot_key_events captures sharding decisions for post-incident analysis
- •Sampled rate_check_log supports short-horizon troubleshooting without storing every decision event
- •Redis HASH for token bucket state enables atomic Lua script operations in single key
- •Auto-TTL on Redis keys prevents memory bloat from expired counters
Resolve policy hierarchy for a rate check
WITH RECURSIVE policy_chain AS (
SELECT policy_id, parent_policy_id, limit_value, burst_size, hierarchy_level, 1 as depth
FROM rate_policies WHERE policy_id = $1 AND is_active = true
UNION ALL
SELECT p.policy_id, p.parent_policy_id, p.limit_value, p.burst_size, p.hierarchy_level, pc.depth + 1
FROM rate_policies p JOIN policy_chain pc ON p.policy_id = pc.parent_policy_id
WHERE p.is_active = true
)
SELECT * FROM policy_chain ORDER BY depth DESC;Atomic token bucket check via Redis Lua
EVALSHA <sha> 1 rate:user:usr_abc123 1000 16.67 <now_us> 1 50Find stale counter sync state (potential drift)
SELECT counter_key, region,
EXTRACT(EPOCH FROM NOW() - last_sync_at) as seconds_stale,
local_count
FROM counter_sync_state
WHERE last_sync_at < NOW() - INTERVAL '10 seconds'
ORDER BY last_sync_at ASC
LIMIT 100;Top over-limit keys in last hour
SELECT counter_key, COUNT(*) as deny_count,
MAX(checks_per_second) as peak_rate
FROM rate_check_log
WHERE decision = 'denied' AND checked_at > NOW() - INTERVAL '1 hour'
GROUP BY counter_key
ORDER BY deny_count DESC
LIMIT 20;idx_policies_activeFast lookup of active policies by hierarchy level during rate checksidx_policies_parentTraverse policy hierarchy for inherited limits (partial index excludes root)idx_policies_targetingMatch requests to policies using JSONB containment queriesidx_audit_policyRetrieve audit history for a specific policy (compliance, debugging)idx_sync_staleFind counter keys with stale sync state for reconciliationidx_hotkey_recentTrack hot-key sharding history for capacity planningidx_rate_check_denied_recentInvestigate top denied keys quickly during incidentsScaling & Bottlenecks
Primary Bottleneck
Hot-key counters cause single Redis node to saturate while other nodes are idle
Mitigation
Automatic counter sharding distributes hot keys across multiple Redis hash slots
What You'd Change
- •Deploy hot-key detector monitoring checks/sec per counter key
- •Implement automatic counter splitting when key exceeds 1000 checks/sec
- •Add in-process L1 cache with 100ms TTL to absorb repeated checks
Primary Bottleneck
Cross-region counter drift allows up to 15% overshoot during sync windows
Mitigation
Reduce sync interval for high-velocity counters and implement generation-aware merge of per-region consumed deltas
What You'd Change
- •Migrate from fixed-interval gossip to adaptive sync based on counter velocity
- •Implement windowed consumed counters with generation IDs for conflict-free cross-region merging
- •Add priority sync channel for counters approaching their limits (>80% utilization)
- •Deploy regional counter aggregators to reduce fan-out overhead
Primary Bottleneck
Counter store memory exceeds single-cluster capacity with 1B+ active keys
Mitigation
Implement tiered counter storage with hot/warm partitioning and probabilistic counting for long-tail
What You'd Change
- •Partition counters: hot (>10 checks/sec) in Redis, warm (<10/sec) in DynamoDB
- •Use Count-Min Sketch for approximate counting of long-tail keys
- •Deploy counter pre-allocation for predictable daily traffic patterns
- •Implement counter compaction: merge idle sub-counters back to single key
Failure Scenarios
Monitoring & Observability
Comprehensive monitoring should tie serving latency, quota correctness, sync freshness, and region-level resilience into one operational view with fast diagnosis paths.
ratelimit_check_latencyhistogramEnd-to-end latency of rate check decisions
ratelimit_decision_totalcounterTotal rate limit decisions by result (allowed, denied, degraded_allow, error)
policy_cache_hit_rategaugePercentage of policy lookups served from cache
counter_sync_laggaugeGlobal p99 lag between local consumed counters and merged cross-region view
sync_lag_per_regiongaugePer-region lag to latest sync generation
quota_overshoot_pctgaugePercentage by which global consumption exceeds quota limit
hot_key_countgaugeNumber of counter keys classified as hot (>1K checks/sec)
checks_per_regiongaugeRequest checks served per second in each region
redis_health_per_regiongaugeComposite region score from primary health, replica lag, and failover state
deny_ratio_by_policygaugePer-policy denied/total ratio used to detect misconfigured quotas
Rate check latency exceeding SLO, may impact API response times
ratelimit_check_latency_p99 > 10ms for 2mRunbook: Check Redis cluster health, policy cache hit rate, and counter shard balance
Cross-region sync stale, global quotas may be inaccurate
counter_sync_lag > 5000ms for 1mRunbook: Inspect sync transport, regional gossip fanout, and generation advancement
One or more regions are lagging badly and can over-consume quota
max(sync_lag_per_region) > 8000ms for 3mRunbook: Isolate lagging region, force full sync replay, temporarily tighten local quota budget
Global quota overshoot exceeding tolerance, possible abuse or sync issue
quota_overshoot_pct > 10% for 5mRunbook: Trigger immediate counter reconciliation, check hot-key traffic, and escalate high-priority sync lane
Redis counter store unavailable, rate limiter in degraded mode
redis_connection_errors > 10 in 1mRunbook: Check Redis failover status, enable conservative in-process fallback counters, and watch degraded_allow ratio
Policy cache hit rate collapse likely causing DB thundering herd
policy_cache_hit_rate < 90 for 2mRunbook: Enable stale-while-revalidate, inject cache jitter, and rate-limit bulk invalidations
One or more policies are denying unexpectedly high traffic
max(deny_ratio_by_policy) > 0.7 for 10mRunbook: Compare current policy version to previous, review rollout intent, and simulate policy impact
Serving SLO
Quota Correctness
Regional Reliability
Hot Key Pressure
Scale Calculator
Estimate quota-check throughput, Redis memory footprint, sync bandwidth, and monthly spend for multi-region rate limiting. Inputs separate key cardinality from active traffic so hot-key cost behavior is realistic.
* Estimates based on simplified AWS pricing. Actual costs may vary.
Cost & Capacity
Test Your Understanding
Failure Diagnosis
Architecture Decisions
Summary & Takeaways
- 1.Token bucket is the best general-purpose algorithm: it naturally handles bursts, is implementable atomically via Lua script, and has O(1) memory per key
- 2.The core tension in distributed rate limiting is global accuracy vs local latency - windowed consumed counters with adaptive sync offer a practical bounded-drift trade-off
- 3.Hot-key mitigation through automatic counter sharding is essential - a single large tenant should not degrade the experience for others
- 4.Graceful degradation must be per-policy configurable: billing quotas should fail closed while abuse prevention can fail open
- 5.Cross-region sync lag means global quotas are inherently approximate - design for 1-5% overshoot tolerance rather than exact enforcement
- 6.Rate limiting is on the critical path of every API request - the rate limiter itself must never become the bottleneck or single point of failure
- •Implement adaptive quotas that learn from traffic patterns and automatically adjust limits based on usage trends
- •Add fairness controls using weighted fair queuing to prevent quota starvation among sub-limits
- •Build a self-service quota management portal where customers can view usage, request increases, and configure alerts
- •Implement rate limit simulation that predicts the impact of policy changes before deployment
- •Add cost-based rate limiting where different endpoints consume different amounts of quota based on resource usage