SystemDesign Pro
ProjectsPathsKnowledgebaseAbout
PrivacyTermsRefundsCookiesContact
© 2026 SystemDesign Pro. All rights reserved.
0/10
intermediatefeature-flagsedgecachingconsistencyexperimentationsdk-designtargeting

Feature Flag Evaluation Engine at Scale

Design a low-latency feature flag evaluation system with targeting rules, percentage rollouts, A/B testing integration, kill switches, and multi-region consistency.

What You'll Learn

  • •Deterministic bucketing using MurmurHash3 for consistent user experience across SDKs and sessions
  • •Edge evaluation architecture with streaming config updates for sub-5ms latency at global scale
  • •Targeting rule engine design with segments, attributes, and boolean logic (AND/OR/NOT)
  • •Kill switch implementation with sub-5s global propagation using Kafka and SSE
  • •A/B testing integration with exposure event tracking and experiment analysis
  • •SDK design patterns: server-side full evaluation vs client-side pre-computed flags
  • •Stale flag detection using evaluation statistics and automated cleanup workflows
  • •Multi-region consistency vs availability trade-offs in feature flag systems

Interview Simulation

Run a timed mock interview for this project and get a scored debrief.

Quick Context

Problem

Feature flags have become critical infrastructure enabling trunk-based development, gradual rollouts, instant rollbacks, and experimentation without code deployments. The system must evaluate millions of flags per second with sub-5ms latency, maintain consistent user bucketing across all SDKs (server, mobile, web), propagate kill switches globally in under 5 seconds, and remain resilient during regional failures. Key challenges include: designing a targeting rule engine that supports complex boolean conditions, implementing deterministic bucketing that produces identical results across all SDK implementations, building an edge evaluation layer that scales horizontally while maintaining cache consistency, and integrating with A/B testing platforms for proper exposure tracking. Success metrics: p99 evaluation latency <5ms, kill switch propagation <5s, zero inconsistent bucket assignments across SDKs, and 99.99% evaluation availability.

Constraints & Assumptions9 items
  • •
    Support 2,000,000 evaluations/sec peak with p99 <5ms local evaluation latency
    Why?
  • •
    Propagate kill switches to all edge nodes and SDKs globally within 5 seconds
    Why?
  • •
    Guarantee identical bucketing results across all SDK implementations (Java, Go, Python, Node, iOS, Android, JS)
  • •
    Handle 500,000 active flags with 50,000 config updates per day
    Why?

Key Numbers(hover for details)

2.0M/sec
Evaluations
Peak throughput
p99 <5ms
Latency
Local eval time
500K
Flags
Active flags
10
Regions
Global POPs
<5s
Kill Switch
Global propagation
99.99%
Availability
Eval uptime
p99 <50ms
Bootstrap
SDK init time

Requirements

Flag Evaluation

Evaluate boolean, multivariate (string/number), and JSON flags with configurable default values

Why it matters: Core capability - flags control feature access for all users

Targeting Rules

Support user segments, attribute-based targeting (country, plan, version), and complex boolean conditions (AND/OR/NOT)

Why it matters: Enables precise control over which users see which features

Percentage Rollouts

Deterministic percentage-based rollouts using stable hashing - users always see the same variant

Why it matters: Gradual rollouts require consistent user experience across sessions

Kill Switch

Immediate global disable that overrides all targeting rules with propagation under 5s

Why it matters: Critical for incident response - must be able to turn off broken features instantly

A/B Testing Integration

Emit exposure events with user, flag, variant, and timestamp for experiment analysis

Why it matters: Product teams need accurate data to measure feature impact

SDK Support

Server SDKs (Java, Go, Python, Node), mobile SDKs (iOS, Android), and client-side JS SDK with offline support

Why it matters: Must work across all platforms with consistent behavior

Stale Flag Detection

Identify flags with no evaluations in 30+ days or 100% rollout for 90+ days

Why it matters: Technical debt from unused flags creates maintenance burden and confusion

Audit & History

Track all changes with actor, timestamp, previous/new state, and approval workflow integration

Why it matters: Compliance and debugging require knowing who changed what and when

Scheduled Rollouts

Support time-based activation/deactivation and gradual percentage increases

Why it matters: Coordinate releases with marketing launches and reduce manual operations

Architecture Evolution

Single-region flag API with in-memory cache and polling SDKs. Handles 100-1,000 users, 10K evals/sec, at $100-300/month. Good for startups validating product-market fit.

Click any component to explore its details, or Trace Flow to see data movement.

Press enter or space to select a node.You can then use the arrow keys to move the node around. Press delete to remove it and escape to cancel.
Press enter or space to select an edge. You can then press delete to remove it or escape to cancel.

Legend

Clients
Flag Service
Audit

What Changed & Why

  • •Single flag service with PostgreSQL backend for flag storage
  • •In-memory cache (Caffeine/Guava) with 30-second TTL for flag configs
  • •SDKs poll flag service every 30 seconds for config updates
  • •Simple percentage rollout using user ID hash modulo 100
  • •Basic boolean and string flag types only
  • •Synchronous audit log writes to same PostgreSQL database

Key Decisions

5 decisions
Evaluation Location: Edge vs Origin vs Client3 alternatives

Edge evaluation with local rule cache at regional POPs

Config Propagation: Polling vs Streaming vs Push4 alternatives

Streaming updates via Kafka with SSE to SDKs

Bucketing Algorithm: Hash Function Selection4 alternatives

MurmurHash3 (32-bit) with modulo 100,000 for 0.001% precision

Targeting Rule Storage: JSON vs DSL vs SQL3 alternatives

JSON rules in PostgreSQL JSONB with in-memory rule engine

Audit Log: Sync vs Async vs Event Sourcing3 alternatives

Async append-only log with Kafka + data warehouse

API Design

The Feature Flag API provides management endpoints, SDK bootstrap with ETag support, streaming config updates, and server-side evaluation for untrusted clients.

Base URL

https://api.featureflags.io/v1

Authentication

SDK Key or Bearer Token

Server SDKs use SDK keys (sdk-server-xxx). Client SDKs use client-side keys (sdk-client-xxx) which have limited permissions. Management API uses OAuth2 Bearer tokens.

Endpoints

GET/sdk/flags/{projectKey}Get all flags for SDK bootstrap
GET/sdk/stream/{projectKey}SSE stream for real-time config updates
POST/sdk/evaluate/{projectKey}/{flagKey}Evaluate a single flag (client SDK)

Webhooks

EVENT
flag.updated

Fired when any flag configuration changes

Payload

{
  "event": "flag.updated",
  "flag": {"id": "flag-abc", "key": "new-feature"},
  "changes": ["enabled", "rules"],
  "actor": {"id": "user-123", "email": "dev@example.com"},
  "timestamp": "2024-01-15T14:30:00Z"
}
EVENT
flag.kill_switch

Fired when a kill switch is activated or deactivated

Payload

{
  "event": "flag.kill_switch",
  "flag": {"id": "flag-abc", "key": "broken-feature"},
  "killSwitchEnabled": true,
  "reason": "Production incident",
  "actor": {"id": "user-123", "email": "oncall@example.com"},
  "timestamp": "2024-01-15T14:30:00Z"
}

Code Samples

TypeScript: Deterministic BucketingProduction

MurmurHash3-based bucketing algorithm that produces consistent results across all SDK implementations.

TypeScript: Stale Flag DetectionProduction

Identify flags that are candidates for cleanup based on evaluation patterns.

Data Model & Queries

Schema
SQL
-- Core schema for feature flag engine (PostgreSQL)

-- Projects organize flags by team/application
CREATE TABLE projects (
  project_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  key VARCHAR(50) UNIQUE NOT NULL,
  name VARCHAR(100) NOT NULL,
  description TEXT,
  settings JSONB DEFAULT '{}',
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);

-- Flag definitions with versioning
CREATE TABLE flags (
  flag_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID NOT NULL REFERENCES projects(project_id),
  key VARCHAR(100) NOT NULL,
  name VARCHAR(200) NOT NULL,
  description TEXT,
  flag_type VARCHAR(20) NOT NULL CHECK (flag_type IN ('boolean', 'string', 'number', 'json')),

  -- State
  enabled BOOLEAN NOT NULL DEFAULT false,
  kill_switch_enabled BOOLEAN NOT NULL DEFAULT false,
  archived BOOLEAN NOT NULL DEFAULT false,

  -- Variants (e.g., [{key: "on", value: true}, {key: "off", value: false}])
  variants JSONB NOT NULL,
  off_variant VARCHAR(50) NOT NULL,

  -- Default rule when no targeting rules match
  default_rule JSONB NOT NULL,

  -- Bucketing salt (change to re-randomize experiment assignments)
  salt VARCHAR(50) NOT NULL DEFAULT gen_random_uuid()::text,

  -- Versioning
  version BIGINT NOT NULL DEFAULT 1,
  config_version VARCHAR(50) NOT NULL DEFAULT '1',

  -- Metadata
  tags TEXT[] DEFAULT '{}',
  created_by UUID NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

  UNIQUE(project_id, key)
);

-- Targeting rules evaluated in priority order
CREATE TABLE flag_rules (
  rule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  flag_id UUID NOT NULL REFERENCES flags(flag_id) ON DELETE CASCADE,
  priority INT NOT NULL,
  name VARCHAR(100),
  description TEXT,

  -- Conditions expression tree (supports nested AND/OR/NOT)
  -- Example:
  -- {"op":"AND","clauses":[{"attribute":"country","op":"in","values":["US","CA"]},{"op":"NOT","clause":{"attribute":"app_version","op":"sem_ver_lt","value":"2.3.0"}}]}
  conditions JSONB NOT NULL,

  -- Result when matched
  variation_key VARCHAR(50),  -- Serve specific variant
  rollout JSONB,              -- Or percentage rollout

  enabled BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

  UNIQUE(flag_id, priority)
);

-- Reusable user segments
CREATE TABLE segments (
  segment_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID NOT NULL REFERENCES projects(project_id),
  key VARCHAR(100) NOT NULL,
  name VARCHAR(200) NOT NULL,
  description TEXT,

  -- Rule-based membership
  rules JSONB NOT NULL DEFAULT '[]',

  -- Explicit user lists (for small segments, testing)
  included_users TEXT[] DEFAULT '{}',
  excluded_users TEXT[] DEFAULT '{}',

  -- For large user lists, use external reference
  user_list_url TEXT,  -- S3 URL for large lists
  user_count_approx INT DEFAULT 0,

  version BIGINT NOT NULL DEFAULT 1,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

  UNIQUE(project_id, key)
);

-- Audit log (append-only, partitioned by month)
CREATE TABLE flag_audit_log (
  audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  flag_id UUID NOT NULL,
  project_id UUID NOT NULL,

  -- Actor
  actor_id UUID NOT NULL,
  actor_type VARCHAR(20) NOT NULL CHECK (actor_type IN ('user', 'api_key', 'system', 'scheduled')),
  actor_email VARCHAR(255),

  -- Action
  action VARCHAR(50) NOT NULL,

  -- Change details
  previous_state JSONB,
  new_state JSONB,

  -- Context
  ip_address INET,
  user_agent TEXT,
  approval_id UUID,
  comment TEXT,

  created_at TIMESTAMP NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE flag_audit_log_2024_01 PARTITION OF flag_audit_log
  FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Scheduled flag changes
CREATE TABLE scheduled_changes (
  schedule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  flag_id UUID NOT NULL REFERENCES flags(flag_id),
  project_id UUID NOT NULL,

  change_type VARCHAR(30) NOT NULL,
  scheduled_at TIMESTAMP NOT NULL,
  change_payload JSONB NOT NULL,

  status VARCHAR(20) NOT NULL DEFAULT 'pending'
    CHECK (status IN ('pending', 'executed', 'cancelled', 'failed')),

  created_by UUID NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  executed_at TIMESTAMP,
  execution_error TEXT
);

-- Flag evaluation statistics for stale detection
CREATE TABLE flag_evaluation_stats (
  flag_id UUID NOT NULL REFERENCES flags(flag_id),
  date DATE NOT NULL,

  evaluation_count BIGINT NOT NULL DEFAULT 0,
  unique_users_hll BYTEA,  -- HyperLogLog for unique user approximation

  -- Variant distribution
  variant_counts JSONB DEFAULT '{}',

  -- Timing
  first_eval_at TIMESTAMP,
  last_eval_at TIMESTAMP,

  PRIMARY KEY (flag_id, date)
);

-- Indexes for common queries
CREATE INDEX idx_flags_project_key ON flags(project_id, key);
CREATE INDEX idx_flags_updated ON flags(updated_at) WHERE NOT archived;
CREATE INDEX idx_flags_stale ON flags(updated_at, enabled) WHERE NOT archived;
CREATE INDEX idx_flag_rules_flag ON flag_rules(flag_id, priority);
CREATE INDEX idx_segments_project ON segments(project_id);
CREATE INDEX idx_audit_flag ON flag_audit_log(flag_id, created_at DESC);
CREATE INDEX idx_audit_project ON flag_audit_log(project_id, created_at DESC);
CREATE INDEX idx_scheduled_pending ON scheduled_changes(scheduled_at)
  WHERE status = 'pending';
CREATE INDEX idx_eval_stats_flag ON flag_evaluation_stats(flag_id, date DESC);

-- Redis key schema (documented for SDK/edge reference)
-- flag:{project_key}:{flag_key} => JSON flag config
-- segment:{project_key}:{segment_key} => JSON segment definition
-- flag_version:{project_key} => latest config version hash
-- kill_switch:{project_key}:{flag_key} => "1" if active
-- user_segment:{segment_id}:{user_id} => "1" if user in segment (bloom filter)
Why This Schema
  • •Flags table stores complete flag configuration with versioning for optimistic locking
  • •JSONB for variants and rules enables flexible schema evolution without migrations
  • •Separate flag_rules table allows complex rule ordering with priority column
  • •Segments as first-class entities enable reuse across multiple flags
  • •Audit log partitioned by month for efficient retention management and fast recent queries
  • •Evaluation stats table enables stale flag detection with HyperLogLog for unique users
  • •Scheduled changes table supports time-based rollouts and automated operations
  • •Redis keys documented for SDK and edge evaluator reference
Common Queries

Fetch all flags for a project (SDK bootstrap)

SQL
SELECT f.*, array_agg(r.* ORDER BY r.priority) as rules FROM flags f LEFT JOIN flag_rules r ON f.flag_id = r.flag_id WHERE f.project_id = $1 AND NOT f.archived GROUP BY f.flag_id;

Get flag with rules by key

SQL
SELECT f.*, json_agg(r ORDER BY r.priority) as rules FROM flags f LEFT JOIN flag_rules r ON f.flag_id = r.flag_id WHERE f.project_id = $1 AND f.key = $2 GROUP BY f.flag_id;

Find stale flags (no evals in 30 days)

SQL
SELECT f.* FROM flags f WHERE f.enabled = true AND NOT f.archived AND NOT EXISTS (SELECT 1 FROM flag_evaluation_stats s WHERE s.flag_id = f.flag_id AND s.date > CURRENT_DATE - INTERVAL '30 days' AND s.evaluation_count > 0);

Recent audit history for a flag

SQL
SELECT * FROM flag_audit_log WHERE flag_id = $1 ORDER BY created_at DESC LIMIT 50;

Redis: Get flag config

Redis/Bash
GET flag:{project_key}:{flag_key}

Redis: Check kill switch

Redis/Bash
EXISTS kill_switch:{project_key}:{flag_key}

Redis: Invalidate project config

Redis/Bash
DEL flag_version:{project_key}
Index Rationale
idx_flags_project_keyFast flag lookup by project and key (most common query)
idx_flags_updatedFind recently changed flags for cache invalidation
idx_flags_staleQuery for stale flag detection job
idx_flag_rules_flagLoad rules in priority order for evaluation
idx_audit_flagShow recent changes for a specific flag
idx_scheduled_pendingFind due scheduled changes for executor job
idx_eval_stats_flagFetch evaluation history for stale detection

Scaling & Bottlenecks

Now

Primary Bottleneck

Cache miss rate on edge evaluators causing origin database load spikes

Mitigation

Implement cache warming and increase TTL for stable flags

What You'd Change

  • •Preload hot flags (top 1000 by eval count) on edge startup
  • •Stagger cache TTL expiration to prevent thundering herd (jitter 10-20%)
  • •Add L2 cache tier in Redis for warm flags (30s in-process, 5m in Redis)
  • •Implement background refresh: update cache before TTL expires
2× Scale

Primary Bottleneck

Config update fanout causing Kafka consumer lag >5s during burst updates

Mitigation

Partition update stream by project and add regional consumers

What You'd Change

  • •Increase Kafka partitions from 16 to 64, key by project_id
  • •Deploy consumer groups per region for parallel processing
  • •Implement debouncing: batch rapid updates (wait 100ms) before publishing
  • •Add priority queue for kill switch updates (bypass normal processing)
10× Scale

Primary Bottleneck

Exposure event ingestion overwhelming analytics pipeline at 2M events/sec

Mitigation

Implement client-side sampling and async batching with backpressure

What You'd Change

  • •Add configurable sampling rate per flag (100% for experiments, 1% for monitoring)
  • •Batch exposure events in SDK (flush every 1s or 100 events)
  • •Implement backpressure: drop events if queue exceeds threshold (with metric)
  • •Use columnar storage (ClickHouse/Druid) for exposure analytics
  • •Add client-side deduplication window (don't re-emit same user+flag in 1 hour)

Failure Scenarios

Monitoring & Observability

Feature flag monitoring should simultaneously track serving SLOs, config propagation integrity, SDK health, and experiment data quality.

Key Metrics
flag_evaluation_duration_secondshistogram

Time to evaluate a single flag including rule matching

Good: p99 < 0.005
Warning: p99 0.005-0.020
Critical: p99 > 0.020
flag_evaluations_totalcounter

Total evaluation throughput across all edge regions

Good: Within expected baseline
Warning: Deviation >20%
Critical: Deviation >35%
flag_config_propagation_lag_secondsgauge

Time between control plane write and edge cache update

Good: < 2
Warning: 2-10
Critical: > 10
flag_kill_switch_propagation_secondshistogram

Time for kill switch to reach all edge nodes

Good: p99 < 5
Warning: p99 5-15
Critical: p99 > 15
kafka_consumer_group_laggauge

Pending update events in propagation consumer groups

Good: < 10K
Warning: 10K-100K
Critical: > 100K
flag_cache_hit_rategauge

Percentage of evaluations served from cache

Good: > 99%
Warning: 95-99%
Critical: < 95%
flag_exposure_events_droppedcounter

Exposure events dropped due to backpressure

Good: < 0.1%
Warning: 0.1-1%
Critical: > 1%
flag_variant_distribution_skewgauge

Observed vs expected rollout distribution divergence

Good: < 1%
Warning: 1-3%
Critical: > 3%
sdk_bootstrap_duration_secondshistogram

Time for SDK to fetch initial config and become ready

Good: p99 < 0.05
Warning: p99 0.05-0.2
Critical: p99 > 0.2
sdk_stream_connection_success_rategauge

Percentage of SDK streaming connections established successfully

Good: > 99.9%
Warning: 99.0-99.9%
Critical: < 99.0%
stale_flag_backlog_totalgauge

Count of stale flags awaiting cleanup workflow

Good: < 5% of active flags
Warning: 5-12%
Critical: > 12%
Alert Rules
HighEvaluationLatencywarning

Flag evaluation p99 latency exceeds 5ms target

histogram_quantile(0.99, rate(flag_evaluation_duration_seconds_bucket[5m])) > 0.005

Runbook: Check edge cache hit rate, rule complexity, segment sizes

KillSwitchPropagationSlowcritical

Kill switch took more than 10 seconds to propagate

histogram_quantile(0.99, rate(flag_kill_switch_propagation_seconds_bucket[5m])) > 10

Runbook: Check Kafka consumer lag, SSE connections, edge node health

ConfigPropagationLagwarning

Edge config is more than 30 seconds behind control plane

flag_config_propagation_lag_seconds > 30 for 5m

Runbook: Check Kafka consumer group, network connectivity, edge node logs

UpdateStreamBacklogCriticalcritical

Config update backlog risks kill switch SLA breach

kafka_consumer_group_lag > 100000 for 10m

Runbook: Scale consumers, rebalance partitions, and prioritize kill-switch topic

ExposureEventLosswarning

More than 1% of exposure events being dropped

rate(flag_exposure_events_dropped[5m]) / rate(flag_exposure_events_total[5m]) > 0.01

Runbook: Check event queue size, Kafka producer health, increase batch capacity

VariantSkewDetectedwarning

Observed rollout split diverges from expected distribution

flag_variant_distribution_skew > 0.03 for 15m

Runbook: Validate bucketing consistency across SDK versions and hash inputs

SDKStreamConnectionDropwarning

SDK streaming connectivity is degraded

sdk_stream_connection_success_rate < 0.99 for 10m

Runbook: Inspect SSE gateway health, connection limits, and token validation errors

Dashboard Layout

Evaluation Performance

flag_evaluation_duration_secondsflag_evaluations_totalflag_cache_hit_rate

Config Propagation

flag_config_propagation_lag_secondsflag_kill_switch_propagation_secondskafka_consumer_group_lag

SDK Health

sdk_bootstrap_duration_secondssdk_stream_connection_success_ratesdk_errors_total

Experiment Tracking

flag_exposure_events_totalflag_exposure_events_droppedflag_variant_distribution

Governance

stale_flag_backlog_totalflag_change_lead_time_secondsapproval_queue_depth

Scale Calculator

Estimate edge evaluator fleet size, config propagation capacity, and monthly spend with explicit compute, storage, and network components.

Configuration
2.00M evals/sec
1.00K evals/sec10.00M evals/sec
500.00K flags
100.00 flags1.00M flags
800.00 bytes
200.00 bytes5.00K bytes
10.00K segments
10.00 segments50.00K segments
5.00K users
100.00 users1.00M users
50.00K updates
10.00 updates100.00K updates
10.00 regions
1.00 regions30.00 regions
5.00M clients
1.00K clients10.00M clients
300.00 sec
30.00 sec1.80K sec
120.00 KB
20.00 KB2.00K KB
10.00 %
1.00 %100.00 %
Calculated Results
Edge Cache Size
Memory required per edge node for flag + segment cache
1.34K MB
Total Edge Memory
Memory across all edge regions with 2x headroom
26.08 GB
Evaluator Instances
Edge evaluator instances at ~80K eval/sec each
250
Update Events/sec
Config propagation events per second
5.79events/sec
Stream Consumers
Consumers needed for update fanout
0
Exposure Events/sec
Exposure events generated per second
200.00Kevents/sec
Exposure Storage/Day
Storage for exposure events (500 bytes each)
8.05K GB
SSE Connections
Server-Sent Events connections to maintain
5.00M connections
Bootstrap Bandwidth
Bandwidth for bootstrap refreshes
1.95KMB/s
Storage Cost
Exposure analytics + config snapshot storage
$5,552
Compute Cost
Evaluator fleet + stream consumers + regional control plane
$63,156
Network Cost
Bootstrap + exposure egress/replication transfer
$93,334
Estimated Monthly Cost
Compute + storage + network
$162,043
Cost per 1M Evaluations
Effective cost normalized by monthly evaluation volume
$0
Estimated Monthly Cost (AWS)
$162,043/month
Compute Cost$63,156
Storage Cost$5,552
Network Cost$93,334

* Estimates based on simplified AWS pricing. Actual costs may vary.

Cost & Capacity

Traffic Estimates
Peak Evaluations
10 flags/request x 200K requests/sec
2.0M/sec
Sustained Evaluations
25% of peak
500K/sec
SDK Bootstrap
~5.0M clients refreshing every 300s
16,667/sec
Config Updates
Cross-team product and experiment changes
50,000/day
Exposure Events
10% of evals for experiments
200K/sec
Kill Switches
Emergency disables
100/day
Storage Estimates
Flag Config
500K flags x 800 bytes
400 MB
Segment Data
10K segments x 500KB user lists
5 GB
Audit Logs
50K changes/day x 1KB x 30 days
1.5 GB/month
Exposure Logs
200K/sec x 500B x 86400s
8.6 TB/day
Edge Cache
All flags + segments compressed
1 GB/POP
Eval Stats
500K flags x 30 days x 3KB
50 GB/month

Test Your Understanding

Knowledge Check
Test your understanding of feature flag systems with real-world scenarios and architectural decisions.
6

Failure Diagnosis

5

Architecture Decisions

Summary & Takeaways

Key Takeaways
  • 1.Edge evaluation with in-memory cache is essential for sub-5ms latency - central evaluation adds 50-200ms network latency
  • 2.Deterministic bucketing using MurmurHash3 ensures consistent user experience across SDKs, sessions, and devices
  • 3.Kill switches need dedicated fast path - normal config propagation may have Kafka lag; implement direct database fallback
  • 4.Server SDKs evaluate locally with full rules; client SDKs receive pre-computed results to avoid exposing targeting logic
  • 5.Exposure event deduplication and sampling are critical - 2M evals/sec would generate 170B events/day without controls
  • 6.Stale flag detection requires multiple signals: no evaluations, 100% rollout for 90+ days, no code references
  • 7.Cache TTL jitter prevents thundering herd - stagger expiration with 10-20% random variance
  • 8.Streaming updates (SSE/Kafka) reduce propagation latency from 30s (polling) to <5s for config changes
If I Had More Time
  • •Implement automated rollback based on error rate correlation - detect 5xx spike after flag change
  • •Build flag dependency graph - show which flags depend on others, prevent circular references
  • •Add canary analysis integration - automatic percentage ramp based on metric health
  • •Implement approval workflows with required reviewers for production flag changes
  • •Build real-time A/B test dashboard with statistical significance calculations
  • •Add cross-team flag impact analysis - notify teams when shared flag changes
  • •Implement gradual rollout automation - automatically increase percentage over time if metrics healthy
  • •Build SDK conformance test suite that validates bucketing consistency across all languages