0/10

intermediatefeature-flagsedgecachingconsistencyexperimentationsdk-designtargeting

Feature Flag Evaluation Engine at Scale

Design a low-latency feature flag evaluation system with targeting rules, percentage rollouts, A/B testing integration, kill switches, and multi-region consistency.

What You'll Learn

•Deterministic bucketing using MurmurHash3 for consistent user experience across SDKs and sessions
•Edge evaluation architecture with streaming config updates for sub-5ms latency at global scale
•Targeting rule engine design with segments, attributes, and boolean logic (AND/OR/NOT)
•Kill switch implementation with sub-5s global propagation using Kafka and SSE
•A/B testing integration with exposure event tracking and experiment analysis
•SDK design patterns: server-side full evaluation vs client-side pre-computed flags
•Stale flag detection using evaluation statistics and automated cleanup workflows
•Multi-region consistency vs availability trade-offs in feature flag systems

Interview Simulation

Run a timed mock interview for this project and get a scored debrief.

Quick Context

Problem

Feature flags have become critical infrastructure enabling trunk-based development, gradual rollouts, instant rollbacks, and experimentation without code deployments. The system must evaluate millions of flags per second with sub-5ms latency, maintain consistent user bucketing across all SDKs (server, mobile, web), propagate kill switches globally in under 5 seconds, and remain resilient during regional failures. Key challenges include: designing a targeting rule engine that supports complex boolean conditions, implementing deterministic bucketing that produces identical results across all SDK implementations, building an edge evaluation layer that scales horizontally while maintaining cache consistency, and integrating with A/B testing platforms for proper exposure tracking. Success metrics: p99 evaluation latency <5ms, kill switch propagation <5s, zero inconsistent bucket assignments across SDKs, and 99.99% evaluation availability.

Constraints & Assumptions9 items

•
Support 2,000,000 evaluations/sec peak with p99 <5ms local evaluation latency
Why?
•
Propagate kill switches to all edge nodes and SDKs globally within 5 seconds
Why?
•
Guarantee identical bucketing results across all SDK implementations (Java, Go, Python, Node, iOS, Android, JS)
•
Handle 500,000 active flags with 50,000 config updates per day
Why?

Key Numbers(hover for details)

2.0M/sec

Evaluations

Peak throughput

p99 <5ms

Latency

Local eval time

500K

Flags

Active flags

Regions

Global POPs

<5s

Kill Switch

Global propagation

99.99%

Availability

Eval uptime

p99 <50ms

Bootstrap

SDK init time

Requirements

Flag Evaluation

Evaluate boolean, multivariate (string/number), and JSON flags with configurable default values

Why it matters: Core capability - flags control feature access for all users

Targeting Rules

Support user segments, attribute-based targeting (country, plan, version), and complex boolean conditions (AND/OR/NOT)

Why it matters: Enables precise control over which users see which features

Percentage Rollouts

Deterministic percentage-based rollouts using stable hashing - users always see the same variant

Why it matters: Gradual rollouts require consistent user experience across sessions

Kill Switch

Immediate global disable that overrides all targeting rules with propagation under 5s

Why it matters: Critical for incident response - must be able to turn off broken features instantly

A/B Testing Integration

Emit exposure events with user, flag, variant, and timestamp for experiment analysis

Why it matters: Product teams need accurate data to measure feature impact

SDK Support

Server SDKs (Java, Go, Python, Node), mobile SDKs (iOS, Android), and client-side JS SDK with offline support

Why it matters: Must work across all platforms with consistent behavior

Stale Flag Detection

Identify flags with no evaluations in 30+ days or 100% rollout for 90+ days

Why it matters: Technical debt from unused flags creates maintenance burden and confusion

Audit & History

Track all changes with actor, timestamp, previous/new state, and approval workflow integration

Why it matters: Compliance and debugging require knowing who changed what and when

Scheduled Rollouts

Support time-based activation/deactivation and gradual percentage increases

Why it matters: Coordinate releases with marketing launches and reduce manual operations

Architecture Evolution

Show deltas

Single-region flag API with in-memory cache and polling SDKs. Handles 100-1,000 users, 10K evals/sec, at $100-300/month. Good for startups validating product-market fit.

Click any component to explore its details, or Trace Flow to see data movement.

Legend

Clients

Flag Service

Audit

What Changed & Why

•Single flag service with PostgreSQL backend for flag storage
•In-memory cache (Caffeine/Guava) with 30-second TTL for flag configs
•SDKs poll flag service every 30 seconds for config updates
•Simple percentage rollout using user ID hash modulo 100
•Basic boolean and string flag types only
•Synchronous audit log writes to same PostgreSQL database

Key Decisions

5 decisions

Evaluation Location: Edge vs Origin vs Client3 alternatives

Edge evaluation with local rule cache at regional POPs

Config Propagation: Polling vs Streaming vs Push4 alternatives

Streaming updates via Kafka with SSE to SDKs

Bucketing Algorithm: Hash Function Selection4 alternatives

MurmurHash3 (32-bit) with modulo 100,000 for 0.001% precision

Targeting Rule Storage: JSON vs DSL vs SQL3 alternatives

JSON rules in PostgreSQL JSONB with in-memory rule engine

Audit Log: Sync vs Async vs Event Sourcing3 alternatives

Async append-only log with Kafka + data warehouse

API Design

The Feature Flag API provides management endpoints, SDK bootstrap with ETag support, streaming config updates, and server-side evaluation for untrusted clients.

Base URL

https://api.featureflags.io/v1

Authentication

SDK Key or Bearer Token

Server SDKs use SDK keys (sdk-server-xxx). Client SDKs use client-side keys (sdk-client-xxx) which have limited permissions. Management API uses OAuth2 Bearer tokens.

Endpoints

GET/sdk/flags/{projectKey}Get all flags for SDK bootstrap

GET/sdk/stream/{projectKey}SSE stream for real-time config updates

POST/sdk/evaluate/{projectKey}/{flagKey}Evaluate a single flag (client SDK)

Webhooks

EVENT

flag.updated

Fired when any flag configuration changes

Payload

{
  "event": "flag.updated",
  "flag": {"id": "flag-abc", "key": "new-feature"},
  "changes": ["enabled", "rules"],
  "actor": {"id": "user-123", "email": "dev@example.com"},
  "timestamp": "2024-01-15T14:30:00Z"
}

EVENT

flag.kill_switch

Fired when a kill switch is activated or deactivated

Payload

{
  "event": "flag.kill_switch",
  "flag": {"id": "flag-abc", "key": "broken-feature"},
  "killSwitchEnabled": true,
  "reason": "Production incident",
  "actor": {"id": "user-123", "email": "oncall@example.com"},
  "timestamp": "2024-01-15T14:30:00Z"
}

Code Samples

TypeScript: Deterministic BucketingProduction

MurmurHash3-based bucketing algorithm that produces consistent results across all SDK implementations.

TypeScript: Stale Flag DetectionProduction

Identify flags that are candidates for cleanup based on evaluation patterns.

Data Model & Queries

Schema

SQL

-- Core schema for feature flag engine (PostgreSQL)

-- Projects organize flags by team/application
CREATE TABLE projects (
  project_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  key VARCHAR(50) UNIQUE NOT NULL,
  name VARCHAR(100) NOT NULL,
  description TEXT,
  settings JSONB DEFAULT '{}',
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);

-- Flag definitions with versioning
CREATE TABLE flags (
  flag_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID NOT NULL REFERENCES projects(project_id),
  key VARCHAR(100) NOT NULL,
  name VARCHAR(200) NOT NULL,
  description TEXT,
  flag_type VARCHAR(20) NOT NULL CHECK (flag_type IN ('boolean', 'string', 'number', 'json')),

  -- State
  enabled BOOLEAN NOT NULL DEFAULT false,
  kill_switch_enabled BOOLEAN NOT NULL DEFAULT false,
  archived BOOLEAN NOT NULL DEFAULT false,

  -- Variants (e.g., [{key: "on", value: true}, {key: "off", value: false}])
  variants JSONB NOT NULL,
  off_variant VARCHAR(50) NOT NULL,

  -- Default rule when no targeting rules match
  default_rule JSONB NOT NULL,

  -- Bucketing salt (change to re-randomize experiment assignments)
  salt VARCHAR(50) NOT NULL DEFAULT gen_random_uuid()::text,

  -- Versioning
  version BIGINT NOT NULL DEFAULT 1,
  config_version VARCHAR(50) NOT NULL DEFAULT '1',

  -- Metadata
  tags TEXT[] DEFAULT '{}',
  created_by UUID NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

  UNIQUE(project_id, key)
);

-- Targeting rules evaluated in priority order
CREATE TABLE flag_rules (
  rule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  flag_id UUID NOT NULL REFERENCES flags(flag_id) ON DELETE CASCADE,
  priority INT NOT NULL,
  name VARCHAR(100),
  description TEXT,

  -- Conditions expression tree (supports nested AND/OR/NOT)
  -- Example:
  -- {"op":"AND","clauses":[{"attribute":"country","op":"in","values":["US","CA"]},{"op":"NOT","clause":{"attribute":"app_version","op":"sem_ver_lt","value":"2.3.0"}}]}
  conditions JSONB NOT NULL,

  -- Result when matched
  variation_key VARCHAR(50),  -- Serve specific variant
  rollout JSONB,              -- Or percentage rollout

  enabled BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

  UNIQUE(flag_id, priority)
);

-- Reusable user segments
CREATE TABLE segments (
  segment_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID NOT NULL REFERENCES projects(project_id),
  key VARCHAR(100) NOT NULL,
  name VARCHAR(200) NOT NULL,
  description TEXT,

  -- Rule-based membership
  rules JSONB NOT NULL DEFAULT '[]',

  -- Explicit user lists (for small segments, testing)
  included_users TEXT[] DEFAULT '{}',
  excluded_users TEXT[] DEFAULT '{}',

  -- For large user lists, use external reference
  user_list_url TEXT,  -- S3 URL for large lists
  user_count_approx INT DEFAULT 0,

  version BIGINT NOT NULL DEFAULT 1,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

  UNIQUE(project_id, key)
);

-- Audit log (append-only, partitioned by month)
CREATE TABLE flag_audit_log (
  audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  flag_id UUID NOT NULL,
  project_id UUID NOT NULL,

  -- Actor
  actor_id UUID NOT NULL,
  actor_type VARCHAR(20) NOT NULL CHECK (actor_type IN ('user', 'api_key', 'system', 'scheduled')),
  actor_email VARCHAR(255),

  -- Action
  action VARCHAR(50) NOT NULL,

  -- Change details
  previous_state JSONB,
  new_state JSONB,

  -- Context
  ip_address INET,
  user_agent TEXT,
  approval_id UUID,
  comment TEXT,

  created_at TIMESTAMP NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE flag_audit_log_2024_01 PARTITION OF flag_audit_log
  FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Scheduled flag changes
CREATE TABLE scheduled_changes (
  schedule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  flag_id UUID NOT NULL REFERENCES flags(flag_id),
  project_id UUID NOT NULL,

  change_type VARCHAR(30) NOT NULL,
  scheduled_at TIMESTAMP NOT NULL,
  change_payload JSONB NOT NULL,

  status VARCHAR(20) NOT NULL DEFAULT 'pending'
    CHECK (status IN ('pending', 'executed', 'cancelled', 'failed')),

  created_by UUID NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  executed_at TIMESTAMP,
  execution_error TEXT
);

-- Flag evaluation statistics for stale detection
CREATE TABLE flag_evaluation_stats (
  flag_id UUID NOT NULL REFERENCES flags(flag_id),
  date DATE NOT NULL,

  evaluation_count BIGINT NOT NULL DEFAULT 0,
  unique_users_hll BYTEA,  -- HyperLogLog for unique user approximation

  -- Variant distribution
  variant_counts JSONB DEFAULT '{}',

  -- Timing
  first_eval_at TIMESTAMP,
  last_eval_at TIMESTAMP,

  PRIMARY KEY (flag_id, date)
);

-- Indexes for common queries
CREATE INDEX idx_flags_project_key ON flags(project_id, key);
CREATE INDEX idx_flags_updated ON flags(updated_at) WHERE NOT archived;
CREATE INDEX idx_flags_stale ON flags(updated_at, enabled) WHERE NOT archived;
CREATE INDEX idx_flag_rules_flag ON flag_rules(flag_id, priority);
CREATE INDEX idx_segments_project ON segments(project_id);
CREATE INDEX idx_audit_flag ON flag_audit_log(flag_id, created_at DESC);
CREATE INDEX idx_audit_project ON flag_audit_log(project_id, created_at DESC);
CREATE INDEX idx_scheduled_pending ON scheduled_changes(scheduled_at)
  WHERE status = 'pending';
CREATE INDEX idx_eval_stats_flag ON flag_evaluation_stats(flag_id, date DESC);

-- Redis key schema (documented for SDK/edge reference)
-- flag:{project_key}:{flag_key} => JSON flag config
-- segment:{project_key}:{segment_key} => JSON segment definition
-- flag_version:{project_key} => latest config version hash
-- kill_switch:{project_key}:{flag_key} => "1" if active
-- user_segment:{segment_id}:{user_id} => "1" if user in segment (bloom filter)

Why This Schema

•Flags table stores complete flag configuration with versioning for optimistic locking
•JSONB for variants and rules enables flexible schema evolution without migrations
•Separate flag_rules table allows complex rule ordering with priority column
•Segments as first-class entities enable reuse across multiple flags
•Audit log partitioned by month for efficient retention management and fast recent queries
•Evaluation stats table enables stale flag detection with HyperLogLog for unique users
•Scheduled changes table supports time-based rollouts and automated operations
•Redis keys documented for SDK and edge evaluator reference

Common Queries

Fetch all flags for a project (SDK bootstrap)

SQL

SELECT f.*, array_agg(r.* ORDER BY r.priority) as rules FROM flags f LEFT JOIN flag_rules r ON f.flag_id = r.flag_id WHERE f.project_id = $1 AND NOT f.archived GROUP BY f.flag_id;

Get flag with rules by key

SQL

SELECT f.*, json_agg(r ORDER BY r.priority) as rules FROM flags f LEFT JOIN flag_rules r ON f.flag_id = r.flag_id WHERE f.project_id = $1 AND f.key = $2 GROUP BY f.flag_id;

Find stale flags (no evals in 30 days)

SQL

SELECT f.* FROM flags f WHERE f.enabled = true AND NOT f.archived AND NOT EXISTS (SELECT 1 FROM flag_evaluation_stats s WHERE s.flag_id = f.flag_id AND s.date > CURRENT_DATE - INTERVAL '30 days' AND s.evaluation_count > 0);

Recent audit history for a flag

SQL

SELECT * FROM flag_audit_log WHERE flag_id = $1 ORDER BY created_at DESC LIMIT 50;

Redis: Get flag config

Redis/Bash

GET flag:{project_key}:{flag_key}

Redis: Check kill switch

Redis/Bash

EXISTS kill_switch:{project_key}:{flag_key}

Redis: Invalidate project config

Redis/Bash

DEL flag_version:{project_key}

Index Rationale

idx_flags_project_keyFast flag lookup by project and key (most common query)

idx_flags_updatedFind recently changed flags for cache invalidation

idx_flags_staleQuery for stale flag detection job

idx_flag_rules_flagLoad rules in priority order for evaluation

idx_audit_flagShow recent changes for a specific flag

idx_scheduled_pendingFind due scheduled changes for executor job

idx_eval_stats_flagFetch evaluation history for stale detection

Scaling & Bottlenecks

Now

Primary Bottleneck

Cache miss rate on edge evaluators causing origin database load spikes

Mitigation

Implement cache warming and increase TTL for stable flags

What You'd Change

•Preload hot flags (top 1000 by eval count) on edge startup
•Stagger cache TTL expiration to prevent thundering herd (jitter 10-20%)
•Add L2 cache tier in Redis for warm flags (30s in-process, 5m in Redis)
•Implement background refresh: update cache before TTL expires

2× Scale

Primary Bottleneck

Config update fanout causing Kafka consumer lag >5s during burst updates

Mitigation

Partition update stream by project and add regional consumers

What You'd Change

•Increase Kafka partitions from 16 to 64, key by project_id
•Deploy consumer groups per region for parallel processing
•Implement debouncing: batch rapid updates (wait 100ms) before publishing
•Add priority queue for kill switch updates (bypass normal processing)

10× Scale

Primary Bottleneck

Exposure event ingestion overwhelming analytics pipeline at 2M events/sec

Mitigation

Implement client-side sampling and async batching with backpressure

What You'd Change

•Add configurable sampling rate per flag (100% for experiments, 1% for monitoring)
•Batch exposure events in SDK (flush every 1s or 100 events)
•Implement backpressure: drop events if queue exceeds threshold (with metric)
•Use columnar storage (ClickHouse/Druid) for exposure analytics
•Add client-side deduplication window (don't re-emit same user+flag in 1 hour)

Failure Scenarios

Monitoring & Observability

Feature flag monitoring should simultaneously track serving SLOs, config propagation integrity, SDK health, and experiment data quality.

Key Metrics

flag_evaluation_duration_secondshistogram

Time to evaluate a single flag including rule matching

Good: p99 < 0.005

Warning: p99 0.005-0.020

Critical: p99 > 0.020

flag_evaluations_totalcounter

Total evaluation throughput across all edge regions

Good: Within expected baseline

Warning: Deviation >20%

Critical: Deviation >35%

flag_config_propagation_lag_secondsgauge

Time between control plane write and edge cache update

Good: < 2

Warning: 2-10

Critical: > 10

flag_kill_switch_propagation_secondshistogram

Time for kill switch to reach all edge nodes

Good: p99 < 5

Warning: p99 5-15

Critical: p99 > 15

kafka_consumer_group_laggauge

Pending update events in propagation consumer groups

Good: < 10K

Warning: 10K-100K

Critical: > 100K

flag_cache_hit_rategauge

Percentage of evaluations served from cache

Good: > 99%

Warning: 95-99%

Critical: < 95%

flag_exposure_events_droppedcounter

Exposure events dropped due to backpressure

Good: < 0.1%

Warning: 0.1-1%

Critical: > 1%

flag_variant_distribution_skewgauge

Observed vs expected rollout distribution divergence

Good: < 1%

Warning: 1-3%

Critical: > 3%

sdk_bootstrap_duration_secondshistogram

Time for SDK to fetch initial config and become ready

Good: p99 < 0.05

Warning: p99 0.05-0.2

Critical: p99 > 0.2

sdk_stream_connection_success_rategauge

Percentage of SDK streaming connections established successfully

Good: > 99.9%

Warning: 99.0-99.9%

Critical: < 99.0%

stale_flag_backlog_totalgauge

Count of stale flags awaiting cleanup workflow

Good: < 5% of active flags

Warning: 5-12%

Critical: > 12%

Alert Rules

HighEvaluationLatencywarning

Flag evaluation p99 latency exceeds 5ms target

histogram_quantile(0.99, rate(flag_evaluation_duration_seconds_bucket[5m])) > 0.005

Runbook: Check edge cache hit rate, rule complexity, segment sizes

KillSwitchPropagationSlowcritical

Kill switch took more than 10 seconds to propagate

histogram_quantile(0.99, rate(flag_kill_switch_propagation_seconds_bucket[5m])) > 10

Runbook: Check Kafka consumer lag, SSE connections, edge node health

ConfigPropagationLagwarning

Edge config is more than 30 seconds behind control plane

flag_config_propagation_lag_seconds > 30 for 5m

Runbook: Check Kafka consumer group, network connectivity, edge node logs

UpdateStreamBacklogCriticalcritical

Config update backlog risks kill switch SLA breach

kafka_consumer_group_lag > 100000 for 10m

Runbook: Scale consumers, rebalance partitions, and prioritize kill-switch topic

ExposureEventLosswarning

More than 1% of exposure events being dropped

rate(flag_exposure_events_dropped[5m]) / rate(flag_exposure_events_total[5m]) > 0.01

Runbook: Check event queue size, Kafka producer health, increase batch capacity

VariantSkewDetectedwarning

Observed rollout split diverges from expected distribution

flag_variant_distribution_skew > 0.03 for 15m

Runbook: Validate bucketing consistency across SDK versions and hash inputs

SDKStreamConnectionDropwarning

SDK streaming connectivity is degraded

sdk_stream_connection_success_rate < 0.99 for 10m

Runbook: Inspect SSE gateway health, connection limits, and token validation errors

Dashboard Layout

Evaluation Performance

flag_evaluation_duration_secondsflag_evaluations_totalflag_cache_hit_rate

Config Propagation

flag_config_propagation_lag_secondsflag_kill_switch_propagation_secondskafka_consumer_group_lag

SDK Health

sdk_bootstrap_duration_secondssdk_stream_connection_success_ratesdk_errors_total

Experiment Tracking

flag_exposure_events_totalflag_exposure_events_droppedflag_variant_distribution

Governance

stale_flag_backlog_totalflag_change_lead_time_secondsapproval_queue_depth

Scale Calculator

Estimate edge evaluator fleet size, config propagation capacity, and monthly spend with explicit compute, storage, and network components.

Configuration

Evaluations/sec2.00M evals/sec

1.00K evals/sec10.00M evals/sec

Active Flags500.00K flags

100.00 flags1.00M flags

Avg Flag Size800.00 bytes

200.00 bytes5.00K bytes

Segments10.00K segments

10.00 segments50.00K segments

Avg Users/Segment5.00K users

100.00 users1.00M users

Config Updates/Day50.00K updates

10.00 updates100.00K updates

Edge Regions10.00 regions

1.00 regions30.00 regions

SDK Clients5.00M clients

1.00K clients10.00M clients

Bootstrap Refresh Interval300.00 sec

30.00 sec1.80K sec

Bootstrap Payload120.00 KB

20.00 KB2.00K KB

Exposure Event Rate10.00 %

1.00 %100.00 %

Calculated Results

Edge Cache Size

Memory required per edge node for flag + segment cache

1.34K MB

Total Edge Memory

Memory across all edge regions with 2x headroom

26.08 GB

Evaluator Instances

Edge evaluator instances at ~80K eval/sec each

250

Update Events/sec

Config propagation events per second

5.79events/sec

Stream Consumers

Consumers needed for update fanout

Exposure Events/sec

Exposure events generated per second

200.00Kevents/sec

Exposure Storage/Day

Storage for exposure events (500 bytes each)

8.05K GB

SSE Connections

Server-Sent Events connections to maintain

5.00M connections

Bootstrap Bandwidth

Bandwidth for bootstrap refreshes

1.95KMB/s

Storage Cost

Exposure analytics + config snapshot storage

$5,552

Compute Cost

Evaluator fleet + stream consumers + regional control plane

$63,156

Network Cost

Bootstrap + exposure egress/replication transfer

$93,334

Estimated Monthly Cost

Compute + storage + network

$162,043

Cost per 1M Evaluations

Effective cost normalized by monthly evaluation volume

Estimated Monthly Cost (AWS)

$162,043/month

Compute Cost$63,156

Storage Cost$5,552

Network Cost$93,334

* Estimates based on simplified AWS pricing. Actual costs may vary.

Cost & Capacity

Traffic Estimates

Peak Evaluations

10 flags/request x 200K requests/sec

2.0M/sec

Sustained Evaluations

25% of peak

500K/sec

SDK Bootstrap

~5.0M clients refreshing every 300s

16,667/sec

Config Updates

Cross-team product and experiment changes

50,000/day

Exposure Events

10% of evals for experiments

200K/sec

Kill Switches

Emergency disables

100/day

Storage Estimates

Flag Config

500K flags x 800 bytes

400 MB

Segment Data

10K segments x 500KB user lists

5 GB

Audit Logs

50K changes/day x 1KB x 30 days

1.5 GB/month

Exposure Logs

200K/sec x 500B x 86400s

8.6 TB/day

Edge Cache

All flags + segments compressed

1 GB/POP

Eval Stats

500K flags x 30 days x 3KB

50 GB/month

Test Your Understanding

Knowledge Check

Test your understanding of feature flag systems with real-world scenarios and architectural decisions.

Failure Diagnosis

Architecture Decisions

Summary & Takeaways

Key Takeaways

1.Edge evaluation with in-memory cache is essential for sub-5ms latency - central evaluation adds 50-200ms network latency
2.Deterministic bucketing using MurmurHash3 ensures consistent user experience across SDKs, sessions, and devices
3.Kill switches need dedicated fast path - normal config propagation may have Kafka lag; implement direct database fallback
4.Server SDKs evaluate locally with full rules; client SDKs receive pre-computed results to avoid exposing targeting logic
5.Exposure event deduplication and sampling are critical - 2M evals/sec would generate 170B events/day without controls
6.Stale flag detection requires multiple signals: no evaluations, 100% rollout for 90+ days, no code references
7.Cache TTL jitter prevents thundering herd - stagger expiration with 10-20% random variance
8.Streaming updates (SSE/Kafka) reduce propagation latency from 30s (polling) to <5s for config changes

If I Had More Time

•Implement automated rollback based on error rate correlation - detect 5xx spike after flag change
•Build flag dependency graph - show which flags depend on others, prevent circular references
•Add canary analysis integration - automatic percentage ramp based on metric health
•Implement approval workflows with required reviewers for production flag changes
•Build real-time A/B test dashboard with statistical significance calculations
•Add cross-team flag impact analysis - notify teams when shared flag changes
•Implement gradual rollout automation - automatically increase percentage over time if metrics healthy
•Build SDK conformance test suite that validates bucketing consistency across all languages

0/10

intermediatefeature-flagsedgecachingconsistencyexperimentationsdk-designtargeting

Feature Flag Evaluation Engine at Scale

Design a low-latency feature flag evaluation system with targeting rules, percentage rollouts, A/B testing integration, kill switches, and multi-region consistency.

What You'll Learn

•Deterministic bucketing using MurmurHash3 for consistent user experience across SDKs and sessions
•Edge evaluation architecture with streaming config updates for sub-5ms latency at global scale
•Targeting rule engine design with segments, attributes, and boolean logic (AND/OR/NOT)
•Kill switch implementation with sub-5s global propagation using Kafka and SSE
•A/B testing integration with exposure event tracking and experiment analysis
•SDK design patterns: server-side full evaluation vs client-side pre-computed flags
•Stale flag detection using evaluation statistics and automated cleanup workflows
•Multi-region consistency vs availability trade-offs in feature flag systems

Interview Simulation

Run a timed mock interview for this project and get a scored debrief.

Quick Context

Problem

Constraints & Assumptions9 items

•
Support 2,000,000 evaluations/sec peak with p99 <5ms local evaluation latency
Why?
•
Propagate kill switches to all edge nodes and SDKs globally within 5 seconds
Why?
•
Guarantee identical bucketing results across all SDK implementations (Java, Go, Python, Node, iOS, Android, JS)
•
Handle 500,000 active flags with 50,000 config updates per day
Why?

Key Numbers(hover for details)

2.0M/sec

Evaluations

Peak throughput

p99 <5ms

Latency

Local eval time

500K

Flags

Active flags

Regions

Global POPs

<5s

Kill Switch

Global propagation

99.99%

Availability

Eval uptime

p99 <50ms

Bootstrap

SDK init time

Requirements

Flag Evaluation

Evaluate boolean, multivariate (string/number), and JSON flags with configurable default values

Why it matters: Core capability - flags control feature access for all users

Targeting Rules

Support user segments, attribute-based targeting (country, plan, version), and complex boolean conditions (AND/OR/NOT)

Why it matters: Enables precise control over which users see which features

Percentage Rollouts

Deterministic percentage-based rollouts using stable hashing - users always see the same variant

Why it matters: Gradual rollouts require consistent user experience across sessions

Kill Switch

Immediate global disable that overrides all targeting rules with propagation under 5s

Why it matters: Critical for incident response - must be able to turn off broken features instantly

A/B Testing Integration

Emit exposure events with user, flag, variant, and timestamp for experiment analysis

Why it matters: Product teams need accurate data to measure feature impact

SDK Support

Server SDKs (Java, Go, Python, Node), mobile SDKs (iOS, Android), and client-side JS SDK with offline support

Why it matters: Must work across all platforms with consistent behavior

Stale Flag Detection

Identify flags with no evaluations in 30+ days or 100% rollout for 90+ days

Why it matters: Technical debt from unused flags creates maintenance burden and confusion

Audit & History

Track all changes with actor, timestamp, previous/new state, and approval workflow integration

Why it matters: Compliance and debugging require knowing who changed what and when

Scheduled Rollouts

Support time-based activation/deactivation and gradual percentage increases

Why it matters: Coordinate releases with marketing launches and reduce manual operations

Architecture Evolution

Show deltas

Single-region flag API with in-memory cache and polling SDKs. Handles 100-1,000 users, 10K evals/sec, at $100-300/month. Good for startups validating product-market fit.

Click any component to explore its details, or Trace Flow to see data movement.

Legend

Clients

Flag Service

Audit

What Changed & Why

•Single flag service with PostgreSQL backend for flag storage
•In-memory cache (Caffeine/Guava) with 30-second TTL for flag configs
•SDKs poll flag service every 30 seconds for config updates
•Simple percentage rollout using user ID hash modulo 100
•Basic boolean and string flag types only
•Synchronous audit log writes to same PostgreSQL database

Key Decisions

5 decisions

Evaluation Location: Edge vs Origin vs Client3 alternatives

Edge evaluation with local rule cache at regional POPs

Config Propagation: Polling vs Streaming vs Push4 alternatives

Streaming updates via Kafka with SSE to SDKs

Bucketing Algorithm: Hash Function Selection4 alternatives

MurmurHash3 (32-bit) with modulo 100,000 for 0.001% precision

Targeting Rule Storage: JSON vs DSL vs SQL3 alternatives

JSON rules in PostgreSQL JSONB with in-memory rule engine

Audit Log: Sync vs Async vs Event Sourcing3 alternatives

Async append-only log with Kafka + data warehouse

API Design

The Feature Flag API provides management endpoints, SDK bootstrap with ETag support, streaming config updates, and server-side evaluation for untrusted clients.

Base URL

https://api.featureflags.io/v1

Authentication

SDK Key or Bearer Token

Server SDKs use SDK keys (sdk-server-xxx). Client SDKs use client-side keys (sdk-client-xxx) which have limited permissions. Management API uses OAuth2 Bearer tokens.

Endpoints

GET/sdk/flags/{projectKey}Get all flags for SDK bootstrap

GET/sdk/stream/{projectKey}SSE stream for real-time config updates

POST/sdk/evaluate/{projectKey}/{flagKey}Evaluate a single flag (client SDK)

Webhooks

EVENT

flag.updated

Fired when any flag configuration changes

Payload

{
  "event": "flag.updated",
  "flag": {"id": "flag-abc", "key": "new-feature"},
  "changes": ["enabled", "rules"],
  "actor": {"id": "user-123", "email": "dev@example.com"},
  "timestamp": "2024-01-15T14:30:00Z"
}

EVENT

flag.kill_switch

Fired when a kill switch is activated or deactivated

Payload

{
  "event": "flag.kill_switch",
  "flag": {"id": "flag-abc", "key": "broken-feature"},
  "killSwitchEnabled": true,
  "reason": "Production incident",
  "actor": {"id": "user-123", "email": "oncall@example.com"},
  "timestamp": "2024-01-15T14:30:00Z"
}

Code Samples

TypeScript: Deterministic BucketingProduction

MurmurHash3-based bucketing algorithm that produces consistent results across all SDK implementations.

TypeScript: Stale Flag DetectionProduction

Identify flags that are candidates for cleanup based on evaluation patterns.

Data Model & Queries

Schema

SQL

-- Core schema for feature flag engine (PostgreSQL)

-- Projects organize flags by team/application
CREATE TABLE projects (
  project_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  key VARCHAR(50) UNIQUE NOT NULL,
  name VARCHAR(100) NOT NULL,
  description TEXT,
  settings JSONB DEFAULT '{}',
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);

-- Flag definitions with versioning
CREATE TABLE flags (
  flag_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID NOT NULL REFERENCES projects(project_id),
  key VARCHAR(100) NOT NULL,
  name VARCHAR(200) NOT NULL,
  description TEXT,
  flag_type VARCHAR(20) NOT NULL CHECK (flag_type IN ('boolean', 'string', 'number', 'json')),

  -- State
  enabled BOOLEAN NOT NULL DEFAULT false,
  kill_switch_enabled BOOLEAN NOT NULL DEFAULT false,
  archived BOOLEAN NOT NULL DEFAULT false,

  -- Variants (e.g., [{key: "on", value: true}, {key: "off", value: false}])
  variants JSONB NOT NULL,
  off_variant VARCHAR(50) NOT NULL,

  -- Default rule when no targeting rules match
  default_rule JSONB NOT NULL,

  -- Bucketing salt (change to re-randomize experiment assignments)
  salt VARCHAR(50) NOT NULL DEFAULT gen_random_uuid()::text,

  -- Versioning
  version BIGINT NOT NULL DEFAULT 1,
  config_version VARCHAR(50) NOT NULL DEFAULT '1',

  -- Metadata
  tags TEXT[] DEFAULT '{}',
  created_by UUID NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

  UNIQUE(project_id, key)
);

-- Targeting rules evaluated in priority order
CREATE TABLE flag_rules (
  rule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  flag_id UUID NOT NULL REFERENCES flags(flag_id) ON DELETE CASCADE,
  priority INT NOT NULL,
  name VARCHAR(100),
  description TEXT,

  -- Conditions expression tree (supports nested AND/OR/NOT)
  -- Example:
  -- {"op":"AND","clauses":[{"attribute":"country","op":"in","values":["US","CA"]},{"op":"NOT","clause":{"attribute":"app_version","op":"sem_ver_lt","value":"2.3.0"}}]}
  conditions JSONB NOT NULL,

  -- Result when matched
  variation_key VARCHAR(50),  -- Serve specific variant
  rollout JSONB,              -- Or percentage rollout

  enabled BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

  UNIQUE(flag_id, priority)
);

-- Reusable user segments
CREATE TABLE segments (
  segment_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id UUID NOT NULL REFERENCES projects(project_id),
  key VARCHAR(100) NOT NULL,
  name VARCHAR(200) NOT NULL,
  description TEXT,

  -- Rule-based membership
  rules JSONB NOT NULL DEFAULT '[]',

  -- Explicit user lists (for small segments, testing)
  included_users TEXT[] DEFAULT '{}',
  excluded_users TEXT[] DEFAULT '{}',

  -- For large user lists, use external reference
  user_list_url TEXT,  -- S3 URL for large lists
  user_count_approx INT DEFAULT 0,

  version BIGINT NOT NULL DEFAULT 1,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

  UNIQUE(project_id, key)
);

-- Audit log (append-only, partitioned by month)
CREATE TABLE flag_audit_log (
  audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  flag_id UUID NOT NULL,
  project_id UUID NOT NULL,

  -- Actor
  actor_id UUID NOT NULL,
  actor_type VARCHAR(20) NOT NULL CHECK (actor_type IN ('user', 'api_key', 'system', 'scheduled')),
  actor_email VARCHAR(255),

  -- Action
  action VARCHAR(50) NOT NULL,

  -- Change details
  previous_state JSONB,
  new_state JSONB,

  -- Context
  ip_address INET,
  user_agent TEXT,
  approval_id UUID,
  comment TEXT,

  created_at TIMESTAMP NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE flag_audit_log_2024_01 PARTITION OF flag_audit_log
  FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Scheduled flag changes
CREATE TABLE scheduled_changes (
  schedule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  flag_id UUID NOT NULL REFERENCES flags(flag_id),
  project_id UUID NOT NULL,

  change_type VARCHAR(30) NOT NULL,
  scheduled_at TIMESTAMP NOT NULL,
  change_payload JSONB NOT NULL,

  status VARCHAR(20) NOT NULL DEFAULT 'pending'
    CHECK (status IN ('pending', 'executed', 'cancelled', 'failed')),

  created_by UUID NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  executed_at TIMESTAMP,
  execution_error TEXT
);

-- Flag evaluation statistics for stale detection
CREATE TABLE flag_evaluation_stats (
  flag_id UUID NOT NULL REFERENCES flags(flag_id),
  date DATE NOT NULL,

  evaluation_count BIGINT NOT NULL DEFAULT 0,
  unique_users_hll BYTEA,  -- HyperLogLog for unique user approximation

  -- Variant distribution
  variant_counts JSONB DEFAULT '{}',

  -- Timing
  first_eval_at TIMESTAMP,
  last_eval_at TIMESTAMP,

  PRIMARY KEY (flag_id, date)
);

-- Indexes for common queries
CREATE INDEX idx_flags_project_key ON flags(project_id, key);
CREATE INDEX idx_flags_updated ON flags(updated_at) WHERE NOT archived;
CREATE INDEX idx_flags_stale ON flags(updated_at, enabled) WHERE NOT archived;
CREATE INDEX idx_flag_rules_flag ON flag_rules(flag_id, priority);
CREATE INDEX idx_segments_project ON segments(project_id);
CREATE INDEX idx_audit_flag ON flag_audit_log(flag_id, created_at DESC);
CREATE INDEX idx_audit_project ON flag_audit_log(project_id, created_at DESC);
CREATE INDEX idx_scheduled_pending ON scheduled_changes(scheduled_at)
  WHERE status = 'pending';
CREATE INDEX idx_eval_stats_flag ON flag_evaluation_stats(flag_id, date DESC);

-- Redis key schema (documented for SDK/edge reference)
-- flag:{project_key}:{flag_key} => JSON flag config
-- segment:{project_key}:{segment_key} => JSON segment definition
-- flag_version:{project_key} => latest config version hash
-- kill_switch:{project_key}:{flag_key} => "1" if active
-- user_segment:{segment_id}:{user_id} => "1" if user in segment (bloom filter)

Why This Schema

•Flags table stores complete flag configuration with versioning for optimistic locking
•JSONB for variants and rules enables flexible schema evolution without migrations
•Separate flag_rules table allows complex rule ordering with priority column
•Segments as first-class entities enable reuse across multiple flags
•Audit log partitioned by month for efficient retention management and fast recent queries
•Evaluation stats table enables stale flag detection with HyperLogLog for unique users
•Scheduled changes table supports time-based rollouts and automated operations
•Redis keys documented for SDK and edge evaluator reference

Common Queries

Fetch all flags for a project (SDK bootstrap)

SQL

SELECT f.*, array_agg(r.* ORDER BY r.priority) as rules FROM flags f LEFT JOIN flag_rules r ON f.flag_id = r.flag_id WHERE f.project_id = $1 AND NOT f.archived GROUP BY f.flag_id;

Get flag with rules by key

SQL

SELECT f.*, json_agg(r ORDER BY r.priority) as rules FROM flags f LEFT JOIN flag_rules r ON f.flag_id = r.flag_id WHERE f.project_id = $1 AND f.key = $2 GROUP BY f.flag_id;

Find stale flags (no evals in 30 days)

SQL

SELECT f.* FROM flags f WHERE f.enabled = true AND NOT f.archived AND NOT EXISTS (SELECT 1 FROM flag_evaluation_stats s WHERE s.flag_id = f.flag_id AND s.date > CURRENT_DATE - INTERVAL '30 days' AND s.evaluation_count > 0);

Recent audit history for a flag

SQL

SELECT * FROM flag_audit_log WHERE flag_id = $1 ORDER BY created_at DESC LIMIT 50;

Redis: Get flag config

Redis/Bash

GET flag:{project_key}:{flag_key}

Redis: Check kill switch

Redis/Bash

EXISTS kill_switch:{project_key}:{flag_key}

Redis: Invalidate project config

Redis/Bash

DEL flag_version:{project_key}

Index Rationale

idx_flags_project_keyFast flag lookup by project and key (most common query)

idx_flags_updatedFind recently changed flags for cache invalidation

idx_flags_staleQuery for stale flag detection job

idx_flag_rules_flagLoad rules in priority order for evaluation

idx_audit_flagShow recent changes for a specific flag

idx_scheduled_pendingFind due scheduled changes for executor job

idx_eval_stats_flagFetch evaluation history for stale detection

Scaling & Bottlenecks

Now

Primary Bottleneck

Cache miss rate on edge evaluators causing origin database load spikes

Mitigation

Implement cache warming and increase TTL for stable flags

What You'd Change

•Preload hot flags (top 1000 by eval count) on edge startup
•Stagger cache TTL expiration to prevent thundering herd (jitter 10-20%)
•Add L2 cache tier in Redis for warm flags (30s in-process, 5m in Redis)
•Implement background refresh: update cache before TTL expires

2× Scale

Primary Bottleneck

Config update fanout causing Kafka consumer lag >5s during burst updates

Mitigation

Partition update stream by project and add regional consumers

What You'd Change

•Increase Kafka partitions from 16 to 64, key by project_id
•Deploy consumer groups per region for parallel processing
•Implement debouncing: batch rapid updates (wait 100ms) before publishing
•Add priority queue for kill switch updates (bypass normal processing)

10× Scale

Primary Bottleneck

Exposure event ingestion overwhelming analytics pipeline at 2M events/sec

Mitigation

Implement client-side sampling and async batching with backpressure

What You'd Change

•Add configurable sampling rate per flag (100% for experiments, 1% for monitoring)
•Batch exposure events in SDK (flush every 1s or 100 events)
•Implement backpressure: drop events if queue exceeds threshold (with metric)
•Use columnar storage (ClickHouse/Druid) for exposure analytics
•Add client-side deduplication window (don't re-emit same user+flag in 1 hour)

Failure Scenarios

Monitoring & Observability

Feature flag monitoring should simultaneously track serving SLOs, config propagation integrity, SDK health, and experiment data quality.

Key Metrics

flag_evaluation_duration_secondshistogram

Time to evaluate a single flag including rule matching

Good: p99 < 0.005

Warning: p99 0.005-0.020

Critical: p99 > 0.020

flag_evaluations_totalcounter

Total evaluation throughput across all edge regions

Good: Within expected baseline

Warning: Deviation >20%

Critical: Deviation >35%

flag_config_propagation_lag_secondsgauge

Time between control plane write and edge cache update

Good: < 2

Warning: 2-10

Critical: > 10

flag_kill_switch_propagation_secondshistogram

Time for kill switch to reach all edge nodes

Good: p99 < 5

Warning: p99 5-15

Critical: p99 > 15

kafka_consumer_group_laggauge

Pending update events in propagation consumer groups

Good: < 10K

Warning: 10K-100K

Critical: > 100K

flag_cache_hit_rategauge

Percentage of evaluations served from cache

Good: > 99%

Warning: 95-99%

Critical: < 95%

flag_exposure_events_droppedcounter

Exposure events dropped due to backpressure

Good: < 0.1%

Warning: 0.1-1%

Critical: > 1%

flag_variant_distribution_skewgauge

Observed vs expected rollout distribution divergence

Good: < 1%

Warning: 1-3%

Critical: > 3%

sdk_bootstrap_duration_secondshistogram

Time for SDK to fetch initial config and become ready

Good: p99 < 0.05

Warning: p99 0.05-0.2

Critical: p99 > 0.2

sdk_stream_connection_success_rategauge

Percentage of SDK streaming connections established successfully

Good: > 99.9%

Warning: 99.0-99.9%

Critical: < 99.0%

stale_flag_backlog_totalgauge

Count of stale flags awaiting cleanup workflow

Good: < 5% of active flags

Warning: 5-12%

Critical: > 12%

Alert Rules

HighEvaluationLatencywarning

Flag evaluation p99 latency exceeds 5ms target

histogram_quantile(0.99, rate(flag_evaluation_duration_seconds_bucket[5m])) > 0.005

Runbook: Check edge cache hit rate, rule complexity, segment sizes

KillSwitchPropagationSlowcritical

Kill switch took more than 10 seconds to propagate

histogram_quantile(0.99, rate(flag_kill_switch_propagation_seconds_bucket[5m])) > 10

Runbook: Check Kafka consumer lag, SSE connections, edge node health

ConfigPropagationLagwarning

Edge config is more than 30 seconds behind control plane

flag_config_propagation_lag_seconds > 30 for 5m

Runbook: Check Kafka consumer group, network connectivity, edge node logs

UpdateStreamBacklogCriticalcritical

Config update backlog risks kill switch SLA breach

kafka_consumer_group_lag > 100000 for 10m

Runbook: Scale consumers, rebalance partitions, and prioritize kill-switch topic

ExposureEventLosswarning

More than 1% of exposure events being dropped

rate(flag_exposure_events_dropped[5m]) / rate(flag_exposure_events_total[5m]) > 0.01

Runbook: Check event queue size, Kafka producer health, increase batch capacity

VariantSkewDetectedwarning

Observed rollout split diverges from expected distribution

flag_variant_distribution_skew > 0.03 for 15m

Runbook: Validate bucketing consistency across SDK versions and hash inputs

SDKStreamConnectionDropwarning

SDK streaming connectivity is degraded

sdk_stream_connection_success_rate < 0.99 for 10m

Runbook: Inspect SSE gateway health, connection limits, and token validation errors

Dashboard Layout

Evaluation Performance

flag_evaluation_duration_secondsflag_evaluations_totalflag_cache_hit_rate

Config Propagation

flag_config_propagation_lag_secondsflag_kill_switch_propagation_secondskafka_consumer_group_lag

SDK Health

sdk_bootstrap_duration_secondssdk_stream_connection_success_ratesdk_errors_total

Experiment Tracking

flag_exposure_events_totalflag_exposure_events_droppedflag_variant_distribution

Governance

stale_flag_backlog_totalflag_change_lead_time_secondsapproval_queue_depth

Scale Calculator

Estimate edge evaluator fleet size, config propagation capacity, and monthly spend with explicit compute, storage, and network components.

Configuration

Evaluations/sec2.00M evals/sec

1.00K evals/sec10.00M evals/sec

Active Flags500.00K flags

100.00 flags1.00M flags

Avg Flag Size800.00 bytes

200.00 bytes5.00K bytes

Segments10.00K segments

10.00 segments50.00K segments

Avg Users/Segment5.00K users

100.00 users1.00M users

Config Updates/Day50.00K updates

10.00 updates100.00K updates

Edge Regions10.00 regions

1.00 regions30.00 regions

SDK Clients5.00M clients

1.00K clients10.00M clients

Bootstrap Refresh Interval300.00 sec

30.00 sec1.80K sec

Bootstrap Payload120.00 KB

20.00 KB2.00K KB

Exposure Event Rate10.00 %

1.00 %100.00 %

Calculated Results

Edge Cache Size

Memory required per edge node for flag + segment cache

1.34K MB

Total Edge Memory

Memory across all edge regions with 2x headroom

26.08 GB

Evaluator Instances

Edge evaluator instances at ~80K eval/sec each

250

Update Events/sec

Config propagation events per second

5.79events/sec

Stream Consumers

Consumers needed for update fanout

Exposure Events/sec

Exposure events generated per second

200.00Kevents/sec

Exposure Storage/Day

Storage for exposure events (500 bytes each)

8.05K GB

SSE Connections

Server-Sent Events connections to maintain

5.00M connections

Bootstrap Bandwidth

Bandwidth for bootstrap refreshes

1.95KMB/s

Storage Cost

Exposure analytics + config snapshot storage

$5,552

Compute Cost

Evaluator fleet + stream consumers + regional control plane

$63,156

Network Cost

Bootstrap + exposure egress/replication transfer

$93,334

Estimated Monthly Cost

Compute + storage + network

$162,043

Cost per 1M Evaluations

Effective cost normalized by monthly evaluation volume

Estimated Monthly Cost (AWS)

$162,043/month

Compute Cost$63,156

Storage Cost$5,552

Network Cost$93,334

* Estimates based on simplified AWS pricing. Actual costs may vary.

Cost & Capacity

Traffic Estimates

Peak Evaluations

10 flags/request x 200K requests/sec

2.0M/sec

Sustained Evaluations

25% of peak

500K/sec

SDK Bootstrap

~5.0M clients refreshing every 300s

16,667/sec

Config Updates

Cross-team product and experiment changes

50,000/day

Exposure Events

10% of evals for experiments

200K/sec

Kill Switches

Emergency disables

100/day

Storage Estimates

Flag Config

500K flags x 800 bytes

400 MB

Segment Data

10K segments x 500KB user lists

5 GB

Audit Logs

50K changes/day x 1KB x 30 days

1.5 GB/month

Exposure Logs

200K/sec x 500B x 86400s

8.6 TB/day

Edge Cache

All flags + segments compressed

1 GB/POP

Eval Stats

500K flags x 30 days x 3KB

50 GB/month

Test Your Understanding

Knowledge Check

Test your understanding of feature flag systems with real-world scenarios and architectural decisions.

Failure Diagnosis

Architecture Decisions

Summary & Takeaways

Key Takeaways

1.Edge evaluation with in-memory cache is essential for sub-5ms latency - central evaluation adds 50-200ms network latency
2.Deterministic bucketing using MurmurHash3 ensures consistent user experience across SDKs, sessions, and devices
3.Kill switches need dedicated fast path - normal config propagation may have Kafka lag; implement direct database fallback
4.Server SDKs evaluate locally with full rules; client SDKs receive pre-computed results to avoid exposing targeting logic
5.Exposure event deduplication and sampling are critical - 2M evals/sec would generate 170B events/day without controls
6.Stale flag detection requires multiple signals: no evaluations, 100% rollout for 90+ days, no code references
7.Cache TTL jitter prevents thundering herd - stagger expiration with 10-20% random variance
8.Streaming updates (SSE/Kafka) reduce propagation latency from 30s (polling) to <5s for config changes

If I Had More Time

•Implement automated rollback based on error rate correlation - detect 5xx spike after flag change
•Build flag dependency graph - show which flags depend on others, prevent circular references
•Add canary analysis integration - automatic percentage ramp based on metric health
•Implement approval workflows with required reviewers for production flag changes
•Build real-time A/B test dashboard with statistical significance calculations
•Add cross-team flag impact analysis - notify teams when shared flag changes
•Implement gradual rollout automation - automatically increase percentage over time if metrics healthy
•Build SDK conformance test suite that validates bucketing consistency across all languages

Feature Flag Evaluation Engine at Scale

What You'll Learn

Interview Simulation

Quick Context

Key Numbers(hover for details)

Requirements

Architecture Evolution

Legend

What Changed & Why

Key Decisions

API Design

Base URL

Authentication

Endpoints

Webhooks

Payload

Payload

Code Samples

Data Model & Queries

Scaling & Bottlenecks

Primary Bottleneck

Mitigation

What You'd Change

Primary Bottleneck

Mitigation

What You'd Change

Primary Bottleneck

Mitigation

What You'd Change

Failure Scenarios

Flag changes not reflecting in application for several minutes

Users see different feature states across devices (web vs mobile)

Kill switch activation takes >30 seconds to propagate globally

SDK bootstrap timeout causing application startup delays

Exposure events missing from analytics causing experiment data gaps

Rule evaluation returns wrong variant for complex targeting conditions

Monitoring & Observability

Evaluation Performance

Config Propagation

SDK Health

Experiment Tracking

Governance

Scale Calculator

Cost & Capacity

How It's Calculated

Test Your Understanding

Summary & Takeaways

Feature Flag Evaluation Engine at Scale

What You'll Learn

Interview Simulation

Quick Context

Key Numbers(hover for details)

Requirements

Architecture Evolution

Legend

What Changed & Why

Key Decisions

API Design

Base URL

Authentication

Endpoints

Webhooks

Payload

Payload

Code Samples

Data Model & Queries

Scaling & Bottlenecks

Primary Bottleneck

Mitigation

What You'd Change

Primary Bottleneck

Mitigation

What You'd Change

Primary Bottleneck

Mitigation

What You'd Change

Failure Scenarios

Flag changes not reflecting in application for several minutes

Users see different feature states across devices (web vs mobile)

Kill switch activation takes >30 seconds to propagate globally