Feature Flag Evaluation Engine at Scale
Design a low-latency feature flag evaluation system with targeting rules, percentage rollouts, A/B testing integration, kill switches, and multi-region consistency.
What You'll Learn
- •Deterministic bucketing using MurmurHash3 for consistent user experience across SDKs and sessions
- •Edge evaluation architecture with streaming config updates for sub-5ms latency at global scale
- •Targeting rule engine design with segments, attributes, and boolean logic (AND/OR/NOT)
- •Kill switch implementation with sub-5s global propagation using Kafka and SSE
- •A/B testing integration with exposure event tracking and experiment analysis
- •SDK design patterns: server-side full evaluation vs client-side pre-computed flags
- •Stale flag detection using evaluation statistics and automated cleanup workflows
- •Multi-region consistency vs availability trade-offs in feature flag systems
Interview Simulation
Run a timed mock interview for this project and get a scored debrief.
Quick Context
Feature flags have become critical infrastructure enabling trunk-based development, gradual rollouts, instant rollbacks, and experimentation without code deployments. The system must evaluate millions of flags per second with sub-5ms latency, maintain consistent user bucketing across all SDKs (server, mobile, web), propagate kill switches globally in under 5 seconds, and remain resilient during regional failures. Key challenges include: designing a targeting rule engine that supports complex boolean conditions, implementing deterministic bucketing that produces identical results across all SDK implementations, building an edge evaluation layer that scales horizontally while maintaining cache consistency, and integrating with A/B testing platforms for proper exposure tracking. Success metrics: p99 evaluation latency <5ms, kill switch propagation <5s, zero inconsistent bucket assignments across SDKs, and 99.99% evaluation availability.
- •Support 2,000,000 evaluations/sec peak with p99 <5ms local evaluation latencyWhy?
- •Propagate kill switches to all edge nodes and SDKs globally within 5 secondsWhy?
- •Guarantee identical bucketing results across all SDK implementations (Java, Go, Python, Node, iOS, Android, JS)
- •Handle 500,000 active flags with 50,000 config updates per dayWhy?
Key Numbers(hover for details)
Requirements
Evaluate boolean, multivariate (string/number), and JSON flags with configurable default values
Why it matters: Core capability - flags control feature access for all users
Support user segments, attribute-based targeting (country, plan, version), and complex boolean conditions (AND/OR/NOT)
Why it matters: Enables precise control over which users see which features
Deterministic percentage-based rollouts using stable hashing - users always see the same variant
Why it matters: Gradual rollouts require consistent user experience across sessions
Immediate global disable that overrides all targeting rules with propagation under 5s
Why it matters: Critical for incident response - must be able to turn off broken features instantly
Emit exposure events with user, flag, variant, and timestamp for experiment analysis
Why it matters: Product teams need accurate data to measure feature impact
Server SDKs (Java, Go, Python, Node), mobile SDKs (iOS, Android), and client-side JS SDK with offline support
Why it matters: Must work across all platforms with consistent behavior
Identify flags with no evaluations in 30+ days or 100% rollout for 90+ days
Why it matters: Technical debt from unused flags creates maintenance burden and confusion
Track all changes with actor, timestamp, previous/new state, and approval workflow integration
Why it matters: Compliance and debugging require knowing who changed what and when
Support time-based activation/deactivation and gradual percentage increases
Why it matters: Coordinate releases with marketing launches and reduce manual operations
Architecture Evolution
Single-region flag API with in-memory cache and polling SDKs. Handles 100-1,000 users, 10K evals/sec, at $100-300/month. Good for startups validating product-market fit.
Click any component to explore its details, or Trace Flow to see data movement.
Legend
What Changed & Why
- •Single flag service with PostgreSQL backend for flag storage
- •In-memory cache (Caffeine/Guava) with 30-second TTL for flag configs
- •SDKs poll flag service every 30 seconds for config updates
- •Simple percentage rollout using user ID hash modulo 100
- •Basic boolean and string flag types only
- •Synchronous audit log writes to same PostgreSQL database
Key Decisions
5 decisionsEdge evaluation with local rule cache at regional POPs
Streaming updates via Kafka with SSE to SDKs
MurmurHash3 (32-bit) with modulo 100,000 for 0.001% precision
JSON rules in PostgreSQL JSONB with in-memory rule engine
Async append-only log with Kafka + data warehouse
API Design
The Feature Flag API provides management endpoints, SDK bootstrap with ETag support, streaming config updates, and server-side evaluation for untrusted clients.
Base URL
https://api.featureflags.io/v1Authentication
Server SDKs use SDK keys (sdk-server-xxx). Client SDKs use client-side keys (sdk-client-xxx) which have limited permissions. Management API uses OAuth2 Bearer tokens.
Endpoints
/sdk/flags/{projectKey}/sdk/stream/{projectKey}/sdk/evaluate/{projectKey}/{flagKey}Webhooks
Fired when any flag configuration changes
Payload
{
"event": "flag.updated",
"flag": {"id": "flag-abc", "key": "new-feature"},
"changes": ["enabled", "rules"],
"actor": {"id": "user-123", "email": "dev@example.com"},
"timestamp": "2024-01-15T14:30:00Z"
}Fired when a kill switch is activated or deactivated
Payload
{
"event": "flag.kill_switch",
"flag": {"id": "flag-abc", "key": "broken-feature"},
"killSwitchEnabled": true,
"reason": "Production incident",
"actor": {"id": "user-123", "email": "oncall@example.com"},
"timestamp": "2024-01-15T14:30:00Z"
}Code Samples
MurmurHash3-based bucketing algorithm that produces consistent results across all SDK implementations.
Identify flags that are candidates for cleanup based on evaluation patterns.
Data Model & Queries
-- Core schema for feature flag engine (PostgreSQL)
-- Projects organize flags by team/application
CREATE TABLE projects (
project_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
key VARCHAR(50) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
description TEXT,
settings JSONB DEFAULT '{}',
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);
-- Flag definitions with versioning
CREATE TABLE flags (
flag_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID NOT NULL REFERENCES projects(project_id),
key VARCHAR(100) NOT NULL,
name VARCHAR(200) NOT NULL,
description TEXT,
flag_type VARCHAR(20) NOT NULL CHECK (flag_type IN ('boolean', 'string', 'number', 'json')),
-- State
enabled BOOLEAN NOT NULL DEFAULT false,
kill_switch_enabled BOOLEAN NOT NULL DEFAULT false,
archived BOOLEAN NOT NULL DEFAULT false,
-- Variants (e.g., [{key: "on", value: true}, {key: "off", value: false}])
variants JSONB NOT NULL,
off_variant VARCHAR(50) NOT NULL,
-- Default rule when no targeting rules match
default_rule JSONB NOT NULL,
-- Bucketing salt (change to re-randomize experiment assignments)
salt VARCHAR(50) NOT NULL DEFAULT gen_random_uuid()::text,
-- Versioning
version BIGINT NOT NULL DEFAULT 1,
config_version VARCHAR(50) NOT NULL DEFAULT '1',
-- Metadata
tags TEXT[] DEFAULT '{}',
created_by UUID NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE(project_id, key)
);
-- Targeting rules evaluated in priority order
CREATE TABLE flag_rules (
rule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
flag_id UUID NOT NULL REFERENCES flags(flag_id) ON DELETE CASCADE,
priority INT NOT NULL,
name VARCHAR(100),
description TEXT,
-- Conditions expression tree (supports nested AND/OR/NOT)
-- Example:
-- {"op":"AND","clauses":[{"attribute":"country","op":"in","values":["US","CA"]},{"op":"NOT","clause":{"attribute":"app_version","op":"sem_ver_lt","value":"2.3.0"}}]}
conditions JSONB NOT NULL,
-- Result when matched
variation_key VARCHAR(50), -- Serve specific variant
rollout JSONB, -- Or percentage rollout
enabled BOOLEAN NOT NULL DEFAULT true,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE(flag_id, priority)
);
-- Reusable user segments
CREATE TABLE segments (
segment_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID NOT NULL REFERENCES projects(project_id),
key VARCHAR(100) NOT NULL,
name VARCHAR(200) NOT NULL,
description TEXT,
-- Rule-based membership
rules JSONB NOT NULL DEFAULT '[]',
-- Explicit user lists (for small segments, testing)
included_users TEXT[] DEFAULT '{}',
excluded_users TEXT[] DEFAULT '{}',
-- For large user lists, use external reference
user_list_url TEXT, -- S3 URL for large lists
user_count_approx INT DEFAULT 0,
version BIGINT NOT NULL DEFAULT 1,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE(project_id, key)
);
-- Audit log (append-only, partitioned by month)
CREATE TABLE flag_audit_log (
audit_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
flag_id UUID NOT NULL,
project_id UUID NOT NULL,
-- Actor
actor_id UUID NOT NULL,
actor_type VARCHAR(20) NOT NULL CHECK (actor_type IN ('user', 'api_key', 'system', 'scheduled')),
actor_email VARCHAR(255),
-- Action
action VARCHAR(50) NOT NULL,
-- Change details
previous_state JSONB,
new_state JSONB,
-- Context
ip_address INET,
user_agent TEXT,
approval_id UUID,
comment TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);
-- Create monthly partitions
CREATE TABLE flag_audit_log_2024_01 PARTITION OF flag_audit_log
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
-- Scheduled flag changes
CREATE TABLE scheduled_changes (
schedule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
flag_id UUID NOT NULL REFERENCES flags(flag_id),
project_id UUID NOT NULL,
change_type VARCHAR(30) NOT NULL,
scheduled_at TIMESTAMP NOT NULL,
change_payload JSONB NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending', 'executed', 'cancelled', 'failed')),
created_by UUID NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
executed_at TIMESTAMP,
execution_error TEXT
);
-- Flag evaluation statistics for stale detection
CREATE TABLE flag_evaluation_stats (
flag_id UUID NOT NULL REFERENCES flags(flag_id),
date DATE NOT NULL,
evaluation_count BIGINT NOT NULL DEFAULT 0,
unique_users_hll BYTEA, -- HyperLogLog for unique user approximation
-- Variant distribution
variant_counts JSONB DEFAULT '{}',
-- Timing
first_eval_at TIMESTAMP,
last_eval_at TIMESTAMP,
PRIMARY KEY (flag_id, date)
);
-- Indexes for common queries
CREATE INDEX idx_flags_project_key ON flags(project_id, key);
CREATE INDEX idx_flags_updated ON flags(updated_at) WHERE NOT archived;
CREATE INDEX idx_flags_stale ON flags(updated_at, enabled) WHERE NOT archived;
CREATE INDEX idx_flag_rules_flag ON flag_rules(flag_id, priority);
CREATE INDEX idx_segments_project ON segments(project_id);
CREATE INDEX idx_audit_flag ON flag_audit_log(flag_id, created_at DESC);
CREATE INDEX idx_audit_project ON flag_audit_log(project_id, created_at DESC);
CREATE INDEX idx_scheduled_pending ON scheduled_changes(scheduled_at)
WHERE status = 'pending';
CREATE INDEX idx_eval_stats_flag ON flag_evaluation_stats(flag_id, date DESC);
-- Redis key schema (documented for SDK/edge reference)
-- flag:{project_key}:{flag_key} => JSON flag config
-- segment:{project_key}:{segment_key} => JSON segment definition
-- flag_version:{project_key} => latest config version hash
-- kill_switch:{project_key}:{flag_key} => "1" if active
-- user_segment:{segment_id}:{user_id} => "1" if user in segment (bloom filter)- •Flags table stores complete flag configuration with versioning for optimistic locking
- •JSONB for variants and rules enables flexible schema evolution without migrations
- •Separate flag_rules table allows complex rule ordering with priority column
- •Segments as first-class entities enable reuse across multiple flags
- •Audit log partitioned by month for efficient retention management and fast recent queries
- •Evaluation stats table enables stale flag detection with HyperLogLog for unique users
- •Scheduled changes table supports time-based rollouts and automated operations
- •Redis keys documented for SDK and edge evaluator reference
Fetch all flags for a project (SDK bootstrap)
SELECT f.*, array_agg(r.* ORDER BY r.priority) as rules FROM flags f LEFT JOIN flag_rules r ON f.flag_id = r.flag_id WHERE f.project_id = $1 AND NOT f.archived GROUP BY f.flag_id;Get flag with rules by key
SELECT f.*, json_agg(r ORDER BY r.priority) as rules FROM flags f LEFT JOIN flag_rules r ON f.flag_id = r.flag_id WHERE f.project_id = $1 AND f.key = $2 GROUP BY f.flag_id;Find stale flags (no evals in 30 days)
SELECT f.* FROM flags f WHERE f.enabled = true AND NOT f.archived AND NOT EXISTS (SELECT 1 FROM flag_evaluation_stats s WHERE s.flag_id = f.flag_id AND s.date > CURRENT_DATE - INTERVAL '30 days' AND s.evaluation_count > 0);Recent audit history for a flag
SELECT * FROM flag_audit_log WHERE flag_id = $1 ORDER BY created_at DESC LIMIT 50;Redis: Get flag config
GET flag:{project_key}:{flag_key}Redis: Check kill switch
EXISTS kill_switch:{project_key}:{flag_key}Redis: Invalidate project config
DEL flag_version:{project_key}idx_flags_project_keyFast flag lookup by project and key (most common query)idx_flags_updatedFind recently changed flags for cache invalidationidx_flags_staleQuery for stale flag detection jobidx_flag_rules_flagLoad rules in priority order for evaluationidx_audit_flagShow recent changes for a specific flagidx_scheduled_pendingFind due scheduled changes for executor jobidx_eval_stats_flagFetch evaluation history for stale detectionScaling & Bottlenecks
Primary Bottleneck
Cache miss rate on edge evaluators causing origin database load spikes
Mitigation
Implement cache warming and increase TTL for stable flags
What You'd Change
- •Preload hot flags (top 1000 by eval count) on edge startup
- •Stagger cache TTL expiration to prevent thundering herd (jitter 10-20%)
- •Add L2 cache tier in Redis for warm flags (30s in-process, 5m in Redis)
- •Implement background refresh: update cache before TTL expires
Primary Bottleneck
Config update fanout causing Kafka consumer lag >5s during burst updates
Mitigation
Partition update stream by project and add regional consumers
What You'd Change
- •Increase Kafka partitions from 16 to 64, key by project_id
- •Deploy consumer groups per region for parallel processing
- •Implement debouncing: batch rapid updates (wait 100ms) before publishing
- •Add priority queue for kill switch updates (bypass normal processing)
Primary Bottleneck
Exposure event ingestion overwhelming analytics pipeline at 2M events/sec
Mitigation
Implement client-side sampling and async batching with backpressure
What You'd Change
- •Add configurable sampling rate per flag (100% for experiments, 1% for monitoring)
- •Batch exposure events in SDK (flush every 1s or 100 events)
- •Implement backpressure: drop events if queue exceeds threshold (with metric)
- •Use columnar storage (ClickHouse/Druid) for exposure analytics
- •Add client-side deduplication window (don't re-emit same user+flag in 1 hour)
Failure Scenarios
Monitoring & Observability
Feature flag monitoring should simultaneously track serving SLOs, config propagation integrity, SDK health, and experiment data quality.
flag_evaluation_duration_secondshistogramTime to evaluate a single flag including rule matching
flag_evaluations_totalcounterTotal evaluation throughput across all edge regions
flag_config_propagation_lag_secondsgaugeTime between control plane write and edge cache update
flag_kill_switch_propagation_secondshistogramTime for kill switch to reach all edge nodes
kafka_consumer_group_laggaugePending update events in propagation consumer groups
flag_cache_hit_rategaugePercentage of evaluations served from cache
flag_exposure_events_droppedcounterExposure events dropped due to backpressure
flag_variant_distribution_skewgaugeObserved vs expected rollout distribution divergence
sdk_bootstrap_duration_secondshistogramTime for SDK to fetch initial config and become ready
sdk_stream_connection_success_rategaugePercentage of SDK streaming connections established successfully
stale_flag_backlog_totalgaugeCount of stale flags awaiting cleanup workflow
Flag evaluation p99 latency exceeds 5ms target
histogram_quantile(0.99, rate(flag_evaluation_duration_seconds_bucket[5m])) > 0.005Runbook: Check edge cache hit rate, rule complexity, segment sizes
Kill switch took more than 10 seconds to propagate
histogram_quantile(0.99, rate(flag_kill_switch_propagation_seconds_bucket[5m])) > 10Runbook: Check Kafka consumer lag, SSE connections, edge node health
Edge config is more than 30 seconds behind control plane
flag_config_propagation_lag_seconds > 30 for 5mRunbook: Check Kafka consumer group, network connectivity, edge node logs
Config update backlog risks kill switch SLA breach
kafka_consumer_group_lag > 100000 for 10mRunbook: Scale consumers, rebalance partitions, and prioritize kill-switch topic
More than 1% of exposure events being dropped
rate(flag_exposure_events_dropped[5m]) / rate(flag_exposure_events_total[5m]) > 0.01Runbook: Check event queue size, Kafka producer health, increase batch capacity
Observed rollout split diverges from expected distribution
flag_variant_distribution_skew > 0.03 for 15mRunbook: Validate bucketing consistency across SDK versions and hash inputs
SDK streaming connectivity is degraded
sdk_stream_connection_success_rate < 0.99 for 10mRunbook: Inspect SSE gateway health, connection limits, and token validation errors
Evaluation Performance
Config Propagation
SDK Health
Experiment Tracking
Governance
Scale Calculator
Estimate edge evaluator fleet size, config propagation capacity, and monthly spend with explicit compute, storage, and network components.
* Estimates based on simplified AWS pricing. Actual costs may vary.
Cost & Capacity
Test Your Understanding
Failure Diagnosis
Architecture Decisions
Summary & Takeaways
- 1.Edge evaluation with in-memory cache is essential for sub-5ms latency - central evaluation adds 50-200ms network latency
- 2.Deterministic bucketing using MurmurHash3 ensures consistent user experience across SDKs, sessions, and devices
- 3.Kill switches need dedicated fast path - normal config propagation may have Kafka lag; implement direct database fallback
- 4.Server SDKs evaluate locally with full rules; client SDKs receive pre-computed results to avoid exposing targeting logic
- 5.Exposure event deduplication and sampling are critical - 2M evals/sec would generate 170B events/day without controls
- 6.Stale flag detection requires multiple signals: no evaluations, 100% rollout for 90+ days, no code references
- 7.Cache TTL jitter prevents thundering herd - stagger expiration with 10-20% random variance
- 8.Streaming updates (SSE/Kafka) reduce propagation latency from 30s (polling) to <5s for config changes
- •Implement automated rollback based on error rate correlation - detect 5xx spike after flag change
- •Build flag dependency graph - show which flags depend on others, prevent circular references
- •Add canary analysis integration - automatic percentage ramp based on metric health
- •Implement approval workflows with required reviewers for production flag changes
- •Build real-time A/B test dashboard with statistical significance calculations
- •Add cross-team flag impact analysis - notify teams when shared flag changes
- •Implement gradual rollout automation - automatically increase percentage over time if metrics healthy
- •Build SDK conformance test suite that validates bucketing consistency across all languages