Vivek Kaushik
AboutBlogWorkMy Work Ethics
System Design

System Design

Created
Feb 22, 2026 08:18 AM
Tags
azure
system design
interview

1️⃣ Azure Messaging

Event Grid

Purpose: Event routing (reactive integration)
  • Push-based
  • At-least-once delivery
  • ~24h retry window
  • No message completion concept
  • No long-term durable storage
  • No ordering guarantees
Use When:
  • Blob created events
  • Azure resource changes
  • Lightweight domain events
  • Logic App / Function triggers
Avoid When:
  • Long outages possible
  • Need DLQ control
  • Need transactional guarantees
  • Need back-pressure handling

Service Bus (Queue / Topic)

Purpose: Durable enterprise messaging
  • Pull-based
  • Explicit Complete() required
  • At-least-once delivery
  • Durable storage
  • DLQ support
  • TTL configurable
  • Supports sessions (ordering)
Topics: Fan-out with independent subscription state
Each subscription behaves like its own queue.
Use When:
  • Business-critical workflows
  • Notifications
  • Background jobs
  • Controlled retry required

Event Hub

Purpose: High-throughput streaming ingestion
  • Millions of events/sec
  • Partition-based
  • Used for telemetry / clickstreams

2️⃣ Service Bus Lock Expiry & Duplicate Processing

Problem

Long-running function > lock duration
→ Lock expires
→ Second instance processes message
→ Duplicate side effects

Mitigations

  1. Increase lock duration (max 5 min)
  1. Enable Auto Lock Renewal (preferred)
  1. Redesign to complete message quickly and process asynchronously
⚠️ Lock renewal does NOT guarantee exactly-once.
Idempotency is mandatory.

3️⃣ Idempotency Patterns

Database-Level Idempotency

  • Unique constraint on business key
  • Conditional update (atomic state transition)
  • Status table (Pending → InProgress → Completed)
Example (conditional update):
UPDATE Orders SET Status = 'InProgress' WHERE OrderId = @id AND Status = 'Pending';
If affected rows = 0 → another instance already processing.

External API Risk

If API is:
  • Non-idempotent
  • No status endpoint
  • No compensating action
Then exactly-once cannot be guaranteed.
Mitigate via:
  • Idempotency key (if supported)
  • Saga pattern
  • Reconciliation process
  • Manual intervention for high-risk cases
Key Principle:
Exactly-once is an application-level illusion built on idempotency.

4️⃣ Circuit Breaker Pattern (Cache / External APIs)

Purpose:
Prevent cascading failure when dependency is down.
Without CB:
  • Each request waits for timeout
  • Latency explosion
  • Thread exhaustion
With CB:
  • Fail fast after threshold
  • Route to fallback
Distributed CB options:
  • Redis key flag
  • Shared DB flag
  • Azure App Configuration feature flag

5️⃣ Redis Strategy

Cache-Aside Pattern

  1. Check Redis
  1. If miss → DB
  1. Populate cache

TTL Strategy

Balance freshness vs DB protection.
  • Static mapping → Long TTL
  • Volatile data → Short TTL

Redis Failure Handling

If Redis down:
  • Trip circuit breaker
  • Use DB fallback
  • Possibly use in-memory cache
  • Apply rate limiting to protect DB

6️⃣ Hot Key Problem

Scenario: Single key receives 50k RPS
Risk:
  • Redis shard overload
  • App tier bottleneck
  • DB spike if cache miss
Best mitigation:
  • Edge caching (Azure Front Door)
  • Cache 302 redirect
  • Long TTL
Push traffic outward toward CDN layer.

7️⃣ Expiration Strategy (Avoid Big Deletes)

Bad:
  • Monthly full-table delete
  • Lock escalation
  • Log growth
Better:
  • ExpiresAt column (indexed)
  • Enforce at query time
  • Small batch incremental cleanup
  • Partition by date (advanced)

8️⃣ Traffic Spike Handling (10x Surge)

Immediate:
  • Return 429 (Too Many Requests)
  • Rate limit in APIM
  • Protect DB
Short Term:
  • Increase cache TTL
  • Scale out app tier cautiously
  • Scale DB tier
Long Term:
  • Add read replicas
  • Improve partitioning
  • Increase cache hit ratio
Key principle:
Protect the database first.

9️⃣ APIM vs Application Gateway

Application Gateway

  • Layer 7 load balancer
  • Path routing
  • WAF
  • VNet internal routing

APIM

  • API governance
  • Rate limiting
  • JWT validation
  • Quotas
  • Versioning

🔟 APIM Rate Limiting – Code Examples

Static Rate Limit

<rate-limit calls="100" renewal-period="60" />

Dynamic Rate Limit Using Named Value

  1. Create Named Value: systemOverloaded (boolean)
  1. Policy Example:
<choose> <when condition="@(bool.Parse((string)context.Variables.GetValueOrDefault("systemOverloaded","false")))"> <return-response> <set-status code="429" reason="System Overloaded" /> </return-response> </when> </choose>
  1. Azure Monitor Alert → triggers Azure Function → updates Named Value via REST API.
Pattern:
Metrics → Alert → Automation → APIM policy toggle.

🔟 Cosmos DB Partition Key Strategy

Partition key must:
  • High cardinality
  • Even traffic distribution
  • Align with dominant query pattern
  • Avoid hot partitions
Cosmos limits:
  • 10k RU/s per physical partition
  • 50GB per physical partition

Example – Orders API

Option A: Partition by UserId

{ "id": "order123", "userId": "user45", "amount": 500 }
Partition key: /userId
Efficient for:
GET /users/{userId}/orders

Problem

GET /orders/{orderId} requires partition key.

Solution

Create lookup container:
{ "orderId": "order123", "userId": "user45" }
Step 1: Lookup orderId → get userId
Step 2: Point read using partition key + id
Cosmos = Query-first modeling.

1️⃣1️⃣ Cosmos Consistency Levels

  1. Strong – Linearizability; reads always see latest committed write (higher latency, limited regions).
  1. Bounded Staleness – Reads lag by configured time or versions.
  1. Session – Read-your-own-writes consistency per session.
  1. Consistent Prefix – Reads never see out-of-order writes.
  1. Eventual – Fastest, no ordering guarantee.
Tradeoff:
Consistency vs Latency vs Availability (CAP theorem applies).

1️⃣2️⃣ Distributed Tracing

Track end-to-end request flow:
  • Correlation ID
  • Cache latency
  • DB latency
  • External API latency
Tools:
  • Application Insights
  • OpenTelemetry
Purpose:
Identify bottlenecks and cascading failures quickly.

1️⃣3️⃣ System Design Scenarios (Essence)

1️⃣ File Processing System

  • Blob → Event Grid → Service Bus → Function
  • Queue-based load leveling
  • Idempotent processing
  • DLQ handling
  • Auto lock renewal
  • External API saga handling
Key lesson: Separate ingestion from long-running work.

2️⃣ URL Shortener

  • Base62 encoding
  • Cache-aside with Redis
  • Edge caching for hot keys
  • Sharding by shortCode hash (if needed)
  • Expiration via ExpiresAt + incremental cleanup
Key lesson: Push load outward (Edge > Cache > DB).

3️⃣ Notification System

  • Service Bus Topics for durable fan-out
  • Channel isolation (Email/SMS/Push)
  • Persist delivery state
  • Circuit breaker per provider
  • Scheduled retries
  • Saga for cross-channel coordination
Key lesson: Decouple ingestion from delivery.

4️⃣ Scalable API Service

  • Front Door + WAF
  • APIM for governance
  • App Service autoscale
  • Redis cache
  • SQL or Cosmos based on access pattern
  • Rate limiting + overload protection
Key lesson: Design for failure, not just success.

1️⃣4️⃣ Core Engineering Principles

  1. Idempotency > Hope
  1. Protect DB under stress
  1. Push load outward (Edge → Cache → App → DB)
  1. Separate ingestion from processing
  1. Accept distributed system limits
  1. Query pattern drives data modeling
  1. Exactly-once requires cooperation
  1. Design for observability

 
Table of Contents
1️⃣ Azure MessagingEvent GridService Bus (Queue / Topic)Event Hub2️⃣ Service Bus Lock Expiry & Duplicate ProcessingProblemMitigations3️⃣ Idempotency PatternsDatabase-Level IdempotencyExternal API Risk4️⃣ Circuit Breaker Pattern (Cache / External APIs)5️⃣ Redis StrategyCache-Aside PatternTTL StrategyRedis Failure Handling6️⃣ Hot Key Problem7️⃣ Expiration Strategy (Avoid Big Deletes)8️⃣ Traffic Spike Handling (10x Surge)9️⃣ APIM vs Application GatewayApplication GatewayAPIM🔟 APIM Rate Limiting – Code ExamplesStatic Rate LimitDynamic Rate Limit Using Named Value🔟 Cosmos DB Partition Key StrategyExample – Orders APIOption A: Partition by UserIdProblemSolution1️⃣1️⃣ Cosmos Consistency Levels1️⃣2️⃣ Distributed Tracing1️⃣3️⃣ System Design Scenarios (Essence)1️⃣ File Processing System2️⃣ URL Shortener3️⃣ Notification System4️⃣ Scalable API Service1️⃣4️⃣ Core Engineering Principles
Copyright 2026 Vivek Kaushik