Vivek Kaushik
AboutBlogWorkMy Work Ethics

System Design Case Studies

Created
Feb 22, 2026 05:16 PM
Tags

1️⃣ Scalable File Processing System (Azure-Focused)

🎯 Requirements

  • Users upload files (possibly large)
  • Files must be processed asynchronously
  • System must scale automatically
  • Handle retries and failures safely
  • No duplicate processing
  • Observability required
  • External API call may be involved

📌 Assumptions

  • High upload variability (spikes possible)
  • At-least-once processing acceptable
  • Strong reliability required
  • Processing may take minutes

🏗️ Architecture Overview

Client → Blob Storage (SAS upload)
Blob Created Event → Event Grid
Event Grid → Service Bus Queue
Service Bus → Azure Function (Processor)
Processor → Redis / SQL / External API

🔹 Key Components

Azure Blob Storage

  • Stores raw files
  • SAS tokens for secure direct upload
  • Avoid API bottleneck

Event Grid

  • Reacts to blob creation
  • Lightweight routing

Service Bus

  • Durable buffering
  • Queue-based load leveling
  • DLQ support
  • Auto lock renewal for long tasks

Azure Functions

  • Competing consumers
  • Scales based on queue length
  • Must be idempotent

🔁 Idempotency Strategy

  • Maintain ProcessingStatus table
  • Status: Pending → InProgress → Completed
  • Unique constraint on FileId
  • Conditional update ensures single owner

⚠️ Failure & Edge Cases

Lock Expiry

Mitigation:
  • Auto lock renewal
  • Or redesign to complete message early

Function Crash Mid-Processing

  • Lock expires
  • Message reprocessed
  • Idempotency prevents duplication

External API Non-Idempotent

  • Cannot guarantee exactly-once
  • Use Saga
  • Store external call state
  • Reconciliation/manual review

Redis Down

  • Circuit breaker
  • Fallback to DB
  • Rate limit to protect DB

📈 Scalability

  • Horizontal scaling of functions
  • Service Bus absorbs spikes
  • Blob auto-scales
  • Partition DB if needed

2️⃣ URL Shortener

🎯 Requirements

  • Create short URL
  • Redirect quickly (<50ms)
  • High read volume
  • Moderate write volume
  • Expiration (90 days)
  • Analytics optional

🏗️ Architecture

Client → Front Door → APIM → App Service
App → Redis Cache
Redis → SQL DB

🔢 Short Code Generation

  • Auto-increment ID
  • Base62 encoding (URL safe)
  • Avoid Base64 (+ / = unsafe)

📦 Data Storage

SQL Schema:
  • id (PK)
  • shortCode (indexed)
  • longUrl
  • createdAt
  • expiresAt

⚡ Caching Strategy

  • Cache-aside
  • Long TTL for static mapping
  • Edge caching (Front Door) for hot URLs

🔥 Hot Key Scenario

Single URL → 50k RPS
Mitigation:
  • CDN caching
  • Reduce Redis & DB pressure

🗑 Expiration Handling

  • Use expiresAt column
  • Indexed
  • Incremental batch deletion
  • Avoid monthly large delete

📈 Scaling

  • Stateless app servers
  • Autoscale rules (CPU + RPS)
  • Read replicas for SQL

3️⃣ Notification System

🎯 Requirements

  • Send Email, SMS, Push
  • Triggered by events
  • Retry on failure
  • No duplicate spam
  • Independent channel scaling

🏗️ Architecture

Producer → Service Bus Topic
Subscriptions:
  • Email Processor
  • SMS Processor
  • Push Processor
Each subscription isolated.

🔁 Delivery State

Notification table:
  • NotificationId
  • Channel
  • Status
  • RetryCount
  • LastAttempt

⚠️ SMS Provider Down Scenario

  • Complete SB message early
  • Persist delivery intent
  • Circuit breaker on provider
  • Exponential backoff
  • Scheduled retry
Avoid retry storm.

🔄 "Both or None" Channel Requirement

  • True atomicity impossible
  • Use Saga pattern
  • Compensating message
  • Manual reconciliation if needed

📈 Scalability

  • Competing consumers per channel
  • Independent scaling
  • DLQ per subscription

4️⃣ Scalable API Service

🎯 Requirements

  • 5k–50k RPS
  • Global users
  • Secure
  • Low latency
  • High availability

🏗️ Architecture

Client → Front Door (WAF)
→ APIM
→ App Service
→ Redis
→ SQL / Cosmos DB

🔐 Security

  • Azure Entra ID
  • Managed Identity
  • No secrets in code

📊 Rate Limiting

  • Static rate-limit policy
  • Dynamic throttle via Named Value + Azure Monitor automation
  • Return 429 when overloaded

📦 Database Choice

SQL

  • Strong relational model
  • ACID transactions
  • Read replicas

Cosmos DB

  • Global distribution
  • Multi-region writes
  • Configurable consistency
  • Partition key critical

🌍 Multi-Region

  • Front Door for routing
  • Geo-replication for DB
  • Cache per region

⚠️ Traffic Spike Handling

  1. Apply rate limiting
  1. Increase cache TTL
  1. Scale app tier
  1. Scale DB
  1. Long-term architectural improvements

🧠 Final Takeaways

  • Idempotency is mandatory
  • Lock renewal reduces but does not eliminate duplication
  • Push load outward
  • Separate ingestion from processing
  • Query pattern drives data modeling
  • Exactly-once requires cooperation
  • Design for failure first
Table of Contents
1️⃣ Scalable File Processing System (Azure-Focused)🎯 Requirements📌 Assumptions🏗️ Architecture Overview🔹 Key ComponentsAzure Blob StorageEvent GridService BusAzure Functions🔁 Idempotency Strategy⚠️ Failure & Edge CasesLock ExpiryFunction Crash Mid-ProcessingExternal API Non-IdempotentRedis Down📈 Scalability2️⃣ URL Shortener🎯 Requirements🏗️ Architecture🔢 Short Code Generation📦 Data Storage⚡ Caching Strategy🔥 Hot Key Scenario🗑 Expiration Handling📈 Scaling3️⃣ Notification System🎯 Requirements🏗️ Architecture🔁 Delivery State⚠️ SMS Provider Down Scenario🔄 "Both or None" Channel Requirement📈 Scalability4️⃣ Scalable API Service🎯 Requirements🏗️ Architecture🔐 Security📊 Rate Limiting📦 Database ChoiceSQLCosmos DB🌍 Multi-Region⚠️ Traffic Spike Handling🧠 Final Takeaways
Copyright 2026 Vivek Kaushik