1️⃣ Azure Messaging
Event Grid
Purpose: Event routing (reactive integration)
- Push-based
- At-least-once delivery
- ~24h retry window
- No message completion concept
- No long-term durable storage
- No ordering guarantees
Use When:
- Blob created events
- Azure resource changes
- Lightweight domain events
- Logic App / Function triggers
Avoid When:
- Long outages possible
- Need DLQ control
- Need transactional guarantees
- Need back-pressure handling
Service Bus (Queue / Topic)
Purpose: Durable enterprise messaging
- Pull-based
- Explicit
Complete()required
- At-least-once delivery
- Durable storage
- DLQ support
- TTL configurable
- Supports sessions (ordering)
Topics: Fan-out with independent subscription state
Each subscription behaves like its own queue.
Use When:
- Business-critical workflows
- Notifications
- Background jobs
- Controlled retry required
Event Hub
Purpose: High-throughput streaming ingestion
- Millions of events/sec
- Partition-based
- Used for telemetry / clickstreams
2️⃣ Service Bus Lock Expiry & Duplicate Processing
Problem
Long-running function > lock duration
→ Lock expires
→ Second instance processes message
→ Duplicate side effects
Mitigations
- Increase lock duration (max 5 min)
- Enable Auto Lock Renewal (preferred)
- Redesign to complete message quickly and process asynchronously
⚠️ Lock renewal does NOT guarantee exactly-once.
Idempotency is mandatory.
3️⃣ Idempotency Patterns
Database-Level Idempotency
- Unique constraint on business key
- Conditional update (atomic state transition)
- Status table (Pending → InProgress → Completed)
Example (conditional update):
UPDATE Orders SET Status = 'InProgress' WHERE OrderId = @id AND Status = 'Pending';
If affected rows = 0 → another instance already processing.
External API Risk
If API is:
- Non-idempotent
- No status endpoint
- No compensating action
Then exactly-once cannot be guaranteed.
Mitigate via:
- Idempotency key (if supported)
- Saga pattern
- Reconciliation process
- Manual intervention for high-risk cases
Key Principle:
Exactly-once is an application-level illusion built on idempotency.
4️⃣ Circuit Breaker Pattern (Cache / External APIs)
Purpose:
Prevent cascading failure when dependency is down.
Without CB:
- Each request waits for timeout
- Latency explosion
- Thread exhaustion
With CB:
- Fail fast after threshold
- Route to fallback
Distributed CB options:
- Redis key flag
- Shared DB flag
- Azure App Configuration feature flag
5️⃣ Redis Strategy
Cache-Aside Pattern
- Check Redis
- If miss → DB
- Populate cache
TTL Strategy
Balance freshness vs DB protection.
- Static mapping → Long TTL
- Volatile data → Short TTL
Redis Failure Handling
If Redis down:
- Trip circuit breaker
- Use DB fallback
- Possibly use in-memory cache
- Apply rate limiting to protect DB
6️⃣ Hot Key Problem
Scenario: Single key receives 50k RPS
Risk:
- Redis shard overload
- App tier bottleneck
- DB spike if cache miss
Best mitigation:
- Edge caching (Azure Front Door)
- Cache 302 redirect
- Long TTL
Push traffic outward toward CDN layer.
7️⃣ Expiration Strategy (Avoid Big Deletes)
Bad:
- Monthly full-table delete
- Lock escalation
- Log growth
Better:
- ExpiresAt column (indexed)
- Enforce at query time
- Small batch incremental cleanup
- Partition by date (advanced)
8️⃣ Traffic Spike Handling (10x Surge)
Immediate:
- Return 429 (Too Many Requests)
- Rate limit in APIM
- Protect DB
Short Term:
- Increase cache TTL
- Scale out app tier cautiously
- Scale DB tier
Long Term:
- Add read replicas
- Improve partitioning
- Increase cache hit ratio
Key principle:
Protect the database first.
9️⃣ APIM vs Application Gateway
Application Gateway
- Layer 7 load balancer
- Path routing
- WAF
- VNet internal routing
APIM
- API governance
- Rate limiting
- JWT validation
- Quotas
- Versioning
🔟 APIM Rate Limiting – Code Examples
Static Rate Limit
<rate-limit calls="100" renewal-period="60" />
Dynamic Rate Limit Using Named Value
- Create Named Value:
systemOverloaded(boolean)
- Policy Example:
<choose> <when condition="@(bool.Parse((string)context.Variables.GetValueOrDefault("systemOverloaded","false")))"> <return-response> <set-status code="429" reason="System Overloaded" /> </return-response> </when> </choose>
- Azure Monitor Alert → triggers Azure Function → updates Named Value via REST API.
Pattern:
Metrics → Alert → Automation → APIM policy toggle.
🔟 Cosmos DB Partition Key Strategy
Partition key must:
- High cardinality
- Even traffic distribution
- Align with dominant query pattern
- Avoid hot partitions
Cosmos limits:
- 10k RU/s per physical partition
- 50GB per physical partition
Example – Orders API
Option A: Partition by UserId
{ "id": "order123", "userId": "user45", "amount": 500 }
Partition key:
/userIdEfficient for:
GET /users/{userId}/orders
Problem
GET /orders/{orderId} requires partition key.
Solution
Create lookup container:
{ "orderId": "order123", "userId": "user45" }
Step 1: Lookup orderId → get userId
Step 2: Point read using partition key + id
Cosmos = Query-first modeling.
1️⃣1️⃣ Cosmos Consistency Levels
- Strong – Linearizability; reads always see latest committed write (higher latency, limited regions).
- Bounded Staleness – Reads lag by configured time or versions.
- Session – Read-your-own-writes consistency per session.
- Consistent Prefix – Reads never see out-of-order writes.
- Eventual – Fastest, no ordering guarantee.
Tradeoff:
Consistency vs Latency vs Availability (CAP theorem applies).
1️⃣2️⃣ Distributed Tracing
Track end-to-end request flow:
- Correlation ID
- Cache latency
- DB latency
- External API latency
Tools:
- Application Insights
- OpenTelemetry
Purpose:
Identify bottlenecks and cascading failures quickly.
1️⃣3️⃣ System Design Scenarios (Essence)
1️⃣ File Processing System
- Blob → Event Grid → Service Bus → Function
- Queue-based load leveling
- Idempotent processing
- DLQ handling
- Auto lock renewal
- External API saga handling
Key lesson: Separate ingestion from long-running work.
2️⃣ URL Shortener
- Base62 encoding
- Cache-aside with Redis
- Edge caching for hot keys
- Sharding by shortCode hash (if needed)
- Expiration via ExpiresAt + incremental cleanup
Key lesson: Push load outward (Edge > Cache > DB).
3️⃣ Notification System
- Service Bus Topics for durable fan-out
- Channel isolation (Email/SMS/Push)
- Persist delivery state
- Circuit breaker per provider
- Scheduled retries
- Saga for cross-channel coordination
Key lesson: Decouple ingestion from delivery.
4️⃣ Scalable API Service
- Front Door + WAF
- APIM for governance
- App Service autoscale
- Redis cache
- SQL or Cosmos based on access pattern
- Rate limiting + overload protection
Key lesson: Design for failure, not just success.
1️⃣4️⃣ Core Engineering Principles
- Idempotency > Hope
- Protect DB under stress
- Push load outward (Edge → Cache → App → DB)
- Separate ingestion from processing
- Accept distributed system limits
- Query pattern drives data modeling
- Exactly-once requires cooperation
- Design for observability
