Vivek Kaushik

1️⃣ Azure Messaging

Event Grid

Purpose: Event routing (reactive integration)

Push-based

At-least-once delivery

~24h retry window

No message completion concept

No long-term durable storage

No ordering guarantees

Use When:

Blob created events

Azure resource changes

Lightweight domain events

Logic App / Function triggers

Avoid When:

Long outages possible

Need DLQ control

Need transactional guarantees

Need back-pressure handling

Service Bus (Queue / Topic)

Purpose: Durable enterprise messaging

Pull-based

Explicit Complete() required

At-least-once delivery

Durable storage

DLQ support

TTL configurable

Supports sessions (ordering)

Topics: Fan-out with independent subscription state

Each subscription behaves like its own queue.

Use When:

Business-critical workflows

Notifications

Background jobs

Controlled retry required

Event Hub

Purpose: High-throughput streaming ingestion

Millions of events/sec

Partition-based

Used for telemetry / clickstreams

2️⃣ Service Bus Lock Expiry & Duplicate Processing

Problem

Long-running function > lock duration

→ Lock expires

→ Second instance processes message

→ Duplicate side effects

Mitigations

Increase lock duration (max 5 min)

Enable Auto Lock Renewal (preferred)

Redesign to complete message quickly and process asynchronously

⚠️ Lock renewal does NOT guarantee exactly-once.

Idempotency is mandatory.

3️⃣ Idempotency Patterns

Database-Level Idempotency

Unique constraint on business key

Conditional update (atomic state transition)

Status table (Pending → InProgress → Completed)

Example (conditional update):


UPDATE Orders
SET Status = 'InProgress'
WHERE OrderId = @id AND Status = 'Pending';

If affected rows = 0 → another instance already processing.

External API Risk

If API is:

Non-idempotent

No status endpoint

No compensating action

Then exactly-once cannot be guaranteed.

Mitigate via:

Idempotency key (if supported)

Saga pattern

Reconciliation process

Manual intervention for high-risk cases

Key Principle:

Exactly-once is an application-level illusion built on idempotency.

4️⃣ Circuit Breaker Pattern (Cache / External APIs)

Purpose:

Prevent cascading failure when dependency is down.

Without CB:

Each request waits for timeout

Latency explosion

Thread exhaustion

With CB:

Fail fast after threshold

Route to fallback

Distributed CB options:

Redis key flag

Shared DB flag

Azure App Configuration feature flag

5️⃣ Redis Strategy

Cache-Aside Pattern

Check Redis

If miss → DB

Populate cache

TTL Strategy

Balance freshness vs DB protection.

Static mapping → Long TTL

Volatile data → Short TTL

Redis Failure Handling

If Redis down:

Trip circuit breaker

Use DB fallback

Possibly use in-memory cache

Apply rate limiting to protect DB

6️⃣ Hot Key Problem

Scenario: Single key receives 50k RPS

Risk:

Redis shard overload

App tier bottleneck

DB spike if cache miss

Best mitigation:

Edge caching (Azure Front Door)

Cache 302 redirect

Long TTL

Push traffic outward toward CDN layer.

7️⃣ Expiration Strategy (Avoid Big Deletes)

Bad:

Monthly full-table delete

Lock escalation

Log growth

Better:

ExpiresAt column (indexed)

Enforce at query time

Small batch incremental cleanup

Partition by date (advanced)

8️⃣ Traffic Spike Handling (10x Surge)

Immediate:

Return 429 (Too Many Requests)

Rate limit in APIM

Protect DB

Short Term:

Increase cache TTL

Scale out app tier cautiously

Scale DB tier

Long Term:

Add read replicas

Improve partitioning

Increase cache hit ratio

Key principle:

Protect the database first.

9️⃣ APIM vs Application Gateway

Application Gateway

Layer 7 load balancer

Path routing

VNet internal routing

APIM

API governance

Rate limiting

JWT validation

Quotas

Versioning

🔟 APIM Rate Limiting – Code Examples

Static Rate Limit


<rate-limit calls="100" renewal-period="60" />

Dynamic Rate Limit Using Named Value

Create Named Value: systemOverloaded (boolean)

Policy Example:


<choose>
  <when condition="@(bool.Parse((string)context.Variables.GetValueOrDefault("systemOverloaded","false")))">
    <return-response>
      <set-status code="429" reason="System Overloaded" />
    </return-response>
  </when>
</choose>

Azure Monitor Alert → triggers Azure Function → updates Named Value via REST API.

Pattern:

Metrics → Alert → Automation → APIM policy toggle.

🔟 Cosmos DB Partition Key Strategy

Partition key must:

High cardinality

Even traffic distribution

Align with dominant query pattern

Avoid hot partitions

Cosmos limits:

10k RU/s per physical partition

50GB per physical partition

Example – Orders API

Option A: Partition by UserId


{
  "id": "order123",
  "userId": "user45",
  "amount": 500
}

Partition key: /userId

Efficient for:

GET /users/{userId}/orders

Problem

GET /orders/{orderId} requires partition key.

Solution

Create lookup container:


{
  "orderId": "order123",
  "userId": "user45"
}

Step 1: Lookup orderId → get userId

Step 2: Point read using partition key + id

Cosmos = Query-first modeling.

1️⃣1️⃣ Cosmos Consistency Levels

Strong – Linearizability; reads always see latest committed write (higher latency, limited regions).

Bounded Staleness – Reads lag by configured time or versions.

Session – Read-your-own-writes consistency per session.

Consistent Prefix – Reads never see out-of-order writes.

Eventual – Fastest, no ordering guarantee.

Tradeoff:

Consistency vs Latency vs Availability (CAP theorem applies).

1️⃣2️⃣ Distributed Tracing

Track end-to-end request flow:

Correlation ID

Cache latency

DB latency

External API latency

Tools:

Application Insights

OpenTelemetry

Purpose:

Identify bottlenecks and cascading failures quickly.

1️⃣3️⃣ System Design Scenarios (Essence)

1️⃣ File Processing System

Blob → Event Grid → Service Bus → Function

Queue-based load leveling

Idempotent processing

DLQ handling

Auto lock renewal

External API saga handling

Key lesson: Separate ingestion from long-running work.

2️⃣ URL Shortener

Base62 encoding

Cache-aside with Redis

Edge caching for hot keys

Sharding by shortCode hash (if needed)

Expiration via ExpiresAt + incremental cleanup

Key lesson: Push load outward (Edge > Cache > DB).

3️⃣ Notification System

Service Bus Topics for durable fan-out

Channel isolation (Email/SMS/Push)

Persist delivery state

Circuit breaker per provider

Scheduled retries

Saga for cross-channel coordination

Key lesson: Decouple ingestion from delivery.

4️⃣ Scalable API Service

Front Door + WAF

APIM for governance

App Service autoscale

Redis cache

SQL or Cosmos based on access pattern

Rate limiting + overload protection

Key lesson: Design for failure, not just success.

1️⃣4️⃣ Core Engineering Principles

Idempotency > Hope

Protect DB under stress

Push load outward (Edge → Cache → App → DB)

Separate ingestion from processing

Accept distributed system limits

Query pattern drives data modeling

Exactly-once requires cooperation

Design for observability

1️⃣ Azure Messaging

Event Grid

Purpose: Event routing (reactive integration)

Push-based

At-least-once delivery

~24h retry window

No message completion concept

No long-term durable storage

No ordering guarantees

Use When:

Blob created events

Azure resource changes

Lightweight domain events

Logic App / Function triggers

Avoid When:

Long outages possible

Need DLQ control

Need transactional guarantees

Need back-pressure handling

Service Bus (Queue / Topic)

Purpose: Durable enterprise messaging

Pull-based

Explicit Complete() required

At-least-once delivery

Durable storage

DLQ support

TTL configurable

Supports sessions (ordering)

Topics: Fan-out with independent subscription state

Each subscription behaves like its own queue.

Use When:

Business-critical workflows

Notifications

Background jobs

Controlled retry required

Event Hub

Purpose: High-throughput streaming ingestion

Millions of events/sec

Partition-based

Used for telemetry / clickstreams

2️⃣ Service Bus Lock Expiry & Duplicate Processing

Problem

Long-running function > lock duration

→ Lock expires

→ Second instance processes message

→ Duplicate side effects

Mitigations

Increase lock duration (max 5 min)

Enable Auto Lock Renewal (preferred)

Redesign to complete message quickly and process asynchronously

⚠️ Lock renewal does NOT guarantee exactly-once.

Idempotency is mandatory.

3️⃣ Idempotency Patterns

Database-Level Idempotency

Unique constraint on business key

Conditional update (atomic state transition)

Status table (Pending → InProgress → Completed)

Example (conditional update):


UPDATE Orders
SET Status = 'InProgress'
WHERE OrderId = @id AND Status = 'Pending';

If affected rows = 0 → another instance already processing.

External API Risk

If API is:

Non-idempotent

No status endpoint

No compensating action

Then exactly-once cannot be guaranteed.

Mitigate via:

Idempotency key (if supported)

Saga pattern

Reconciliation process

Manual intervention for high-risk cases

Key Principle:

Exactly-once is an application-level illusion built on idempotency.

4️⃣ Circuit Breaker Pattern (Cache / External APIs)

Purpose:

Prevent cascading failure when dependency is down.

Without CB:

Each request waits for timeout

Latency explosion

Thread exhaustion

With CB:

Fail fast after threshold

Route to fallback

Distributed CB options:

Redis key flag

Shared DB flag

Azure App Configuration feature flag

5️⃣ Redis Strategy

Cache-Aside Pattern

Check Redis

If miss → DB

Populate cache

TTL Strategy

Balance freshness vs DB protection.

Static mapping → Long TTL

Volatile data → Short TTL

Redis Failure Handling

If Redis down:

Trip circuit breaker

Use DB fallback

Possibly use in-memory cache

Apply rate limiting to protect DB

6️⃣ Hot Key Problem

Scenario: Single key receives 50k RPS

Risk:

Redis shard overload

App tier bottleneck

DB spike if cache miss

Best mitigation:

Edge caching (Azure Front Door)

Cache 302 redirect

Long TTL

Push traffic outward toward CDN layer.

7️⃣ Expiration Strategy (Avoid Big Deletes)

Bad:

Monthly full-table delete

Lock escalation

Log growth

Better:

ExpiresAt column (indexed)

Enforce at query time

Small batch incremental cleanup

Partition by date (advanced)

8️⃣ Traffic Spike Handling (10x Surge)

Immediate:

Return 429 (Too Many Requests)

Rate limit in APIM

Protect DB

Short Term:

Increase cache TTL

Scale out app tier cautiously

Scale DB tier

Long Term:

Add read replicas

Improve partitioning

Increase cache hit ratio

Key principle:

Protect the database first.

9️⃣ APIM vs Application Gateway

Application Gateway

Layer 7 load balancer

Path routing

VNet internal routing

APIM

API governance

Rate limiting

JWT validation

Quotas

Versioning

🔟 APIM Rate Limiting – Code Examples

Static Rate Limit


<rate-limit calls="100" renewal-period="60" />

Dynamic Rate Limit Using Named Value

Create Named Value: systemOverloaded (boolean)

Policy Example:


<choose>
  <when condition="@(bool.Parse((string)context.Variables.GetValueOrDefault("systemOverloaded","false")))">
    <return-response>
      <set-status code="429" reason="System Overloaded" />
    </return-response>
  </when>
</choose>

Azure Monitor Alert → triggers Azure Function → updates Named Value via REST API.

Pattern:

Metrics → Alert → Automation → APIM policy toggle.

🔟 Cosmos DB Partition Key Strategy

Partition key must:

High cardinality

Even traffic distribution

Align with dominant query pattern

Avoid hot partitions

Cosmos limits:

10k RU/s per physical partition

50GB per physical partition

Example – Orders API

Option A: Partition by UserId


{
  "id": "order123",
  "userId": "user45",
  "amount": 500
}

Partition key: /userId

Efficient for:

GET /users/{userId}/orders

Problem

GET /orders/{orderId} requires partition key.

Solution

Create lookup container:


{
  "orderId": "order123",
  "userId": "user45"
}

Step 1: Lookup orderId → get userId

Step 2: Point read using partition key + id

Cosmos = Query-first modeling.

1️⃣1️⃣ Cosmos Consistency Levels

Strong – Linearizability; reads always see latest committed write (higher latency, limited regions).

Bounded Staleness – Reads lag by configured time or versions.

Session – Read-your-own-writes consistency per session.

Consistent Prefix – Reads never see out-of-order writes.

Eventual – Fastest, no ordering guarantee.

Tradeoff:

Consistency vs Latency vs Availability (CAP theorem applies).

1️⃣2️⃣ Distributed Tracing

Track end-to-end request flow:

Correlation ID

Cache latency

DB latency

External API latency

Tools:

Application Insights

OpenTelemetry

Purpose:

Identify bottlenecks and cascading failures quickly.

1️⃣3️⃣ System Design Scenarios (Essence)

1️⃣ File Processing System

Blob → Event Grid → Service Bus → Function

Queue-based load leveling

Idempotent processing

DLQ handling

Auto lock renewal

External API saga handling

Key lesson: Separate ingestion from long-running work.

2️⃣ URL Shortener

Base62 encoding

Cache-aside with Redis

Edge caching for hot keys

Sharding by shortCode hash (if needed)

Expiration via ExpiresAt + incremental cleanup

Key lesson: Push load outward (Edge > Cache > DB).

3️⃣ Notification System

Service Bus Topics for durable fan-out

Channel isolation (Email/SMS/Push)

Persist delivery state

Circuit breaker per provider

Scheduled retries

Saga for cross-channel coordination

Key lesson: Decouple ingestion from delivery.

4️⃣ Scalable API Service

Front Door + WAF

APIM for governance

App Service autoscale

Redis cache

SQL or Cosmos based on access pattern

Rate limiting + overload protection

Key lesson: Design for failure, not just success.

1️⃣4️⃣ Core Engineering Principles

Idempotency > Hope

Protect DB under stress

Push load outward (Edge → Cache → App → DB)

Separate ingestion from processing

Accept distributed system limits

Query pattern drives data modeling

Exactly-once requires cooperation

Design for observability