Enterprise Performance & Caching at Scale
Led performance testing and caching optimisation for a national restaurant chain whose APIs were buckling under peak traffic.
Challenge
National restaurant chain APIs couldn't handle peak load
Solution
Performance testing with k6, caching layer optimization (Valkey), benchmarks
Result
APIs scaled for peak traffic, measurable latency improvements
The Problem
A national restaurant chain with thousands of locations ran their digital ordering, menu management, and loyalty systems through a suite of APIs. During peak hours — lunch rush, dinner rush, promotional events — the APIs buckled. Menu loads timed out. Order submissions failed. The loyalty system returned stale data.
The engineering team knew performance was bad. What they didn't have was a clear picture of where the bottlenecks were, what the actual limits were, and how the caching layer was (or wasn't) helping.
What I Led
I owned the performance engineering initiative end to end:
1. Baseline Performance Testing with k6 I stood up a comprehensive k6 test suite targeting the critical API paths — menu retrieval, order submission, payment processing, and loyalty point queries. We ran staged load tests: baseline, peak simulation, and stress tests at 2x expected peak. The data told us exactly where the system broke and at what thresholds.
2. Caching Layer Audit and Optimisation (Valkey) The existing caching strategy was inconsistent — some endpoints cached aggressively, others not at all. Cache invalidation logic had bugs that caused stale data in production. I led a systematic audit of the caching layer using Valkey (the Redis fork), redesigned TTL strategies per endpoint, and implemented cache warming for predictable high-traffic windows.
3. Benchmark Framework I established a repeatable benchmarking framework so the team could measure performance regressions in CI. Every PR that touched a critical path got an automated performance gate.
The Outcome
- Menu APIs that previously timed out at 500 concurrent users now handled peak traffic comfortably
- P95 latency dropped measurably across all critical endpoints
- Cache hit rates improved significantly, reducing database load during peak windows
- The benchmark framework caught three performance regressions before they reached production
- The ops team stopped getting paged during lunch rush