Lessons from Building Distributed Systems at TCS
Real-world insights from designing scalable backend systems handling millions of requests daily at Tata Consultancy Services.
Lessons from Building Distributed Systems at TCS
Since joining TCS in June 2025, I’ve been working on backend systems that handle millions of requests daily. Here are the hard-won lessons.
1. Event-Driven > Request-Response
Switching from synchronous REST calls to Apache Kafka for inter-service communication reduced our P99 latency by 60%.
# Before: Synchronous
response = service_b.process(data) # Blocks until complete
# After: Event-driven
producer.send('events', {'type': 'process', 'data': data})
2. Redis is Not Just Caching
We use Redis for:
- Rate limiting (sliding window counters)
- Session management
- Real-time leaderboards
- Distributed locks (Redlock algorithm)
3. Circuit Breakers Save Lives
When downstream services fail, circuit breakers prevent cascade failures:
const breaker = new CircuitBreaker(apiCall, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
});
4. Observability is Non-Negotiable
You can’t fix what you can’t see. We use:
- Prometheus + Grafana for metrics
- Jaeger for distributed tracing
- ELK stack for centralized logging
5. Design for Failure
Every service assumes its dependencies will fail. Graceful degradation > complete outage.
Comments
Recently Viewed
Related Posts
Building Real-World RAG Pipelines with LangChain
A practical guide to building Retrieval-Augmented Generation pipelines for production applications using LangChain and vector databases.
Building Oriz: 1000+ Free Online Tools Platform
How I built Oriz.in — a platform with over 1000 free online tools using Next.js, Cloudflare Workers, and a microservices architecture.
Why I Bet on Cloudflare Workers for Edge Computing
How Cloudflare Workers changed the way I think about backend architecture, and why edge computing is the future.