This paper presents a comprehensive practitioner case study of the distributed system architecture, observability discipline, and release engineering practices that supported a major food delivery platform — the largest in Latin America — at a sustained scale of over 200 million orders per year with 99.9%+ availability. We document the seven-layer architectural stack employed: client gateway, API edge, microservices on Java 17 / Spring Boot, asynchronous messaging via Apache Kafka and AWS SQS/SNS, polyglot persistence including PostgreSQL, DynamoDB, Redis, and a Data Lake on S3, infrastructure managed through Kubernetes on AWS EKS and Terraform, and a multi-pillar observability stack combining New Relic APM, Prometheus, Grafana, and centralised logging via LogZ. We describe the canary deployment methodology that enables progressive traffic shifting from 1% to 100% with automated rollback gates triggered by error rate, latency, and resource thresholds, and the six Kubernetes production controls — HPA auto-scaling, resource quotas, liveness probes, rolling updates, pod disruption budgets, and namespace isolation — that guarantee zero-downtime updates. We also document the alerting pipeline connecting technical signals to engineering response, and Git as the single source of truth for code, infrastructure, and configuration. Operational outcomes include 200M+ annual transactions sustained at 99.9%+ availability, p99 API latency under 200ms, 40% latency reduction through microservices migration, 15% infrastructure cost reduction through FinOps governance, 90% backoffice automation, and the capacity to load-test the platform at three times peak production traffic on a weekly schedule. We distill five engineering principles transferable to practitioners building or operating comparable platforms: polyglot persistence as a requirement at scale, observability as a precondition for high availability, automated rollback as the enabler of continuous deployment, infrastructure-as-code as the only sustainable approach at scale, and domain alignment as the organising principle that makes distributed systems operationally viable.
Harison Pereira Bila de Carvalho (Fri,) studied this question.