Revolutionizing Weekly St...

Revolutionizing Weekly Statements: Paytm Money's Journey

From EC2 to Kubernetes: Scaling for a Million Users

Slide 1: Welcome to the Transformation

Jibu N Chacko's Insightful Journey at Paytm Money

Image URL: https://images.pexels.com/photos/31749279/pexels-p...

Introducing the Speaker: Jibu N Chacko, Senior SRE/DevOps Manager, welcomes you to the evolution of Paytm Money's statement system.
The Talk's Focus: This presentation will cover the journey from EC2-hosted Java to a Kubernetes-native pipeline.
Technical Deep Dive: Explore challenges, decisions, and lessons learned during a ten-fold user surge.
Setting the Stage: Understand the context and the urgent need for a scalable solution.

Jibu N Chacko's Professional Background

Image URL: https://images.pexels.com/photos/31759729/pexels-p...

15 Years of Experience: Jibu has over 15 years in SRE/DevOps roles at Oracle, Walmart Labs, and Booking.com.
Core Strengths: Expertise in cloud architecture, automation, CI/CD, container orchestration, and monitoring.
Mission at Paytm Money: Brought in to modernize and scale infrastructure under tight timelines.
Passion for Platforms: Passionate about building resilient, scalable, and secure platforms.

Leading the Charge in Bangalore

Image URL: https://images.pexels.com/photos/6077567/pexels-ph...

DevOps Manager: Served as DevOps Manager from April 2018 to August 2019 in Bangalore.
Infrastructure Leadership: Directed AWS infrastructure and DevOps practices for Paytm Money.
Automation and Kubernetes: Automated deployments and config management with Kubernetes on EC2.
Monitoring and High Availability: Built monitoring/alerting frameworks to ensure high availability.

From 100k to 1M Users Overnight

Image URL: https://images.pexels.com/photos/8391346/pexels-ph...

Initial Statements: Started with weekly statements for ~100,000 users.
Zero-Fee Model Impact: Launching a zero-fee model led to over 1 million users almost overnight.
Pipeline Failure: The existing pipeline couldn't scale, resulting in SLA breaches.
Urgent Need for Scalability: A horizontally scalable, highly parallel solution became crucial.

The Monolithic Struggle

Image URL: https://images.pexels.com/photos/31772702/pexels-p...

Monolithic Java App: Used Apache PDFBox in a monolithic Java app on a single EC2 VM.
GC Pauses and Bottlenecks: High GC pauses and single-threaded PDF generation under load.
Batch Time Challenges: Multi-hour batch times at scale with underutilized resources.
Predictability Issues: Even larger VMs couldn't deliver predictable performance or cost control.

The First Modernization Attempt

Image URL: https://images.pexels.com/photos/31717272/pexels-p...

Python Rewrite: Rewritten in Python and containerized for faster startup and concurrency.
GIL and CPU Quotas: Python's GIL and Docker CPU quotas limited true parallelism.
Container Overload: Required many containers, driving up costs significantly.
Throughput Struggles: Meeting throughput targets risked contention despite increased costs.

The Serverless Experiment

Image URL: https://images.pexels.com/photos/7693967/pexels-ph...

Serverless Architecture: Used SQS queue and Lambda functions to prototype the system.
Cold Starts and Limits: Cold starts and account concurrency limits caused latency spikes.
Unpredictable Throughput: SQS-to-Lambda mapping delays led to unpredictable throughput.
SLA Challenges: Queue-based scaling couldn't meet the hard deadline SLA for Friday midnight.

CRD/Operator for Predictable Scaling

Image URL: https://images.pexels.com/photos/31790754/pexels-p...

Kubernetes-Native Approach: Adopted Kubernetes with CustomResourceDefinitions and an Operator.
Declarative Infrastructure: Used CRDs for declarative infrastructure and custom workflow management.
Self-Healing and Observability: Leveraged Kubernetes for self-healing and observability benefits.
Manual Kubernetes Setup: Installed and configured Kubernetes manually on EC2 nodes due to EKS unavailability.

A Flowchart Overview

Image URL: https://images.pexels.com/photos/31768083/pexels-p...

Components and Flow: Includes CronJob, StatementBatch CRD, Batch Controller, Redis Cluster, and more.
Redis for State Tracking: Redis fits in for state tracking and failure propagation.
Dev-Team API Integration: User and Statement APIs decoupled data access from operator logic.
Workflow and Retries: Workflow from batch creation to PDF/email generation with a two-tier retry mechanism.

For Joining Us on This Journey

Image URL: https://images.pexels.com/photos/4439414/pexels-ph...

Your Attention Appreciated: Thank you for your time and attention during this presentation.
Questions Welcome: Feel free to ask questions on architecture, implementation, or lessons learned.
Sharing Resources: Technical artifacts like Helm charts and controller code available post-session.
A Rewarding Experience: It was a pleasure sharing the transformation journey of Paytm Money's weekly statement system.