Revolutionizing Weekly Statements: Paytm Money's Journey

From EC2 to Kubernetes: Scaling for a Million Users

Presentation image

Slide 1: Welcome to the Transformation

Jibu N Chacko's Insightful Journey at Paytm Money

Image URL: https://images.pexels.com/photos/31749279/pexels-p...
  • Introducing the Speaker: Jibu N Chacko, Senior SRE/DevOps Manager, welcomes you to the evolution of Paytm Money's statement system.
  • The Talk's Focus: This presentation will cover the journey from EC2-hosted Java to a Kubernetes-native pipeline.
  • Technical Deep Dive: Explore challenges, decisions, and lessons learned during a ten-fold user surge.
  • Setting the Stage: Understand the context and the urgent need for a scalable solution.
Welcome to the Transformation image

Slide 2: A Career of Excellence

Jibu N Chacko's Professional Background

Image URL: https://images.pexels.com/photos/31759729/pexels-p...
  • 15 Years of Experience: Jibu has over 15 years in SRE/DevOps roles at Oracle, Walmart Labs, and Booking.com.
  • Core Strengths: Expertise in cloud architecture, automation, CI/CD, container orchestration, and monitoring.
  • Mission at Paytm Money: Brought in to modernize and scale infrastructure under tight timelines.
  • Passion for Platforms: Passionate about building resilient, scalable, and secure platforms.
A Career of Excellence image

Slide 3: Role at Paytm Money

Leading the Charge in Bangalore

Image URL: https://images.pexels.com/photos/6077567/pexels-ph...
  • DevOps Manager: Served as DevOps Manager from April 2018 to August 2019 in Bangalore.
  • Infrastructure Leadership: Directed AWS infrastructure and DevOps practices for Paytm Money.
  • Automation and Kubernetes: Automated deployments and config management with Kubernetes on EC2.
  • Monitoring and High Availability: Built monitoring/alerting frameworks to ensure high availability.
Role at Paytm Money image

Slide 4: The Surge and Its Challenges

From 100k to 1M Users Overnight

Image URL: https://images.pexels.com/photos/8391346/pexels-ph...
  • Initial Statements: Started with weekly statements for ~100,000 users.
  • Zero-Fee Model Impact: Launching a zero-fee model led to over 1 million users almost overnight.
  • Pipeline Failure: The existing pipeline couldn't scale, resulting in SLA breaches.
  • Urgent Need for Scalability: A horizontally scalable, highly parallel solution became crucial.
The Surge and Its Challenges image

Slide 5: Java on EC2 Limitations

The Monolithic Struggle

Image URL: https://images.pexels.com/photos/31772702/pexels-p...
  • Monolithic Java App: Used Apache PDFBox in a monolithic Java app on a single EC2 VM.
  • GC Pauses and Bottlenecks: High GC pauses and single-threaded PDF generation under load.
  • Batch Time Challenges: Multi-hour batch times at scale with underutilized resources.
  • Predictability Issues: Even larger VMs couldn't deliver predictable performance or cost control.
Java on EC2 Limitations image

Slide 6: Python Containers' Limitations

The First Modernization Attempt

Image URL: https://images.pexels.com/photos/31717272/pexels-p...
  • Python Rewrite: Rewritten in Python and containerized for faster startup and concurrency.
  • GIL and CPU Quotas: Python's GIL and Docker CPU quotas limited true parallelism.
  • Container Overload: Required many containers, driving up costs significantly.
  • Throughput Struggles: Meeting throughput targets risked contention despite increased costs.
Python Containers' Limitations image

Slide 7: AWS Lambda and SQS Prototype

The Serverless Experiment

Image URL: https://images.pexels.com/photos/7693967/pexels-ph...
  • Serverless Architecture: Used SQS queue and Lambda functions to prototype the system.
  • Cold Starts and Limits: Cold starts and account concurrency limits caused latency spikes.
  • Unpredictable Throughput: SQS-to-Lambda mapping delays led to unpredictable throughput.
  • SLA Challenges: Queue-based scaling couldn't meet the hard deadline SLA for Friday midnight.
AWS Lambda and SQS Prototype image

Slide 8: Embracing Kubernetes

CRD/Operator for Predictable Scaling

Image URL: https://images.pexels.com/photos/31790754/pexels-p...
  • Kubernetes-Native Approach: Adopted Kubernetes with CustomResourceDefinitions and an Operator.
  • Declarative Infrastructure: Used CRDs for declarative infrastructure and custom workflow management.
  • Self-Healing and Observability: Leveraged Kubernetes for self-healing and observability benefits.
  • Manual Kubernetes Setup: Installed and configured Kubernetes manually on EC2 nodes due to EKS unavailability.
Embracing Kubernetes image

Slide 9: The Final Architecture

A Flowchart Overview

Image URL: https://images.pexels.com/photos/31768083/pexels-p...
  • Components and Flow: Includes CronJob, StatementBatch CRD, Batch Controller, Redis Cluster, and more.
  • Redis for State Tracking: Redis fits in for state tracking and failure propagation.
  • Dev-Team API Integration: User and Statement APIs decoupled data access from operator logic.
  • Workflow and Retries: Workflow from batch creation to PDF/email generation with a two-tier retry mechanism.
The Final Architecture image

Slide 10: Thank You

For Joining Us on This Journey

Image URL: https://images.pexels.com/photos/4439414/pexels-ph...
  • Your Attention Appreciated: Thank you for your time and attention during this presentation.
  • Questions Welcome: Feel free to ask questions on architecture, implementation, or lessons learned.
  • Sharing Resources: Technical artifacts like Helm charts and controller code available post-session.
  • A Rewarding Experience: It was a pleasure sharing the transformation journey of Paytm Money's weekly statement system.
Thank You image