Revolutionizing PDF Statements: A Kubernetes Journey

Scaling Paytm Money's Weekly Statements with CRD Innovation

Slide 1: Welcome to Our Story

A DevOps Manager's Journey at Paytm Money

Image URL: None found
  • Your Host: DevOps Manager at PaytmMoney.com, sharing insights on PDF generation.
  • The Talk: CRD-Based PDF Statement Generation, a solution to scale.
  • When: Presented in April 2025, reflecting on past innovations.
  • What's Coming: A deep dive into the journey from legacy to cutting-edge.
  • Engage: Join us as we explore the transformation of Paytm Money's infrastructure.

Slide 2: Our Roadmap

Navigating Through the Presentation

Image URL: None found
  • Introduction & STAR: Setting the scene with our situation, task, action, and result.
  • Business Surge: How rapid growth and constraints shaped our journey.
  • Legacy Challenges: The limitations of our initial architecture.
  • CRD Innovation: A detailed look at our CRD-based solution.
  • What's Next: Future steps and opportunities for further enhancement.

Slide 3: The Beginning

Launching Paytm Money's Zero-Fee Platform

Image URL: None found
  • The Situation: In mid-2018, Paytm Money launched with zero fees for mutual funds.
  • The Task: Generate and email over 1 million PDF statements weekly, within India.
  • Our Action: Developed a Kubernetes-native CRD solution with CronJob and custom controllers.
  • The Result: Jobs completed in under 6 hours, with a 40% reduction in infra costs.
  • A Milestone: Achieved an error rate below 0.1%, setting a new standard.

Slide 4: Growth & Constraints

Navigating Rapid Expansion in India

Image URL: None found
  • Explosive Growth: 750,000+ early registrations in days, growing to 21 million users.
  • India-Only: Limited to AWS India's Mumbai region with self-managed Kubernetes.
  • DevOps to the Rescue: Dev teams focused on product features, leaving statement delivery to DevOps.
  • Challenges: No EKS available, requiring innovative solutions within constraints.
  • Our Response: Utilized open-source autoscaler to manage the load within India.

Slide 5: The Legacy System

Understanding Our Starting Point

Image URL: None found
  • Java Monolith: Ran on large EC2 instances, struggling with scalability.
  • PDF Library Woes: Heavy library and poor parallelism led to Friday backlogs.
  • Email Issues: Multiple outbound IPs caused spam filtering and low deliverability.
  • Cost Inefficiency: High fixed EC2 costs with no elastic response to load spikes.
  • The Need for Change: A clear signal that a new approach was necessary for growth.

Slide 6: First Steps to Containers

Transitioning to a More Scalable Solution

Image URL: None found
  • Python Rewrite: Statement generator rewritten in Python for better performance.
  • Containerization: Moved to self-managed Kubernetes on EC2 for scalability.
  • GIL Limitations: Hit Python's Global Interpreter Lock, limiting CPU utilization.
  • Marginal Gains: Performance improved slightly, but costs remained high.
  • The Journey Continues: Recognized the need for a more innovative approach.

Slide 7: Our Final Architecture

A Comprehensive Overview

Image URL: None found
  • CronJob Trigger: Initiates batch creation every Friday at 3 PM.
  • StatementBatch CR: Defines the master job with declarative parameters.
  • Statement CRs: Represents per-user tasks for efficient processing.
  • Worker Pods: Handles PDF generation and email sending within pods.
  • SMTP Relay: Uses a fixed Elastic IP for high deliverability.

Slide 8: StatementBatch CRD

Defining the Master Job

Image URL: None found
  • CRD Structure: Defines the master job with declarative parameters.
  • Batch ID: Unique identifier for each batch of statements.
  • User Count: Specifies the number of users for the batch.
  • Parallelism: Controls the number of parallel tasks, defaulting to 200.
  • Timeout: Sets a timeout for batch processing, defaulting to 540 minutes.

Slide 9: Statement CRD

Tracking Per-User Tasks

Image URL: None found
  • CRD Structure: Represents per-user work items with status tracking.
  • User ID: Unique identifier for the user receiving the statement.
  • Email: Email address where the statement will be sent.
  • Data Key: Key to access user data for statement generation.
  • Retry Count: Tracks retries, defaulting to 0, up to a maximum.

Slide 10: CronJob Scheduling

Automating Weekly Statements

Image URL: None found
  • Trigger Time: Runs at 3 PM every Friday to start the statement process.
  • Job Creation: Creates a StatementBatch CR to manage the batch process.
  • Container Image: Uses paytm/statement-initiator:v1 to initiate the process.
  • Restart Policy: Set to OnFailure to ensure reliability of the job.
  • Date Argument: Passes the current date to the container for processing.

Slide 11: Batch Controller

Managing Large Batches

Image URL: None found
  • Pending Phase: Creates Statement CRs to spawn per-user tasks.
  • Running Phase: Monitors progress, updates to Completed when finished.
  • Progress Metrics: Updates metrics to track processed vs. failed counts.
  • Splitting Batches: Efficiently divides large batches into manageable Statement CRs.
  • Status Tracking: Ensures all parts of the batch are accounted for and processed.

Slide 12: Statement Controller

Handling Individual Statements

Image URL: None found
  • Pending to Generating: Transitions to generating phase, creating a worker pod.
  • Generating to Generated: Updates status when the pod finishes PDF generation.
  • Generated to Sending: Initiates the email sending process.
  • Sending to Sent: Updates status upon successful email delivery.
  • Handling Failures: Retries failed statements up to a maximum retry count.

Slide 13: Worker Pod Specification

Efficient PDF Generation and Email Sending

Image URL: None found
  • PDF Generator: Uses paytm/pdf-generator:v1 with specified resource limits.
  • Email Sender: Utilizes paytm/email-sender:v1, configured for SMTP relay.
  • Volume for PDFs: Employs an emptyDir volume for PDF file exchange.
  • Elastic IP: All pods egress via the same Elastic IP for consistency.
  • Restart Policy: Set to Never, ensuring pods complete their tasks once.

Slide 14: Security and Access

Ensuring Secure Operations

Image URL: None found
  • Service Account: Created as statement-controller-sa for secure access.
  • Cluster Role: Defines permissions for statementbatches and statements.
  • Cluster Role Binding: Binds the role to the service account for least privilege.
  • Pod Permissions: Grants permissions to create, delete, and get pods.
  • Security Principle: Operates on the principle of least privilege for safety.

Slide 15: Thank You

Appreciating Your Engagement

Image URL: None found
  • Gratitude: Thank you for joining us on this innovative journey.
  • Your Insights: We value your questions and look forward to your feedback.
  • Future Collaboration: Let's keep the conversation going and explore new possibilities.
  • Stay Connected: Follow us for more updates on our technological advancements.
  • Farewell: Thank you again, and we hope to see you at our next event!