Revolutionizing PDF Statements: A Kubernetes Journey
Scaling Paytm Money's Weekly Statements with CRD Innovation
Slide 1: Welcome to Our Story
A DevOps Manager's Journey at Paytm Money
Image URL: None found
- Your Host: DevOps Manager at PaytmMoney.com, sharing insights on PDF generation.
- The Talk: CRD-Based PDF Statement Generation, a solution to scale.
- When: Presented in April 2025, reflecting on past innovations.
- What's Coming: A deep dive into the journey from legacy to cutting-edge.
- Engage: Join us as we explore the transformation of Paytm Money's infrastructure.
Slide 2: Our Roadmap
Navigating Through the Presentation
Image URL: None found
- Introduction & STAR: Setting the scene with our situation, task, action, and result.
- Business Surge: How rapid growth and constraints shaped our journey.
- Legacy Challenges: The limitations of our initial architecture.
- CRD Innovation: A detailed look at our CRD-based solution.
- What's Next: Future steps and opportunities for further enhancement.
Slide 3: The Beginning
Launching Paytm Money's Zero-Fee Platform
Image URL: None found
- The Situation: In mid-2018, Paytm Money launched with zero fees for mutual funds.
- The Task: Generate and email over 1 million PDF statements weekly, within India.
- Our Action: Developed a Kubernetes-native CRD solution with CronJob and custom controllers.
- The Result: Jobs completed in under 6 hours, with a 40% reduction in infra costs.
- A Milestone: Achieved an error rate below 0.1%, setting a new standard.
Slide 4: Growth & Constraints
Navigating Rapid Expansion in India
Image URL: None found
- Explosive Growth: 750,000+ early registrations in days, growing to 21 million users.
- India-Only: Limited to AWS India's Mumbai region with self-managed Kubernetes.
- DevOps to the Rescue: Dev teams focused on product features, leaving statement delivery to DevOps.
- Challenges: No EKS available, requiring innovative solutions within constraints.
- Our Response: Utilized open-source autoscaler to manage the load within India.
Slide 5: The Legacy System
Understanding Our Starting Point
Image URL: None found
- Java Monolith: Ran on large EC2 instances, struggling with scalability.
- PDF Library Woes: Heavy library and poor parallelism led to Friday backlogs.
- Email Issues: Multiple outbound IPs caused spam filtering and low deliverability.
- Cost Inefficiency: High fixed EC2 costs with no elastic response to load spikes.
- The Need for Change: A clear signal that a new approach was necessary for growth.
Slide 6: First Steps to Containers
Transitioning to a More Scalable Solution
Image URL: None found
- Python Rewrite: Statement generator rewritten in Python for better performance.
- Containerization: Moved to self-managed Kubernetes on EC2 for scalability.
- GIL Limitations: Hit Python's Global Interpreter Lock, limiting CPU utilization.
- Marginal Gains: Performance improved slightly, but costs remained high.
- The Journey Continues: Recognized the need for a more innovative approach.
Slide 7: Our Final Architecture
A Comprehensive Overview
Image URL: None found
- CronJob Trigger: Initiates batch creation every Friday at 3 PM.
- StatementBatch CR: Defines the master job with declarative parameters.
- Statement CRs: Represents per-user tasks for efficient processing.
- Worker Pods: Handles PDF generation and email sending within pods.
- SMTP Relay: Uses a fixed Elastic IP for high deliverability.
Slide 8: StatementBatch CRD
Defining the Master Job
Image URL: None found
- CRD Structure: Defines the master job with declarative parameters.
- Batch ID: Unique identifier for each batch of statements.
- User Count: Specifies the number of users for the batch.
- Parallelism: Controls the number of parallel tasks, defaulting to 200.
- Timeout: Sets a timeout for batch processing, defaulting to 540 minutes.
Slide 9: Statement CRD
Tracking Per-User Tasks
Image URL: None found
- CRD Structure: Represents per-user work items with status tracking.
- User ID: Unique identifier for the user receiving the statement.
- Email: Email address where the statement will be sent.
- Data Key: Key to access user data for statement generation.
- Retry Count: Tracks retries, defaulting to 0, up to a maximum.
Slide 10: CronJob Scheduling
Automating Weekly Statements
Image URL: None found
- Trigger Time: Runs at 3 PM every Friday to start the statement process.
- Job Creation: Creates a StatementBatch CR to manage the batch process.
- Container Image: Uses paytm/statement-initiator:v1 to initiate the process.
- Restart Policy: Set to OnFailure to ensure reliability of the job.
- Date Argument: Passes the current date to the container for processing.
Slide 11: Batch Controller
Managing Large Batches
Image URL: None found
- Pending Phase: Creates Statement CRs to spawn per-user tasks.
- Running Phase: Monitors progress, updates to Completed when finished.
- Progress Metrics: Updates metrics to track processed vs. failed counts.
- Splitting Batches: Efficiently divides large batches into manageable Statement CRs.
- Status Tracking: Ensures all parts of the batch are accounted for and processed.
Slide 12: Statement Controller
Handling Individual Statements
Image URL: None found
- Pending to Generating: Transitions to generating phase, creating a worker pod.
- Generating to Generated: Updates status when the pod finishes PDF generation.
- Generated to Sending: Initiates the email sending process.
- Sending to Sent: Updates status upon successful email delivery.
- Handling Failures: Retries failed statements up to a maximum retry count.
Slide 13: Worker Pod Specification
Efficient PDF Generation and Email Sending
Image URL: None found
- PDF Generator: Uses paytm/pdf-generator:v1 with specified resource limits.
- Email Sender: Utilizes paytm/email-sender:v1, configured for SMTP relay.
- Volume for PDFs: Employs an emptyDir volume for PDF file exchange.
- Elastic IP: All pods egress via the same Elastic IP for consistency.
- Restart Policy: Set to Never, ensuring pods complete their tasks once.
Slide 14: Security and Access
Ensuring Secure Operations
Image URL: None found
- Service Account: Created as statement-controller-sa for secure access.
- Cluster Role: Defines permissions for statementbatches and statements.
- Cluster Role Binding: Binds the role to the service account for least privilege.
- Pod Permissions: Grants permissions to create, delete, and get pods.
- Security Principle: Operates on the principle of least privilege for safety.
Slide 15: Thank You
Appreciating Your Engagement
Image URL: None found
- Gratitude: Thank you for joining us on this innovative journey.
- Your Insights: We value your questions and look forward to your feedback.
- Future Collaboration: Let's keep the conversation going and explore new possibilities.
- Stay Connected: Follow us for more updates on our technological advancements.
- Farewell: Thank you again, and we hope to see you at our next event!