I build systems that stay calm
when traffic gets loud.
I’m Shivam Kumar — a Platform & SRE engineer. I turn chaotic distributed systems into predictable, observable, and cost-aware platforms.
Featured Work
Flink on Kubernetes Platform
A self-serve streaming platform with guardrails: sane defaults, paved paths, and operational clarity for developers.
- Lag-aware autoscaling + self-healing job orchestration
- Multi-tenant isolation and upgrade-safe deployments
- Opinionated observability: SLOs, golden signals, runbooks
Cost-First Metrics Migration
Re-architected metrics storage and query patterns to reduce spend without sacrificing incident-time fidelity.
- Prometheus → VictoriaMetrics migration with rollout playbook
- Dropped cardinality hotspots via re-labeling + guidelines
- Cut infra costs by ~60% while improving query latency
GitOps Delivery System
A delivery pipeline that makes shipping boring: previews, policy checks, and safe progressive rollouts.
- ArgoCD-based sync strategy + standardized app templates
- Guardrails: policy-as-code, secrets hygiene, drift detection
- Reduced operational toil by ~50% through automation
Toolbox
Platforms
Languages
Systems
Reliability
Experience
I like roles where the job isn’t “keep it up”, it’s “make it resilient.”
Flexera
Orchestrating the reliability of SaaS solutions. Focused on turning distributed chaos into actionable business insights through robust platform engineering.
MoEngage
The Scale Up
Built an in-house Flink on Kubernetes platform: self-healing, lag-aware scaling, and clear on-call ergonomics.
The Cost Killers
Migrated Prometheus → VictoriaMetrics, reducing infra spend by ~60%.
Automation Wins
Built platforms + GitOps pipelines, cutting operational toil by half.
Ingram Micro
Foundation years: multi-stage CI/CD on AWS & Azure, lift-and-shift migrations, and making “secure by default” non-negotiable.
Writing
I write to turn tribal knowledge into repeatable playbooks. Expect posts about Kubernetes, Terraform, platform patterns, and reliability thinking.
