PortfolioPlatform • SRE • Cloud-Native

I build systems that stay calm
when traffic gets loud.

I’m Shivam Kumar — a Platform & SRE engineer. I turn chaotic distributed systems into predictable, observable, and cost-aware platforms.

Kubernetes-native platformsStreaming + event-driven systemsSLOs + incident responseFinOps-minded observability
Shivam Kumar

Shivam Kumar

Platform & SRE @ Flexera

Bengaluru, India
Distributed systems
Built for scale + clarity
50%
Operational toil reduced
AWS + Azure Associate
Certified

Featured Work

Cloud Platforms
Case

Flink on Kubernetes Platform

A self-serve streaming platform with guardrails: sane defaults, paved paths, and operational clarity for developers.

  • Lag-aware autoscaling + self-healing job orchestration
  • Multi-tenant isolation and upgrade-safe deployments
  • Opinionated observability: SLOs, golden signals, runbooks
KubernetesFlinkGitOpsSRE
Observability
Case

Cost-First Metrics Migration

Re-architected metrics storage and query patterns to reduce spend without sacrificing incident-time fidelity.

  • Prometheus → VictoriaMetrics migration with rollout playbook
  • Dropped cardinality hotspots via re-labeling + guidelines
  • Cut infra costs by ~60% while improving query latency
PrometheusVictoriaMetricsGrafanaFinOps
Developer Experience
Case

GitOps Delivery System

A delivery pipeline that makes shipping boring: previews, policy checks, and safe progressive rollouts.

  • ArgoCD-based sync strategy + standardized app templates
  • Guardrails: policy-as-code, secrets hygiene, drift detection
  • Reduced operational toil by ~50% through automation
ArgoCDTerraformCI/CDSecurity

Toolbox

Platforms

KubernetesAWSTerraformKafka

Languages

GoPythonBashHCL

Systems

Distributed SystemsEvent-DrivenPerformanceeBPF (curious)

Reliability

SLOsIncident ResponseObservabilityRunbooks

Experience

I like roles where the job isn’t “keep it up”, it’s “make it resilient.”

Flexera

Member of Technical Staff — SRE
Apr 2024 — Present

Orchestrating the reliability of SaaS solutions. Focused on turning distributed chaos into actionable business insights through robust platform engineering.

Cloud-nativePlatformObservability

MoEngage

Site Reliability Engineer (I, II, III)
Dec 2020 — Mar 2024

The Scale Up

Built an in-house Flink on Kubernetes platform: self-healing, lag-aware scaling, and clear on-call ergonomics.

The Cost Killers

Migrated Prometheus → VictoriaMetrics, reducing infra spend by ~60%.

Automation Wins

Built platforms + GitOps pipelines, cutting operational toil by half.

Ingram Micro

Software Engineer — SRE
Jan 2019 — Nov 2020

Foundation years: multi-stage CI/CD on AWS & Azure, lift-and-shift migrations, and making “secure by default” non-negotiable.

Writing

I write to turn tribal knowledge into repeatable playbooks. Expect posts about Kubernetes, Terraform, platform patterns, and reliability thinking.

Let’s build something that survives reality.

If you’re hiring for Platform/SRE, or you want to talk cloud-native systems, observability, or streaming workloads — reach out.