SudhanshuSingh
DevOps / Site Reliability Engineer

I'm a DevOps / SRE engineer who owns production reliability end-to-end. At RegisterKaro I single-handedly migrated our platform from DigitalOcean to AWS and now run it solo for 50,000+ users — Infrastructure as Code with Terraform and Ansible, CI/CD with blue/green deploys, and full observability via Prometheus, Grafana, and Loki. A lot of the job is keeping production fast, reliable, and cheap: live diagnostics and incident response, slow-query and pipeline optimization that took our database down a tier, and a ~20% cut in infrastructure cost. Underneath it is a real engineering foundation — Node.js / TypeScript services, RabbitMQ, Redis, and distributed-systems correctness — which is what lets me fix production at the code level, not just the infra level.
Skills
- AWS
- Terraform
- Ansible
- Docker
- GitHub Actions
- Nginx
- Linux
- DigitalOcean
- Prometheus
- Grafana
- Loki
- MongoDB
- PostgreSQL
- Redis
- RabbitMQ
- Node.js
- TypeScript
- Express.js
- REST APIs
- Webhooks
- JWT/OAuth
- Socket.IO
- React.js
Experience
Mar 2025 - Present
RegisterKaro (Safe Ledger Pvt Ltd)
DevOps / Site Reliability Engineer
- Single-handedly migrated production from DigitalOcean to AWS (ap-south-1) — provisioned the full stack with Terraform and Ansible (multi-AZ EC2 Auto Scaling Group, ALB, ECR, ElastiCache/Valkey, Secrets Manager, all in private subnets), ran a zero-downtime DNS cutover for 50,000+ customers, and retired the legacy stack 4 days early.
- Drove a ~20% cut in infrastructure cost (~$11.6K/year) — relocated compute to the database's region, set up Atlas↔AWS VPC peering (eliminated ~$390/mo of NAT egress), and decommissioned DigitalOcean; authored the vendor-bill-backed cost brief prepared for the CEO.
- Downsized the production MongoDB Atlas cluster a full tier (M50 → M40, ~$7.6K/year) — hand-analyzed slow-query logs and ran a pipeline and index optimization campaign: inverted the heaviest aggregation pipeline (23.5s → ~80ms) and dropped ~244 redundant indexes across 16 collections, reclaiming ~43% of slow-query time and ~2 TB/month of disk reads.
- Cut ~8,000+ redundant MongoDB ops/min (peak ~17K fleet-wide) — built live production diagnostics (event-loop lag and per-endpoint counters across both PM2 workers), Redis-cached auth, notification, and dashboard hot paths (auth: 6–7 DB ops/request → ~0), and moved per-request writes to cron, eliminating multi-second event-loop freezes (~46s → <0.2s).
- Hardened reliability and security — multi-AZ Auto Scaling, decoupled API health checks from RabbitMQ, enforced TLS and AUTH on ElastiCache, and replaced the database's 0.0.0.0/0 access with VPC peering and an explicit allowlist; validated a ~5-minute recovery objective in a restore drill.
- Fixed a long-standing webhook race condition under concurrent RabbitMQ ingestion (atomic findOneAndUpdate and prefetch tuning) and added a Redis distributed lock for idempotent deduplication that degrades gracefully when Redis is unavailable.
- Built GitHub Actions CI/CD across the 4-service stack with blue/green deploys (ALB target-group swap) plus rolling ASG/SSM deploys and automated rollback, and stood up the Grafana + Loki + Promtail observability stack for incident response.
May 2024 - Jan 2025
Segwitz
Software Engineer (Frontend)
- Built performant React dashboards using code splitting and lazy loading, collaborating with the design team to reduce initial load time by 10%.
- Implemented OAuth 2.0 authentication with token refresh, protected routes, and secure session handling for SPA dashboards, coordinating with the backend team on API contracts.
- Contributed to frontend architecture decisions and component library conventions, conducting code reviews on shared UI modules.