Skip to main content
100ms

Platform Engineer - Core Infrastructure

1w

100ms

Bengaluru, IN · Full-time · INR 2,500,000 – INR 4,500,000

About this role

100ms operates a real-time Live Video platform powering latency-sensitive, high-concurrency video experiences, and an AI Agents platform automating complex patient access workflows in U.S. healthcare. Both products run on a shared, robust infrastructure foundation. You'll join the central platform team responsible for keeping both running reliably, securely, and at scale.

Own and operate production infrastructure across multiple GKE clusters supporting real-time video workloads and AI agent pipelines, with HA, autoscaling, and full observability. Manage GitOps workflows using Argo CD for automated, version-controlled deployments. Maintain and optimize monitoring and alerting stacks using Open Source tools with product-specific SLOs for video jitter and AI task throughput.

Implement infrastructure as code using Terraform for GCP resources and Helm charts for Kubernetes manifests. Support unique demands of real-time video including media server scaling, WebRTC infrastructure, and low-latency networking. Handle AI agent workloads with LLM inference infrastructure, async task queues, and healthcare system integrations.

Own the security posture of infrastructure by enforcing least-privilege access, managing secrets hygiene, and driving hardening across clusters. Implement compliance-aligned controls like encryption and audit logging for healthcare data. Collaborate with product and engineering teams to embed security early through shift-left practices.

Requirements

  • Computer Science / Engineering degree or equivalent practical experience
  • Minimum 3 years of hands-on experience with Kubernetes in a production environment
  • Strong knowledge of CI/CD pipelines and GitOps workflows using Argo CD or similar tools
  • Proficient in infrastructure automation using Terraform and Helm
  • Experience in managing open source monitoring and logging stacks (Prometheus, Loki, Grafana, Alertmanager etc)
  • Working knowledge of cloud security principles — IAM, network policies, pod security, RBAC, and secrets management
  • Comfortable with Linux systems, shell scripting, and basic networking including UDP/TCP behaviour relevant to real-time media

Responsibilities

  • Own and operate production infrastructure across multiple GKE clusters supporting real-time video workloads and AI agent pipelines with HA, autoscaling, and full observability
  • Manage GitOps workflows using Argo CD for automated, version-controlled, and auditable deployments
  • Maintain and optimize monitoring and alerting stacks using Open Source Monitoring Tools with product-specific SLOs
  • Implement infrastructure as code using Terraform for GCP resources and Helm charts for Kubernetes manifests
  • Support real-time video infrastructure including media server scaling, WebRTC, low-latency networking, and high-throughput data paths
  • Support AI agent workloads including LLM inference infrastructure, async task queues, and integration pipelines with external healthcare systems
  • Lead or support incident response, cluster upgrades, and disaster recovery procedures
  • Own the security posture of infrastructure enforcing least-privilege access controls, secrets hygiene, and security hardening