Site Reliability Engineer

1w1 week ago

ThoughtSpot

Bengaluru, IN · Full-time · INR 3,000,000 – INR 5,000,000

About this role

As part of the ThoughtSpot SRE team, you will be on the cutting edge of operational intelligence. Ensure service reliability and act as a trusted partner for customers by proactively leveraging AI/ML for timely updates, meaningful solutions, and predictive improvements. You bridge customers and engineering with deep systems expertise and passion for customer success.

Act as the primary point of contact for customer-facing technical issues on the SaaS platform, including data connectivity, report errors, performance concerns, and integration challenges. Provide timely, accurate updates while meeting SLAs and driving resolutions via tickets and calls. Translate complex technical issues into clear updates for all stakeholders.

Maintain, monitor, and troubleshoot cloud infrastructure using Grafana, Prometheus, Datadog, and Splunk. Implement AI/ML-driven solutions for proactive observability, anomaly detection, and intelligent alerting to enhance reliability and reduce MTTR. Apply NetOps and SecOps principles across cloud and on-premise deployments.

Participate in on-call rotations, lead incident reviews, and conduct root cause analyses for continuous improvement. Work cross-functionally with Engineering to boost debuggability, availability, scalability, and performance. Develop automation and best practices to build resilient, self-optimizing systems.

Requirements

B.S. in Computer Science or equivalent relevant experience
Proven experience troubleshooting complex Linux systems and managing virtualization and cloud platforms (VMware, AWS, Azure, GCP)
Hands-on experience with monitoring tools such as Grafana, Prometheus, Datadog, or Splunk
Demonstrated experience and a keen interest in leveraging AI/ML principles to address SRE challenges — including AIOps, predictive maintenance, and intelligent automation
Prior experience in enterprise customer support, including on-call rotations and incident management, with the ability to lead root cause analyses
Strong problem-solving and algorithmic thinking with a solid understanding of systems

Responsibilities

Act as the primary point of contact for customer-facing technical issues related to the SaaS platform, including data connectivity, report errors, performance concerns, access problems, data inconsistencies, software bugs, and integration challenges
Provide timely, accurate, and clear updates to customers, consistently meeting SLAs and driving issues through to full resolution via tickets and calls
Maintain, monitor, and troubleshoot ThoughtSpot cloud infrastructure using tools like Grafana, Prometheus, Datadog, and Splunk
Implement and leverage AI/ML-driven solutions for proactive observability, predictive anomaly detection, and intelligent alerting to enhance service reliability and reduce MTTR
Participate in on-call rotations, lead incident reviews, and conduct thorough root cause analyses to drive continuous improvement
Work cross-functionally with Engineering to define and implement tools that enhance debuggability, supportability, availability, scalability, and performance
Develop and implement automation and best practices to streamline operations and strengthen system reliability