
Site Reliability Engineer
1w1 week agoThoughtSpot
Bengaluru, IN · Full-time · INR 3,000,000 – INR 5,000,000
About this role
As part of the ThoughtSpot SRE team, you will be on the cutting edge of operational intelligence. Ensure service reliability and act as a trusted partner for customers by proactively leveraging AI/ML for timely updates, meaningful solutions, and predictive improvements. You bridge customers and engineering with deep systems expertise and passion for customer success.
Act as the primary point of contact for customer-facing technical issues on the SaaS platform, including data connectivity, report errors, performance concerns, and integration challenges. Provide timely, accurate updates while meeting SLAs and driving resolutions via tickets and calls. Translate complex technical issues into clear updates for all stakeholders.
Maintain, monitor, and troubleshoot cloud infrastructure using Grafana, Prometheus, Datadog, and Splunk. Implement AI/ML-driven solutions for proactive observability, anomaly detection, and intelligent alerting to enhance reliability and reduce MTTR. Apply NetOps and SecOps principles across cloud and on-premise deployments.
Participate in on-call rotations, lead incident reviews, and conduct root cause analyses for continuous improvement. Work cross-functionally with Engineering to boost debuggability, availability, scalability, and performance. Develop automation and best practices to build resilient, self-optimizing systems.
Requirements
- B.S. in Computer Science or equivalent relevant experience
- Proven experience troubleshooting complex Linux systems and managing virtualization and cloud platforms (VMware, AWS, Azure, GCP)
- Hands-on experience with monitoring tools such as Grafana, Prometheus, Datadog, or Splunk
- Demonstrated experience and a keen interest in leveraging AI/ML principles to address SRE challenges — including AIOps, predictive maintenance, and intelligent automation
- Prior experience in enterprise customer support, including on-call rotations and incident management, with the ability to lead root cause analyses
- Strong problem-solving and algorithmic thinking with a solid understanding of systems
Responsibilities
- Act as the primary point of contact for customer-facing technical issues related to the SaaS platform, including data connectivity, report errors, performance concerns, access problems, data inconsistencies, software bugs, and integration challenges
- Provide timely, accurate, and clear updates to customers, consistently meeting SLAs and driving issues through to full resolution via tickets and calls
- Maintain, monitor, and troubleshoot ThoughtSpot cloud infrastructure using tools like Grafana, Prometheus, Datadog, and Splunk
- Implement and leverage AI/ML-driven solutions for proactive observability, predictive anomaly detection, and intelligent alerting to enhance service reliability and reduce MTTR
- Participate in on-call rotations, lead incident reviews, and conduct thorough root cause analyses to drive continuous improvement
- Work cross-functionally with Engineering to define and implement tools that enhance debuggability, supportability, availability, scalability, and performance
- Develop and implement automation and best practices to streamline operations and strengthen system reliability
Similar roles

Senior Data Engineer
1w1 week agoMakpar
Washington, US · Full-time · $150,000 – $190,000

Senior Data Engineer
1w1 week agoPostbank
Berlin, DE · Full-time · €80,000 – €110,000

Infrastructure Engineering Manager
1w1 week agoFortnox
Växjö, SE · Full-time · SEK 800,000 – SEK 1,100,000

Senior AI/ML Engineer - Shared Services Automation - Remote
1w1 week agoMayo Clinic
Rochester, US · Full-time · $160,000 – $220,000