Senior Site Reliability Engineer - Remote

1d1 day ago

General Dynamics Information Technology

US · Full-time · $111,155 – $150,385

About this role

Discover a career that is challenging, impactful, and mission-critical. Join our team as a Senior Site Reliability Engineer and make an impact on mission-critical cloud systems, reliability engineering, and service availability. While you help us advance the mission, we’ll help you build your skills and advance your career.

Ensure high availability, resilience, and performance of critical systems by designing monitoring, alerting, and automated remediation solutions. Analyze logs, system metrics, and performance data to identify bottlenecks and drive improvements to system health and scalability.

Develop and maintain automation scripts, infrastructure-as-code configurations, and tooling to streamline deployments, scaling, and operational workflows. Lead initiatives to reduce manual toil and promote best-in-class reliability engineering practices across cloud environments.

Provide architectural guidance on distributed systems, capacity planning, disaster recovery, and reliability patterns. Communicate technical risks, impacts, and reliability insights clearly to non-technical stakeholders while leading reliability-focused projects.

Requirements

5+ years of experience in site reliability engineering, systems engineering, or related technical roles.
Strong software engineering background with proficiency in Python, Go, or similar languages.
Deep understanding of distributed systems, cloud platforms (AWS, Azure, GCP), and container orchestration (Kubernetes).
Experience with observability tools such as Prometheus, Grafana, OpenTelemetry, and log aggregation platforms.
Skilled in defining SLIs, SLOs, and error budgets for measuring and maintaining service reliability.
Demonstrated success designing, supporting, and improving reliability for distributed, cloud-based systems.
Excellent problem-solving and analytical skills with a proactive and systematic approach to reliability issues.

Responsibilities

Ensure high availability, resilience, and performance of critical systems by designing monitoring, alerting, and automated remediation solutions.
Analyze logs, system metrics, and performance data to identify bottlenecks and drive improvements to system health and scalability.
Develop and maintain automation scripts, infrastructure-as-code configurations, and tooling to streamline deployments, scaling, and operational workflows.
Lead initiatives to reduce manual toil and promote best-in-class reliability engineering practices across cloud environments.
Participate in on-call rotations, respond to incidents promptly, conduct root-cause analyses, and drive systemic corrective actions.
Collaborate with software engineering, product, and infrastructure teams to integrate reliability into the full development lifecycle.
Provide architectural guidance on distributed systems, capacity planning, disaster recovery, and reliability patterns.