Site Reliability Manager

placeHerndon calendar_month2/4/25

Overview:

We are seeking a highly skilled and experienced Site Reliability Manager to join our team. The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our systems and services. They will lead a team of engineers in designing, implementing, and maintaining robust infrastructure and automation solutions.

The ideal candidate must reside in the Washington DC area and be available to work on site in downtown Washington DC as required.

Responsibilities:

Lead a service delivery team of 8-20 people (Service Support specialist, DevSecOps and Site reliability engineers)
Define and implement best practices for infrastructure as code, deployment automation, and monitoring
Collaborate with cross-functional teams to design scalable and fault-tolerant architectures.
Develop and maintain service level objectives (SLOs) and key performance indicators (KPIs) to measure system reliability and performance.
Conduct post-mortems and root cause analyses for incidents and implement preventive measures to mitigate future incidents.
Drive continuous improvement initiatives to enhance the reliability, scalability, and efficiency of our systems and services.
Mentor and coach team members to foster a culture of learning and innovation.

Qualifications and Education:

Required:

Bachelor’s degree in computer science, Engineering, or a related field; Master's degree preferred.
10+ years of experience in a similar role managing a team of site reliability engineers and delivering in AWS cloud platform.
Proven track record of managing high-performance teams.
5+ years of experience supporting operations and maintenance for cloud-native applications in production that are fault-tolerant, self-healing, scalable and high available,
Deep understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
Strong knowledge of infrastructure as code tools (e.g., Terraform, Ansible, ArgoCD) and CI/CD pipelines.
Experience with monitoring, logging, and observability tools like DataDog, AWS Cloudwatch, ELK, Prometheus, Splunk etc.
Excellent communication and interpersonal skills, with the ability to collaborate effectively with cross-functional teams.
Strong problem-solving and analytical skills, with a keen attention to detail.
Certifications such as AWS Certified DevOps Engineer or Google Professional Cloud DevOps Engineer are a plus.
Ability to obtain and maintain a Public Trust clearance.

Compensation:

In accordance with pay transparency guidelines, the proposed salary range for this position is $140,000.00 to $180,000.00. Final salary will be determined based on various factors such as relevant skills, experience and certifications.

local_fire_departmentUrgent

Competition Coordinator (Part-Time)

placeHerndon (VA)

remote long term?We do have an office in Herndon VA; however, we are operating remotely for the most part. This role can be based anywhere in the Mid-Atlantic region of MD, DC, VA & WV. This person may need to attend meetings in the Herndon, VA office...

business_centerHigh salary

Receptionist

apartmentRobert HalfplaceHerndon (VA)

We are offering a short-term contract employment opportunity for a Receptionist in Herndon, Virginia. This role involves providing support for various administrative and front desk duties within the industry. The Receptionist will be the first point...

check_circleNew offer

Project Control Analyst - Remote / Work from Home job

placeHerndon (VA)

Company Overview: Work Where it Matters Five Rivers Services (FRS), an Akima company, is not just another federal IT contractor. As an Alaska Native Corporation (ANC), our mission and purpose extend beyond our exciting federal projects as we...

Best jobs you don't want to miss:

Senior Project Manager Jobs in Herndon (VA)

Junior Project Manager Jobs in Herndon (VA)

Management Jobs in Herndon (VA) 6 Urgent

Strategy Manager Jobs in Herndon (VA) 3 Urgent

Cisco Manager Jobs in Herndon (VA)