Site Reliability Engineer

placeHerndon calendar_month 

Overview:

Summary:

As a Site Reliability Engineer, you will help build out and run production environments, automate operations and maintain and support infrastructure. Drive and establish Service level objectives (SLOs) and metrics to meet reliability expectations of multiple applications

Responsibilities:

What You'll Be Doing:

  • Deploy and manage applications into Kubernetes container platforms such as AWS EKS, or OpenShift
  • Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues.
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance.
  • Implement and support integrated CI/CD pipelines for on-premises and/or cloud assets using tools such as Jenkins, GitHub/Bitbucket, Nexus/Artifactory
  • Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents
  • Implement, deploy and maintain infrastructure as code (IaC) for provisioning infrastructure using AWS CloudFormation or Terraform
  • Maintain, monitor, and improve application configurations using tools such as Ansible, Packer, Puppet, or Chef Designs, deploys, monitors, and manage cloud solutions in public environments such as AWS or Azure, or private environments.
  • Design, build, and maintain automated monitoring and notification services to support fault tolerant and highly available systems and metrics using tools such as AWS CloudWatch, EFK, and Prometheus

Qualifications and Education:

Required Qualifications:

  • Bachelor’s degree in computer science, Engineering, or a related field and 8-10 years of relevant experience
  • 5+ years of experience supporting operations and maintenance for cloud-native applications in production that are fault-tolerant, self-healing, scalable and high available,
  • Deep understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Kubernetes).
  • Experience with monitoring, logging, and observability tools like DataDog, AWS Cloudwatch, ELK, Prometheus, Splunk etc.
  • Knowledge of infrastructure as code tools (e.g., Terraform, Ansible, ArgoCD) and CI/CD pipelines.
  • Experience deploying enterprise software within AWS Services such as EKS, RDS, EC2, Elastic Load Balancers, Lambda, DynamoDB, multi regions, and API Gateway
  • Strong problem-solving and analytical skills, with a keen attention to detail.
  • Certifications such as AWS Certified DevOps Engineer or Google Professional Cloud DevOps Engineer are a plus.
  • Ability to obtain and maintain a Public Trust clearance.

Compensation:

Things to Know:

In accordance with pay transparency guidelines, the proposed salary range for this position is $120,000 to $155,000. Final salary will be determined based on various factors such as relevant skills, experience and certifications.

local_fire_departmentUrgent

Reliability Engineer

apartmentRandstadplaceWinchester (VA), 44 mi from Herndon (VA)
job summary: Industry leading, global and growing manufacturer located in the Winchester, VA area is looking (direct hire) for a Reliability Engineer. This is not a remote or hybrid position, candidates must be willing to work 100...
apartmentAmazonplaceHerndon (VA)
As an Infrastructure Reliability Engineer you will be proactively driving the reliability risk identification, assessment and mitigation for datacenter infrastructure & Security equipment (Example: Cameras, Media Destruction Devices, Access Control...
electric_boltImmediate start

Site Reliability Engineer (Pipeline)

placeWashington, 20 mi from Herndon (VA)
that aligns with your qualifications, our recruitment team will contact you. A Senior Site Reliability Engineer is responsible for both the operations and maintenance of the Atlassian Developer Tools in support of developer customers. Additionally...