Site Reliability Engineer / Resiliency Engineer - Hatboro, PA
Location: USA, in person / travel to client site (not remote) Job Type: Full-time Industry: Leading Financial Services Organization / Technology Strategy Firm
Please read before applying:Candidates are strongly encouraged to read the full job description before applying, as this role will involve detailed technical interviews to assess deep system knowledge and hands-on expertise. We value efficiency, so this will help ensure both you and the employer are aligned in expectations and qualifications.
Additionally, candidates who tailor their resume to highlight specific experience related to the job description will stand out. Generic resumes do not effectively represent the skills required for this role and are not helpful for either the candidate or the employer.
Overview:We are seeking a highly skilled and proactive Site Reliability Engineer (SRE) / Resiliency Engineer to join our team in support of a leading financial services client. As a leading technology strategy firm, we are dedicated to helping our client improve their product and platform resiliency, ensuring their mission-critical systems remain reliable and scalable.
This role will focus on designing, enhancing, and maintaining resilient and highly available infrastructures that support complex financial services platforms.
The ideal candidate will have extensive experience working across a variety of operating systems and cloud platforms, including Linux, Windows, BSD, OpenShift, AWS, Azure, GCP, and even legacy systems such as mainframe. Youll be part of a collaborative team responsible for ensuring that systems can withstand failures, scale seamlessly with demand, and recover quickly from disruptions.
Additionally, this role may require on-call work to provide critical support based on client needs, ensuring operational continuity at all times.
Key Responsibilities:- Resiliency & Fault Tolerance: Design, implement, and continuously improve the resiliency of production systems, platforms, and products across cloud and on-prem environments, ensuring maximum uptime and operational continuity.
- Cross-Platform Support: Maintain and optimize environments across Linux, Windows, BSD VMs, and mainframe systems. Familiarity with OpenShift amongst other container orchestration platforms for robust, scalable deployments. Candidate will work in a diverse ecosystem and be required to take ownership across multiple technologies.
- Cloud Environments: Manage and support systems running in AWS, Azure, and GCP, ensuring seamless integration and optimal performance across different cloud platforms.
- Monitoring & Observability: Lead the development of advanced monitoring strategies, leveraging tools like Prometheus, Grafana, Splunk, New Relic, Datadog, AppDynamics, Nagios, Zabbix, and Elastic Stack (ELK) to proactively detect and address potential issues before they impact production.
- Chaos Engineering: Implement chaos engineering practices using tools like Gremlin, Chaos Monkey, Litmus, Simian Army, Pumba and other frameworks (both commercial and open source) to inject failure and test system resilience in real-world conditions, driving continuous improvements in fault tolerance.
- Ecosystem Management: Oversee and manage a complex ecosystem that spans cloud services, on-prem systems, virtualization, containers, and mainframe, ensuring all components work cohesively to support highly available and scalable environments.
- Capacity Planning & Scalability: Plan for and manage the scaling of systems to accommodate increasing traffic, data, and user demands, ensuring consistent performance without sacrificing availability. Tools like Kubernetes, Helm, Docker, Terraform, and CloudFormation are key.
- Incident Response & Root Cause Analysis (RCA): Lead incident response efforts, including root cause analysis, post-mortems, and implementation of corrective actions. Collaborate across teams to ensure that resiliency improvements are fully integrated into future system deployments.
- Automation: Develop and implement automation for system management, including infrastructure-as-code using tools like Terraform, Ansible, Chef, Puppet, SaltStack, CloudFormation, and Kubernetes Operators. Ensure consistent deployment and maintenance of infrastructure across environments.
- Security & Compliance: Work with security teams to ensure that all systems meet security standards and are compliant with industry regulations, proactively mitigating risks associated with system vulnerabilities. Familiarity with security tools like HashiCorp Vault, Open Policy Agent, Aqua Security, and Twistlock is a plus.
- Client Collaboration: Work closely with internal teams and our financial services client to define resiliency requirements, address client needs, and ensure the operational excellence of mission-critical systems.
- On-Call Support: Participate in an on-call rotation to provide support outside regular working hours, ensuring that client-facing services remain available and performant as needed.
- Experience: At least 5 years of hands-on experience in Site Reliability Engineering, Resiliency Engineering, or related roles, particularly in highly regulated, mission-critical environments like financial services.
- Cross-Platform Management: Deep expertise with Linux, Windows, BSD VMs, and mainframe systems. Familiarity with container orchestration platforms like OpenShift, Kubernetes, and Docker; experience managing large-scale cloud environments (AWS, Azure, GCP).
- Monitoring & Observability Tools: Proficiency with monitoring and observability platforms such as Prometheus, Grafana, Splunk, New Relic, Datadog, AppDynamics, Nagios, Zabbix, and Elastic Stack (ELK). Strong understanding of how to leverage logs, metrics, and traces for proactive issue resolution.
- Cloud & Hybrid Infrastructure: Extensive experience with AWS, Azure, and GCP, managing hybrid infrastructures that span cloud and on-prem environments. Experience with cloud-native services and containerized applications is a must.
- Chaos Engineering: Proven experience implementing chaos engineering practices, using tools like Gremlin, Chaos Monkey, Litmus, Simian Army, or Pumba to test and validate system resilience.
- Automation & Infrastructure as Code: Strong background in infrastructure-as-code tools such as Terraform, Ansible, Chef, Puppet, SaltStack, and CloudFormation. Proficiency in automating cloud infrastructure, system provisioning, and configuration management.
- Incident Management: Proven track record in managing high-severity incidents, performing root cause analysis (RCA), and implementing long-term solutions to improve system reliability and prevent recurrence.
- Security & Compliance: Knowledge of industry-specific security and compliance standards, particularly in financial services, and experience implementing and maintaining secure systems. Familiarity with HashiCorp Vault, Aqua Security, and Twistlock.
- Programming & Scripting: Proficiency in Go, Python, Bash, or similar programming/scripting languages for automation, system management, and performance tuning.
- Collaboration Skills: Ability to effectively collaborate with clients, development teams, and operations teams to drive operational improvements and ensure a shared understanding of resiliency requirements.
- Mainframe Systems: Familiarity with mainframe environments and integration with modern technologies.
- Disaster Recovery & Business Continuity: Experience in designing and testing disaster recovery (DR) strategies to ensure that systems can rapidly recover from outages.
- Certifications: Industry certifications such as AWS Certified Solutions Architect, Certified Kubernetes Administrator (CKA), Linux Professional Institute Certification, or other relevant credentials are desirable.
- Passionate Technologist: You stay up-to-date with the latest developments in both commercial and open-source tools, and often spend your evenings or weekends experimenting with new technologies, sharing knowledge, and contributing to the broader technical community.
- Innovative Mindset: You bring fresh ideas to the table, are always looking for better ways to do things, and have a proven track record of driving improvements in system design and resiliency.
- Tool Expertise: Hands-on experience with cutting-edge tools and technologies that improve automation, monitoring, and system performance, including Helm, Envoy, Consul, Istio, Traefik, Prometheus Operator, Service Mesh, and Kong.
- Demo & Showcase Skills: Candidates who can set up live demos or showcase their expertise with practical demonstrations during the interview process will be strongly preferred. This demonstrates hands-on experience and technical proficiency in real-world scenarios.
As a leading technology strategy firm, we are proud to support our financial services client in enhancing their platform and product resiliency. Join our team to work on cutting-edge technologies, drive innovation in infrastructure management, and make a real impact on the availability and scalability of critical financial services.
Youll be part of a collaborative team that values continuous learning and offers ample opportunities for professional growth.
This version now includes a much more comprehensive list of technical tools, technologies, and platforms, including OpenShift, Kubernetes, Prometheus, Helm, Traefik, Envoy, and many others, while keeping all the previous content. Let me know if any other tools or adjustments are needed!
City Information:
Hatboro, PA- Hatboro, located in Montgomery County, is a historic town known for its small-town charm and vibrant community. The towns downtown area features a mix of local shops, cafes, and restaurants, giving it a unique and welcomin