Senior Site Reliability Engineer

placeNew York calendar_month 

Overview:

We are seeking a Site Reliability Engineer (SRE) to lead and mentor our SRE and TechOps teams with a focus on automation to drive accountability, efficiency, and continuous improvement. In this role, you will build and maintain observability frameworks to ensure system reliability, performance, and scalability, while fostering a culture of innovation through iterative enhancements.

Leading a dynamic team, you will ensure smooth platform operations, drive automation, and streamline workflows to reduce manual interventions. Supporting both gameday and non-gameday activities, you will set strategic goals to maintain high performance across teams, collaborating with key technology partners and gaining insights into soccer operations to support the rapid growth of a major sports league.

Responsibilities:

  • Build and Maintain Observability Frameworks: Develop and implement observability frameworks to monitor the health and performance of our services, ensuring uptime and reliability.
  • Incident Response and On-Call Support: Be the first line of defense in troubleshooting and resolving incidents without relying on runbooks, using strong problem-solving skills.
  • API Testing: Perform thorough API testing for published content using tools like Postman and Cypress to ensure accuracy and performance.
  • Infrastructure as Code: Utilize Terraform for managing infrastructure, including ServiceNow integrations, and automate workflows.
  • Monitoring and Logging: Leverage Datadog, or equivalent tools such as New Relic or Splunk, to set up monitoring, logging, and alerting systems.
  • Collaboration and Communication: Work closely with cross-functional teams, including developers, operations, and product managers, to ensure seamless integration and deployment of services.
  • AWS Resources Management: Manage and optimize AWS resources, including EKS and ECS, to ensure scalability and cost-efficiency.
  • CI/CD Pipeline Management: Use GitLab pipelines for continuous integration and deployment, ensuring smooth and automated delivery of code changes.
  • Integration: Integrate tools like ServiceNow with Slack or Asana to streamline workflows and enhance team communication.
  • Exhibit strong Team Leadership by effectively helping manage a team of highly skilled consultants and full-time professionals, cultivating a culture of innovation, accountability, and continuous improvement across our enterprise operations.
  • Apply data-driven insights for thorough Incident Analysis, enabling the identification of trends and proactive addressing of vulnerabilities in collaboration with other tech leaders.
  • Demonstrate a commitment to staying abreast of industry trends, emerging technologies, and standard methodologies, aiming to find opportunities for ongoing improvement and innovation.
  • A bachelor's degree in a relevant field, such as Computer Science, Information Technology, or a related field. Advanced degrees or certifications (e.g., ITIL, AWS, Azure) are highly desirable and can provide a competitive edge.
  • 7+ years of experience, with 5+ in Cloud Expertise and Technical Operations: Proven background in architecting and managing cloud solutions (AWS, Azure, Google Cloud), along with hands-on experience in complex technology operations environments, including infrastructure, network, security, and incident management and 2+ years managing or mentoring roles within technology operations (ITSM/ITOM), or a related field
  • Proficiency in implementing automation tools and a proven ability to drive automation excellence within the organization, improving efficiency and reliability.

Qualifications:

  • Startup Mentality: Comfortable in a fast-paced environment where wearing multiple hats is the norm.
  • Collaborative: Strong team player with excellent communication skills.
  • Curiosity and Passion: A genuine passion for technology, with a strong desire to learn and explore new tools and methodologies.
  • Love of Learning: While a formal computer science degree is required, a solid foundation in coding and problem-solving, whether self-taught or through experience, is essential.
  • Cloud First: Experience with AWS is mandatory; familiarity with GCP and Azure are a plus.
  • Programming Languages: Proficiency in Node.js, Python; familiarity with Go, React/React Native is a plus.
  • Infrastructure as Code: Experience with Terraform,
  • Integrating ServiceNow with other tools, ETL experience between third parties is a bonus.
  • API Data Quality checks and Frontend Testing: Hands-on experience with Cypress, Postman, and monitoring tools like Datadog (or equivalents like New Relic or Splunk).
  • Cloud Infrastructure: Strong understanding of AWS services, particularly EKS and ECS.
  • CI/CD Pipelines: Experience with GitLab for managing pipelines and automating deployments.
  • Observability: Expertise in setting up and maintaining observability frameworks to monitor and improve system reliability.
  • Troubleshooting: Excellent problem-solving and analytical abilities.
  • Ability to work effectively in a fast paced, team environment
  • Strong interpersonal skills and the ability to effectively communicate, both written and verbally
  • Demonstrated decision making and problem-solving skills
  • Meticulous with the ability to multi-task and meet deadlines with minimal supervision
  • Ability to work non-traditional hours, including evenings, weekends, and holidays and to occasionally travel

Total Rewards

Starting Base Salary: $130,000 - $155,000. MLS/SUM base salaries are contingent upon several factors including individual qualifications, market financials, and operational business needs.

We are committed to providing a Total Rewards package that attracts, supports, engages, and retains talent through the following:

  • Benefits – comprehensive and competitive medical, dental, and vision benefits, as well as a suite of programs to promote well-being including a $500 Wellness Reimbursement. A generous PTO offering, and hybrid Office/Remote Work Schedule are also offered to promote Work-Life balance!
  • Career & Professional Development – on the job training, feedback, and on-going educational opportunities to continue your personal and professional development.
  • Employee Engagement – office perks, discounts and employee events that go “beyond the traditional paycheck” to make you feel a part of our team and inspire you to elevate the Game!

We are an equal opportunity employer, and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity or expression, pregnancy, age, national origin, disability status.

apartmentAmber PeopleplaceNew York
Site Reliability Engineer - Research Computing The Research Infrastructure Cloud HPC team is a group of experts solving computing problems in the critical path of Research within the business. We work directly with Research and Model Implementation...
local_fire_departmentUrgent

IT Desktop Support / Network Integrator

apartmentSeven Seven SoftwaresplaceNew York
offerings: SaaS, PaaS & IaaS with Google, Amazon and/or Microsoft or equivalent.  •  Knowledge of the DevOps Operating model and the role of a Site Reliability Engineer (SRE).  •  Interpersonal skills/communication  •  Good quality English speaking is essential...
apartmentPlacement Services USA, Inc.placeJersey City, 2 mi from New York
Salary range: $120,000.00 - $240,000.00/year The experience and education requirements are: Two (2) years of experience in a Software Engineering/Site Reliability Engineering role or related occupation running mission-critical services. Must have...