Site Reliability Engineer - Research - New York

apartmentAmber People placeNew York calendar_month 

Site Reliability Engineer - Research Computing

The Research Infrastructure Cloud HPC team is a group of experts solving computing problems in the critical path of Research within the business. We work directly with Research and Model Implementation teams and provide them with tools and compute resources to take their ideas from inception to real tradable products.

We are looking for an ambitious and operationally minded software engineer to join our team as we mature and scale our cloud HPC platform from a successful strategy-specific offering to the next iteration of our firm-wide Research platform.

Why join? Our client has a stellar 25-year track record and a reputation for excellence. Our goal is to be the best quantitative investment manager in the world—measured by the quality of our products, not their size. The companies very high employee-retention rate speaks for itself.

Our people are intellectuallyextraordinary, and our community is close-knit, down-to-earth, and diverse.

Responsibilities

We are a small flat team sitting at the cross-section of research, implementation, and systems infrastructure. Our team responsibilities span many areas. Below find a sampling of the types of work you will be expected to work on:

  • Design and implementation of cloud-based HPC systems. Our projects typically involve equal parts engineering and operations for success in our fast-moving environment. You will be expected to do both for projects small and large.
  • Running our HPC plant day-to-day. Our research environment is up 24/7, and we want to keep it that way. Everybody on the team contributes to the support of our plant, which thankfully is light because of our automation and quality work.
  • Implementing automation. We will always choose to work smart over working hard. You will be responsible for conception and implementation of automation from CI/CD pipelines to production metrics and monitoring of our cloud HPC platform.
  • Capacity management and benchmark optimization. Our demand for compute is constant and involves challenging problems focused on scaling our compute and optimizing it for research-critical workloads.
  • Obsessive User Focus(link removed) All members of the team are expected to partner with researchers and engineers to deliver high-quality cloud HPC systems that are efficient and reliable. This includes leading projects to evolve it as our needs change.
Qualifications
  • 5+ years of software engineering and/or systems programming experience
  • 2+ years of experience working with a public cloud, AWS preferred
  • Mastery of at least one programming language building production systems, Python preferred
  • Experience with a production configuration management tool, Salt/SaltStack preferred
  • Experience with a cloud-based infrastructure-as-code tool, Terraform preferred
  • Excellent written and verbal communication skills
  • Past experience working with or supporting researchers and/or other developers is a plus
  • Knowledge of Slurm or similar HPC schedulers and resource managers is a plus

Education

Bachelor’s degree in computer science, engineering, or a related field from a strong academic program.

placeNew York
Overview: We are seeking a Site Reliability Engineer (SRE) to lead and mentor our SRE and TechOps teams with a focus on automation to drive accountability, efficiency, and continuous improvement. In this role, you will build and maintain...
apartmentPlacement Services USA, Inc.placeJersey City, 2 mi from New York
Salary range: $120,000.00 - $240,000.00/year The experience and education requirements are: Two (2) years of experience in a Software Engineering/Site Reliability Engineering role or related occupation running mission-critical services. Must have...
electric_boltImmediate start

IT Desktop Support / Network Integrator

apartmentSeven Seven SoftwaresplaceNew York
offerings: SaaS, PaaS & IaaS with Google, Amazon and/or Microsoft or equivalent.  •  Knowledge of the DevOps Operating model and the role of a Site Reliability Engineer (SRE).  •  Interpersonal skills/communication  •  Good quality English speaking is essential...