The systems we build are the foundation to our research. You'll touch all parts of our code and infrastructure, whether that’s building large-scale distributed systems, improving the robustness and reliability of large language model training, optimizing network architecture, or improving our developer tooling. This role offers an opportunity for a strong systems engineer to work closely with ML engineers and researchers to support cutting-edge ML research and deployment.


Representative projects
  • Own a many-thousand-node Kubernetes cluster to support ML research
  • Pair with ML engineers to design and optimize infrastructure for serving large ML models
  • Design and build fault-tolerant infrastructure to support running large-scale jobs reliably despite failures of individual nodes
  • Migrate a cloud deployment to Terraform
  • Optimize load-balancing strategies to efficiently use multiple zones
  • Add alerts and playbooks for cluster monitoring
You might be a good fit if you
  • Have significant experience working with cloud infrastructure
  • Are comfortable debugging large-scale software systems
  • Enjoy close collaboration with engineers and researchers with a variety of backgrounds and expertise
  • Care about the societal impacts of your work
  • Pick up slack, even if it goes outside your job description
Strong candidates may also have experience with some of the following
  • Operating cloud infrastructure
  • Terraform
  • High-performance networking
  • Python internals
  • Low-level Linux interfaces and administration

Employee Benefits :

  • Flexible Work Schedule
  • Five-day work per week
  • Group-based health insurance plan
  • Provident Fund
  • Paid Time Off (10 days - annual leave)

To apply for this opportunity, please submit your application here.

Questions?

If you have any question, do not hesitate to contact us.

Work with us
Apply for our opening opportunities