The systems we build are the foundation to our research. You'll touch all parts of our code and infrastructure, whether that’s building large-scale distributed systems, improving the robustness and reliability of large language model training, optimizing network architecture, or improving our developer tooling. This role offers an opportunity for a strong systems engineer to work closely with ML engineers and researchers to support cutting-edge ML research and deployment.

Representative projects
  • Own a many-thousand-node Kubernetes cluster to support ML research
  • Pair with ML engineers to design and optimize infrastructure for serving large ML models
  • Design and build fault-tolerant infrastructure to support running large-scale jobs reliably despite failures of individual nodes
  • Migrate a cloud deployment to Terraform
  • Optimize load-balancing strategies to efficiently use multiple zones
  • Add alerts and playbooks for cluster monitoring
You might be a good fit if you
  • Have significant experience working with cloud infrastructure
  • Are comfortable debugging large-scale software systems
  • Enjoy close collaboration with engineers and researchers with a variety of backgrounds and expertise
  • Care about the societal impacts of your work
  • Pick up slack, even if it goes outside your job description
Strong candidates may also have experience with some of the following
  • Operating cloud infrastructure
  • Terraform
  • High-performance networking
  • Python internals
  • Low-level Linux interfaces and administration

Employee Benefits :

  • Flexible Work Schedule
  • Five-day work per week
  • Group-based health insurance plan
  • Provident Fund
  • Paid Time Off (10 days - annual leave)

*To apply for this opportunity, please submit your application here.


If you have any question, do not hesitate to contact us.

Work with us
Apply for our opening opportunities