The systems we build are the foundation to our research. You'll touch all parts of our code and infrastructure, whether that’s building large-scale distributed systems, improving the robustness and reliability of large language model training, optimizing network architecture, or improving our developer tooling. This role offers an opportunity for a strong systems engineer to work closely with ML engineers and researchers to support cutting-edge ML research and deployment.
Representative projects
- Own a many-thousand-node Kubernetes cluster to support ML research
- Pair with ML engineers to design and optimize infrastructure for serving large ML models
- Design and build fault-tolerant infrastructure to support running large-scale jobs reliably despite failures of individual nodes
- Migrate a cloud deployment to Terraform
- Optimize load-balancing strategies to efficiently use multiple zones
- Add alerts and playbooks for cluster monitoring
You might be a good fit if you
- Have significant experience working with cloud infrastructure
- Are comfortable debugging large-scale software systems
- Enjoy close collaboration with engineers and researchers with a variety of backgrounds and expertise
- Care about the societal impacts of your work
- Pick up slack, even if it goes outside your job description
Strong candidates may also have experience with some of the following
- Operating cloud infrastructure
- Terraform
- High-performance networking
- Python internals
- Low-level Linux interfaces and administration