Machine Learning Operations Engineer
160k - 240k USD
On-site
Full Time
#Engineering
#MLOps
#Docker
#Kubernetes
#AWS
Together AI is a research-driven organization dedicated to making artificial intelligence more accessible and affordable by co-designing hardware, software, and models. We are looking for a Lead Machine Learning Operations Engineer to join our team on-site in the United States to build the systems and APIs that power large-scale model inference and fine-tuning for our customers.
Responsibilities
- Collaborate with our research, engineering, and sales teams to deploy and operate robust inference systems.
- Develop and maintain the documentation, services, and tools required for effective testing and automation.
- Focus on the stability, scalability, and efficiency of our infrastructure and system resources.
- Perform code and design reviews to ensure high-quality standards.
- Engage in an on-call rotation to manage and resolve critical system incidents.
Must-haves
- At least 5 years of professional experience building production-level machine learning inference or training systems.
- A bachelor’s degree in computer science or equivalent practical industry experience.
- Deep familiarity with modern machine learning, with a specific focus on large language models.
- Strong proficiency in DevOps practices, including CI/CD, automation, Docker, and Kubernetes.
- Experience working with major cloud providers like AWS, Google Cloud, or Azure.
- Expertise in programming languages such as Python or Go, alongside ML frameworks like PyTorch, TensorFlow, or Scikit-learn.
- Professional fluency in English.
Benefits
The base salary range for this full-time position is $160,000 - $240,000.
- Equity compensation.
- Medical insurance.




