As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE’s culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.
What you’ll be doing:
In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting.
What we need to see:
Ways to stand out from the crowd:
#LI-Hybrid
Huawei’s Munich Research Center is responsible for advanced technology research, architectural development, design and strategic engineering of our products. The...
How to applyResponsibilities TikTok is the leading destination for short-form mobile video. At TikTok, our mission is to inspire creativity and bring...
How to applyJob Requisition ID: 37537 Learn from the best in the business Flexible work arrangements – work in a way that...
How to applyDescription & Requirements Bloomberg’s Engineering AI department has 350+ AI practitioners building highly sought after products and features that often...
How to applyAmazon Search owns the global search engine for the Amazon shopping site. The Japan team improves the search engine and...
How to applyResponsibilities TikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. TikTok...
How to apply