As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that powers all AI research across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE’s culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.
What you’ll be doing:
In this role you will be building and improving our ecosystem around GPU-accelerated computing including developing large scale automation solutions. You will also be maintaining and building deep learning AI-HPC GPU clusters at scale and supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. You will design, implement and support operational and reliability aspects of large scale distributed systems with focus on performance at scale, real time monitoring, logging, and alerting.
What we need to see:
Ways to stand out from the crowd:
#LI-Hybrid
Job Description The team you’ll be part of Strategy & Technology Strategy and Technology lays the path for Nokia’s future...
How to applyLab Summary: The Samsung AI Center (SAIC) within Samsung Research America (SRA) leads at the forefront of innovation in creating...
How to applyJob Description Can be located at any major MFC location (Orlando, FL, Ocala, FL, Grand Prairie, TX, Camden, AR, Chelmsford,...
How to applyMeta Platforms Inc. is seeking an ASIC Engineer, Architecture to join our Infrastructure organization. This organization is responsible for building...
How to applyAre you fascinated by the endless possibilities of deep learning and neural networks? Do you thrive on advancing the state-of-the-art...
How to applyAufgaben Are you passionate about robots taking over the world? Then look no further! We are constantly looking for highly...
How to apply