We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Director of AI Infrastructure

Microsoft
United States, Washington, Redmond
Jan 31, 2025
OverviewWe are Microsoft Research, a leading industrial research laboratory comprised of over 1,000 computer scientists, engineers and technical staff working across the United States, United Kingdom, China, India, Canada, and the Netherlands. We are seeking a Director of AI infrastructure to join our dynamic team in Microsoft Research, where we are at the forefront of transforming scientific research through cutting-edge technology. As a key member of this team, you will play a pivotal role in conceptualizing, architecting, and implementing innovative solutions that drive our ambitious goals. Collaborating with partners across Microsoft Research, Microsoft and the Industry, you will have the opportunity to shape the future of machine learning infrastructure and how its leveraged for AI-driven research. This is a management and engineering role that is responsible for the delivery of next-generation AI/ML training and inference systems in a hybrid environment. This position is responsible for ownership of graphics processing unit (GPU) cluster delivery, hardware strategy, performance optimization of state-of-the-art systems, team coordination, and the performance development of their team. They will collaborate with PM counterparts to provide strategic planning, collaboration with internal and external teams, and manage GPU capacity across a global organization. This role requires a deep understanding of industry trends, strategic initiatives, and the ability to align services and staffing skills accordingly. This position works closely with various teams and stakeholders to ensure the successful implementation, management and utilization of services, hardware, and capacity. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.
ResponsibilitiesLeverage industry expertise, understanding and experience to envision, develop and deliver secure services that align with organizational needs and accelerate research. Identify, engage and align with strategic initiatives in the company as well as industry trends, assuring that service offerings and strategy both align and complement emerging initiatives and trends. Assure team skills are aligned with service offerings, industry trends, company strategy and organizational needs. Maintain relationships between relevant groups within the organization and company to identify collaboration opportunities, service improvements and changes, hardware and network pilots that align with key research efforts. Define the services offered and the gaps they fill, including demand, costs, and lifecycle planning. Be the final point of escalation for service and system failures, owning communication with customers, engineering, program management, leadership, and other stakeholders. Provide leadership, hiring, spend control, policy setting, and enforcement to keep GPU clusters running, highly utilized and training job efficiency above company averages. Ensure team accountability to meet customer needs through planning, documentation, training, support feedback and incident management. Ensure provisions for monitoring, alerting, and resolving issues are in place and monitor daily SLA reports to drive key insights, improvements, and potential service changes. Identify impactful projects running on capable hardware and offer engineering to optimize and scale-out training to increase efficiency and utilization of GPU workloads. Team management and development such as conduct 1:1s, performance reviews, planning agendas for team, engineering, quarterly business review meetings, partner with program management, own ADO strategy and backlog reviews and drive career growth, development and recruiting efforts. Embody our culture and values.
Applied = 0

(web-6f6965f9bf-tv2z2)