Academic and commercial groups worldwide use our products to revolutionize deep learning data analytics and power data centers. Join the team building many of the world’s largest and fastest AI/HPC systems! We are looking for someone who can work on a dynamic, customer-focused team and requires excellent interpersonal skills.
As an expert, you will help us with the strategic challenges we encounter, including computing, networking, and storage design for large-scale, high-performance workloads, effective resource utilization in a heterogeneous computing environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.
• Primary responsibilities will include deploying, managing, and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.
• Be the domain expert with customers during planning calls through implementation.
• Handover-related documentation and knowledge transfers are required to support customers as they begin rolling out some of the world’s most sophisticated systems!
• Building and improving our ecosystem around GPU-accelerated computing, including developing large-scale automation solutions
• 8+ years providing in-depth support and deployment services, solving problems for hardware and software products.
• Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, network routing/advanced networking (tuning and monitoring).
• Minimum five years of experience designing and operating large-scale compute infrastructure.
• Cluster management technologies (Bright, XCat, etc).
• Minimum of a four-year degree from an accredited university or college or equivalent experience in computer science, electrical engineering, or computer engineering.
• Experience analyzing and tuning performance for a variety of HPC workloads.
• Working knowledge of cluster configuration management tools such as Ansible, Puppet, and Salt.
• Experience with HPC cluster job schedulers such as SLURM, LSF
• In-depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud
• Proficient in Centos/RHEL and Ubuntu Linux distros, including Python programming and bash scripting
• Experience with HPC workflows that use MPI
• Scripting proficiency(Bash, Ansible, etc).
• Good interpersonal skills with the ability to maintain and deliver resolutions for customer-blocking issues as they arise.
• Strong organizational skills and ability to prioritize/multi-task easily with limited supervision.
• Experience with Schedulers such as SLURM, LSF, UGE, etc.
• Understanding of MLPerf benchmarking
• Familiarity with InfiniBand with IBOP and RDMA
• Experience with GPU-focused hardware/software.
• Experience with MPI.
• Automation tooling background (Ansible, Salt, Puppet, etc.).
• Ethernet and Storage technologies such as Lustre or GPFS.
• Background in Software Defined Networking and HPC cluster networking
• Familiarity with deep learning frameworks like PyTorch and TensorFlow
• Understanding fast, distributed storage systems like Lustre and GPFS for HPC workloads.
• Remote work and flexible working hours
• Additional private medical and dentist insurance
• Monthly food vouchers
• Monthly transport coverage
• Professional and career benefits
• Celebrating online happy hours
• Internal sports competitions
• Top-quality work environment