Skip to main content
Filter
##SHPCE

Senior HPC Engineer

location Bulgaria,Romania working time Full-time remote Remote

About the job

Academic and commercial groups worldwide use our products to revolutionize deep learning data analytics and power data centers. Join the team building many of the world’s largest and fastest AI/HPC systems! We are looking for someone who can work on a dynamic, customer-focused team and requires excellent interpersonal skills.
As an expert, you will help us with the strategic challenges we encounter, including computing, networking, and storage design for large-scale, high-performance workloads, effective resource utilization in a heterogeneous computing environment, evolving our private/public cloud strategy, capacity modeling, and growth planning across our global computing environment.

Responsibilities

• Primary responsibilities will include deploying, managing, and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.
• Be the domain expert with customers during planning calls through implementation.
• Handover-related documentation and knowledge transfers are required to support customers as they begin rolling out some of the world’s most sophisticated systems!
• Building and improving our ecosystem around GPU-accelerated computing, including developing large-scale automation solutions

Requirements

• 8+ years providing in-depth support and deployment services, solving problems for hardware and software products.
• Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, network routing/advanced networking (tuning and monitoring).
• Minimum five years of experience designing and operating large-scale compute infrastructure.
• Cluster management technologies (Bright, XCat, etc).
• Minimum of a four-year degree from an accredited university or college or equivalent experience in computer science, electrical engineering, or computer engineering.
• Experience analyzing and tuning performance for a variety of HPC workloads.
• Working knowledge of cluster configuration management tools such as Ansible, Puppet, and Salt.
• Experience with HPC cluster job schedulers such as SLURM, LSF
• In-depth understating of container technologies like Docker, Singularity, Shifter, Charliecloud
• Proficient in Centos/RHEL and Ubuntu Linux distros, including Python programming and bash scripting
• Experience with HPC workflows that use MPI
• Scripting proficiency(Bash, Ansible, etc).
• Good interpersonal skills with the ability to maintain and deliver resolutions for customer-blocking issues as they arise.
• Strong organizational skills and ability to prioritize/multi-task easily with limited supervision.
• Experience with Schedulers such as SLURM, LSF, UGE, etc.

Advantages

• Understanding of MLPerf benchmarking
• Familiarity with InfiniBand with IBOP and RDMA
• Experience with GPU-focused hardware/software.
• Experience with MPI.
• Automation tooling background (Ansible, Salt, Puppet, etc.).
• Ethernet and Storage technologies such as Lustre or GPFS.
• Background in Software Defined Networking and HPC cluster networking
• Familiarity with deep learning frameworks like PyTorch and TensorFlow
• Understanding fast, distributed storage systems like Lustre and GPFS for HPC workloads.

What we can offer

• Remote work and flexible working hours
• Additional private medical and dentist insurance
• Monthly food vouchers
• Monthly transport coverage
• Professional and career benefits
• Celebrating online happy hours
• Internal sports competitions
• Top-quality work environment

19 hours ago