Skip to main content
Filter
#LA-NVI

Linux Administrator

location Poland,romania working time Full-time remote Remote

About the job

NVIDIA is known for developing integrated circuits, which are used in everything from electronic game consoles to personal computers (PCs). The company is a leading manufacturer of high-end graphics processing units (GPUs)

The individual in this role will actively monitor the clusters to identify and resolve any issues that may arise, collaborating closely with various teams as necessary. The job involves troubleshooting a wide range of problems, spanning from hardware and network issues to Kubernetes or other Linux service complications.

This role demands a dynamic and proactive approach to maintaining the stability and performance of our clusters, contributing significantly to our cutting-edge HPC and AI initiatives.

Responsibilities

Linux/Linux Administration:

  • Identify log file locations and effectively utilize them for diagnostics.
  • Setup, configuration, and troubleshooting of servers.
  • Proficient in Linux networking with an emphasis on high-performance computing and AI environments.

Kubernetes:

  • Navigate and interpret log files to troubleshoot issues within Kubernetes.
  • Ability to diagnose and rectify problems that may occur within the Kubernetes ecosystem.

L2/L3 Networking – Cumulus Switches:

  • Manage and configure networking infrastructure at both L2 and L3 levels, especially with Cumulus switches.

GPU Experience:

  • Utilize and troubleshoot GPU technologies in the cluster environment.

InfiniBand:

  • Manage and maintain InfiniBand infrastructure within the clusters.

Hardware Troubleshooting:

  • Identify and resolve hardware issues that affect the cluster’s performance.

Requirements

  • Proven experience in Linux systems administration and troubleshooting.
  • Proficiency in Kubernetes and its associated ecosystem.
  • Strong understanding of L2 and L3 Ethernet networking, particularly with Cumulus switches.
  • Experience with GPU technologies in a cluster environment.
  • Familiarity with InfiniBand networking.
  • Knowledge of storage systems such as DDN, VAST, GPFS, etc.
  • Excellent problem-solving and diagnostic skills.
  • Ability to collaborate with diverse teams to resolve complex issues efficiently.

Advantages

  • Prior experience in supporting high-performance computing or AI clusters is a plus.

What we can offer

  • Additional 20 days of paid leave
  • Remote work and flexible working hours
  • Professional and career benefits
  • Top-quality work environment
  • Online courses
  • Online sports activities

If you are looking for stability, professional growth, long-term career, and technology challenges in the sought-after companies – come and join us today! One last thing, if you have a lot of these skills, but not all of them, please still apply. We love to teach those who are willing to learn.

19 hours ago