Jobiglo

Sin resultados.

Lead HPC Network Engineer - AI Infrastructure

EPAM Systems · Colombie

Nuevo
Senior 🇬🇧 English
InfiniBand NDR/HDR

Descripcion del puesto

About the role

EPAM is seeking a Lead HPC Network Engineer to define and drive the architecture of high‑performance network fabrics that support large‑scale AI and LLM workloads. The role will shape the technical vision for InfiniBand, RDMA, Ethernet, and Kubernetes‑based GPU clusters for a global technology client.

Key responsibilities

  • Own the architectural roadmap for InfiniBand/RDMA and high‑speed Ethernet fabrics across GPU clusters.
  • Design and evaluate network topologies (Fat‑tree, Clos, Rail‑optimized, Dragonfly) and create decision frameworks based on performance, scale and cost.
  • Establish engineering standards for host‑side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement and PCIe topology.
  • Lead performance engineering for RDMA/RoCE, NCCL/MSCCL and collective communication, and conduct root‑cause investigations.
  • Define reference architecture for Kubernetes networking on GPU clusters, covering CNI plugins, network policies, multi‑NIC pods and device plugins.
  • Drive adoption of SmartNIC/DPU technologies such as NVIDIA BlueField, including SR‑IOV, offload, isolation and security use cases.
  • Shape network observability strategy by defining metrics, dashboards, alerts, congestion detection and SLO frameworks.
  • Mentor senior engineers, influence client roadmaps and ensure end‑to‑end delivery of mission‑critical network platforms.

Required profile

  • Proven leadership experience in designing and delivering large‑scale HPC/AI network platforms.
  • Deep expertise in InfiniBand NDR/HDR, RDMA/RoCE and NVIDIA/Mellanox networking.
  • Strong background in Linux host networking, PCIe/GPU/NIC topology and NUMA awareness.
  • Hands‑on experience with Kubernetes networking for GPU clusters and related CNI solutions.
  • Track record of performance engineering and root‑cause analysis for multi‑node GPU training workloads.

Required skills

  • InfiniBand NDR/HDR
  • RDMA / RoCE
  • NVIDIA / Mellanox networking
  • NCCL / MSCCL communication patterns
  • Linux host networking and driver configuration
  • PCIe, GPU and NIC topology, NUMA placement
  • Kubernetes networking, CNI plugins, network policies
  • SmartNIC / DPU (e.g., NVIDIA BlueField, SR‑IOV)
  • Network observability tools and metrics

Questions fréquentes

Le salaire n'est pas communiqué publiquement par le recruteur. Vous pouvez postuler et négocier directement avec EPAM Systems.
Cliquez sur "Postuler maintenant" en haut de la page. Vous pouvez importer votre CV en 1 clic — Jobiglo extrait automatiquement vos informations et postule pour vous.

Por que reporta esta oferta?

Gracias por su reporte. Revisaremos esta oferta.

Postula en 30 segundos

Ingresa tu email para postular. Se creara una cuenta automaticamente.

Al continuar, aceptas nuestras condiciones de uso.

Ya tienes cuenta? Iniciar sesion

Publicado hace 14 horas

Expira en 1 mes

8 vistas · 0 candidaturas

Aumenta tus posibilidades

Sube tu CV: te propondremos las ofertas que coinciden con tu perfil.

Analizando tu CV...

EPAM Systems

Colombie