Lead HPC Network Engineer - AI Infrastructure
EPAM Systems · Colombie
Descripcion del puesto
About the role
EPAM is seeking a Lead HPC Network Engineer to define and drive the architecture of high‑performance network fabrics that support large‑scale AI and LLM workloads. The role will shape the technical vision for InfiniBand, RDMA, Ethernet, and Kubernetes‑based GPU clusters for a global technology client.
Key responsibilities
- Own the architectural roadmap for InfiniBand/RDMA and high‑speed Ethernet fabrics across GPU clusters.
- Design and evaluate network topologies (Fat‑tree, Clos, Rail‑optimized, Dragonfly) and create decision frameworks based on performance, scale and cost.
- Establish engineering standards for host‑side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement and PCIe topology.
- Lead performance engineering for RDMA/RoCE, NCCL/MSCCL and collective communication, and conduct root‑cause investigations.
- Define reference architecture for Kubernetes networking on GPU clusters, covering CNI plugins, network policies, multi‑NIC pods and device plugins.
- Drive adoption of SmartNIC/DPU technologies such as NVIDIA BlueField, including SR‑IOV, offload, isolation and security use cases.
- Shape network observability strategy by defining metrics, dashboards, alerts, congestion detection and SLO frameworks.
- Mentor senior engineers, influence client roadmaps and ensure end‑to‑end delivery of mission‑critical network platforms.
Required profile
- Proven leadership experience in designing and delivering large‑scale HPC/AI network platforms.
- Deep expertise in InfiniBand NDR/HDR, RDMA/RoCE and NVIDIA/Mellanox networking.
- Strong background in Linux host networking, PCIe/GPU/NIC topology and NUMA awareness.
- Hands‑on experience with Kubernetes networking for GPU clusters and related CNI solutions.
- Track record of performance engineering and root‑cause analysis for multi‑node GPU training workloads.
Required skills
- InfiniBand NDR/HDR
- RDMA / RoCE
- NVIDIA / Mellanox networking
- NCCL / MSCCL communication patterns
- Linux host networking and driver configuration
- PCIe, GPU and NIC topology, NUMA placement
- Kubernetes networking, CNI plugins, network policies
- SmartNIC / DPU (e.g., NVIDIA BlueField, SR‑IOV)
- Network observability tools and metrics
Questions fréquentes
Por que reporta esta oferta?
Postula en 30 segundos
Ingresa tu email para postular. Se creara una cuenta automaticamente.
Al continuar, aceptas nuestras condiciones de uso.
Ya tienes cuenta? Iniciar sesion
Publicado hace 17 horas
Expira en 1 mes
11 vistas · 0 candidaturas
Aumenta tus posibilidades
Sube tu CV: te propondremos las ofertas que coinciden con tu perfil.
Analizando tu CV...
EPAM Systems
Colombie