1. RDMA – Remote Direct Memory Access
- Definition: Enables direct memory access from one system to another over the network, bypassing the CPU for ultra-low latency.
- Use in AI: Used in multi-node training for fast parameter exchange.
- Products: NVIDIA BlueField DPUs, ConnectX NICs, DGX systems, AMD Pensando DPUs, Intel Mount Evans IPU, HPE Slingshot.
2. RoCE – RDMA over Converged Ethernet
- Definition: Implements RDMA over standard Ethernet networks.
- Use in AI: Powers high-speed Ethernet-based GPU clusters.
- Products: NVIDIA SuperNIC, BlueField-3, Spectrum-X Switches, Intel Ethernet 800 series, AMD Xilinx Alveo.
3. InfiniBand
- Definition: HPC-grade networking tech providing high bandwidth and low latency.
- Use in AI: Connects GPUs across nodes in AI supercomputers.
- Products: NVIDIA Quantum-2, ConnectX-7 NICs, DGX SuperPOD, HPE/Cray Slingshot.
4. GPUDirect
- Definition: Allows peripherals (NICs, storage) to communicate directly with GPU memory.
- Use in AI: Reduces data copy latency.
- Products: NVIDIA BlueField, ConnectX (GPUDirect RDMA), DGX systems, OVX, AMD Smart Access, Intel Direct I/O.
5. NVLink / NVSwitch
- Definition: Proprietary NVIDIA GPU interconnect for high-speed, high-bandwidth communication.
- Use in AI: Enables model/data parallelism across GPUs.
- Products: NVIDIA A100/H100, DGX H100, DGX GH200, AMD MI300X (Infinity Fabric), Intel EMIB.
6. PCIe (Peripheral Component Interconnect Express)
- Definition: Standard I/O bus for attaching GPUs, SSDs, and DPUs to CPUs.
- Use in AI: Host-to-GPU or GPU-to-storage bandwidth.
- Products: BlueField-2 (PCIe Gen4), BlueField-3 (PCIe Gen5), AMD EPYC Gen5, Intel Xeon Max.
7. SR-IOV (Single Root I/O Virtualization)
- Definition: Allows multiple VMs to share a single physical device securely.
- Use in AI: Enables multitenancy and isolation.
- Products: NVIDIA BlueField, ConnectX, Spectrum-X, Intel Mount Evans IPU, AWS Nitro.
8. NVMe-oF (NVMe over Fabrics)
- Definition: Access NVMe SSDs across a network.
- Use in AI: Shared fast storage for training jobs.
- Products: BlueField SNAP, DGX Storage, Pure Storage AIRI, Intel D7-P5810, AMD EPYC NVMe-oF.
9. SNAP (Storage Acceleration Platform)
- Definition: NVIDIA’s virtual NVMe and VirtIO-blk engine on BlueField.
- Use in AI: Disaggregated storage, edge file systems.
- Products: BlueField-2, BlueField-3, Intel VROC, VMware vSAN on DPUs.
10. PTP / IEEE 1588v2
- Definition: Protocol to synchronize clocks with microsecond precision.
- Use in AI: Needed for time-sensitive AI training and inference.
- Products: BlueField-3, Spectrum-X, HPE Aruba, Cisco Nexus, Broadcom Trident4.
11. Time-Triggered Scheduling / PCC (Programmable Congestion Control)
- Definition: Tools to manage packet timing and prevent congestion in AI fabrics.
- Use in AI: Ensures fairness in multi-tenant AI clusters.
- Products: NVIDIA SuperNIC, Spectrum-X, AWS EFA, Intel Ethernet congestion control tuning.
12. ASAP² – Accelerated Switching and Packet Processing
- Definition: NVIDIA’s SDN offload engine for vSwitch and packet classification.
- Use in AI: Infrastructure offload, telemetry.
- Products: BlueField DPUs, AMD Pensando DSC, Intel Mount Evans IPU.
13. DOCA – Data Center Infrastructure SDK
- Definition: Programming framework to build apps on BlueField.
- Use in AI: Custom firewall, observability, NVMe services.
- Products: BlueField-2, BlueField-3, SuperNIC, AMD Pensando SDK, Intel IPDK.