Key Terminologies in AI-Based Infrastructure

1. RDMA – Remote Direct Memory Access

Definition: Enables direct memory access from one system to another over the network, bypassing the CPU for ultra-low latency.
Use in AI: Used in multi-node training for fast parameter exchange.
Products: NVIDIA BlueField DPUs, ConnectX NICs, DGX systems, AMD Pensando DPUs, Intel Mount Evans IPU, HPE Slingshot.

2. RoCE – RDMA over Converged Ethernet

Definition: Implements RDMA over standard Ethernet networks.
Use in AI: Powers high-speed Ethernet-based GPU clusters.
Products: NVIDIA SuperNIC, BlueField-3, Spectrum-X Switches, Intel Ethernet 800 series, AMD Xilinx Alveo.

3. InfiniBand

Definition: HPC-grade networking tech providing high bandwidth and low latency.
Use in AI: Connects GPUs across nodes in AI supercomputers.
Products: NVIDIA Quantum-2, ConnectX-7 NICs, DGX SuperPOD, HPE/Cray Slingshot.

4. GPUDirect

Definition: Allows peripherals (NICs, storage) to communicate directly with GPU memory.
Use in AI: Reduces data copy latency.
Products: NVIDIA BlueField, ConnectX (GPUDirect RDMA), DGX systems, OVX, AMD Smart Access, Intel Direct I/O.

5. NVLink / NVSwitch

Definition: Proprietary NVIDIA GPU interconnect for high-speed, high-bandwidth communication.
Use in AI: Enables model/data parallelism across GPUs.
Products: NVIDIA A100/H100, DGX H100, DGX GH200, AMD MI300X (Infinity Fabric), Intel EMIB.

6. PCIe (Peripheral Component Interconnect Express)

Definition: Standard I/O bus for attaching GPUs, SSDs, and DPUs to CPUs.
Use in AI: Host-to-GPU or GPU-to-storage bandwidth.
Products: BlueField-2 (PCIe Gen4), BlueField-3 (PCIe Gen5), AMD EPYC Gen5, Intel Xeon Max.

7. SR-IOV (Single Root I/O Virtualization)

Definition: Allows multiple VMs to share a single physical device securely.
Use in AI: Enables multitenancy and isolation.
Products: NVIDIA BlueField, ConnectX, Spectrum-X, Intel Mount Evans IPU, AWS Nitro.

8. NVMe-oF (NVMe over Fabrics)

Definition: Access NVMe SSDs across a network.
Use in AI: Shared fast storage for training jobs.
Products: BlueField SNAP, DGX Storage, Pure Storage AIRI, Intel D7-P5810, AMD EPYC NVMe-oF.

9. SNAP (Storage Acceleration Platform)

10. PTP / IEEE 1588v2

Definition: Protocol to synchronize clocks with microsecond precision.
Use in AI: Needed for time-sensitive AI training and inference.
Products: BlueField-3, Spectrum-X, HPE Aruba, Cisco Nexus, Broadcom Trident4.

11. Time-Triggered Scheduling / PCC (Programmable Congestion Control)

Definition: Tools to manage packet timing and prevent congestion in AI fabrics.
Use in AI: Ensures fairness in multi-tenant AI clusters.
Products: NVIDIA SuperNIC, Spectrum-X, AWS EFA, Intel Ethernet congestion control tuning.

12. ASAP² – Accelerated Switching and Packet Processing

Definition: NVIDIA’s SDN offload engine for vSwitch and packet classification.
Use in AI: Infrastructure offload, telemetry.
Products: BlueField DPUs, AMD Pensando DSC, Intel Mount Evans IPU.

13. DOCA – Data Center Infrastructure SDK