Home AI Key Terminologies in AI-Based Infrastructure

Key Terminologies in AI-Based Infrastructure

0
ai_infrastructure_wordcloud_knoxbyte

1. RDMA – Remote Direct Memory Access

  • Definition: Enables direct memory access from one system to another over the network, bypassing the CPU for ultra-low latency.
  • Use in AI: Used in multi-node training for fast parameter exchange.
  • Products: NVIDIA BlueField DPUs, ConnectX NICs, DGX systems, AMD Pensando DPUs, Intel Mount Evans IPU, HPE Slingshot.

2. RoCE – RDMA over Converged Ethernet

  • Definition: Implements RDMA over standard Ethernet networks.
  • Use in AI: Powers high-speed Ethernet-based GPU clusters.
  • Products: NVIDIA SuperNIC, BlueField-3, Spectrum-X Switches, Intel Ethernet 800 series, AMD Xilinx Alveo.

3. InfiniBand

  • Definition: HPC-grade networking tech providing high bandwidth and low latency.
  • Use in AI: Connects GPUs across nodes in AI supercomputers.
  • Products: NVIDIA Quantum-2, ConnectX-7 NICs, DGX SuperPOD, HPE/Cray Slingshot.

4. GPUDirect

  • Definition: Allows peripherals (NICs, storage) to communicate directly with GPU memory.
  • Use in AI: Reduces data copy latency.
  • Products: NVIDIA BlueField, ConnectX (GPUDirect RDMA), DGX systems, OVX, AMD Smart Access, Intel Direct I/O.

5. NVLink / NVSwitch

  • Definition: Proprietary NVIDIA GPU interconnect for high-speed, high-bandwidth communication.
  • Use in AI: Enables model/data parallelism across GPUs.
  • Products: NVIDIA A100/H100, DGX H100, DGX GH200, AMD MI300X (Infinity Fabric), Intel EMIB.

6. PCIe (Peripheral Component Interconnect Express)

  • Definition: Standard I/O bus for attaching GPUs, SSDs, and DPUs to CPUs.
  • Use in AI: Host-to-GPU or GPU-to-storage bandwidth.
  • Products: BlueField-2 (PCIe Gen4), BlueField-3 (PCIe Gen5), AMD EPYC Gen5, Intel Xeon Max.

7. SR-IOV (Single Root I/O Virtualization)

  • Definition: Allows multiple VMs to share a single physical device securely.
  • Use in AI: Enables multitenancy and isolation.
  • Products: NVIDIA BlueField, ConnectX, Spectrum-X, Intel Mount Evans IPU, AWS Nitro.

8. NVMe-oF (NVMe over Fabrics)

  • Definition: Access NVMe SSDs across a network.
  • Use in AI: Shared fast storage for training jobs.
  • Products: BlueField SNAP, DGX Storage, Pure Storage AIRI, Intel D7-P5810, AMD EPYC NVMe-oF.

9. SNAP (Storage Acceleration Platform)

  • Definition: NVIDIA’s virtual NVMe and VirtIO-blk engine on BlueField.
  • Use in AI: Disaggregated storage, edge file systems.
  • Products: BlueField-2, BlueField-3, Intel VROC, VMware vSAN on DPUs.

10. PTP / IEEE 1588v2

  • Definition: Protocol to synchronize clocks with microsecond precision.
  • Use in AI: Needed for time-sensitive AI training and inference.
  • Products: BlueField-3, Spectrum-X, HPE Aruba, Cisco Nexus, Broadcom Trident4.

11. Time-Triggered Scheduling / PCC (Programmable Congestion Control)

  • Definition: Tools to manage packet timing and prevent congestion in AI fabrics.
  • Use in AI: Ensures fairness in multi-tenant AI clusters.
  • Products: NVIDIA SuperNIC, Spectrum-X, AWS EFA, Intel Ethernet congestion control tuning.

12. ASAP² – Accelerated Switching and Packet Processing

  • Definition: NVIDIA’s SDN offload engine for vSwitch and packet classification.
  • Use in AI: Infrastructure offload, telemetry.
  • Products: BlueField DPUs, AMD Pensando DSC, Intel Mount Evans IPU.

13. DOCA – Data Center Infrastructure SDK

  • Definition: Programming framework to build apps on BlueField.
  • Use in AI: Custom firewall, observability, NVMe services.
  • Products: BlueField-2, BlueField-3, SuperNIC, AMD Pensando SDK, Intel IPDK.

No comments

Leave a reply

Please enter your comment!
Please enter your name here

Exit mobile version