DGX AI Cluster


DGX AI Cluster - CCDS IIT Kharagpur

The DGX AI Cluster at CCDS, IIT Kharagpur, is a state-of-the-art high-performance AI supercomputing infrastructure designed to support large-scale artificial intelligence training, deep learning research, advanced simulations, and data-intensive analytics. Built to meet the growing computational demands of modern research, the cluster delivers enterprise-grade performance, scalability, and reliability for cutting-edge scientific and engineering applications.

At its core, the system is powered by NVIDIA DGX H100 architecture, featuring eight NVIDIA H100 Tensor Core GPUs per node. These GPUs are interconnected via four NVSwitches, enabling full GPU-to-GPU interconnectivity within each node and ensuring ultra-high bandwidth, low-latency communication for accelerated AI model training and inference.

To further enhance data throughput and scalability, the cluster has been recently upgraded with a dedicated 1.1 PiB Parallel File System (PFS) storage infrastructure. This storage system is integrated using NVIDIA Quantum InfiniBand switching technology with 400 Gbps bandwidth, providing high-speed, low-latency communication between compute and storage layers. This architecture significantly improves performance for large datasets, distributed AI workloads, and multi-node training environments.

The DGX AI Cluster represents a major step forward in strengthening IIT Kharagpur’s capabilities in AI research, high-performance computing, and data-driven innovation under the Centre of Computational and Data Sciences (CCDS).


Network Architecture

DGX AI Cluster Network Architecture - CCDS IIT Kharagpur

System Architecture

Component Specification
Master Node Intel Xeon SKL G-6148 (40 Cores @ 2.4 GHz), 384 GB RAM, 5 TB Storage
Compute Nodes 5 × NVIDIA DGX H100 Systems
GPU Configuration 8 × NVIDIA H100 GPUs per node (NVSwitch Interconnect)

Recent Infrastructure Enhancements (2026)

1.1 PiB Parallel File System (PFS) Storage Installed
High-throughput distributed storage enabling large-scale AI model training and high-speed data access across compute nodes.
Mellanox High-Speed Switch Fabric Deployment
Ultra-low latency and high-bandwidth networking backbone connecting compute nodes and storage infrastructure.

Cluster Access

Login Command:

ssh <username>@dgxmaster.iitkgp.ac.in

Partition Mapping:

  • dgx001 → dgx1
  • dgx002 → dgx2
  • dgx003 → dgx3
  • dgx004 → dgx4
  • dgx005 → dgx5

Storage & Data Management

  • 50 GB quota in /home/<username>
  • 1 TB quota in /dgx00x/<username>
  • RAID directories mounted as /dgx00x
cd /dgx00x/<username>/

QOS Restrictions

QoS Name Max Jobs/User Max CPUs Min GPUs Max GPUs Max Wall Time
gpu2 2 56 1 2 72 Hrs (3 days)
gpu4 2 56 3 4 48 Hrs (2 days)
gpu6 1 112 5 6 24 Hrs (1 day)
gpu8 1 112 7 8 12 Hours

User Guide


Important Job Submission Guidelines

  • Jobs must be submitted to a single node (#SBATCH --nodes=1).
  • The number of tasks per node (--ntasks-per-node) and GPUs (--gres=gpu) must align with assigned limits.
  • Users must specify the correct QoS and partition in their job script.
Sample Submission Commands
sbatch /home/train_script.sh

cd /dgx00x/<username>
sbatch script.sh

Common Errors

Issue: GPU count requested exceeds QoS limits.

#SBATCH --qos=gpu2
#SBATCH --gres=gpu:4   # gpu2 allows max 2 GPUs
                

Fix: Match GPU count with selected QoS.

Issue: Requested time exceeds allowed QoS wall time.

#SBATCH --qos=gpu4
#SBATCH --time=72:00:00   # gpu4 max is 48 hours
                

Fix: Reduce wall time or select appropriate QoS.

Issue: Job submitted from compute node instead of master node.

sbatch /raid/<username>/train_script.sh   # Incorrect
                

Fix:

cd ~
sbatch train_script.sh   # Submit from home directory
                

Monitoring Tools

  • sinfo – View partitions
  • squeue – Monitor jobs
  • df -h – Check storage mounts
  • module avail – View available modules