Durham Intelligent NIC Environment (DINE)
The Durham Intelligent NIC Environment (DINE) supercomputer is a small 24-node development cluster equipped with NVIDIA BlueField-2 Data Processing Units (DPUs) using a non-blocking HDR200 fabric. These DPUs enable direct access to remote memory to improve the performance of massively parallel codes, in preparation for future exascale systems, and will provide researchers with a test-bed facility development of new and novel computing paradigms.
The cost of data movement – both runtime and energy – is predicted to be one major showstopper on our road to exascale. As computers driving data centres, supercomputers and machine learning farms become faster, their interconnects, i.e. communication devices, grow into a limiting factor; even worse, they also face the omnipresent unreliability that will arise. One way to improve them is to make them smart – to make them learn how to route data flows, how to meet security constraints, or even to deploy computations into the network. Smart network devices can take ownership of the data movement, bring data into the right format before it is delivered, care about security and resiliency, and so forth.
A key feature of DINE is the NVIDIA BlueField smart NIC cards which provide a programmable network offload capability, allowing network functions to be accelerated, and freeing up compute cores for other tasks.
DINE is comprised of 24 nodes each containing:
- Dual 16-core AMD EPYC 7302 ROME processors (3GHz)
- 512GB RAM
- BlueField-2 Smart NIC (200 GBit/s HDR200)
- These contain 16GB RAM, 8 high clock ARM cores and Ubuntu 20.04
- NVIDIA HDR200 InfiniBand switch
The system runs in “Host Separated” mode, meaning that the BlueFields (BFs) can be treated as servers in their own right (running Ubuntu), and MPI jobs can run on both host and device. The host nodes are called b[101-124], while the corresponding BlueField cards are called bluefield[101-124].
Executables created on Cosma 8’s login nodes should work as compiled on the BlueField. However, it might be advantageous to compile on the cluster itself. For this, please ssh bluefield118 which we typically use a login node.
DINE is used in a power-efficient way, i.e. whenever the nodes are not used for a while, the cluster shuts them down. They are automatically rebooted whenever SLURM requests them. However, this reboot might take a few minutes, and the nodes therefore might be shown as down even though they are, in principle, available.
Yet to be written. Which compiler are used.
Funding and acknowledgement
This work has used Durham University’s DINE cluster. DINE has been purchased through Durham University’s Research Capital Equipment Fund 19_20 Allocation, led by the Department of Computer Science. It is installed in collaboration and as addendum to DiRAC@Durham facility managed by the Institute for Computational Cosmology on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). DiRAC equipment was funded by BEIS capital funding via STFC capital grants ST/P002293/1, ST/R002371/1 and ST/S002502/1, Durham University and STFC operations grant ST/R000832/1. DiRAC is part of the National e-Infrastructure. The project received funding through the ExCALIBUR Hardware and Enabling Software Programme (H&ES).