通常出現cuda initial error就只要把driver更新就可以了。但是DGX 系列在後續為了加強節點與節點之間的傳輸速度,出廠就直接安裝了NV-Switch。
之前不知道發生甚麼問題,一直出現cuIntial error,查了很久才發現是因為NV-switch功能被disable。因此只要下達以下指令就可以了
執行前主要先確定gpu的driver跟nvswitch有沒有相符,這直接看錯誤訊息就好:
sudo systemctl status nvidia-fabricmanager
這邊可能會出現底下類似錯誤:
× nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/etc/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2025-06-15 22:57:30 PDT; 19min ago
CPU: 93ms
Jun 15 22:57:30 DGX-H200 systemd[1]: Starting NVIDIA fabric manager service...
Jun 15 22:57:30 DGX-H200 nvidia-fabricmanager-start.sh[16241]: Detected Pre-NVL5 system
Jun 15 22:57:30 DGX-H200 nv-fabricmanager[16258]: fabric manager NVIDIA GPU driver interface version 570.133.20 don't match with driver version 575.57.08. Please update with ma>
Jun 15 22:57:30 DGX-H200 nvidia-fabricmanager-start.sh[16258]: fabric manager NVIDIA GPU driver interface version 570.133.20 don't match with driver version 575.57.08. Please u>
Jun 15 22:57:30 DGX-H200 nvidia-fabricmanager-start.sh[16241]: "/usr/bin/nv-fabricmanager" failed! Exit code: 1
直接可以看到是因為GPU driver是575,但是nvswitch的driver還在570。這邊就需要重新安裝一下(記得自己修改版號)
sudo apt-get install nvidia-fabricmanager-575
接下來就是把預設的fabricmanager打開:
sudo systemctl start nvidia-fabricmanager #把nv-switch打開
sudo systemctl enable nvidia-fabricmanager #開機階段就預設打開
Ref:
https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf