2025年6月16日 星期一

DGX H200 driver upgarde from 570 to 575 system hanged

 GPU driver 更新後,nvswich的driver也要更新到對應的位置。但不知道為什麼,設定檔也跟著跑掉,導致nvswith的設定位置不對。需要做幾個修正:

1. 修改startup的script吃config的位置。該config本來是設定為fork,但575好像改了設定,給予一個sleep,導致設定也要跟著修改為simple

2. 重新loading config


修改設定檔:

sudo systemctl edit --full nvidia-fabricmanager.service

以下為設定檔,原始的部分我用註解沒有更動:


====

[Unit]

Description=NVIDIA fabric manager service

After=network-online.target

Requires=network-online.target


[Service]

User=root

PrivateTmp=false

#Type=forking

Type=simple

TimeoutStartSec=720


Environment="FM_CONFIG_FILE=/usr/share/nvidia/nvswitch/fabricmanager.cfg"

Environment="FM_PID_FILE=/var/run/nvidia-fabricmanager/nv-fabricmanager.pid"

Environment="NVLSM_CONFIG_FILE=/usr/share/nvidia/nvlsm/nvlsm.conf"

Environment="NVLSM_PID_FILE=/var/run/nvidia-fabricmanager/nvlsm.pid"



PIDFile=/var/run/nvidia-fabricmanager/nv-fabricmanager.pid 


#ExecStart=/usr/bin/nvidia-fabricmanager-start.sh $FM_CONFIG_FILE $FM_PID_FILE $NVLSM_CONFIG_FILE $NVLSM_PID_FILE

ExecStart=/usr/bin/nvidia-fabricmanager-start.sh --fm-config-file /usr/share/nvidia/nvswitch/fabricmanager.cfg --fm-pid-file $FM_PID_FILE --nvlsm-config-file $NVLSM_CONFIG_FILE --nvlsm-pid-file $NVLSM_PID_FILE

ExecStop=/bin/sh -c '\

  sed -i "/^FM_SM_MGMT_PORT_GUID=0x[a-fA-F0-9]\\+$/d" "$FM_CONFIG_FILE"; \

  if [ -f "$NVLSM_CONFIG_FILE" ]; then \

    sed -i "/^guid 0x[a-fA-F0-9]\\+$/d" "$NVLSM_CONFIG_FILE"; \

  fi; \

  if [ -f "$FM_PID_FILE" ] && [ -s "$FM_PID_FILE" ]; then \

    kill "$(cat "$FM_PID_FILE")"; \

  fi; \

  if [ -f "$NVLSM_PID_FILE" ] && [ -s "$NVLSM_PID_FILE" ]; then \

    kill "$(cat "$NVLSM_PID_FILE")"; \

  fi'

LimitCORE=infinity


[Install]

WantedBy=multi-user.target


====


修改完畢後,要把設定檔重新loading:

sudo systemctl daemon-reload

sudo systemctl start nvidia-fabricmanager.service


最後看看有沒有出錯:

systemctl status nvidia-fabricmanager.service

journalctl -u nvidia-fabricmanager.service -b -n 50 --no-pager



沒有留言:

張貼留言

DGX H200 driver upgarde from 570 to 575 system hanged

 GPU driver 更新後,nvswich的driver也要更新到對應的位置。但不知道為什麼,設定檔也跟著跑掉,導致nvswith的設定位置不對。需要做幾個修正: 1. 修改startup的script吃config的位置。該config本來是設定為fork,但575好像改了...