Logs for troubleshooting

This page explains which logs and context to provide when opening a support ticket for NVIDIA GPU infrastructure hosted by Nscale, without using the Nscale Log Gatherer package. The standard manual collection is:

The NVDebug Redfish/BMC support bundle, including generated files and reports, plus host logs when available and the NVDebug console log.
NVIDIA nvidia-bug-report.sh output.

Run commands with sudo or as root where possible so kernel, driver, hardware, and service logs are complete.

Before you collect logs

Collect logs as close to the failure time as possible. If the host is still running, collect logs before rebooting unless there is an operational reason to recover the system first. Include this context with every support request:

Field	What to include
Site or region	The affected Nscale site, region, or deployment identifier.
Asset	Hostname, node name, switch name, rack ID, or serial number.
Component	GPU, NVLink, NIC/HCA, NIC/DPU, BMC, CPU, FAN, PSU, HDD/SSD, System Board, Power Cycle, Chassis, CDU, or Other/Unknown.
Failure time	Approximate start time, end time if known, and timezone.
Symptoms	Error messages, alerts, Xid/SXid values, missing devices, link state, throttling, power event, or workload behavior.
Impact	Whether the issue affects one GPU, one node, one rack, multiple racks, a cluster, or customer workloads.
Current state	Whether the device is online, drained, rebooted, powered off, replaced, or ready to be worked on.
Actions taken	Reseats, reboots, power cycles, counter clears, workload moves, cable cleanings, replacements, or other remediation already attempted.

Severity guide

Use the severity that reflects actual business or service impact:

Severity	Use when
Critical	The customer environment is inaccessible or reserved capacity cannot reasonably be consumed.
High	The environment is accessible, but important functionality is unavailable and there is no acceptable workaround.
Medium	Functionality is impacted, but a workaround exists or the customer can continue using the environment.
Low	There is a minor issue, question, or no material production impact.

Dependencies

For the manual path:

Linux on the target NVIDIA GPU host or on a remote collection host that can reach the BMC and, when needed, the host OS.
sudo or root access is strongly recommended.
NVIDIA driver tooling, including nvidia-smi and ideally nvidia-bug-report.sh.
Python 3.12 and Python venv support for NVDebug.
curl or another approved method to download from GitHub.
unzip, tar, and gzip.
ipmitool for NVDebug IPMI-over-LAN collection.
BMC network access and credentials for the standard Nscale NVDebug bundle.

On Ubuntu, the usual setup is:

bash <<'EOF'
set -euo pipefail
sudo apt-get update
sudo apt-get install -y python3.12 python3.12-venv python3-pip curl unzip tar gzip ipmitool
EOF

Do not install extra diagnostic tools unless Nscale Support specifically asks for them.

Create a working directory

Copy and paste this as one block. It creates the working directory and stores the path so later copy/paste blocks can find it.

bash <<'EOF'
set -euo pipefail
LOGDIR=/tmp/nscale_manual_logs_$(hostname -s)_$(date -u +%Y%m%dT%H%M%SZ)
mkdir -p "$LOGDIR"
printf '%s\n' "$LOGDIR" > /tmp/nscale_manual_logdir.path
echo "Log directory: $LOGDIR"
EOF

Collect the NVDebug support bundle

NVDebug is NVIDIA’s diagnostic collection tool for NVIDIA server platforms. The downloadable GitHub project is named open-nvdebug; this guide refers to the tool as NVDebug. For the full vendor reference, see the NVIDIA NVDebug User Guide. For Nscale support cases, the important artifact is the generated NVDebug support bundle, not only the terminal output. The expected bundle contains both host-side logs and BMC/Redfish telemetry. Do not submit host-only or BMC-only collection unless absolutely necessary or Nscale Support explicitly asks for it. Why this guide asks for BMC details: NVDebug can run locally without BMC credentials, but that is a reduced host-local collection. NVIDIA documents that BMC-related collectors are skipped in that mode. For normal Nscale support, provide the BMC IP or hostname and credentials so the bundle includes Redfish/BMC and IPMI evidence. In most environments, the same BMC account is used for Redfish and IPMI access. Install ipmitool on the machine where NVDebug runs. In most cases, this is the affected host. Do not include this package or binary in the final support archive; it is only a runtime prerequisite for NVDebug’s IPMI-over-LAN collectors. sshpass is not required for the normal affected-host workflow in this guide. Nscale Support may ask for it only for a specific SSH-based collection path, such as some jumphost, BMC SSH, HMC SSH, proxy, or tunnel workflows. In most cases, run the next two commands on the affected host. The first command downloads NVDebug from the open-nvdebug GitHub repository and prepares it in a local Python virtual environment. The requirements.txt file is included in the downloaded open-nvdebug package and installs NVDebug’s Python dependencies. This guide uses nvdebug-v2.1.0-release; use a different release only if Nscale Support asks you to.

Download NVDebug and set up the environment

bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
NVDEBUG_TAG=nvdebug-v2.1.0-release
curl -L -o open-nvdebug.zip "https://github.com/NVIDIA/open-nvdebug/archive/refs/tags/${NVDEBUG_TAG}.zip"
unzip open-nvdebug.zip
cd "open-nvdebug-${NVDEBUG_TAG}"

python3.12 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements.txt
python -m src.tool.main --version > "$LOGDIR/nvdebug-version.txt" 2>&1
printf '%s\n' "$PWD" > /tmp/nscale_manual_nvdebug_dir.path
echo "NVDebug prepared under: $PWD"
EOF

Run NVDebug and generate the support bundle

Run this block on the affected host. The affected host must be able to reach its BMC. This is the preferred collection path for most customers because NVDebug collects host logs locally and BMC/Redfish telemetry through the BMC details you enter. Do not paste passwords directly into the command line. The block below prompts for the required BMC/Redfish/IPMI details, creates the temporary NVDebug configuration automatically, runs NVDebug, and removes the temporary config file afterward.

bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
NVDEBUG_DIR=$(cat /tmp/nscale_manual_nvdebug_dir.path)
DUT_CONFIG="$LOGDIR/nvdebug-local-host-dut_config.json"
: > "$DUT_CONFIG"
chmod 600 "$DUT_CONFIG"

"$NVDEBUG_DIR/.venv/bin/python" - "$DUT_CONFIG" <<'PYCONFIG'
import getpass
import json
import sys
from pathlib import Path

bmc_ip = input("BMC IP or hostname: ").strip()
bmc_user = input("BMC username: ").strip()
bmc_pass = getpass.getpass("BMC password: ")
baseboard = input("Baseboard override (optional, press Enter for auto-detect): ").strip()

config = {
    "dut-1": {
        "local": True,
        "baseboard": baseboard,
        "BMC_IP": bmc_ip,
        "BMC_USERNAME": bmc_user,
        "BMC_PASSWORD": bmc_pass,
    }
}

Path(sys.argv[1]).write_text(json.dumps(config, indent=2))
PYCONFIG

cd "$NVDEBUG_DIR"
trap 'rm -f "$DUT_CONFIG"' EXIT
sudo "$PWD/.venv/bin/python" -m src.tool.main collect --dut-config "$DUT_CONFIG" -o "$LOGDIR/nvdebug-host-bmc" > "$LOGDIR/nvdebug-host-bmc.stdout.txt" 2>&1
echo "NVDebug generated files: $LOGDIR/nvdebug-host-bmc"
echo "NVDebug console log: $LOGDIR/nvdebug-host-bmc.stdout.txt"
EOF

Include the full generated NVDebug output directory and its matching console log: nvdebug-host-bmc/ plus nvdebug-host-bmc.stdout.txt, or nvdebug-jumphost/ plus nvdebug-jumphost.stdout.txt if you used the jumphost block. If NVDebug cannot collect both host logs and BMC/Redfish telemetry, include any nvdebug*.stdout.txt or Python error output and tell Nscale Support which access path failed.

Collect NVIDIA bug report

bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
(cd "$LOGDIR" && sudo nvidia-bug-report.sh) > "$LOGDIR/nvidia-bug-report.stdout.txt" 2>&1
echo "NVIDIA bug report output saved under: $LOGDIR"
EOF

If nvidia-bug-report.sh creates nvidia-bug-report.log.gz, leave it in $LOGDIR so it is included in the final archive.

Add failure-specific logs

Failure type	Expected logs
GPU missing from `nvidia-smi`, GPU fallen off bus, driver error, Xid/SXid/NVRM, PCIe or enumeration issue	Standard manual collection. Include affected GPU index/UUID/serial, Xid/SXid if known, and the failure timestamp.
GPU ECC, SRAM/DRAM, row remapper, page retirement, memory error, or suspected GPU memory fault	Standard manual collection. Include affected GPU index/UUID/serial, observed ECC or row-remapper alert, and workload impact.
NVLink, NVSwitch, Fabric Manager, IMEX, NVLSM, OpenSM, Fabric state, training, Xid 145, or Xid 149 issue	Standard manual collection from each affected compute host. For NVL systems, include switch-side NVOS tech-support or NVSwitch logs when available or requested.
Thermal, fan, power, throttling, PSU, or voltage issue	Standard manual collection. Include the NVDebug Redfish/BMC bundle when BMC access is available, plus facility alarms, sensor exports, or photos/screenshots when useful.
Unexpected reboot, node power loss, power cycle, or suspected BMC event	Standard manual collection after the host is back online. Include the NVDebug Redfish/BMC bundle when BMC access is available. Include exact reboot or outage time if known.
NVMe, disk I/O, filesystem, Linux MD RAID, or hardware RAID issue	Standard manual collection. Include the affected device path, mount point, or volume if known. Nscale Support may request storage-specific command output after initial triage.
System memory, CPU, MCE, EDAC, RAS, OOM, or host-level thermal issue	Standard manual collection. Include any visible machine-check, EDAC, OOM, DIMM slot, CPU socket, or sensor alert details.
Ethernet switch device issue	Device logs, CLI command output, interface state and error counters, environmental or hardware alerts, and screenshots when available. Include switch asset, site, component, current state, and impact.
Ethernet link issue	Endpoint A and Port A, Endpoint B and Port B if known, plus command output for both sides where available: `net show interface <swp interface number> detail`, `net show interface pluggables`, and `sudo l1-show <swp interface number>`. Include link state, counters, cleaning or optics replacement attempts, and impact.
InfiniBand switch device issue	Device logs, UFM or fabric health evidence, port state and error counters, environmental or hardware alerts, and screenshots when available. Include switch asset, site, component, current state, and impact.
InfiniBand link issue	Endpoint A and Port A, Endpoint B and Port B if known, UFM evidence, and `mlxlink -d lid <lid> -p <physical_port_no> -c` output. Note whether the port was isolated in UFM and whether `perfquery -x <lid> <logical_port>` was run as part of your normal process.
Rack issue	Rack ID, affected rack unit/side/bay/PDU, photos or screenshots, related device logs, power or environmental evidence, and any impacted host logs.
Facility, cooling, rPDU, or widespread power issue	Site, affected racks/rows/clusters/devices, start time, current state, alarm screenshots, rPDU or facility exports, photos when useful, and sample host logs from impacted nodes.

Targeted add-on commands

Run these only when the issue type matches or Nscale Support asks for them. Skip commands that are not available on the host and note that they were unavailable in the support request.

GPU missing, Xid/SXid, PCIe, NVLink, or fabric

bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
nvidia-smi -q > nvidia-smi-q.txt 2>&1
nvidia-smi topo -m > nvidia-smi-topo-m.txt 2>&1
nvidia-smi nvlink -s > nvidia-smi-nvlink-s.txt 2>&1
nvidia-smi nvlink --errorcounters > nvidia-smi-nvlink-errorcounters.txt 2>&1
nvidia-smi --query-gpu=index,name,uuid,serial,pci.bus_id,temperature.gpu,memory.total,memory.used,memory.free,power.draw --format=csv > nvidia-smi-query-gpu-health.csv 2>&1
nvidia-smi --query-gpu=index,serial,uuid,pci.bus_id,fabric.state,fabric.status --format=csv > nvidia-smi-query-fabric-status.csv 2>&1
sudo dmesg -T | grep -Ei 'Xid|SXid|NVRM|fallen off|PCIe|AER|nvidia|nvswitch|nvlink|fabric' > dmesg-gpu-fabric-filtered.txt 2>&1
EOF

Active diagnostics: Run dcgmi diag -r 4 only when Nscale Support asks for it or when the host is available for a long-running active diagnostic.

bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
sudo dcgmi diag -r 4 > dcgmi-diag-r4.txt 2>&1
EOF

GPU ECC, memory, row remapper, or page retirement

bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
nvidia-smi -q -d ECC > nvidia-smi-q-d-ecc.txt 2>&1
nvidia-smi -q -d MEMORY > nvidia-smi-q-d-memory.txt 2>&1
nvidia-smi -q -d ROW_REMAPPER > nvidia-smi-q-d-row-remapper.txt 2>&1
nvidia-smi -q -d PAGE_RETIREMENT > nvidia-smi-q-d-page-retirement.txt 2>&1
sudo dmesg -T | grep -Ei 'ECC|row|remap|retire|SRAM|DRAM|memory error|Xid|NVRM|MCE|EDAC|RAS' > dmesg-memory-ecc-filtered.txt 2>&1
EOF

Unexpected reboot, power cycle, or host reset

bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
sudo journalctl --list-boots --no-pager > journalctl-list-boots.txt 2>&1
last -x reboot shutdown -F > last-reboot-shutdown.txt 2>&1
who -b > who-b.txt 2>&1
uptime -s > uptime-s.txt 2>&1
EOF

NVMe, disk, filesystem, or RAID

bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
df -hT > df-hT.txt 2>&1
lsblk -a -o NAME,KNAME,PATH,SIZE,TYPE,FSTYPE,MOUNTPOINTS,MODEL,SERIAL,WWN,VENDOR,STATE,TRAN > lsblk.txt 2>&1
cat /proc/mdstat > proc-mdstat.txt 2>&1
nvme list > nvme-list.txt 2>&1
for dev in /dev/nvme*n1; do [ -e "$dev" ] || continue; sudo nvme smart-log "$dev"; done > nvme-smart-log.txt 2>&1
for dev in /dev/nvme*n1; do [ -e "$dev" ] || continue; sudo nvme error-log "$dev"; done > nvme-error-log.txt 2>&1
sudo dmesg -T | grep -Ei 'I/O error|blk_update_request|buffer I/O|medium error|critical warning|nvme|md/|raid|scsi|sas|sata|smart|xfs|ext4|btrfs|reset|timeout' > dmesg-storage-filtered.txt 2>&1
EOF

System memory, CPU, MCE, EDAC, RAS, OOM, or host thermal

bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
free -h > free-h.txt 2>&1
cat /proc/meminfo > proc-meminfo.txt 2>&1
lscpu > lscpu.txt 2>&1
sudo dmidecode -t memory > dmidecode-memory.txt 2>&1
sensors > sensors.txt 2>&1
sudo dmesg -T | grep -Ei 'MCE|EDAC|RAS|ECC|memory error|hardware error|machine check|out of memory|oom|thermal|overheat|temperature|throttl' > dmesg-memory-ras-thermal-filtered.txt 2>&1
EOF

Create the archive

bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
ARCHIVE=/tmp/$(basename "$LOGDIR").tar.gz
sudo tar -czf "$ARCHIVE" -C /tmp "$(basename "$LOGDIR")"
sudo chown "$(id -u):$(id -g)" "$ARCHIVE" 2>/dev/null || true
echo "$ARCHIVE"
EOF

Attach the .tar.gz archive to tickets created in Jira with the context listed at the top of this page.

Alternative

Customers who approve the use of the packaged automated collector can use the Nscale Log Gatherer to collect the same diagnostic information in a single support bundle. Because the Nscale Log Gatherer is provided as a Python script, you can easily review its contents to verify exactly what data is collected before executing it.

Data notice

Support bundles may include hostnames, IP addresses, serial numbers, kernel logs, package versions, NVIDIA diagnostic output, BMC details, storage health data, and other operational metadata. Review the archive before sharing it if your environment requires data screening.

​Before you collect logs

​Severity guide

​Dependencies

​Create a working directory

​Collect the NVDebug support bundle

​Download NVDebug and set up the environment

​Run NVDebug and generate the support bundle

​Collect NVIDIA bug report

​Add failure-specific logs

​Targeted add-on commands

​GPU missing, Xid/SXid, PCIe, NVLink, or fabric

​GPU ECC, memory, row remapper, or page retirement

​Unexpected reboot, power cycle, or host reset

​NVMe, disk, filesystem, or RAID

​System memory, CPU, MCE, EDAC, RAS, OOM, or host thermal

​Create the archive

​Alternative

​Data notice