> ## Documentation Index
> Fetch the complete documentation index at: https://docs.nscale.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Logs for troubleshooting

> Manual log collection expectations for NVIDIA GPU infrastructure support cases.

This page explains which logs and context to provide when opening a support ticket for NVIDIA GPU infrastructure hosted by Nscale, without using the Nscale Log Gatherer package.

The standard manual collection is:

* The NVDebug Redfish/BMC support bundle, including generated files and reports, plus host logs when available and the NVDebug console log.
* NVIDIA `nvidia-bug-report.sh` output.

Run commands with `sudo` or as root where possible so kernel, driver, hardware, and service logs are complete.

## Before you collect logs

Collect logs as close to the failure time as possible. If the host is still running, collect logs before rebooting unless there is an operational reason to recover the system first.

Include this context with every support request:

| Field          | What to include                                                                                                                        |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| Site or region | The affected Nscale site, region, or deployment identifier.                                                                            |
| Asset          | Hostname, node name, switch name, rack ID, or serial number.                                                                           |
| Component      | GPU, NVLink, NIC/HCA, NIC/DPU, BMC, CPU, FAN, PSU, HDD/SSD, System Board, Power Cycle, Chassis, CDU, or Other/Unknown.                 |
| Failure time   | Approximate start time, end time if known, and timezone.                                                                               |
| Symptoms       | Error messages, alerts, Xid/SXid values, missing devices, link state, throttling, power event, or workload behavior.                   |
| Impact         | Whether the issue affects one GPU, one node, one rack, multiple racks, a cluster, or customer workloads.                               |
| Current state  | Whether the device is online, drained, rebooted, powered off, replaced, or ready to be worked on.                                      |
| Actions taken  | Reseats, reboots, power cycles, counter clears, workload moves, cable cleanings, replacements, or other remediation already attempted. |

## Severity guide

Use the severity that reflects actual business or service impact:

| Severity | Use when                                                                                                         |
| -------- | ---------------------------------------------------------------------------------------------------------------- |
| Critical | The customer environment is inaccessible or reserved capacity cannot reasonably be consumed.                     |
| High     | The environment is accessible, but important functionality is unavailable and there is no acceptable workaround. |
| Medium   | Functionality is impacted, but a workaround exists or the customer can continue using the environment.           |
| Low      | There is a minor issue, question, or no material production impact.                                              |

## Dependencies

For the manual path:

* Linux on the target NVIDIA GPU host or on a remote collection host that can reach the BMC and, when needed, the host OS.
* `sudo` or root access is strongly recommended.
* NVIDIA driver tooling, including `nvidia-smi` and ideally `nvidia-bug-report.sh`.
* Python 3.12 and Python venv support for NVDebug.
* `curl` or another approved method to download from GitHub.
* `unzip`, `tar`, and `gzip`.
* `ipmitool` for NVDebug IPMI-over-LAN collection.
* BMC network access and credentials for the standard Nscale NVDebug bundle.

On Ubuntu, the usual setup is:

```bash theme={null}
bash <<'EOF'
set -euo pipefail
sudo apt-get update
sudo apt-get install -y python3.12 python3.12-venv python3-pip curl unzip tar gzip ipmitool
EOF
```

Do not install extra diagnostic tools unless Nscale Support specifically asks for them.

## Create a working directory

Copy and paste this as one block. It creates the working directory and stores the path so later copy/paste blocks can find it.

```bash theme={null}
bash <<'EOF'
set -euo pipefail
LOGDIR=/tmp/nscale_manual_logs_$(hostname -s)_$(date -u +%Y%m%dT%H%M%SZ)
mkdir -p "$LOGDIR"
printf '%s\n' "$LOGDIR" > /tmp/nscale_manual_logdir.path
echo "Log directory: $LOGDIR"
EOF
```

## Collect the NVDebug support bundle

NVDebug is NVIDIA's diagnostic collection tool for NVIDIA server platforms. The downloadable GitHub project is named [`open-nvdebug`](https://github.com/NVIDIA/open-nvdebug); this guide refers to the tool as NVDebug. For the full vendor reference, see the [NVIDIA NVDebug User Guide](https://docs.nvidia.com/multi-node-nvlink-systems/nvdebug-guide/index.html).

For Nscale support cases, the important artifact is the generated NVDebug support bundle, not only the terminal output. The expected bundle contains both host-side logs and BMC/Redfish telemetry. Do not submit host-only or BMC-only collection unless absolutely necessary or Nscale Support explicitly asks for it.

Why this guide asks for BMC details: NVDebug can run locally without BMC credentials, but that is a reduced host-local collection. NVIDIA documents that BMC-related collectors are skipped in that mode. For normal Nscale support, provide the BMC IP or hostname and credentials so the bundle includes Redfish/BMC and IPMI evidence. In most environments, the same BMC account is used for Redfish and IPMI access.

Install `ipmitool` on the machine where NVDebug runs. In most cases, this is the affected host. Do not include this package or binary in the final support archive; it is only a runtime prerequisite for NVDebug's IPMI-over-LAN collectors.

`sshpass` is not required for the normal affected-host workflow in this guide. Nscale Support may ask for it only for a specific SSH-based collection path, such as some jumphost, BMC SSH, HMC SSH, proxy, or tunnel workflows.

In most cases, run the next two commands on the affected host. The first command downloads NVDebug from the `open-nvdebug` GitHub repository and prepares it in a local Python virtual environment. The `requirements.txt` file is included in the downloaded `open-nvdebug` package and installs NVDebug's Python dependencies. This guide uses `nvdebug-v2.1.0-release`; use a different release only if Nscale Support asks you to.

### Download NVDebug and set up the environment

```bash theme={null}
bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
NVDEBUG_TAG=nvdebug-v2.1.0-release
curl -L -o open-nvdebug.zip "https://github.com/NVIDIA/open-nvdebug/archive/refs/tags/${NVDEBUG_TAG}.zip"
unzip open-nvdebug.zip
cd "open-nvdebug-${NVDEBUG_TAG}"

python3.12 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements.txt
python -m src.tool.main --version > "$LOGDIR/nvdebug-version.txt" 2>&1
printf '%s\n' "$PWD" > /tmp/nscale_manual_nvdebug_dir.path
echo "NVDebug prepared under: $PWD"
EOF
```

### Run NVDebug and generate the support bundle

Run this block on the affected host. The affected host must be able to reach its BMC. This is the preferred collection path for most customers because NVDebug collects host logs locally and BMC/Redfish telemetry through the BMC details you enter.

Do not paste passwords directly into the command line. The block below prompts for the required BMC/Redfish/IPMI details, creates the temporary NVDebug configuration automatically, runs NVDebug, and removes the temporary config file afterward.

```bash theme={null}
bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
NVDEBUG_DIR=$(cat /tmp/nscale_manual_nvdebug_dir.path)
DUT_CONFIG="$LOGDIR/nvdebug-local-host-dut_config.json"
: > "$DUT_CONFIG"
chmod 600 "$DUT_CONFIG"

"$NVDEBUG_DIR/.venv/bin/python" - "$DUT_CONFIG" <<'PYCONFIG'
import getpass
import json
import sys
from pathlib import Path

bmc_ip = input("BMC IP or hostname: ").strip()
bmc_user = input("BMC username: ").strip()
bmc_pass = getpass.getpass("BMC password: ")
baseboard = input("Baseboard override (optional, press Enter for auto-detect): ").strip()

config = {
    "dut-1": {
        "local": True,
        "baseboard": baseboard,
        "BMC_IP": bmc_ip,
        "BMC_USERNAME": bmc_user,
        "BMC_PASSWORD": bmc_pass,
    }
}

Path(sys.argv[1]).write_text(json.dumps(config, indent=2))
PYCONFIG

cd "$NVDEBUG_DIR"
trap 'rm -f "$DUT_CONFIG"' EXIT
sudo "$PWD/.venv/bin/python" -m src.tool.main collect --dut-config "$DUT_CONFIG" -o "$LOGDIR/nvdebug-host-bmc" > "$LOGDIR/nvdebug-host-bmc.stdout.txt" 2>&1
echo "NVDebug generated files: $LOGDIR/nvdebug-host-bmc"
echo "NVDebug console log: $LOGDIR/nvdebug-host-bmc.stdout.txt"
EOF
```

<details>
  <summary>Alternative: run NVDebug from a jumphost</summary>

  Use this when you cannot run NVDebug directly on the affected host. The jumphost must be able to reach both the BMC and the affected host OS over SSH so the bundle still contains both BMC/Redfish telemetry and host-side logs.

  ```bash theme={null}
  bash <<'EOF'
  set -euo pipefail
  LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
  NVDEBUG_DIR=$(cat /tmp/nscale_manual_nvdebug_dir.path)
  DUT_CONFIG="$LOGDIR/nvdebug-jumphost-dut_config.json"
  : > "$DUT_CONFIG"
  chmod 600 "$DUT_CONFIG"

  "$NVDEBUG_DIR/.venv/bin/python" - "$DUT_CONFIG" <<'PYCONFIG'
  import getpass
  import json
  import sys
  from pathlib import Path

  bmc_ip = input("BMC IP or hostname: ").strip()
  bmc_user = input("BMC username: ").strip()
  bmc_pass = getpass.getpass("BMC password: ")
  host_ip = input("Affected host IP or hostname: ").strip()
  host_user = input("Host username with sudo access: ").strip()
  host_pass = getpass.getpass("Host password: ")
  baseboard = input("Baseboard override (optional, press Enter for auto-detect): ").strip()

  config = {
      "dut-1": {
          "baseboard": baseboard,
          "BMC_IP": bmc_ip,
          "BMC_USERNAME": bmc_user,
          "BMC_PASSWORD": bmc_pass,
          "HOST_IP": host_ip,
          "HOST_USERNAME": host_user,
          "HOST_PASSWORD": host_pass,
      }
  }

  Path(sys.argv[1]).write_text(json.dumps(config, indent=2))
  PYCONFIG

  cd "$NVDEBUG_DIR"
  trap 'rm -f "$DUT_CONFIG"' EXIT
  sudo "$PWD/.venv/bin/python" -m src.tool.main collect --dut-config "$DUT_CONFIG" -o "$LOGDIR/nvdebug-jumphost" > "$LOGDIR/nvdebug-jumphost.stdout.txt" 2>&1
  echo "NVDebug generated files: $LOGDIR/nvdebug-jumphost"
  echo "NVDebug console log: $LOGDIR/nvdebug-jumphost.stdout.txt"
  EOF
  ```
</details>

Include the full generated NVDebug output directory and its matching console log: `nvdebug-host-bmc/` plus `nvdebug-host-bmc.stdout.txt`, or `nvdebug-jumphost/` plus `nvdebug-jumphost.stdout.txt` if you used the jumphost block. If NVDebug cannot collect both host logs and BMC/Redfish telemetry, include any `nvdebug*.stdout.txt` or Python error output and tell Nscale Support which access path failed.

## Collect NVIDIA bug report

```bash theme={null}
bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
(cd "$LOGDIR" && sudo nvidia-bug-report.sh) > "$LOGDIR/nvidia-bug-report.stdout.txt" 2>&1
echo "NVIDIA bug report output saved under: $LOGDIR"
EOF
```

If `nvidia-bug-report.sh` creates `nvidia-bug-report.log.gz`, leave it in `$LOGDIR` so it is included in the final archive.

## Add failure-specific logs

| Failure type                                                                                              | Expected logs                                                                                                                                                                                                                                                                                                                 |
| --------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| GPU missing from `nvidia-smi`, GPU fallen off bus, driver error, Xid/SXid/NVRM, PCIe or enumeration issue | Standard manual collection. Include affected GPU index/UUID/serial, Xid/SXid if known, and the failure timestamp.                                                                                                                                                                                                             |
| GPU ECC, SRAM/DRAM, row remapper, page retirement, memory error, or suspected GPU memory fault            | Standard manual collection. Include affected GPU index/UUID/serial, observed ECC or row-remapper alert, and workload impact.                                                                                                                                                                                                  |
| NVLink, NVSwitch, Fabric Manager, IMEX, NVLSM, OpenSM, Fabric state, training, Xid 145, or Xid 149 issue  | Standard manual collection from each affected compute host. For NVL systems, include switch-side NVOS tech-support or NVSwitch logs when available or requested.                                                                                                                                                              |
| Thermal, fan, power, throttling, PSU, or voltage issue                                                    | Standard manual collection. Include the NVDebug Redfish/BMC bundle when BMC access is available, plus facility alarms, sensor exports, or photos/screenshots when useful.                                                                                                                                                     |
| Unexpected reboot, node power loss, power cycle, or suspected BMC event                                   | Standard manual collection after the host is back online. Include the NVDebug Redfish/BMC bundle when BMC access is available. Include exact reboot or outage time if known.                                                                                                                                                  |
| NVMe, disk I/O, filesystem, Linux MD RAID, or hardware RAID issue                                         | Standard manual collection. Include the affected device path, mount point, or volume if known. Nscale Support may request storage-specific command output after initial triage.                                                                                                                                               |
| System memory, CPU, MCE, EDAC, RAS, OOM, or host-level thermal issue                                      | Standard manual collection. Include any visible machine-check, EDAC, OOM, DIMM slot, CPU socket, or sensor alert details.                                                                                                                                                                                                     |
| Ethernet switch device issue                                                                              | Device logs, CLI command output, interface state and error counters, environmental or hardware alerts, and screenshots when available. Include switch asset, site, component, current state, and impact.                                                                                                                      |
| Ethernet link issue                                                                                       | Endpoint A and Port A, Endpoint B and Port B if known, plus command output for both sides where available: `net show interface <swp interface number> detail`, `net show interface pluggables`, and `sudo l1-show <swp interface number>`. Include link state, counters, cleaning or optics replacement attempts, and impact. |
| InfiniBand switch device issue                                                                            | Device logs, UFM or fabric health evidence, port state and error counters, environmental or hardware alerts, and screenshots when available. Include switch asset, site, component, current state, and impact.                                                                                                                |
| InfiniBand link issue                                                                                     | Endpoint A and Port A, Endpoint B and Port B if known, UFM evidence, and `mlxlink -d lid <lid> -p <physical_port_no> -c` output. Note whether the port was isolated in UFM and whether `perfquery -x <lid> <logical_port>` was run as part of your normal process.                                                            |
| Rack issue                                                                                                | Rack ID, affected rack unit/side/bay/PDU, photos or screenshots, related device logs, power or environmental evidence, and any impacted host logs.                                                                                                                                                                            |
| Facility, cooling, rPDU, or widespread power issue                                                        | Site, affected racks/rows/clusters/devices, start time, current state, alarm screenshots, rPDU or facility exports, photos when useful, and sample host logs from impacted nodes.                                                                                                                                             |

## Targeted add-on commands

Run these only when the issue type matches or Nscale Support asks for them. Skip commands that are not available on the host and note that they were unavailable in the support request.

### GPU missing, Xid/SXid, PCIe, NVLink, or fabric

```bash theme={null}
bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
nvidia-smi -q > nvidia-smi-q.txt 2>&1
nvidia-smi topo -m > nvidia-smi-topo-m.txt 2>&1
nvidia-smi nvlink -s > nvidia-smi-nvlink-s.txt 2>&1
nvidia-smi nvlink --errorcounters > nvidia-smi-nvlink-errorcounters.txt 2>&1
nvidia-smi --query-gpu=index,name,uuid,serial,pci.bus_id,temperature.gpu,memory.total,memory.used,memory.free,power.draw --format=csv > nvidia-smi-query-gpu-health.csv 2>&1
nvidia-smi --query-gpu=index,serial,uuid,pci.bus_id,fabric.state,fabric.status --format=csv > nvidia-smi-query-fabric-status.csv 2>&1
sudo dmesg -T | grep -Ei 'Xid|SXid|NVRM|fallen off|PCIe|AER|nvidia|nvswitch|nvlink|fabric' > dmesg-gpu-fabric-filtered.txt 2>&1
EOF
```

**Active diagnostics:** Run `dcgmi diag -r 4` only when Nscale Support asks for it or when the host is available for a long-running active diagnostic.

```bash theme={null}
bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
sudo dcgmi diag -r 4 > dcgmi-diag-r4.txt 2>&1
EOF
```

### GPU ECC, memory, row remapper, or page retirement

```bash theme={null}
bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
nvidia-smi -q -d ECC > nvidia-smi-q-d-ecc.txt 2>&1
nvidia-smi -q -d MEMORY > nvidia-smi-q-d-memory.txt 2>&1
nvidia-smi -q -d ROW_REMAPPER > nvidia-smi-q-d-row-remapper.txt 2>&1
nvidia-smi -q -d PAGE_RETIREMENT > nvidia-smi-q-d-page-retirement.txt 2>&1
sudo dmesg -T | grep -Ei 'ECC|row|remap|retire|SRAM|DRAM|memory error|Xid|NVRM|MCE|EDAC|RAS' > dmesg-memory-ecc-filtered.txt 2>&1
EOF
```

### Unexpected reboot, power cycle, or host reset

```bash theme={null}
bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
sudo journalctl --list-boots --no-pager > journalctl-list-boots.txt 2>&1
last -x reboot shutdown -F > last-reboot-shutdown.txt 2>&1
who -b > who-b.txt 2>&1
uptime -s > uptime-s.txt 2>&1
EOF
```

### NVMe, disk, filesystem, or RAID

```bash theme={null}
bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
df -hT > df-hT.txt 2>&1
lsblk -a -o NAME,KNAME,PATH,SIZE,TYPE,FSTYPE,MOUNTPOINTS,MODEL,SERIAL,WWN,VENDOR,STATE,TRAN > lsblk.txt 2>&1
cat /proc/mdstat > proc-mdstat.txt 2>&1
nvme list > nvme-list.txt 2>&1
for dev in /dev/nvme*n1; do [ -e "$dev" ] || continue; sudo nvme smart-log "$dev"; done > nvme-smart-log.txt 2>&1
for dev in /dev/nvme*n1; do [ -e "$dev" ] || continue; sudo nvme error-log "$dev"; done > nvme-error-log.txt 2>&1
sudo dmesg -T | grep -Ei 'I/O error|blk_update_request|buffer I/O|medium error|critical warning|nvme|md/|raid|scsi|sas|sata|smart|xfs|ext4|btrfs|reset|timeout' > dmesg-storage-filtered.txt 2>&1
EOF
```

### System memory, CPU, MCE, EDAC, RAS, OOM, or host thermal

```bash theme={null}
bash <<'EOF'
set -u
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
cd "$LOGDIR"
free -h > free-h.txt 2>&1
cat /proc/meminfo > proc-meminfo.txt 2>&1
lscpu > lscpu.txt 2>&1
sudo dmidecode -t memory > dmidecode-memory.txt 2>&1
sensors > sensors.txt 2>&1
sudo dmesg -T | grep -Ei 'MCE|EDAC|RAS|ECC|memory error|hardware error|machine check|out of memory|oom|thermal|overheat|temperature|throttl' > dmesg-memory-ras-thermal-filtered.txt 2>&1
EOF
```

## Create the archive

```bash theme={null}
bash <<'EOF'
set -euo pipefail
LOGDIR=$(cat /tmp/nscale_manual_logdir.path)
ARCHIVE=/tmp/$(basename "$LOGDIR").tar.gz
sudo tar -czf "$ARCHIVE" -C /tmp "$(basename "$LOGDIR")"
sudo chown "$(id -u):$(id -g)" "$ARCHIVE" 2>/dev/null || true
echo "$ARCHIVE"
EOF
```

Attach the `.tar.gz` archive to tickets created in Jira with the context listed at the top of this page.

## Alternative

Customers who approve the use of the packaged automated collector can use the Nscale Log Gatherer to collect the same diagnostic information in a single support bundle. Because the Nscale Log Gatherer is provided as a Python script, you can easily review its contents to verify exactly what data is collected before executing it.

## Data notice

Support bundles may include hostnames, IP addresses, serial numbers, kernel logs, package versions, NVIDIA diagnostic output, BMC details, storage health data, and other operational metadata. Review the archive before sharing it if your environment requires data screening.
