After NCCL burn-in reports "transport retry count exceeded," which corrective action addresses the underlying fabric issue?
An engineer needs to verify NVLink isolation on a single node with 8 GPUs. Which NCCL test configuration stresses switch bisection bandwidth?
After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?
An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?
A 24-hour HPL burn-in fails with "illegal value" errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?
You are following the official steps to install the NVIDIA Container Toolkit using a package manager on Ubuntu. After importing the NVIDIA package repository and GPG key, what is the next action?
As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?
After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?
A system administrator is installing a GPU into a server and needs to avoid damaging the device. What item should be used?
If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?