NVIDIA’s RTX 5090 and RTX PRO 6000 Encounter Virtualization Reset Vulnerability
Summary:
- NVIDIA’s RTX 5090 and RTX PRO 6000 graphics cards have been identified to have a critical virtualization reset vulnerability.
- CloudRift, a GPU cloud service provider, has confirmed the issue in production environments and is offering a $1,000 reward for solutions.
- User reports indicate that the problem is potentially limited to NVIDIA’s Blackwell series, with no similar issues observed in earlier models.
On September 7, it was reported that NVIDIA’s latest graphics cards, the RTX 5090 and RTX PRO 6000, are suffering from a virtualization reset vulnerability. This critical issue renders the graphics card completely unresponsive until the host system is physically restarted.
Identifying the Vulnerability
CloudRift, a prominent GPU cloud service provider, discovered this problem on multiple systems equipped with NVIDIA’s Blackwell chips in operational settings. Following their comprehensive analysis, they have released detailed reports and are actively seeking community assistance to identify the root cause of this issue. To incentivize potential solutions, CloudRift has announced a reward of $1,000.
The vulnerability arises when the GPU is passed to a virtual machine via Kernel-based Virtual Machine (KVM) and Virtual Function I/O (VFIO). During operations, if the virtual machine is shut down or the GPU is reassigned, the host system attempts a PCI Express functional level reset (FLR). Unfortunately, this process leads to a failure: the GPU does not return to its normal operational state, and the kernel logs this error: "65535 milliseconds after FLR; give up."
Impact on Functionality
Once the GPU hangs in this manner, it becomes unreadable to tools like lspci
, which subsequently displays an "Unknown head type 7f" error message. The only recourse available to users facing this issue is to power down the entire machine and restart it—a solution that is far from ideal for any critical work environment.
AI startup Tiny Corp has echoed CloudRift’s findings, further questioning whether the RTX 5090 and RTX PRO 6000 possess inherent hardware flaws. Their investigations have so far yielded no answers, leaving both companies and the user community in a state of uncertainty.
Community Feedback
As the problem continues to gain visibility, user reports from various forums have highlighted similar experiences. Home users and early adopters of the RTX 5090 have noted that shutting down a Windows virtual machine can lead to the entire host system freezing, with even operating system restarts failing to reinitialize the GPU. This widespread issue has triggered an ongoing discussion among users seeking solutions.
Attempts to mitigate the problem by modifying PCIe Active State Power Management (ASPM) or Access Control Service (ACS) settings have proven ineffective. Notably, no similar issues have been reported for older graphics cards like the RTX 4090, suggesting that the vulnerability may be confined to NVIDIA’s Blackwell series.
Conclusion
While NVIDIA continues to advertise its cutting-edge technology, the emergence of the virtualization reset vulnerability highlights potential shortcomings in the RTX 5090 and RTX PRO 6000. As both CloudRift and Tiny Corp work to unravel the complexities surrounding this issue, the impact on user experience and operational continuity remains a pressing concern.
For users operating in environments reliant on GPU virtualization, continual monitoring of this situation is advised. The promise of a $1,000 reward from CloudRift may inspire further investigation into this significant issue, potentially leading to solutions that restore functionality and reliability to affected systems.
The upcoming weeks will be crucial as the community comes together to address the challenges posed by NVIDIA’s latest offerings, redefining not only the user experience but also the future of GPU cloud services.