Early AMD GCN GPUs Seeing Improved GPU Recovery - Another Valve-Led Linux Improvement

Timur Kristóf of Valve's Linux open-source GPU driver team has been responsible for many of the improvements recently to early GCN GPUs, especially with getting GCN 1.0 and GCN 1.1 GPUs over to the AMDGPU driver by default rather than the prior Radeon driver default. The latest work he's leading is on enabling soft reset support for the GFX IP block.
This initial GFX IP block soft reset support is targeting AMD "GFX8" graphics IP with the likes of Polaris, Fiji, Tonga, and Carrizo hardware. But Timur mentioned to Phoronix he is also looking at implementing this functionality for older GCN hardware too so potentially going back to GCN 1.0 could enjoy this better reset experience.
Timur noted with the patch series reworking the AMDGPU reset handling:
"IP block soft reset is a way to reset just one IP block in a GPU without resetting the whole GPU or losing the ontents of VRAM. Currently this is implemented for various IP blocks, but actually only used on Carrizo and Stoney as part of the ASIC reset code, and it fails.
Let's rework that.
Delete the defunct code from the ASIC reset code path. Also delete check_soft_reset() and pre/post_soft_reset() which were quite useless and redundant (see the commit messages for details).
Add IP block soft reset as a GPU recovery method instead. This works similarly to ring reset, but will affect all rings that belong to the IP block. For example, a GFXIP block soft reset will affect all graphics and compute rings. It is called when a job is timed out. Attempt to minimize the effect on non-guilty jobs, then back up the contents of all affected rings, perform the HW specific soft reset, then restore the rings. For this, I am also including some patches from Alex which were written for pipe reset and solve some problems also for IP block soft reset.
Finally, let's fix up the soft reset implementation on GFX8 to make sure it works on every GFX8 chip. Specifically, fix an issue with compute rings hanging after the reset, and fix an issue with increased power consumption after the reset, among others. With those issues gone, enable the new GPU recovery method on GFX8."
With being able to reset just the graphics IP block and not losing the VRAM contents would be a big win over the status quo. Others on Valve's Linux team helped out in coming up with a Vulkan test case to typically trigger a hard reset by causing the command processor to intentionally hang.
Hopefully this GFX8 reset work is baked well enough soon for reaching the mainline AMDGPU driver and it won't be too long before seeing this GFX IP soft reset support expanded to the even older Radeon GPUs.
5 Comments
