https://bugs.freedesktop.org/show_bug.cgi?id=110509 --- Comment #9 from Alex Deucher --- (In reply to James.Dutton from comment #8) > Thank you for the feedback. > Is there a data sheet somewhere that might help me work out a fix for this. > What I would like is: > 1) A way to scan all the engines and detect which ones have hung. If the gpu scheduler for a queue on a particular engine times out, you can be pretty sure the engine has hung. At that point you can check the current busy status for the block (IP is_idle() callback). > 2) A way to intentionally halt an engine and tidy up. So that the modprobe, > rmmod, modprobe scenario works. hw_fini() IP callback. > 3) data sheet details regarding how to un-hang each engine. > Specifically, in this case the IP block . Each IP has a soft reset (implemented via the IP soft_reset() callback), but depending on the hang, in some cases, you may have to do a full GPU reset to recover. This is not a hw hang, it's a sw deadlock. > > Maybe that is not possible, and (I think you are hinting at it), one cannot > reset an individual IP block. So the approach is to suspend the card, and > then do a full reset of the entire card, then resume. All asics support full GPU reset which is implemented via the SOC level amdgpu_asic_funcs reset() callback. > > I think a different suspend process would be better. > We have a for_each within the suspend code. The output of that code should > not be a single error code, but instead an array indicating the current > state of each engine (running/hung), the intended state and status of > whether the intention worked or failed. If the loop through the for_each, it > could compare the current state and intended state, and attempt to reach the > intended state, and report an error code for each engine. Then the code to > achieve the transition can been different depending on the current -> > intended transition. > i.e. code for running -> suspended, can be different than code for hung -> > suspended. The code already needs to know which engines are enabled/disabled > (Vega 56 vs Vega 64) We don't really care of the suspend fails or not. See amdgpu_device_gpu_recover() for the full sequence. > > I can hang this IP block at will. I have 2 games that hang it within > seconds of starting. There was a deadlock in the dm code which has been fixed. Please try a new code base. e.g., https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-5.2-wip -- You are receiving this mail because: You are the assignee for the bug.