On 2022-05-16 11:08, Christian König wrote: > Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky: >> >> Ping >> > > Ah, yes sorry. > >> Andrey >> >> On 2022-05-13 11:41, Andrey Grodzovsky wrote: >>>> Yes, exactly that's the idea. >>>> >>>> Basically the reset domain knowns which amdgpu devices it needs to >>>> reset together. >>>> >>>> If you then represent that so that you always have a hive even when >>>> you only have one device in it, or if you put an array of devices >>>> which needs to be reset together into the reset domain doesn't matter. >>>> >>>> Maybe go for the later approach, that is probably a bit cleaner and >>>> less code to change. >>>> >>>> Christian. >>> >>> >>> Unfortunately this approach raises also a few  difficulties - >>> First - if holding array of devices in reset_domain then when you >>> come to GPU reset function you don't really know which adev is the >>> one triggered the reset and this is actually essential to some >>> procedures like emergency restart. > > What is "emergency restart"? That's not some requirement I know about. Emergency restart is something u can see at the beginning of amdgpu_gpu_recover function - it's a historical work around for some type of ASICs who weren't able to do full reset I think.  We should eventually remove it bu for now I thin it's still in use. > >>> >>> Second - in XGMI case we must take into account that one of the hive >>> members might go away in runtime (i could do echo 1 > >>> /sysfs/pci_id/remove on it for example at any moment) - so now we >>> need to maintain this array and mark such entry with NULL probably >>> on XGMI node removal , and then there might be hot insertion and all >>> this adds more complications. >>> >>> I now tend to prefer your initial solution for it's simplicity and >>> the result will be what we need - >>> >>> "E.g. in the reset code (either before or after the reset, that's >>> debatable) you do something like this: >>> >>> for (i = 0; i < num_ring; ++i) >>> cancel_delayed_work(ring[i]->scheduler....) >>> cancel_work(adev->ras_work); >>> cancel_work(adev->iofault_work); >>> cancel_work(adev->debugfs_work); >>> " > > Works for me. I already expected that switching over the reset to be > based on the reset context wouldn't be that easy. > > Regards, > Christian. Ok - i will resend a patch. Andrey > >>> >>> Let me know what you think. >>> >>> Andrey >