Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky: > > Ping > Ah, yes sorry. > Andrey > > On 2022-05-13 11:41, Andrey Grodzovsky wrote: >>> Yes, exactly that's the idea. >>> >>> Basically the reset domain knowns which amdgpu devices it needs to >>> reset together. >>> >>> If you then represent that so that you always have a hive even when >>> you only have one device in it, or if you put an array of devices >>> which needs to be reset together into the reset domain doesn't matter. >>> >>> Maybe go for the later approach, that is probably a bit cleaner and >>> less code to change. >>> >>> Christian. >> >> >> Unfortunately this approach raises also a few  difficulties - >> First - if holding array of devices in reset_domain then when you >> come to GPU reset function you don't really know which adev is the >> one triggered the reset and this is actually essential to some >> procedures like emergency restart. What is "emergency restart"? That's not some requirement I know about. >> >> Second - in XGMI case we must take into account that one of the hive >> members might go away in runtime (i could do echo 1 > >> /sysfs/pci_id/remove on it for example at any moment) - so now we >> need to maintain this array and mark such entry with NULL probably on >> XGMI node removal , and then there might be hot insertion and all >> this adds more complications. >> >> I now tend to prefer your initial solution for it's simplicity and >> the result will be what we need - >> >> "E.g. in the reset code (either before or after the reset, that's >> debatable) you do something like this: >> >> for (i = 0; i < num_ring; ++i) >> cancel_delayed_work(ring[i]->scheduler....) >> cancel_work(adev->ras_work); >> cancel_work(adev->iofault_work); >> cancel_work(adev->debugfs_work); >> " Works for me. I already expected that switching over the reset to be based on the reset context wouldn't be that easy. Regards, Christian. >> >> Let me know what you think. >> >> Andrey