Ping Andrey On 2022-05-13 11:41, Andrey Grodzovsky wrote: >> Yes, exactly that's the idea. >> >> Basically the reset domain knowns which amdgpu devices it needs to >> reset together. >> >> If you then represent that so that you always have a hive even when >> you only have one device in it, or if you put an array of devices >> which needs to be reset together into the reset domain doesn't matter. >> >> Maybe go for the later approach, that is probably a bit cleaner and >> less code to change. >> >> Christian. > > > Unfortunately this approach raises also a few  difficulties - > First - if holding array of devices in reset_domain then when you come > to GPU reset function you don't really know which adev is the one > triggered the reset and this is actually essential to some procedures > like emergency restart. > > Second - in XGMI case we must take into account that one of the hive > members might go away in runtime (i could do echo 1 > > /sysfs/pci_id/remove on it for example at any moment) - so now we need > to maintain this array and mark such entry with NULL probably on XGMI > node removal , and then there might be hot insertion and all this adds > more complications. > > I now tend to prefer your initial solution for it's simplicity and the > result will be what we need - > > "E.g. in the reset code (either before or after the reset, that's > debatable) you do something like this: > > for (i = 0; i < num_ring; ++i) > cancel_delayed_work(ring[i]->scheduler....) > cancel_work(adev->ras_work); > cancel_work(adev->iofault_work); > cancel_work(adev->debugfs_work); > " > > Let me know what you think. > > Andrey