Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky:
>
> Ping
>

Ah, yes sorry.

> Andrey
>
> On 2022-05-13 11:41, Andrey Grodzovsky wrote:
>>> Yes, exactly that's the idea.
>>>
>>> Basically the reset domain knowns which amdgpu devices it needs to 
>>> reset together.
>>>
>>> If you then represent that so that you always have a hive even when 
>>> you only have one device in it, or if you put an array of devices 
>>> which needs to be reset together into the reset domain doesn't matter.
>>>
>>> Maybe go for the later approach, that is probably a bit cleaner and 
>>> less code to change.
>>>
>>> Christian.
>>
>>
>> Unfortunately this approach raises also a few  difficulties -
>> First - if holding array of devices in reset_domain then when you 
>> come to GPU reset function you don't really know which adev is the 
>> one triggered the reset and this is actually essential to some 
>> procedures like emergency restart.

What is "emergency restart"? That's not some requirement I know about.

>>
>> Second - in XGMI case we must take into account that one of the hive 
>> members might go away in runtime (i could do echo 1 > 
>> /sysfs/pci_id/remove on it for example at any moment) - so now we 
>> need to maintain this array and mark such entry with NULL probably on 
>> XGMI node removal , and then there might be hot insertion and all 
>> this adds more complications.
>>
>> I now tend to prefer your initial solution for it's simplicity and 
>> the result will be what we need -
>>
>> "E.g. in the reset code (either before or after the reset, that's 
>> debatable) you do something like this:
>>
>> for (i = 0; i < num_ring; ++i)
>> cancel_delayed_work(ring[i]->scheduler....)
>> cancel_work(adev->ras_work);
>> cancel_work(adev->iofault_work);
>> cancel_work(adev->debugfs_work);
>> "

Works for me. I already expected that switching over the reset to be 
based on the reset context wouldn't be that easy.

Regards,
Christian.

>>
>> Let me know what you think.
>>
>> Andrey