amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: "Andrey Grodzovsky" <andrey.grodzovsky@amd.com>,
	"Lazar, Lijo" <lijo.lazar@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Felix Kuehling" <felix.kuehling@amd.com>,
	amd-gfx@lists.freedesktop.org
Cc: Bai Zoy <Zoy.Bai@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.
Date: Mon, 16 May 2022 17:08:27 +0200	[thread overview]
Message-ID: <6cf84065-6341-3f96-4e09-ab71796241ec@amd.com> (raw)
In-Reply-To: <cdce4608-9ade-ac23-b957-6d38a3e2b55a@amd.com>

[-- Attachment #1: Type: text/plain, Size: 1991 bytes --]

Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky:
>
> Ping
>

Ah, yes sorry.

> Andrey
>
> On 2022-05-13 11:41, Andrey Grodzovsky wrote:
>>> Yes, exactly that's the idea.
>>>
>>> Basically the reset domain knowns which amdgpu devices it needs to 
>>> reset together.
>>>
>>> If you then represent that so that you always have a hive even when 
>>> you only have one device in it, or if you put an array of devices 
>>> which needs to be reset together into the reset domain doesn't matter.
>>>
>>> Maybe go for the later approach, that is probably a bit cleaner and 
>>> less code to change.
>>>
>>> Christian.
>>
>>
>> Unfortunately this approach raises also a few  difficulties -
>> First - if holding array of devices in reset_domain then when you 
>> come to GPU reset function you don't really know which adev is the 
>> one triggered the reset and this is actually essential to some 
>> procedures like emergency restart.

What is "emergency restart"? That's not some requirement I know about.

>>
>> Second - in XGMI case we must take into account that one of the hive 
>> members might go away in runtime (i could do echo 1 > 
>> /sysfs/pci_id/remove on it for example at any moment) - so now we 
>> need to maintain this array and mark such entry with NULL probably on 
>> XGMI node removal , and then there might be hot insertion and all 
>> this adds more complications.
>>
>> I now tend to prefer your initial solution for it's simplicity and 
>> the result will be what we need -
>>
>> "E.g. in the reset code (either before or after the reset, that's 
>> debatable) you do something like this:
>>
>> for (i = 0; i < num_ring; ++i)
>> cancel_delayed_work(ring[i]->scheduler....)
>> cancel_work(adev->ras_work);
>> cancel_work(adev->iofault_work);
>> cancel_work(adev->debugfs_work);
>> "

Works for me. I already expected that switching over the reset to be 
based on the reset context wouldn't be that easy.

Regards,
Christian.

>>
>> Let me know what you think.
>>
>> Andrey 

[-- Attachment #2: Type: text/html, Size: 3552 bytes --]

  reply	other threads:[~2022-05-16 15:08 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-04 16:18 [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
2022-05-05 10:09 ` Christian König
2022-05-05 13:15   ` Andrey Grodzovsky
2022-05-05 13:23     ` Christian König
2022-05-05 13:54       ` Andrey Grodzovsky
2022-05-05 15:06         ` Christian König
2022-05-05 18:57           ` Andrey Grodzovsky
2022-05-05 19:49             ` Felix Kuehling
2022-05-05 21:47               ` Andrey Grodzovsky
2022-05-06  5:41                 ` Luben Tuikov
2022-05-06  6:02                 ` Lazar, Lijo
2022-05-06  8:56                   ` Christian König
2022-05-10 16:00                     ` Andrey Grodzovsky
2022-05-10 16:17                       ` Christian König
2022-05-10 17:01                         ` Andrey Grodzovsky
2022-05-10 17:19                           ` Christian König
2022-05-10 18:53                             ` Andrey Grodzovsky
2022-05-11  7:38                               ` Christian König
2022-05-11 13:43                                 ` Andrey Grodzovsky
2022-05-11 13:58                                   ` Christian König
2022-05-11 15:20                                     ` Lazar, Lijo
2022-05-11 15:35                                       ` Andrey Grodzovsky
2022-05-11 15:37                                         ` Lazar, Lijo
2022-05-11 15:43                                           ` Andrey Grodzovsky
2022-05-11 15:46                                             ` Lazar, Lijo
2022-05-11 15:53                                               ` Andrey Grodzovsky
2022-05-11 15:39                                         ` Christian König
2022-05-11 15:57                                           ` Andrey Grodzovsky
2022-05-12  6:03                                             ` Christian König
2022-05-12 12:57                                               ` Andrey Grodzovsky
2022-05-11 20:27                                           ` Andrey Grodzovsky
2022-05-12  6:06                                             ` Christian König
2022-05-12  9:21                                               ` Lazar, Lijo
2022-05-12 13:07                                               ` Andrey Grodzovsky
2022-05-12 13:15                                                 ` Christian König
2022-05-12 13:44                                                   ` Andrey Grodzovsky
2022-05-13 15:41                                                   ` Andrey Grodzovsky
2022-05-16 14:12                                                     ` Andrey Grodzovsky
2022-05-16 15:08                                                       ` Christian König [this message]
2022-05-16 15:13                                                         ` Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6cf84065-6341-3f96-4e09-ab71796241ec@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Zoy.Bai@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=felix.kuehling@amd.com \
    --cc=lijo.lazar@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).