From: "Christian König" <christian.koenig@amd.com>
To: "Andrey Grodzovsky" <andrey.grodzovsky@amd.com>,
"Lazar, Lijo" <lijo.lazar@amd.com>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Felix Kuehling" <felix.kuehling@amd.com>,
amd-gfx@lists.freedesktop.org
Cc: Bai Zoy <Zoy.Bai@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.
Date: Mon, 16 May 2022 17:08:27 +0200 [thread overview]
Message-ID: <6cf84065-6341-3f96-4e09-ab71796241ec@amd.com> (raw)
In-Reply-To: <cdce4608-9ade-ac23-b957-6d38a3e2b55a@amd.com>
[-- Attachment #1: Type: text/plain, Size: 1991 bytes --]
Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky:
>
> Ping
>
Ah, yes sorry.
> Andrey
>
> On 2022-05-13 11:41, Andrey Grodzovsky wrote:
>>> Yes, exactly that's the idea.
>>>
>>> Basically the reset domain knowns which amdgpu devices it needs to
>>> reset together.
>>>
>>> If you then represent that so that you always have a hive even when
>>> you only have one device in it, or if you put an array of devices
>>> which needs to be reset together into the reset domain doesn't matter.
>>>
>>> Maybe go for the later approach, that is probably a bit cleaner and
>>> less code to change.
>>>
>>> Christian.
>>
>>
>> Unfortunately this approach raises also a few difficulties -
>> First - if holding array of devices in reset_domain then when you
>> come to GPU reset function you don't really know which adev is the
>> one triggered the reset and this is actually essential to some
>> procedures like emergency restart.
What is "emergency restart"? That's not some requirement I know about.
>>
>> Second - in XGMI case we must take into account that one of the hive
>> members might go away in runtime (i could do echo 1 >
>> /sysfs/pci_id/remove on it for example at any moment) - so now we
>> need to maintain this array and mark such entry with NULL probably on
>> XGMI node removal , and then there might be hot insertion and all
>> this adds more complications.
>>
>> I now tend to prefer your initial solution for it's simplicity and
>> the result will be what we need -
>>
>> "E.g. in the reset code (either before or after the reset, that's
>> debatable) you do something like this:
>>
>> for (i = 0; i < num_ring; ++i)
>> cancel_delayed_work(ring[i]->scheduler....)
>> cancel_work(adev->ras_work);
>> cancel_work(adev->iofault_work);
>> cancel_work(adev->debugfs_work);
>> "
Works for me. I already expected that switching over the reset to be
based on the reset context wouldn't be that easy.
Regards,
Christian.
>>
>> Let me know what you think.
>>
>> Andrey
[-- Attachment #2: Type: text/html, Size: 3552 bytes --]
next prev parent reply other threads:[~2022-05-16 15:08 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-04 16:18 [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
2022-05-05 10:09 ` Christian König
2022-05-05 13:15 ` Andrey Grodzovsky
2022-05-05 13:23 ` Christian König
2022-05-05 13:54 ` Andrey Grodzovsky
2022-05-05 15:06 ` Christian König
2022-05-05 18:57 ` Andrey Grodzovsky
2022-05-05 19:49 ` Felix Kuehling
2022-05-05 21:47 ` Andrey Grodzovsky
2022-05-06 5:41 ` Luben Tuikov
2022-05-06 6:02 ` Lazar, Lijo
2022-05-06 8:56 ` Christian König
2022-05-10 16:00 ` Andrey Grodzovsky
2022-05-10 16:17 ` Christian König
2022-05-10 17:01 ` Andrey Grodzovsky
2022-05-10 17:19 ` Christian König
2022-05-10 18:53 ` Andrey Grodzovsky
2022-05-11 7:38 ` Christian König
2022-05-11 13:43 ` Andrey Grodzovsky
2022-05-11 13:58 ` Christian König
2022-05-11 15:20 ` Lazar, Lijo
2022-05-11 15:35 ` Andrey Grodzovsky
2022-05-11 15:37 ` Lazar, Lijo
2022-05-11 15:43 ` Andrey Grodzovsky
2022-05-11 15:46 ` Lazar, Lijo
2022-05-11 15:53 ` Andrey Grodzovsky
2022-05-11 15:39 ` Christian König
2022-05-11 15:57 ` Andrey Grodzovsky
2022-05-12 6:03 ` Christian König
2022-05-12 12:57 ` Andrey Grodzovsky
2022-05-11 20:27 ` Andrey Grodzovsky
2022-05-12 6:06 ` Christian König
2022-05-12 9:21 ` Lazar, Lijo
2022-05-12 13:07 ` Andrey Grodzovsky
2022-05-12 13:15 ` Christian König
2022-05-12 13:44 ` Andrey Grodzovsky
2022-05-13 15:41 ` Andrey Grodzovsky
2022-05-16 14:12 ` Andrey Grodzovsky
2022-05-16 15:08 ` Christian König [this message]
2022-05-16 15:13 ` Andrey Grodzovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6cf84065-6341-3f96-4e09-ab71796241ec@amd.com \
--to=christian.koenig@amd.com \
--cc=Zoy.Bai@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrey.grodzovsky@amd.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=felix.kuehling@amd.com \
--cc=lijo.lazar@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).