From: "Chen, Guchun" <Guchun.Chen@amd.com>
To: "Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
"Koenig, Christian" <Christian.Koenig@amd.com>,
"Pan, Xinhui" <Xinhui.Pan@amd.com>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>
Subject: RE: [PATCH] drm/amdgpu: add missed write lock for pci detected state pci_channel_io_normal
Date: Sat, 2 Oct 2021 15:20:17 +0000 [thread overview]
Message-ID: <DM5PR12MB246998ECCF8122818696AF20F1AC9@DM5PR12MB2469.namprd12.prod.outlook.com> (raw)
In-Reply-To: <DM5PR12MB2469490BBE6ED3A83D542004F1AB9@DM5PR12MB2469.namprd12.prod.outlook.com>
[Public]
Hi Andrey,
A new patch with subject "drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume" has been sent, pls review it. Thanks.
Regards,
Guchun
-----Original Message-----
From: Chen, Guchun
Sent: Friday, October 1, 2021 11:21 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; amd-gfx@lists.freedesktop.org; Koenig, Christian <Christian.Koenig@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: RE: [PATCH] drm/amdgpu: add missed write lock for pci detected state pci_channel_io_normal
[Public]
Got your point. Will send a new patch to address this.
Regards,
Guchun
-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Sent: Friday, October 1, 2021 10:29 PM
To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org; Koenig, Christian <Christian.Koenig@amd.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
Subject: Re: [PATCH] drm/amdgpu: add missed write lock for pci detected state pci_channel_io_normal
No, scheduler restart and device unlock must take place inamdgpu_pci_resume (see struct pci_error_handlers for the various states of PCI recovery). So just add a flag (probably in amdgpu_device) so we can remember what pci_channel_state_t we came from (unfortunately it's not passed to us in amdgpu_pci_resume) and unless it's set don't do anything in amdgpu_pci_resume.
Andrey
On 2021-10-01 4:21 a.m., Chen, Guchun wrote:
> [Public]
>
> Hi Andrey,
>
> Do you mean to move the code of drm_sched_resubmit_jobs and drm_sched_start in amdgpu_pci_resume to amdgpu_pci_error_detected, under the case pci_channel_io_frozen?
> Then leave amdgpu_pci_resume as a null function, and in this way, we can drop the acquire/lock write lock for case of pci_channel_io_normal as well?
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: Friday, October 1, 2021 10:22 AM
> To: Chen, Guchun <Guchun.Chen@amd.com>; amd-gfx@lists.freedesktop.org;
> Koenig, Christian <Christian.Koenig@amd.com>; Pan, Xinhui
> <Xinhui.Pan@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: add missed write lock for pci
> detected state pci_channel_io_normal
>
> On 2021-09-30 10:00 p.m., Guchun Chen wrote:
>
>> When a PCI error state pci_channel_io_normal is detectd, it will
>> report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI
>> driver will continue the execution of PCI resume callback
>> report_resume by pci_walk_bridge, and the callback will go into
>> amdgpu_pci_resume finally, where write lock is releasd
>> unconditionally without acquiring such lock.
>
> Good catch but, the issue is even wider in scope, what about drm_sched_resubmit_jobs and drm_sched_start called without being stopped before ? Better to put the entire scope of code in this function under flag that set only in pci_channel_io_frozen. As far as i remember we don't need to do anything in case of pci_channel_io_normal.
>
> Andrey
>
>
>> Fixes: c9a6b82f45e2("drm/amdgpu: Implement DPC recovery")
>> Signed-off-by: Guchun Chen <guchun.chen@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index bb5ad2b6ca13..12f822d51de2 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5370,6 +5370,7 @@ pci_ers_result_t
>> amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>>
>> switch (state) {
>> case pci_channel_io_normal:
>> + amdgpu_device_lock_adev(adev, NULL);
>> return PCI_ERS_RESULT_CAN_RECOVER;
>> /* Fatal error, prepare for slot reset */
>> case pci_channel_io_frozen:
prev parent reply other threads:[~2021-10-02 15:20 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-01 2:00 [PATCH] drm/amdgpu: add missed write lock for pci detected state pci_channel_io_normal Guchun Chen
2021-10-01 2:21 ` Andrey Grodzovsky
2021-10-01 8:21 ` Chen, Guchun
2021-10-01 14:28 ` Andrey Grodzovsky
2021-10-01 15:21 ` Chen, Guchun
2021-10-02 15:20 ` Chen, Guchun [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DM5PR12MB246998ECCF8122818696AF20F1AC9@DM5PR12MB2469.namprd12.prod.outlook.com \
--to=guchun.chen@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Andrey.Grodzovsky@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=Xinhui.Pan@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.