amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Luben Tuikov <luben.tuikov@amd.com>
To: Daniel Vetter <daniel@ffwll.ch>,
	Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: daniel.vetter@ffwll.ch, michel@daenzer.net,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	ckoenig.leichtzumerken@gmail.com
Subject: Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.
Date: Tue, 17 Nov 2020 19:46:01 -0500	[thread overview]
Message-ID: <3a2004ed-1b82-b36d-4233-4ee107203567@amd.com> (raw)
In-Reply-To: <20201117194922.GW401619@phenom.ffwll.local>

On 2020-11-17 2:49 p.m., Daniel Vetter wrote:
> On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:
>>
>> On 11/17/20 1:52 PM, Daniel Vetter wrote:
>>> On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:
>>>> On 6/22/20 5:53 AM, Daniel Vetter wrote:
>>>>> On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:
>>>>>> No point to try recovery if device is gone, just messes up things.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 ++++++++++++++++
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 ++++++++
>>>>>>    2 files changed, 24 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> index 6932d75..5d6d3d9 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> @@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>>>>>>    	return ret;
>>>>>>    }
>>>>>> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
>>>>>> +{
>>>>>> +	int i;
>>>>>> +
>>>>>> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>> +		struct amdgpu_ring *ring = adev->rings[i];
>>>>>> +
>>>>>> +		if (!ring || !ring->sched.thread)
>>>>>> +			continue;
>>>>>> +
>>>>>> +		cancel_delayed_work_sync(&ring->sched.work_tdr);
>>>>>> +	}
>>>>>> +}
>>>>> I think this is a function that's supposed to be in drm/scheduler, not
>>>>> here. Might also just be your cleanup code being ordered wrongly, or your
>>>>> split in one of the earlier patches not done quite right.
>>>>> -Daniel
>>>>
>>>> This function iterates across all the schedulers  per amdgpu device and accesses
>>>> amdgpu specific structures , drm/scheduler deals with single scheduler at most
>>>> so looks to me like this is the right place for this function
>>> I guess we could keep track of all schedulers somewhere in a list in
>>> struct drm_device and wrap this up. That was kinda the idea.
>>>
>>> Minimally I think a tiny wrapper with docs for the
>>> cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
>>> observe to make sure there's no race.
>>
>>
>> Will do
>>
>>
>>> I'm not exactly sure there's no
>>> guarantee here we won't get a new tdr work launched right afterwards at
>>> least, so this looks a bit like a hack.
>>
>>
>> Note that for any TDR work happening post amdgpu_cancel_all_tdr
>> amdgpu_job_timedout->drm_dev_is_unplugged
>> will return true and so it will return early. To make it water proof tight
>> against race
>> i can switch from drm_dev_is_unplugged to drm_dev_enter/exit
> 
> Hm that's confusing. You do a work_cancel_sync, so that at least looks
> like "tdr work must not run after this point"
> 
> If you only rely on drm_dev_enter/exit check with the tdr work, then
> there's no need to cancel anything.
> 
> For race free cancel_work_sync you need:
> 1. make sure whatever is calling schedule_work is guaranteed to no longer
> call schedule_work.
> 2. call cancel_work_sync
> 
> Anything else is cargo-culted work cleanup:
> 
> - 1. without 2. means if a work got scheduled right before it'll still be
>   a problem.
> - 2. without 1. means a schedule_work right after makes you calling
>   cancel_work_sync pointless.

This is sound advice and I did something similar for SAS over a decade
ago where an expander could be disconnected from the domain via which
many IOs are flying to end devices.

You need a small tiny DRM function which low-level drivers (such as amdgpu)
call in order to tell DRM that this device is not accepting commands
any more (sets a flag) and starts a thread to clean up commands
which are "done" or "incoming". At the same time, the low-level driver
returns commands which are pending in the hardware back out to
DRM (thus those commands become "done" from "pending"), and
DRM cleans them up.(*)

The point is that you're not bubbling up the error, but
directly notifying the highest level of upper layer to hold off,
while you're cleaning up all incoming and pending commands.

Depending on the situation, case 1 above has two sub-cases:

a) the device will not come back--then cancel any new work
   back out to the application client, or
b) the device may come back again, i.e. it is being reset,
   then you can queue up work, assuming the device will
   come back on successfully and you'd be able to send
   the incoming requests down to it. Or cancel everything
   and let the application client do the queueing and
   resubmission, like in a). The latter will not work when this
   resubmission (and error recovery) is done without
   the knowledge of the application client, for instance
   communication or parity errors, protocol retries, etc.

(*) I've some work coming in, in the scheduler, which could make
this handling easier, or at least set a mechanism by which
this could be made easier.

Regards,
Luben

> 
> So either both or nothing.
> -Daniel
> 
>>
>> Andrey
>>
>>
>>> -Daniel
>>>
>>>> Andrey
>>>>
>>>>
>>>>>> +
>>>>>>    static void
>>>>>>    amdgpu_pci_remove(struct pci_dev *pdev)
>>>>>>    {
>>>>>>    	struct drm_device *dev = pci_get_drvdata(pdev);
>>>>>> +	struct amdgpu_device *adev = dev->dev_private;
>>>>>>    	drm_dev_unplug(dev);
>>>>>> +	amdgpu_cancel_all_tdr(adev);
>>>>>>    	ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
>>>>>>    	amdgpu_driver_unload_kms(dev);
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> index 4720718..87ff0c0 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> @@ -28,6 +28,8 @@
>>>>>>    #include "amdgpu.h"
>>>>>>    #include "amdgpu_trace.h"
>>>>>> +#include <drm/drm_drv.h>
>>>>>> +
>>>>>>    static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>>    {
>>>>>>    	struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>>>>>> @@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>>    	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>>>>>> +	if (drm_dev_is_unplugged(adev->ddev)) {
>>>>>> +		DRM_INFO("ring %s timeout, but device unplugged, skipping.\n",
>>>>>> +					  s_job->sched->name);
>>>>>> +		return;
>>>>>> +	}
>>>>>> +
>>>>>>    	if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>>>>>>    		DRM_ERROR("ring %s timeout, but soft recovered\n",
>>>>>>    			  s_job->sched->name);
>>>>>> -- 
>>>>>> 2.7.4
>>>>>>
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2020-11-18  0:46 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-21  6:03 [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
2020-06-21  6:03 ` [PATCH v2 1/8] drm: Add dummy page per device or GEM object Andrey Grodzovsky
2020-06-22  9:35   ` Daniel Vetter
2020-06-22 14:21     ` Pekka Paalanen
2020-06-22 14:24       ` Daniel Vetter
2020-06-22 14:28         ` Pekka Paalanen
2020-11-09 20:34     ` Andrey Grodzovsky
2020-11-15  6:39     ` Andrey Grodzovsky
2020-06-22 13:18   ` Christian König
2020-06-22 14:23     ` Daniel Vetter
2020-06-22 14:32     ` Andrey Grodzovsky
2020-06-22 17:45       ` Christian König
2020-06-22 17:50         ` Daniel Vetter
2020-11-09 20:53           ` Andrey Grodzovsky
2020-11-13 20:52           ` Andrey Grodzovsky
2020-11-14  8:41             ` Christian König
2020-11-14  9:51               ` Daniel Vetter
2020-11-14  9:57                 ` Daniel Vetter
2020-11-16  9:42                   ` Michel Dänzer
2020-11-15  6:34                 ` Andrey Grodzovsky
2020-11-16  9:48                   ` Christian König
2020-11-16 19:00                     ` Andrey Grodzovsky
2020-11-16 20:36                       ` Christian König
2020-11-16 20:42                         ` Andrey Grodzovsky
2020-11-19 10:01                           ` Christian König
2020-06-21  6:03 ` [PATCH v2 2/8] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
2020-06-22  9:41   ` Daniel Vetter
2020-06-24  3:31     ` Andrey Grodzovsky
2020-06-24  7:19       ` Daniel Vetter
2020-11-10 17:41     ` Andrey Grodzovsky
2020-06-22 19:30   ` Christian König
2020-06-21  6:03 ` [PATCH v2 3/8] drm/ttm: Add unampping of the entire device address space Andrey Grodzovsky
2020-06-22  9:45   ` Daniel Vetter
2020-06-23  5:00     ` Andrey Grodzovsky
2020-06-23 10:25       ` Daniel Vetter
2020-06-23 12:55         ` Christian König
2020-06-22 19:37   ` Christian König
2020-06-22 19:47   ` Alex Deucher
2020-06-21  6:03 ` [PATCH v2 4/8] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
2020-06-22  9:48   ` Daniel Vetter
2020-11-12  4:19     ` Andrey Grodzovsky
2020-11-12  9:29       ` Daniel Vetter
2020-06-21  6:03 ` [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal Andrey Grodzovsky
2020-06-22  9:51   ` Daniel Vetter
2020-06-22 11:21     ` Greg KH
2020-06-22 16:07       ` Andrey Grodzovsky
2020-06-22 16:45         ` Greg KH
2020-06-23  4:51           ` Andrey Grodzovsky
2020-06-23  6:05             ` Greg KH
2020-06-24  3:04               ` Andrey Grodzovsky
2020-06-24  6:11                 ` Greg KH
2020-06-25  1:52                   ` Andrey Grodzovsky
2020-11-10 17:54                   ` Andrey Grodzovsky
2020-11-10 17:59                     ` Greg KH
2020-11-11 15:13                       ` Andrey Grodzovsky
2020-11-11 15:34                         ` Greg KH
2020-11-11 15:45                           ` Andrey Grodzovsky
2020-11-11 16:06                             ` Greg KH
2020-11-11 16:34                               ` Andrey Grodzovsky
2020-12-02 15:48                           ` Andrey Grodzovsky
2020-12-02 17:34                             ` Greg KH
2020-12-02 18:02                               ` Andrey Grodzovsky
2020-12-02 18:20                                 ` Greg KH
2020-12-02 18:40                                   ` Andrey Grodzovsky
2020-06-22 13:19   ` Christian König
2020-06-21  6:03 ` [PATCH v2 6/8] drm/amdgpu: Unmap entire device address space on device remove Andrey Grodzovsky
2020-06-22  9:56   ` Daniel Vetter
2020-06-22 19:38   ` Christian König
2020-06-22 19:48     ` Alex Deucher
2020-06-23 10:22       ` Daniel Vetter
2020-06-23 13:16         ` Christian König
2020-06-24  3:12           ` Andrey Grodzovsky
2020-06-21  6:03 ` [PATCH v2 7/8] drm/amdgpu: Fix sdma code crash post device unplug Andrey Grodzovsky
2020-06-22  9:55   ` Daniel Vetter
2020-06-22 19:40   ` Christian König
2020-06-23  5:11     ` Andrey Grodzovsky
2020-06-23  7:14       ` Christian König
2020-06-21  6:03 ` [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
2020-06-22  9:53   ` Daniel Vetter
2020-11-17 18:38     ` Andrey Grodzovsky
2020-11-17 18:52       ` Daniel Vetter
2020-11-17 19:18         ` Andrey Grodzovsky
2020-11-17 19:49           ` Daniel Vetter
2020-11-17 20:07             ` Andrey Grodzovsky
2020-11-18  7:39               ` Daniel Vetter
2020-11-18 12:01                 ` Christian König
2020-11-18 15:43                   ` Luben Tuikov
2020-11-18 16:20                   ` Andrey Grodzovsky
2020-11-19  7:55                     ` Christian König
2020-11-19 15:02                       ` Andrey Grodzovsky
2020-11-19 15:29                         ` Daniel Vetter
2020-11-19 21:24                           ` Andrey Grodzovsky
2020-11-18  0:46             ` Luben Tuikov [this message]
2020-06-22  9:46 ` [PATCH v2 0/8] RFC Support hot device unplug in amdgpu Daniel Vetter
2020-06-23  5:14   ` Andrey Grodzovsky
2020-06-23  9:04     ` Michel Dänzer
2020-06-24  3:21       ` Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3a2004ed-1b82-b36d-4233-4ee107203567@amd.com \
    --to=luben.tuikov@amd.com \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=michel@daenzer.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).