From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2FE07C6369E for ; Thu, 19 Nov 2020 15:30:01 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id AD76524695 for ; Thu, 19 Nov 2020 15:29:58 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=ffwll.ch header.i=@ffwll.ch header.b="JYY7pbjr" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AD76524695 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=amd-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1F69C6E560; Thu, 19 Nov 2020 15:29:58 +0000 (UTC) Received: from mail-wr1-x443.google.com (mail-wr1-x443.google.com [IPv6:2a00:1450:4864:20::443]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4232A6E55C for ; Thu, 19 Nov 2020 15:29:56 +0000 (UTC) Received: by mail-wr1-x443.google.com with SMTP id m6so6846109wrg.7 for ; Thu, 19 Nov 2020 07:29:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=du/XZLhELlOGLcoo1QiqkS84fYpPAz0Q6FUDJggJGHo=; b=JYY7pbjrnjtxWvIEZdNqtd4/ltrWyulI3GtPgvnv+72isEZgxkjSy3G/Lvr/l83Tfv 5LdHPbege5B51PNCV/pcnBScTdOIgI6jaW7mpM71tGzDrXVqAwbzgjidnGMF3ymjHoMh XbzzXKkL/O4wcdDg+hTp3yV5b1QLocm+k0edg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=du/XZLhELlOGLcoo1QiqkS84fYpPAz0Q6FUDJggJGHo=; b=JpDpv0LO0xLz7NUDktfTtirWbfeX8TiDW+HeCnNXFNHgAZDGAmJaLks6Z509sik7G/ odAabvMxzyr2AApbKS3FhUztgCLSoUN+sT3g+oEcd3rWyv0lOuMCFt/ss/7ukir/uHyi gsEySjSM+eQHmGnaVvgAUiYcSFAj4mxlMPaff7C2thl2NT4rE4b1nkwEs/ro/O7zqFV4 yx2kI76nbyksb5e7QAFsRASOib01l4sbXRePDJNX7suMUcAJph7UsWQrrfo4itoLm6Oj dl6L9D+G1ZHMULxVT12qVQYStF72W+bFKMWIT9rD8Ao3UD7OT+5Wgut7u+RsdlVpXehg Qx7Q== X-Gm-Message-State: AOAM532jzRQsqUMBk0131ayFURFIu7YVsQPd6SBkpsL7LbjQ/BTlE4lD U0UPLdxn003/kerm7q3/f7DKIA== X-Google-Smtp-Source: ABdhPJxsdu5zTD1prq2edk3j9DnTcHIyPtkdH4hHBI8JrZtrCTEIVuDqw8sexG23iag9C3e2iA4zjA== X-Received: by 2002:a5d:510a:: with SMTP id s10mr10661109wrt.402.1605799794915; Thu, 19 Nov 2020 07:29:54 -0800 (PST) Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa]) by smtp.gmail.com with ESMTPSA id n10sm173181wrx.9.2020.11.19.07.29.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Nov 2020 07:29:54 -0800 (PST) Date: Thu, 19 Nov 2020 16:29:51 +0100 From: Daniel Vetter To: Andrey Grodzovsky Subject: Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged. Message-ID: <20201119152951.GD401619@phenom.ffwll.local> References: <24dd3691-5599-459c-2e5d-a8f2e504ec66@amd.com> <20201117185255.GP401619@phenom.ffwll.local> <20201117194922.GW401619@phenom.ffwll.local> <064ef461-8f59-2eb8-7777-6ff5b8d28cdd@amd.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-Operating-System: Linux phenom 5.7.0-1-amd64 X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Michel =?iso-8859-1?Q?D=E4nzer?= , dri-devel , Pekka Paalanen , amd-gfx list , Daniel Vetter , Alex Deucher , christian.koenig@amd.com Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" On Thu, Nov 19, 2020 at 10:02:28AM -0500, Andrey Grodzovsky wrote: > = > On 11/19/20 2:55 AM, Christian K=F6nig wrote: > > Am 18.11.20 um 17:20 schrieb Andrey Grodzovsky: > > > = > > > On 11/18/20 7:01 AM, Christian K=F6nig wrote: > > > > Am 18.11.20 um 08:39 schrieb Daniel Vetter: > > > > > On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky > > > > > wrote: > > > > > > = > > > > > > On 11/17/20 2:49 PM, Daniel Vetter wrote: > > > > > > > On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky w= rote: > > > > > > > > On 11/17/20 1:52 PM, Daniel Vetter wrote: > > > > > > > > > On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovs= ky wrote: > > > > > > > > > > On 6/22/20 5:53 AM, Daniel Vetter wrote: > > > > > > > > > > > On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grod= zovsky wrote: > > > > > > > > > > > > No point to try recovery if device is gone, just me= sses up things. > > > > > > > > > > > > = > > > > > > > > > > > > Signed-off-by: Andrey Grodzovsky > > > > > > > > > > > > --- > > > > > > > > > > > > =A0=A0=A0=A0 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.= c | 16 ++++++++++++++++ > > > > > > > > > > > > =A0=A0=A0=A0 drivers/gpu/drm/amd/amdgpu/amdgpu_job.= c | 8 ++++++++ > > > > > > > > > > > > =A0=A0=A0=A0 2 files changed, 24 insertions(+) > > > > > > > > > > > > = > > > > > > > > > > > > diff --git > > > > > > > > > > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > > > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > > > > > > > > > index 6932d75..5d6d3d9 100644 > > > > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > > > > > > > > > @@ -1129,12 +1129,28 @@ static > > > > > > > > > > > > int amdgpu_pci_probe(struct > > > > > > > > > > > > pci_dev *pdev, > > > > > > > > > > > > =A0=A0=A0=A0=A0=A0=A0=A0=A0 return ret; > > > > > > > > > > > > =A0=A0=A0=A0 } > > > > > > > > > > > > +static void amdgpu_cancel_all_tdr(struct amdgpu_de= vice *adev) > > > > > > > > > > > > +{ > > > > > > > > > > > > +=A0=A0=A0=A0=A0=A0=A0 int i; > > > > > > > > > > > > + > > > > > > > > > > > > +=A0=A0=A0=A0=A0=A0=A0 for (i =3D 0; i < AMDGPU_MAX= _RINGS; ++i) { > > > > > > > > > > > > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 stru= ct amdgpu_ring *ring =3D adev->rings[i]; > > > > > > > > > > > > + > > > > > > > > > > > > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 if (= !ring || !ring->sched.thread) > > > > > > > > > > > > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 continue; > > > > > > > > > > > > + > > > > > > > > > > > > + cancel_delayed_work_sync(&ring->sched.work_tdr); > > > > > > > > > > > > +=A0=A0=A0=A0=A0=A0=A0 } > > > > > > > > > > > > +} > > > > > > > > > > > I think this is a function that's supposed to be in d= rm/scheduler, not > > > > > > > > > > > here. Might also just be your > > > > > > > > > > > cleanup code being ordered wrongly, > > > > > > > > > > > or your > > > > > > > > > > > split in one of the earlier patches not done quite ri= ght. > > > > > > > > > > > -Daniel > > > > > > > > > > This function iterates across all the > > > > > > > > > > schedulers per amdgpu device and > > > > > > > > > > accesses > > > > > > > > > > amdgpu specific structures , > > > > > > > > > > drm/scheduler deals with single > > > > > > > > > > scheduler at most > > > > > > > > > > so looks to me like this is the right place for this fu= nction > > > > > > > > > I guess we could keep track of all schedulers somewhere i= n a list in > > > > > > > > > struct drm_device and wrap this up. That was kinda the id= ea. > > > > > > > > > = > > > > > > > > > Minimally I think a tiny wrapper with docs for the > > > > > > > > > cancel_delayed_work_sync(&sched->work_tdr); which explain= s what you must > > > > > > > > > observe to make sure there's no race. > > > > > > > > Will do > > > > > > > > = > > > > > > > > = > > > > > > > > > I'm not exactly sure there's no > > > > > > > > > guarantee here we won't get a new tdr work launched right= afterwards at > > > > > > > > > least, so this looks a bit like a hack. > > > > > > > > Note that for any TDR work happening post amdgpu_cancel_all= _tdr > > > > > > > > amdgpu_job_timedout->drm_dev_is_unplugged > > > > > > > > will return true and so it will return early. To make it wa= ter proof tight > > > > > > > > against race > > > > > > > > i can switch from drm_dev_is_unplugged to drm_dev_enter/exit > > > > > > > Hm that's confusing. You do a work_cancel_sync, so that at le= ast looks > > > > > > > like "tdr work must not run after this point" > > > > > > > = > > > > > > > If you only rely on drm_dev_enter/exit check with the tdr wor= k, then > > > > > > > there's no need to cancel anything. > > > > > > = > > > > > > Agree, synchronize_srcu from drm_dev_unplug should play the role > > > > > > of 'flushing' any earlier (in progress) tdr work which is > > > > > > using drm_dev_enter/exit pair. Any later arising tdr > > > > > > will terminate early when > > > > > > drm_dev_enter > > > > > > returns false. > > > > > Nope, anything you put into the work itself cannot close this rac= e. > > > > > It's the schedule_work that matters here. Or I'm missing somethin= g ... > > > > > I thought that the tdr work you're cancelling here is launched by > > > > > drm/scheduler code, not by the amd callback? > > > = > > > = > > > My bad, you are right, I am supposed to put drm_dev_enter.exit pair > > > into drm_sched_job_timedout > > > = > > > = > > > > = > > > > Yes that is correct. Canceling the work item is not the right > > > > approach at all, nor is adding dev_enter/exit pair in the > > > > recovery handler. > > > = > > > = > > > Without adding the dev_enter/exit guarding pair in the recovery > > > handler you are ending up with GPU reset starting while > > > the device is already unplugged, this leads to multiple errors and ge= neral mess. > > > = > > > = > > > > = > > > > What we need to do here is to stop the scheduler thread and then > > > > wait for any timeout handling to have finished. > > > > = > > > > Otherwise it can scheduler a new timeout just after we have cancele= d this one. > > > > = > > > > Regards, > > > > Christian. > > > = > > > = > > > Schedulers are stopped from amdgpu_driver_unload_kms which indeed > > > happens after drm_dev_unplug > > > so yes, there is still a chance for new work being scheduler and > > > timeout armed after but, once i fix the code > > > to place drm_dev_enter/exit pair into drm_sched_job_timeout I don't > > > see why that not a good solution ? > > = > > Yeah that should work as well, but then you also don't need to cancel > > the work item from the driver. > = > = > Indeed, as Daniel pointed out no need and I dropped it. One correction - I > previously said that w/o > dev_enter/exit guarding pair in scheduler's TO handler you will get GPU > reset starting while device already gone - > of course this is not fully preventing this as the device can be extracted > at any moment just after we > already entered GPU recovery. But it does saves us processing a futile=A0= GPU > recovery which always > starts once you unplug the device if there are active gobs in progress at > the moment and so I think it's > still justifiable to keep the dev_enter/exit guarding pair there. Yeah sprinkling drm_dev_enter/exit over the usual suspect code paths like tdr to make the entire unloading much faster makes sense. Waiting for enormous amounts of mmio ops to time out isn't fun. A comment might be good for that though, to explain why we're doing that. -Daniel > = > Andrey > = > = > > = > > = > > > Any tdr work started after drm_dev_unplug finished will simply abort > > > on entry to drm_sched_job_timedout > > > because drm_dev_enter will be false and the function will return > > > without rearming the timeout timer and > > > so will have no impact. > > > = > > > The only issue i see here now is of possible use after free if some > > > late tdr work will try to execute after > > > drm device already gone, for this we probably should add > > > cancel_delayed_work_sync(sched.work_tdr) > > > to drm_sched_fini after sched->thread is stopped there. > > = > > Good point, that is indeed missing as far as I can see. > > = > > Christian. > > = > > > = > > > Andrey > > = -- = Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx