From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53743C433E0 for ; Fri, 26 Mar 2021 09:06:54 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0AD1D61A18 for ; Fri, 26 Mar 2021 09:06:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0AD1D61A18 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 178EF6F38A; Fri, 26 Mar 2021 09:06:53 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by gabe.freedesktop.org (Postfix) with ESMTP id 313BC6F381; Fri, 26 Mar 2021 09:06:51 +0000 (UTC) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 74ACD1474; Fri, 26 Mar 2021 02:06:50 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 27FB63F718; Fri, 26 Mar 2021 02:06:49 -0700 (PDT) Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak To: "Zhang, Jack (Jian)" , "dri-devel@lists.freedesktop.org" , "amd-gfx@lists.freedesktop.org" , "Koenig, Christian" , "Grodzovsky, Andrey" , "Liu, Monk" , "Deng, Emily" , Rob Herring , Tomeu Vizoso References: <20210315052036.1113638-1-Jack.Zhang1@amd.com> From: Steven Price Message-ID: Date: Fri, 26 Mar 2021 09:07:53 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: Content-Language: en-GB X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On 26/03/2021 02:04, Zhang, Jack (Jian) wrote: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Steve, > > Thank you for your detailed comments. > > But currently the patch is not finalized. > We found some potential race condition even with this patch. The solution is under discussion and hopefully we could find an ideal one. > After that, I will start to consider other drm-driver if it will influence other drivers(except for amdgpu). No problem. Please keep me CC'd, the suggestion of using reference counts may be beneficial for Panfrost as we already build a reference count on top of struct drm_sched_job. So there may be scope for cleaning up Panfrost afterwards even if your work doesn't directly affect it. Thanks, Steve > Best, > Jack > > -----Original Message----- > From: Steven Price > Sent: Monday, March 22, 2021 11:29 PM > To: Zhang, Jack (Jian) ; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso > Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak > > On 15/03/2021 05:23, Zhang, Jack (Jian) wrote: >> [AMD Public Use] >> >> Hi, Rob/Tomeu/Steven, >> >> Would you please help to review this patch for panfrost driver? >> >> Thanks, >> Jack Zhang >> >> -----Original Message----- >> From: Jack Zhang >> Sent: Monday, March 15, 2021 1:21 PM >> To: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; >> Koenig, Christian ; Grodzovsky, Andrey >> ; Liu, Monk ; Deng, Emily >> >> Cc: Zhang, Jack (Jian) >> Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid >> memleak >> >> re-insert Bailing jobs to avoid memory leak. >> >> V2: move re-insert step to drm/scheduler logic >> V3: add panfrost's return value for bailing jobs in case it hits the >> memleak issue. > > This commit message could do with some work - it's really hard to decipher what the actual problem you're solving is. > >> >> Signed-off-by: Jack Zhang >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++-- >> drivers/gpu/drm/panfrost/panfrost_job.c | 4 ++-- >> drivers/gpu/drm/scheduler/sched_main.c | 8 +++++++- >> include/drm/gpu_scheduler.h | 1 + >> 5 files changed, 19 insertions(+), 6 deletions(-) >> > [...] >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c >> b/drivers/gpu/drm/panfrost/panfrost_job.c >> index 6003cfeb1322..e2cb4f32dae1 100644 >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c >> @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job >> * spurious. Bail out. >> */ >> if (dma_fence_is_signaled(job->done_fence)) >> -return DRM_GPU_SCHED_STAT_NOMINAL; >> +return DRM_GPU_SCHED_STAT_BAILING; >> >> dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", >> js, >> @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat >> panfrost_job_timedout(struct drm_sched_job >> >> /* Scheduler is already stopped, nothing to do. */ >> if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job)) >> -return DRM_GPU_SCHED_STAT_NOMINAL; >> +return DRM_GPU_SCHED_STAT_BAILING; >> >> /* Schedule a reset if there's no reset in progress. */ >> if (!atomic_xchg(&pfdev->reset.pending, 1)) > > This looks correct to me - in these two cases drm_sched_stop() is not called on the sched_job, so it looks like currently the job will be leaked. > >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index 92d8de24d0a1..a44f621fb5c4 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work) >> { >> struct drm_gpu_scheduler *sched; >> struct drm_sched_job *job; >> +int ret; >> >> sched = container_of(work, struct drm_gpu_scheduler, >> work_tdr.work); >> >> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct *work) >> list_del_init(&job->list); >> spin_unlock(&sched->job_list_lock); >> >> -job->sched->ops->timedout_job(job); >> +ret = job->sched->ops->timedout_job(job); >> >> +if (ret == DRM_GPU_SCHED_STAT_BAILING) { >> +spin_lock(&sched->job_list_lock); >> +list_add(&job->node, &sched->ring_mirror_list); >> +spin_unlock(&sched->job_list_lock); >> +} > > I think we could really do with a comment somewhere explaining what "bailing" means in this context. For the Panfrost case we have two cases: > > * The GPU job actually finished while the timeout code was running (done_fence is signalled). > > * The GPU is already in the process of being reset (Panfrost has multiple queues, so mostly like a bad job in another queue). > > I'm also not convinced that (for Panfrost) it makes sense to be adding the jobs back to the list. For the first case above clearly the job could just be freed (it's complete). The second case is more interesting and Panfrost currently doesn't handle this well. In theory the driver could try to rescue the job ('soft stop' in Mali language) so that it could be resubmitted. Panfrost doesn't currently support that, so attempting to resubmit the job is almost certainly going to fail. > > It's on my TODO list to look at improving Panfrost in this regard, but sadly still quite far down. > > Steve > >> /* >> * Guilty job did complete and hence needs to be manually removed >> * See drm_sched_stop doc. >> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h >> index 4ea8606d91fe..8093ac2427ef 100644 >> --- a/include/drm/gpu_scheduler.h >> +++ b/include/drm/gpu_scheduler.h >> @@ -210,6 +210,7 @@ enum drm_gpu_sched_stat { >> DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */ >> DRM_GPU_SCHED_STAT_NOMINAL, >> DRM_GPU_SCHED_STAT_ENODEV, >> +DRM_GPU_SCHED_STAT_BAILING, >> }; >> >> /** >> > _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70A49C433DB for ; Fri, 26 Mar 2021 09:06:53 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 1A3D461A18 for ; Fri, 26 Mar 2021 09:06:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1A3D461A18 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=amd-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id ADB5F6F381; Fri, 26 Mar 2021 09:06:52 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by gabe.freedesktop.org (Postfix) with ESMTP id 313BC6F381; Fri, 26 Mar 2021 09:06:51 +0000 (UTC) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 74ACD1474; Fri, 26 Mar 2021 02:06:50 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 27FB63F718; Fri, 26 Mar 2021 02:06:49 -0700 (PDT) Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak To: "Zhang, Jack (Jian)" , "dri-devel@lists.freedesktop.org" , "amd-gfx@lists.freedesktop.org" , "Koenig, Christian" , "Grodzovsky, Andrey" , "Liu, Monk" , "Deng, Emily" , Rob Herring , Tomeu Vizoso References: <20210315052036.1113638-1-Jack.Zhang1@amd.com> From: Steven Price Message-ID: Date: Fri, 26 Mar 2021 09:07:53 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: Content-Language: en-GB X-BeenThere: amd-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: amd-gfx-bounces@lists.freedesktop.org Sender: "amd-gfx" On 26/03/2021 02:04, Zhang, Jack (Jian) wrote: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Steve, > > Thank you for your detailed comments. > > But currently the patch is not finalized. > We found some potential race condition even with this patch. The solution is under discussion and hopefully we could find an ideal one. > After that, I will start to consider other drm-driver if it will influence other drivers(except for amdgpu). No problem. Please keep me CC'd, the suggestion of using reference counts may be beneficial for Panfrost as we already build a reference count on top of struct drm_sched_job. So there may be scope for cleaning up Panfrost afterwards even if your work doesn't directly affect it. Thanks, Steve > Best, > Jack > > -----Original Message----- > From: Steven Price > Sent: Monday, March 22, 2021 11:29 PM > To: Zhang, Jack (Jian) ; dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng, Emily ; Rob Herring ; Tomeu Vizoso > Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak > > On 15/03/2021 05:23, Zhang, Jack (Jian) wrote: >> [AMD Public Use] >> >> Hi, Rob/Tomeu/Steven, >> >> Would you please help to review this patch for panfrost driver? >> >> Thanks, >> Jack Zhang >> >> -----Original Message----- >> From: Jack Zhang >> Sent: Monday, March 15, 2021 1:21 PM >> To: dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; >> Koenig, Christian ; Grodzovsky, Andrey >> ; Liu, Monk ; Deng, Emily >> >> Cc: Zhang, Jack (Jian) >> Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid >> memleak >> >> re-insert Bailing jobs to avoid memory leak. >> >> V2: move re-insert step to drm/scheduler logic >> V3: add panfrost's return value for bailing jobs in case it hits the >> memleak issue. > > This commit message could do with some work - it's really hard to decipher what the actual problem you're solving is. > >> >> Signed-off-by: Jack Zhang >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++-- >> drivers/gpu/drm/panfrost/panfrost_job.c | 4 ++-- >> drivers/gpu/drm/scheduler/sched_main.c | 8 +++++++- >> include/drm/gpu_scheduler.h | 1 + >> 5 files changed, 19 insertions(+), 6 deletions(-) >> > [...] >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c >> b/drivers/gpu/drm/panfrost/panfrost_job.c >> index 6003cfeb1322..e2cb4f32dae1 100644 >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c >> @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job >> * spurious. Bail out. >> */ >> if (dma_fence_is_signaled(job->done_fence)) >> -return DRM_GPU_SCHED_STAT_NOMINAL; >> +return DRM_GPU_SCHED_STAT_BAILING; >> >> dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", >> js, >> @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat >> panfrost_job_timedout(struct drm_sched_job >> >> /* Scheduler is already stopped, nothing to do. */ >> if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job)) >> -return DRM_GPU_SCHED_STAT_NOMINAL; >> +return DRM_GPU_SCHED_STAT_BAILING; >> >> /* Schedule a reset if there's no reset in progress. */ >> if (!atomic_xchg(&pfdev->reset.pending, 1)) > > This looks correct to me - in these two cases drm_sched_stop() is not called on the sched_job, so it looks like currently the job will be leaked. > >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index 92d8de24d0a1..a44f621fb5c4 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work) >> { >> struct drm_gpu_scheduler *sched; >> struct drm_sched_job *job; >> +int ret; >> >> sched = container_of(work, struct drm_gpu_scheduler, >> work_tdr.work); >> >> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct work_struct *work) >> list_del_init(&job->list); >> spin_unlock(&sched->job_list_lock); >> >> -job->sched->ops->timedout_job(job); >> +ret = job->sched->ops->timedout_job(job); >> >> +if (ret == DRM_GPU_SCHED_STAT_BAILING) { >> +spin_lock(&sched->job_list_lock); >> +list_add(&job->node, &sched->ring_mirror_list); >> +spin_unlock(&sched->job_list_lock); >> +} > > I think we could really do with a comment somewhere explaining what "bailing" means in this context. For the Panfrost case we have two cases: > > * The GPU job actually finished while the timeout code was running (done_fence is signalled). > > * The GPU is already in the process of being reset (Panfrost has multiple queues, so mostly like a bad job in another queue). > > I'm also not convinced that (for Panfrost) it makes sense to be adding the jobs back to the list. For the first case above clearly the job could just be freed (it's complete). The second case is more interesting and Panfrost currently doesn't handle this well. In theory the driver could try to rescue the job ('soft stop' in Mali language) so that it could be resubmitted. Panfrost doesn't currently support that, so attempting to resubmit the job is almost certainly going to fail. > > It's on my TODO list to look at improving Panfrost in this regard, but sadly still quite far down. > > Steve > >> /* >> * Guilty job did complete and hence needs to be manually removed >> * See drm_sched_stop doc. >> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h >> index 4ea8606d91fe..8093ac2427ef 100644 >> --- a/include/drm/gpu_scheduler.h >> +++ b/include/drm/gpu_scheduler.h >> @@ -210,6 +210,7 @@ enum drm_gpu_sched_stat { >> DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */ >> DRM_GPU_SCHED_STAT_NOMINAL, >> DRM_GPU_SCHED_STAT_ENODEV, >> +DRM_GPU_SCHED_STAT_BAILING, >> }; >> >> /** >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx