All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: "Rob Clark" <robdclark@gmail.com>,
	"Christian König" <christian.koenig@amd.com>
Cc: dri-devel@lists.freedesktop.org,
	Rob Clark <robdclark@chromium.org>,
	Luben Tuikov <luben.tuikov@amd.com>,
	Daniel Vetter <daniel@ffwll.ch>,
	Sumit Semwal <sumit.semwal@linaro.org>,
	open list <linux-kernel@vger.kernel.org>,
	"open list:DMA BUFFER SHARING FRAMEWORK" 
	<linux-media@vger.kernel.org>,
	"moderated list:DMA BUFFER SHARING FRAMEWORK" 
	<linaro-mm-sig@lists.linaro.org>
Subject: Re: [Linaro-mm-sig] Re: [RFC] drm/scheduler: Unwrap job dependencies
Date: Wed, 6 Dec 2023 10:04:48 +0100	[thread overview]
Message-ID: <1d336117-a94f-4b79-bc71-be9c24a0246a@gmail.com> (raw)
In-Reply-To: <CAF6AEGs5uh1sRDzz7xeDr5xZrXdtg7eoWJhPhRgqhcqAeTX1Jg@mail.gmail.com>



Am 05.12.23 um 18:14 schrieb Rob Clark:
> On Tue, Dec 5, 2023 at 8:56 AM Rob Clark <robdclark@gmail.com> wrote:
>> On Tue, Dec 5, 2023 at 7:58 AM Christian König <christian.koenig@amd.com> wrote:
>>> Am 05.12.23 um 16:41 schrieb Rob Clark:
>>>> On Mon, Dec 4, 2023 at 10:46 PM Christian König
>>>> <christian.koenig@amd.com> wrote:
>>>>> Am 04.12.23 um 22:54 schrieb Rob Clark:
>>>>>> On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@gmail.com> wrote:
>>>>>>> [SNIP]
>>>>>> So, this patch turns out to blow up spectacularly with dma_fence
>>>>>> refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think,
>>>>>> because it starts unwrapping fence chains, possibly in parallel with
>>>>>> fence signaling on the retire path.  Is it supposed to be permissible
>>>>>> to unwrap a fence chain concurrently?
>>>>> The DMA-fence chain object and helper functions were designed so that
>>>>> concurrent accesses to all elements are always possible.
>>>>>
>>>>> See dma_fence_chain_walk() and dma_fence_chain_get_prev() for example.
>>>>> dma_fence_chain_walk() starts with a reference to the current fence (the
>>>>> anchor of the walk) and tries to grab an up to date reference on the
>>>>> previous fence in the chain. Only after that reference is successfully
>>>>> acquired we drop the reference to the anchor where we started.
>>>>>
>>>>> Same for dma_fence_array_first(), dma_fence_array_next(). Here we hold a
>>>>> reference to the array which in turn holds references to each fence
>>>>> inside the array until it is destroyed itself.
>>>>>
>>>>> When this blows up we have somehow mixed up the references somewhere.
>>>> That's what it looked like to me, but wanted to make sure I wasn't
>>>> overlooking something subtle.  And in this case, the fence actually
>>>> should be the syncobj timeline point fence, not the fence chain.
>>>> Virtgpu has essentially the same logic (there we really do want to
>>>> unwrap fences so we can pass host fences back to host rather than
>>>> waiting in guest), I'm not sure if it would blow up in the same way.
>>> Well do you have a backtrace of what exactly happens?
>>>
>>> Maybe we have some _put() before _get() or something like this.
>> I hacked up something to store the backtrace in dma_fence_release()
>> (and leak the block so the backtrace would still be around later when
>> dma_fence_get/put was later called) and ended up with:
>>
>> [  152.811360] freed at:
>> [  152.813718]  dma_fence_release+0x30/0x134
>> [  152.817865]  dma_fence_put+0x38/0x98 [gpu_sched]
>> [  152.822657]  drm_sched_job_add_dependency+0x160/0x18c [gpu_sched]
>> [  152.828948]  drm_sched_job_add_syncobj_dependency+0x58/0x88 [gpu_sched]
>> [  152.835770]  msm_ioctl_gem_submit+0x580/0x1160 [msm]
>> [  152.841070]  drm_ioctl_kernel+0xec/0x16c
>> [  152.845132]  drm_ioctl+0x2e8/0x3f4
>> [  152.848646]  vfs_ioctl+0x30/0x50
>> [  152.851982]  __arm64_sys_ioctl+0x80/0xb4
>> [  152.856039]  invoke_syscall+0x8c/0x120
>> [  152.859919]  el0_svc_common.constprop.0+0xc0/0xdc
>> [  152.864777]  do_el0_svc+0x24/0x30
>> [  152.868207]  el0_svc+0x8c/0xd8
>> [  152.871365]  el0t_64_sync_handler+0x84/0x12c
>> [  152.875771]  el0t_64_sync+0x190/0x194
>>
>> I suppose that doesn't guarantee that this was the problematic put.
>> But dropping this patch to unwrap the fence makes the problem go
>> away..
> Oh, hmm, _add_dependency() is consuming the fence reference

Yeah, I was just about to point that out as well :)

Should be trivial to fix,
Christian

>
> BR,
> -R
>
>> BR,
>> -R
>>
>>> Thanks,
>>> Christian.
>>>
>>>> BR,
>>>> -R
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> BR,
>>>>>> -R
> _______________________________________________
> Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org
> To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org


WARNING: multiple messages have this Message-ID (diff)
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: "Rob Clark" <robdclark@gmail.com>,
	"Christian König" <christian.koenig@amd.com>
Cc: Rob Clark <robdclark@chromium.org>,
	open list <linux-kernel@vger.kernel.org>,
	dri-devel@lists.freedesktop.org,
	"moderated list:DMA BUFFER SHARING FRAMEWORK"
	<linaro-mm-sig@lists.linaro.org>,
	Luben Tuikov <luben.tuikov@amd.com>,
	Sumit Semwal <sumit.semwal@linaro.org>,
	"open list:DMA BUFFER SHARING FRAMEWORK"
	<linux-media@vger.kernel.org>
Subject: Re: [Linaro-mm-sig] Re: [RFC] drm/scheduler: Unwrap job dependencies
Date: Wed, 6 Dec 2023 10:04:48 +0100	[thread overview]
Message-ID: <1d336117-a94f-4b79-bc71-be9c24a0246a@gmail.com> (raw)
In-Reply-To: <CAF6AEGs5uh1sRDzz7xeDr5xZrXdtg7eoWJhPhRgqhcqAeTX1Jg@mail.gmail.com>



Am 05.12.23 um 18:14 schrieb Rob Clark:
> On Tue, Dec 5, 2023 at 8:56 AM Rob Clark <robdclark@gmail.com> wrote:
>> On Tue, Dec 5, 2023 at 7:58 AM Christian König <christian.koenig@amd.com> wrote:
>>> Am 05.12.23 um 16:41 schrieb Rob Clark:
>>>> On Mon, Dec 4, 2023 at 10:46 PM Christian König
>>>> <christian.koenig@amd.com> wrote:
>>>>> Am 04.12.23 um 22:54 schrieb Rob Clark:
>>>>>> On Thu, Mar 23, 2023 at 2:30 PM Rob Clark <robdclark@gmail.com> wrote:
>>>>>>> [SNIP]
>>>>>> So, this patch turns out to blow up spectacularly with dma_fence
>>>>>> refcnt underflows when I enable DRIVER_SYNCOBJ_TIMELINE .. I think,
>>>>>> because it starts unwrapping fence chains, possibly in parallel with
>>>>>> fence signaling on the retire path.  Is it supposed to be permissible
>>>>>> to unwrap a fence chain concurrently?
>>>>> The DMA-fence chain object and helper functions were designed so that
>>>>> concurrent accesses to all elements are always possible.
>>>>>
>>>>> See dma_fence_chain_walk() and dma_fence_chain_get_prev() for example.
>>>>> dma_fence_chain_walk() starts with a reference to the current fence (the
>>>>> anchor of the walk) and tries to grab an up to date reference on the
>>>>> previous fence in the chain. Only after that reference is successfully
>>>>> acquired we drop the reference to the anchor where we started.
>>>>>
>>>>> Same for dma_fence_array_first(), dma_fence_array_next(). Here we hold a
>>>>> reference to the array which in turn holds references to each fence
>>>>> inside the array until it is destroyed itself.
>>>>>
>>>>> When this blows up we have somehow mixed up the references somewhere.
>>>> That's what it looked like to me, but wanted to make sure I wasn't
>>>> overlooking something subtle.  And in this case, the fence actually
>>>> should be the syncobj timeline point fence, not the fence chain.
>>>> Virtgpu has essentially the same logic (there we really do want to
>>>> unwrap fences so we can pass host fences back to host rather than
>>>> waiting in guest), I'm not sure if it would blow up in the same way.
>>> Well do you have a backtrace of what exactly happens?
>>>
>>> Maybe we have some _put() before _get() or something like this.
>> I hacked up something to store the backtrace in dma_fence_release()
>> (and leak the block so the backtrace would still be around later when
>> dma_fence_get/put was later called) and ended up with:
>>
>> [  152.811360] freed at:
>> [  152.813718]  dma_fence_release+0x30/0x134
>> [  152.817865]  dma_fence_put+0x38/0x98 [gpu_sched]
>> [  152.822657]  drm_sched_job_add_dependency+0x160/0x18c [gpu_sched]
>> [  152.828948]  drm_sched_job_add_syncobj_dependency+0x58/0x88 [gpu_sched]
>> [  152.835770]  msm_ioctl_gem_submit+0x580/0x1160 [msm]
>> [  152.841070]  drm_ioctl_kernel+0xec/0x16c
>> [  152.845132]  drm_ioctl+0x2e8/0x3f4
>> [  152.848646]  vfs_ioctl+0x30/0x50
>> [  152.851982]  __arm64_sys_ioctl+0x80/0xb4
>> [  152.856039]  invoke_syscall+0x8c/0x120
>> [  152.859919]  el0_svc_common.constprop.0+0xc0/0xdc
>> [  152.864777]  do_el0_svc+0x24/0x30
>> [  152.868207]  el0_svc+0x8c/0xd8
>> [  152.871365]  el0t_64_sync_handler+0x84/0x12c
>> [  152.875771]  el0t_64_sync+0x190/0x194
>>
>> I suppose that doesn't guarantee that this was the problematic put.
>> But dropping this patch to unwrap the fence makes the problem go
>> away..
> Oh, hmm, _add_dependency() is consuming the fence reference

Yeah, I was just about to point that out as well :)

Should be trivial to fix,
Christian

>
> BR,
> -R
>
>> BR,
>> -R
>>
>>> Thanks,
>>> Christian.
>>>
>>>> BR,
>>>> -R
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> BR,
>>>>>> -R
> _______________________________________________
> Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org
> To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org


  reply	other threads:[~2023-12-06  9:04 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-22 22:44 [RFC] drm/scheduler: Unwrap job dependencies Rob Clark
2023-03-22 22:44 ` Rob Clark
2023-03-23  7:35 ` Christian König
2023-03-23  7:35   ` Christian König
2023-03-23 13:54   ` Rob Clark
2023-03-23 13:54     ` Rob Clark
2023-03-23 14:03     ` Christian König
2023-03-23 14:03       ` Christian König
2023-03-23 21:30       ` Rob Clark
2023-03-23 21:30         ` Rob Clark
2023-12-04 21:54         ` Rob Clark
2023-12-04 21:54           ` Rob Clark
2023-12-05  6:46           ` Christian König
2023-12-05  6:46             ` Christian König
2023-12-05 15:41             ` Rob Clark
2023-12-05 15:41               ` Rob Clark
2023-12-05 15:58               ` Christian König
2023-12-05 15:58                 ` Christian König
2023-12-05 16:56                 ` Rob Clark
2023-12-05 16:56                   ` Rob Clark
2023-12-05 17:14                   ` Rob Clark
2023-12-05 17:14                     ` Rob Clark
2023-12-06  9:04                     ` Christian König [this message]
2023-12-06  9:04                       ` [Linaro-mm-sig] " Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1d336117-a94f-4b79-bc71-be9c24a0246a@gmail.com \
    --to=ckoenig.leichtzumerken@gmail.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linaro-mm-sig@lists.linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=luben.tuikov@amd.com \
    --cc=robdclark@chromium.org \
    --cc=robdclark@gmail.com \
    --cc=sumit.semwal@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.