dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Dave Airlie <airlied@gmail.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Arvind.Yadav@amd.com
Cc: Alex Deucher <alexander.deucher@amd.com>,
	LKML <linux-kernel@vger.kernel.org>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	Daniel Vetter <daniel.vetter@ffwll.ch>
Subject: Re: [git pull] drm for 6.1-rc1
Date: Fri, 7 Oct 2022 08:11:26 +0200	[thread overview]
Message-ID: <d6f082eb-8948-8dde-6813-371cf6b1c7a3@amd.com> (raw)
In-Reply-To: <CAPM=9txE+0EH2Tv_0toDD52j0JO7iDZoJap6qmvMAnRaDRwJNg@mail.gmail.com>

Am 07.10.22 um 04:45 schrieb Dave Airlie:
> On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote:
>>>
>>> [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
>>> [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
>> As far as I can tell, that's the line
>>
>>          struct drm_gpu_scheduler *sched = s_fence->sched;
>>
>> where 's_fence' is NULL. The code is
>>
>>     0: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
>>     5: 41 54                push   %r12
>>     7: 55                    push   %rbp
>>     8: 53                    push   %rbx
>>     9: 48 89 fb              mov    %rdi,%rbx
>>     c:* 48 8b af 88 00 00 00 mov    0x88(%rdi),%rbp <-- trapping instruction
>>    13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
>>    1a: 48 8b 85 80 01 00 00 mov    0x180(%rbp),%rax
>>
>> and that next 'lock decl' instruction would have been the
>>
>>          atomic_dec(&sched->hw_rq_count);
>>
>> at the top of drm_sched_job_done().
>>
>> Now, as to *why* you'd have a NULL s_fence, it would seem that
>> drm_sched_job_cleanup() was called with an active job. Looking at that
>> code, it does
>>
>>          if (kref_read(&job->s_fence->finished.refcount)) {
>>                  /* drm_sched_job_arm() has been called */
>>                  dma_fence_put(&job->s_fence->finished);
>>          ...
>>
>> but then it does
>>
>>          job->s_fence = NULL;
>>
>> anyway, despite the job still being active. The logic of that kind of
>> "fake refcount" escapes me. The above looks fundamentally racy, not to
>> say pointless and wrong (a refcount is a _count_, not a flag, so there
>> could be multiple references to it, what says that you can just
>> decrement one of them and say "I'm done").
>>
>> Now, _why_ any of that happens, I have no idea. I'm just looking at
>> the immediate "that pointer is NULL" thing, and reacting to what looks
>> like a completely bogus refcount pattern.
>>
>> But that odd refcount pattern isn't new, so it's presumably some user
>> on the amd gpu side that changed.
>>
>> The problem hasn't happened again for me, but that's not saying a lot,
>> since it was very random to begin with.
> I chased down the culprit to a drm sched patch, I'll send you a pull
> with a revert in it.
>
> commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
> Author: Arvind Yadav <Arvind.Yadav@amd.com>
> Date:   Wed Sep 14 22:13:20 2022 +0530
>
>      drm/sched: Use parent fence instead of finished
>
>      Using the parent fence instead of the finished fence
>      to get the job status. This change is to avoid GPU
>      scheduler timeout error which can cause GPU reset.
>
>      Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
>      Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>      Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2Fmsgid%2F20220914164321.2156-6-Arvind.Yadav%40amd.com&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C516db37183e84489e1aa08daa80e087e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638007075495101336%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=JWT8R205jIPQu87K7a1T0UJ0iKNO8smHhosijAA0%2BNk%3D&amp;reserved=0
>      Signed-off-by: Christian König <christian.koenig@amd.com>
>
> I'll let Arvind and Christian maybe work out what is going wrong there.

That's a known issue Arvind is already investigating for a while.

Any idea how you triggered it on boot? We have only be able to trigger 
it very sporadic.

Reverting the patch for now sounds like a good idea to me, it's only a 
cleanup anyway.

Thanks,
Christian.

>
> Dave.
>
>>                   Linus


  parent reply	other threads:[~2022-10-07  6:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-05  3:41 [git pull] drm for 6.1-rc1 Dave Airlie
2022-10-05 18:38 ` Linus Torvalds
2022-10-05 20:56   ` Dave Airlie
2022-10-05 18:40 ` pr-tracker-bot
2022-10-06 18:47 ` Linus Torvalds
2022-10-06 19:28   ` Alex Deucher
2022-10-06 19:47     ` Linus Torvalds
2022-10-06 20:14       ` Alex Deucher
2022-10-06 20:24         ` Dave Airlie
2022-10-06 21:41           ` Dave Airlie
2022-10-06 21:52             ` Dave Airlie
2022-10-06 23:45           ` Linus Torvalds
2022-10-07  2:45             ` Dave Airlie
2022-10-07  2:54               ` Dave Airlie
2022-10-07  3:03                 ` Dave Airlie
2022-10-07  6:11               ` Christian König [this message]
2022-10-07  8:16             ` Daniel Vetter
2022-10-07  9:28               ` Daniel Vetter
2022-10-06 19:29   ` Dave Airlie
2022-10-06 19:41     ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d6f082eb-8948-8dde-6813-371cf6b1c7a3@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Arvind.Yadav@amd.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).