From: Daniel Vetter <daniel.vetter@ffwll.ch>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
dri-devel <dri-devel@lists.freedesktop.org>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Christian König" <christian.koenig@amd.com>
Subject: Re: [git pull] drm for 6.1-rc1
Date: Fri, 7 Oct 2022 10:16:30 +0200 [thread overview]
Message-ID: <CAKMK7uF_fKs=Ge5b3sCxa3YgWFaJsLBdCQVj+fDn6ukh9GvKKA@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wgghR4N-4XWjoK18NDkvjBL7i00ab8+otQg955pNGG_dQ@mail.gmail.com>
On Fri, 7 Oct 2022 at 01:45, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote:
> >
> >
> > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
> > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
>
> As far as I can tell, that's the line
>
> struct drm_gpu_scheduler *sched = s_fence->sched;
>
> where 's_fence' is NULL. The code is
>
> 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 5: 41 54 push %r12
> 7: 55 push %rbp
> 8: 53 push %rbx
> 9: 48 89 fb mov %rdi,%rbx
> c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction
> 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
> 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax
>
> and that next 'lock decl' instruction would have been the
>
> atomic_dec(&sched->hw_rq_count);
>
> at the top of drm_sched_job_done().
>
> Now, as to *why* you'd have a NULL s_fence, it would seem that
> drm_sched_job_cleanup() was called with an active job. Looking at that
> code, it does
>
> if (kref_read(&job->s_fence->finished.refcount)) {
> /* drm_sched_job_arm() has been called */
> dma_fence_put(&job->s_fence->finished);
> ...
>
> but then it does
>
> job->s_fence = NULL;
>
> anyway, despite the job still being active. The logic of that kind of
> "fake refcount" escapes me. The above looks fundamentally racy, not to
> say pointless and wrong (a refcount is a _count_, not a flag, so there
> could be multiple references to it, what says that you can just
> decrement one of them and say "I'm done").
Just figured I'll clarify this, because it's indeed a bit wtf and the
comment doesn't explain much. drm_sched_job_cleanup can be called both
when a real job is being cleaned up (which holds a full reference on
job->s_fence and needs to drop it) and to simplify error path in job
constructions (and the "is this refcount initialized already" signals
what exactly needs to be cleaned up or not). So no race, because the
only times this check goes different is when job construction has
failed before the job struct is visible by any other thread.
But yeah the comment could actually explain what's going on here :-)
And yeah the patch Dave reverted screws up the cascade of references
that ensures this all stays alive until drm_sched_job_cleanup is
called on active jobs, so looks all reasonable to me. Some Kunit tests
maybe to exercise these corners? Not the first time pure scheduler
code blew up, so proably worth the effort.
-Daniel
>
> Now, _why_ any of that happens, I have no idea. I'm just looking at
> the immediate "that pointer is NULL" thing, and reacting to what looks
> like a completely bogus refcount pattern.
>
> But that odd refcount pattern isn't new, so it's presumably some user
> on the amd gpu side that changed.
>
> The problem hasn't happened again for me, but that's not saying a lot,
> since it was very random to begin with.
>
> Linus
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
next prev parent reply other threads:[~2022-10-07 8:16 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-05 3:41 [git pull] drm for 6.1-rc1 Dave Airlie
2022-10-05 18:38 ` Linus Torvalds
2022-10-05 20:56 ` Dave Airlie
2022-10-05 18:40 ` pr-tracker-bot
2022-10-06 18:47 ` Linus Torvalds
2022-10-06 19:28 ` Alex Deucher
2022-10-06 19:47 ` Linus Torvalds
2022-10-06 20:14 ` Alex Deucher
2022-10-06 20:24 ` Dave Airlie
2022-10-06 21:41 ` Dave Airlie
2022-10-06 21:52 ` Dave Airlie
2022-10-06 23:45 ` Linus Torvalds
2022-10-07 2:45 ` Dave Airlie
2022-10-07 2:54 ` Dave Airlie
2022-10-07 3:03 ` Dave Airlie
2022-10-07 6:11 ` Christian König
2022-10-07 8:16 ` Daniel Vetter [this message]
2022-10-07 9:28 ` Daniel Vetter
2022-10-06 19:29 ` Dave Airlie
2022-10-06 19:41 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAKMK7uF_fKs=Ge5b3sCxa3YgWFaJsLBdCQVj+fDn6ukh9GvKKA@mail.gmail.com' \
--to=daniel.vetter@ffwll.ch \
--cc=alexander.deucher@amd.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).