From: Daniel Vetter <daniel.vetter@ffwll.ch>
To: Linus Torvalds <torvalds@linux-foundation.org>,
Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
dri-devel <dri-devel@lists.freedesktop.org>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Christian König" <christian.koenig@amd.com>
Subject: Re: [git pull] drm for 6.1-rc1
Date: Fri, 7 Oct 2022 11:28:12 +0200 [thread overview]
Message-ID: <CAKMK7uHsZejvVN1RcS23YsFhb4JvuScpHys17Vn+A7PirE+q1A@mail.gmail.com> (raw)
In-Reply-To: <CAKMK7uF_fKs=Ge5b3sCxa3YgWFaJsLBdCQVj+fDn6ukh9GvKKA@mail.gmail.com>
Forgot to add Andrey as scheduler maintainer.
-Daniel
On Fri, 7 Oct 2022 at 10:16, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Fri, 7 Oct 2022 at 01:45, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote:
> > >
> > >
> > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
> > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> >
> > As far as I can tell, that's the line
> >
> > struct drm_gpu_scheduler *sched = s_fence->sched;
> >
> > where 's_fence' is NULL. The code is
> >
> > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> > 5: 41 54 push %r12
> > 7: 55 push %rbp
> > 8: 53 push %rbx
> > 9: 48 89 fb mov %rdi,%rbx
> > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction
> > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
> > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax
> >
> > and that next 'lock decl' instruction would have been the
> >
> > atomic_dec(&sched->hw_rq_count);
> >
> > at the top of drm_sched_job_done().
> >
> > Now, as to *why* you'd have a NULL s_fence, it would seem that
> > drm_sched_job_cleanup() was called with an active job. Looking at that
> > code, it does
> >
> > if (kref_read(&job->s_fence->finished.refcount)) {
> > /* drm_sched_job_arm() has been called */
> > dma_fence_put(&job->s_fence->finished);
> > ...
> >
> > but then it does
> >
> > job->s_fence = NULL;
> >
> > anyway, despite the job still being active. The logic of that kind of
> > "fake refcount" escapes me. The above looks fundamentally racy, not to
> > say pointless and wrong (a refcount is a _count_, not a flag, so there
> > could be multiple references to it, what says that you can just
> > decrement one of them and say "I'm done").
>
> Just figured I'll clarify this, because it's indeed a bit wtf and the
> comment doesn't explain much. drm_sched_job_cleanup can be called both
> when a real job is being cleaned up (which holds a full reference on
> job->s_fence and needs to drop it) and to simplify error path in job
> constructions (and the "is this refcount initialized already" signals
> what exactly needs to be cleaned up or not). So no race, because the
> only times this check goes different is when job construction has
> failed before the job struct is visible by any other thread.
>
> But yeah the comment could actually explain what's going on here :-)
>
> And yeah the patch Dave reverted screws up the cascade of references
> that ensures this all stays alive until drm_sched_job_cleanup is
> called on active jobs, so looks all reasonable to me. Some Kunit tests
> maybe to exercise these corners? Not the first time pure scheduler
> code blew up, so proably worth the effort.
> -Daniel
>
> >
> > Now, _why_ any of that happens, I have no idea. I'm just looking at
> > the immediate "that pointer is NULL" thing, and reacting to what looks
> > like a completely bogus refcount pattern.
> >
> > But that odd refcount pattern isn't new, so it's presumably some user
> > on the amd gpu side that changed.
> >
> > The problem hasn't happened again for me, but that's not saying a lot,
> > since it was very random to begin with.
> >
> > Linus
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
next prev parent reply other threads:[~2022-10-07 9:28 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-05 3:41 [git pull] drm for 6.1-rc1 Dave Airlie
2022-10-05 18:38 ` Linus Torvalds
2022-10-05 20:56 ` Dave Airlie
2022-10-05 18:40 ` pr-tracker-bot
2022-10-06 18:47 ` Linus Torvalds
2022-10-06 19:28 ` Alex Deucher
2022-10-06 19:47 ` Linus Torvalds
2022-10-06 20:14 ` Alex Deucher
2022-10-06 20:24 ` Dave Airlie
2022-10-06 21:41 ` Dave Airlie
2022-10-06 21:52 ` Dave Airlie
2022-10-06 23:45 ` Linus Torvalds
2022-10-07 2:45 ` Dave Airlie
2022-10-07 2:54 ` Dave Airlie
2022-10-07 3:03 ` Dave Airlie
2022-10-07 6:11 ` Christian König
2022-10-07 8:16 ` Daniel Vetter
2022-10-07 9:28 ` Daniel Vetter [this message]
2022-10-06 19:29 ` Dave Airlie
2022-10-06 19:41 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAKMK7uHsZejvVN1RcS23YsFhb4JvuScpHys17Vn+A7PirE+q1A@mail.gmail.com \
--to=daniel.vetter@ffwll.ch \
--cc=Andrey.Grodzovsky@amd.com \
--cc=alexander.deucher@amd.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).