All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Airlie <airlied@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>, Arvind.Yadav@amd.com
Cc: "Daniel Vetter" <daniel.vetter@ffwll.ch>,
	LKML <linux-kernel@vger.kernel.org>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>
Subject: Re: [git pull] drm for 6.1-rc1
Date: Fri, 7 Oct 2022 13:03:59 +1000	[thread overview]
Message-ID: <CAPM=9tx+mxGphfr7TuUtXz_YgcFDrRi1oq0EhNU+UdmPPGDdUQ@mail.gmail.com> (raw)
In-Reply-To: <CAPM=9tyjMUxAQnJJBVnXXc0tQTjywiK8PLxbJ_Jz4T_pcEospA@mail.gmail.com>

On Fri, 7 Oct 2022 at 12:54, Dave Airlie <airlied@gmail.com> wrote:
>
> On Fri, 7 Oct 2022 at 12:45, Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > >
> > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
> > > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> > >
> > > As far as I can tell, that's the line
> > >
> > >         struct drm_gpu_scheduler *sched = s_fence->sched;
> > >
> > > where 's_fence' is NULL. The code is
> > >
> > >    0: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
> > >    5: 41 54                push   %r12
> > >    7: 55                    push   %rbp
> > >    8: 53                    push   %rbx
> > >    9: 48 89 fb              mov    %rdi,%rbx
> > >    c:* 48 8b af 88 00 00 00 mov    0x88(%rdi),%rbp <-- trapping instruction
> > >   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
> > >   1a: 48 8b 85 80 01 00 00 mov    0x180(%rbp),%rax
> > >
> > > and that next 'lock decl' instruction would have been the
> > >
> > >         atomic_dec(&sched->hw_rq_count);
> > >
> > > at the top of drm_sched_job_done().
> > >
> > > Now, as to *why* you'd have a NULL s_fence, it would seem that
> > > drm_sched_job_cleanup() was called with an active job. Looking at that
> > > code, it does
> > >
> > >         if (kref_read(&job->s_fence->finished.refcount)) {
> > >                 /* drm_sched_job_arm() has been called */
> > >                 dma_fence_put(&job->s_fence->finished);
> > >         ...
> > >
> > > but then it does
> > >
> > >         job->s_fence = NULL;
> > >
> > > anyway, despite the job still being active. The logic of that kind of
> > > "fake refcount" escapes me. The above looks fundamentally racy, not to
> > > say pointless and wrong (a refcount is a _count_, not a flag, so there
> > > could be multiple references to it, what says that you can just
> > > decrement one of them and say "I'm done").
> > >
> > > Now, _why_ any of that happens, I have no idea. I'm just looking at
> > > the immediate "that pointer is NULL" thing, and reacting to what looks
> > > like a completely bogus refcount pattern.
> > >
> > > But that odd refcount pattern isn't new, so it's presumably some user
> > > on the amd gpu side that changed.
> > >
> > > The problem hasn't happened again for me, but that's not saying a lot,
> > > since it was very random to begin with.
> >
> > I chased down the culprit to a drm sched patch, I'll send you a pull
> > with a revert in it.
> >
> > commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
> > Author: Arvind Yadav <Arvind.Yadav@amd.com>
> > Date:   Wed Sep 14 22:13:20 2022 +0530
> >
> >     drm/sched: Use parent fence instead of finished
> >
> >     Using the parent fence instead of the finished fence
> >     to get the job status. This change is to avoid GPU
> >     scheduler timeout error which can cause GPU reset.
> >
> >     Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
> >     Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >     Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-Arvind.Yadav@amd.com
> >     Signed-off-by: Christian König <christian.koenig@amd.com>
> >
> > I'll let Arvind and Christian maybe work out what is going wrong there.
>
> I do spy two changes queued for -next that might be relevant, so I
> might try just pulling those instead.
>
> I'll send a PR in next hour once I test it.

Okay sent, let me know if you see any further problems.

Dave.

WARNING: multiple messages have this Message-ID (diff)
From: Dave Airlie <airlied@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>, Arvind.Yadav@amd.com
Cc: "Alex Deucher" <alexdeucher@gmail.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Daniel Vetter" <daniel.vetter@ffwll.ch>,
	LKML <linux-kernel@vger.kernel.org>,
	dri-devel <dri-devel@lists.freedesktop.org>
Subject: Re: [git pull] drm for 6.1-rc1
Date: Fri, 7 Oct 2022 13:03:59 +1000	[thread overview]
Message-ID: <CAPM=9tx+mxGphfr7TuUtXz_YgcFDrRi1oq0EhNU+UdmPPGDdUQ@mail.gmail.com> (raw)
In-Reply-To: <CAPM=9tyjMUxAQnJJBVnXXc0tQTjywiK8PLxbJ_Jz4T_pcEospA@mail.gmail.com>

On Fri, 7 Oct 2022 at 12:54, Dave Airlie <airlied@gmail.com> wrote:
>
> On Fri, 7 Oct 2022 at 12:45, Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > >
> > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
> > > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> > >
> > > As far as I can tell, that's the line
> > >
> > >         struct drm_gpu_scheduler *sched = s_fence->sched;
> > >
> > > where 's_fence' is NULL. The code is
> > >
> > >    0: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
> > >    5: 41 54                push   %r12
> > >    7: 55                    push   %rbp
> > >    8: 53                    push   %rbx
> > >    9: 48 89 fb              mov    %rdi,%rbx
> > >    c:* 48 8b af 88 00 00 00 mov    0x88(%rdi),%rbp <-- trapping instruction
> > >   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
> > >   1a: 48 8b 85 80 01 00 00 mov    0x180(%rbp),%rax
> > >
> > > and that next 'lock decl' instruction would have been the
> > >
> > >         atomic_dec(&sched->hw_rq_count);
> > >
> > > at the top of drm_sched_job_done().
> > >
> > > Now, as to *why* you'd have a NULL s_fence, it would seem that
> > > drm_sched_job_cleanup() was called with an active job. Looking at that
> > > code, it does
> > >
> > >         if (kref_read(&job->s_fence->finished.refcount)) {
> > >                 /* drm_sched_job_arm() has been called */
> > >                 dma_fence_put(&job->s_fence->finished);
> > >         ...
> > >
> > > but then it does
> > >
> > >         job->s_fence = NULL;
> > >
> > > anyway, despite the job still being active. The logic of that kind of
> > > "fake refcount" escapes me. The above looks fundamentally racy, not to
> > > say pointless and wrong (a refcount is a _count_, not a flag, so there
> > > could be multiple references to it, what says that you can just
> > > decrement one of them and say "I'm done").
> > >
> > > Now, _why_ any of that happens, I have no idea. I'm just looking at
> > > the immediate "that pointer is NULL" thing, and reacting to what looks
> > > like a completely bogus refcount pattern.
> > >
> > > But that odd refcount pattern isn't new, so it's presumably some user
> > > on the amd gpu side that changed.
> > >
> > > The problem hasn't happened again for me, but that's not saying a lot,
> > > since it was very random to begin with.
> >
> > I chased down the culprit to a drm sched patch, I'll send you a pull
> > with a revert in it.
> >
> > commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
> > Author: Arvind Yadav <Arvind.Yadav@amd.com>
> > Date:   Wed Sep 14 22:13:20 2022 +0530
> >
> >     drm/sched: Use parent fence instead of finished
> >
> >     Using the parent fence instead of the finished fence
> >     to get the job status. This change is to avoid GPU
> >     scheduler timeout error which can cause GPU reset.
> >
> >     Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
> >     Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >     Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-Arvind.Yadav@amd.com
> >     Signed-off-by: Christian König <christian.koenig@amd.com>
> >
> > I'll let Arvind and Christian maybe work out what is going wrong there.
>
> I do spy two changes queued for -next that might be relevant, so I
> might try just pulling those instead.
>
> I'll send a PR in next hour once I test it.

Okay sent, let me know if you see any further problems.

Dave.

  reply	other threads:[~2022-10-07  3:04 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-05  3:41 [git pull] drm for 6.1-rc1 Dave Airlie
2022-10-05  3:41 ` Dave Airlie
2022-10-05 18:38 ` Linus Torvalds
2022-10-05 18:38   ` Linus Torvalds
2022-10-05 20:56   ` Dave Airlie
2022-10-05 20:56     ` Dave Airlie
2022-10-05 18:40 ` pr-tracker-bot
2022-10-05 18:40   ` pr-tracker-bot
2022-10-06 18:47 ` Linus Torvalds
2022-10-06 18:47   ` Linus Torvalds
2022-10-06 19:28   ` Alex Deucher
2022-10-06 19:28     ` Alex Deucher
2022-10-06 19:47     ` Linus Torvalds
2022-10-06 19:47       ` Linus Torvalds
2022-10-06 20:14       ` Alex Deucher
2022-10-06 20:14         ` Alex Deucher
2022-10-06 20:24         ` Dave Airlie
2022-10-06 20:24           ` Dave Airlie
2022-10-06 21:41           ` Dave Airlie
2022-10-06 21:41             ` Dave Airlie
2022-10-06 21:52             ` Dave Airlie
2022-10-06 21:52               ` Dave Airlie
2022-10-06 23:45           ` Linus Torvalds
2022-10-06 23:45             ` Linus Torvalds
2022-10-07  2:45             ` Dave Airlie
2022-10-07  2:45               ` Dave Airlie
2022-10-07  2:54               ` Dave Airlie
2022-10-07  2:54                 ` Dave Airlie
2022-10-07  3:03                 ` Dave Airlie [this message]
2022-10-07  3:03                   ` Dave Airlie
2022-10-07  6:11               ` Christian König
2022-10-07  6:11                 ` Christian König
2022-10-07  8:16             ` Daniel Vetter
2022-10-07  8:16               ` Daniel Vetter
2022-10-07  9:28               ` Daniel Vetter
2022-10-07  9:28                 ` Daniel Vetter
2022-10-06 19:29   ` Dave Airlie
2022-10-06 19:29     ` Dave Airlie
2022-10-06 19:41     ` Linus Torvalds
2022-10-06 19:41       ` Linus Torvalds
2022-10-07  6:52 Bert Karwatzki
2022-10-07  7:07 Bert Karwatzki
2022-10-07  7:23 Bert Karwatzki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPM=9tx+mxGphfr7TuUtXz_YgcFDrRi1oq0EhNU+UdmPPGDdUQ@mail.gmail.com' \
    --to=airlied@gmail.com \
    --cc=Arvind.Yadav@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.