All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Airlie <airlied@gmail.com>
To: Alex Deucher <alexdeucher@gmail.com>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Daniel Vetter" <daniel.vetter@ffwll.ch>,
	LKML <linux-kernel@vger.kernel.org>,
	dri-devel <dri-devel@lists.freedesktop.org>
Subject: Re: [git pull] drm for 6.1-rc1
Date: Fri, 7 Oct 2022 07:52:35 +1000	[thread overview]
Message-ID: <CAPM=9tx8tjzz5q4gkLbh=R+xO5x-8QQOB9E=GAXrV6=-r844-A@mail.gmail.com> (raw)
In-Reply-To: <CAPM=9tyL=J26aHdhSSK0jwYQLHBf8jjTMvJmj1cQheUF=wpd-Q@mail.gmail.com>

On Fri, 7 Oct 2022 at 07:41, Dave Airlie <airlied@gmail.com> wrote:
>
> On Fri, 7 Oct 2022 at 06:24, Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Fri, 7 Oct 2022 at 06:14, Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > > > >
> > > > > Maybe you are seeing this which is an issue with GPU TLB flushes which
> > > > > is kind of sporadic:
> > > > > https://gitlab.freedesktop.org/drm/amd/-/issues/2113
> > > >
> > > > Well, that seems to be 5.19, and while timing changes (or whatever
> > > > other software updates) could have made it start trigger, this machine
> > > > has been pretty solid otgerwise.
> > > >
> > > > > Are you seeing any GPU page faults in your kernel log?
> > > >
> > > > Nothing even remotely like that "no-retry page fault" in that issue
> > > > report. Of course, if it happens just before the whole thing locks
> > > > up...
> > >
> > > Your chip is too old to support retry faults so it's likely you could
> > > be just seeing a GPU page fault followed by a hang.  Your chip also
> > > lacks a paging queue, so you would be affected by the TLB issue.
> >
> >
> > Okay I got my FIJI running Linus tree and netconsole to blow up like
> > this, running fedora 36 desktop, steam, firefox, and then I ran
> > poweroff over ssh.
> >
> > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
> > [ 1234.778782] #PF: supervisor read access in kernel mode
> > [ 1234.778787] #PF: error_code(0x0000) - not-present page
> > [ 1234.778791] PGD 0 P4D 0
> > [ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
> > [ 1234.778809] Hardware name: System manufacturer System Product
> > Name/PRIME X370-PRO, BIOS 5603 07/28/2020
> > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> > [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
> > ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
> > 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
> > 00 f0
> > [ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087
> > [ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018
> > [ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000
> > [ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808
> > [ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00
> > [ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458
> > [ 1234.778912] FS:  00007f35e7008580(0000) GS:ffff95428ebc0000(0000)
> > knlGS:0000000000000000
> > [ 1234.778916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0
> > [ 1234.778924] Call Trace:
> > [ 1234.778981]  <IRQ>
> > [ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
> > [ 1234.778999]  dma_fence_signal+0x2c/0x50
> > [ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
> > [ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
> > [ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
> > [ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
> > [ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
> > [ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
> > [ 1234.779946]  handle_irq_event+0x34/0x70
> > [ 1234.779949]  handle_edge_irq+0x9f/0x240
> > [ 1234.779954]  __common_interrupt+0x66/0x100
> > [ 1234.779960]  common_interrupt+0xa0/0xc0
> > [ 1234.779965]  </IRQ>
> > [ 1234.779968]  <TASK>
> > [ 1234.779971]  asm_common_interrupt+0x22/0x40
> > [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
> > [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
> > 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
> > 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
> > 83 ea
> > [ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202
> >
> > I'll see if I can dig any.
>
> I'm kicking the tires on the drm-next tree just prior to submission,
> and in an attempt to make myself look foolish and to tempt fate, it
> seems stable.

Yay it worked, crashed drm-next. will start reverting down the rabbit hole.

Dave.

WARNING: multiple messages have this Message-ID (diff)
From: Dave Airlie <airlied@gmail.com>
To: Alex Deucher <alexdeucher@gmail.com>
Cc: "Daniel Vetter" <daniel.vetter@ffwll.ch>,
	LKML <linux-kernel@vger.kernel.org>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Christian König" <christian.koenig@amd.com>
Subject: Re: [git pull] drm for 6.1-rc1
Date: Fri, 7 Oct 2022 07:52:35 +1000	[thread overview]
Message-ID: <CAPM=9tx8tjzz5q4gkLbh=R+xO5x-8QQOB9E=GAXrV6=-r844-A@mail.gmail.com> (raw)
In-Reply-To: <CAPM=9tyL=J26aHdhSSK0jwYQLHBf8jjTMvJmj1cQheUF=wpd-Q@mail.gmail.com>

On Fri, 7 Oct 2022 at 07:41, Dave Airlie <airlied@gmail.com> wrote:
>
> On Fri, 7 Oct 2022 at 06:24, Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Fri, 7 Oct 2022 at 06:14, Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > > > >
> > > > > Maybe you are seeing this which is an issue with GPU TLB flushes which
> > > > > is kind of sporadic:
> > > > > https://gitlab.freedesktop.org/drm/amd/-/issues/2113
> > > >
> > > > Well, that seems to be 5.19, and while timing changes (or whatever
> > > > other software updates) could have made it start trigger, this machine
> > > > has been pretty solid otgerwise.
> > > >
> > > > > Are you seeing any GPU page faults in your kernel log?
> > > >
> > > > Nothing even remotely like that "no-retry page fault" in that issue
> > > > report. Of course, if it happens just before the whole thing locks
> > > > up...
> > >
> > > Your chip is too old to support retry faults so it's likely you could
> > > be just seeing a GPU page fault followed by a hang.  Your chip also
> > > lacks a paging queue, so you would be affected by the TLB issue.
> >
> >
> > Okay I got my FIJI running Linus tree and netconsole to blow up like
> > this, running fedora 36 desktop, steam, firefox, and then I ran
> > poweroff over ssh.
> >
> > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
> > [ 1234.778782] #PF: supervisor read access in kernel mode
> > [ 1234.778787] #PF: error_code(0x0000) - not-present page
> > [ 1234.778791] PGD 0 P4D 0
> > [ 1234.778798] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
> > [ 1234.778809] Hardware name: System manufacturer System Product
> > Name/PRIME X370-PRO, BIOS 5603 07/28/2020
> > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> > [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
> > ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
> > 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
> > 00 f0
> > [ 1234.778834] RSP: 0000:ffffabe680380de0 EFLAGS: 00010087
> > [ 1234.778839] RAX: ffffffffc04e9230 RBX: 0000000000000000 RCX: 0000000000000018
> > [ 1234.778897] RDX: 00000ba278e8977a RSI: ffff953fb288b460 RDI: 0000000000000000
> > [ 1234.778901] RBP: ffff953fb288b598 R08: 00000000000000e0 R09: ffff953fbd98b808
> > [ 1234.778905] R10: 0000000000000000 R11: ffffabe680380ff8 R12: ffffabe680380e00
> > [ 1234.778908] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff953fbd9ec458
> > [ 1234.778912] FS:  00007f35e7008580(0000) GS:ffff95428ebc0000(0000)
> > knlGS:0000000000000000
> > [ 1234.778916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1234.778919] CR2: 0000000000000088 CR3: 000000010147c000 CR4: 00000000003506e0
> > [ 1234.778924] Call Trace:
> > [ 1234.778981]  <IRQ>
> > [ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
> > [ 1234.778999]  dma_fence_signal+0x2c/0x50
> > [ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
> > [ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
> > [ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
> > [ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
> > [ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
> > [ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
> > [ 1234.779946]  handle_irq_event+0x34/0x70
> > [ 1234.779949]  handle_edge_irq+0x9f/0x240
> > [ 1234.779954]  __common_interrupt+0x66/0x100
> > [ 1234.779960]  common_interrupt+0xa0/0xc0
> > [ 1234.779965]  </IRQ>
> > [ 1234.779968]  <TASK>
> > [ 1234.779971]  asm_common_interrupt+0x22/0x40
> > [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
> > [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
> > 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
> > 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
> > 83 ea
> > [ 1234.779985] RSP: 0000:ffffabe680bcfd78 EFLAGS: 00000202
> >
> > I'll see if I can dig any.
>
> I'm kicking the tires on the drm-next tree just prior to submission,
> and in an attempt to make myself look foolish and to tempt fate, it
> seems stable.

Yay it worked, crashed drm-next. will start reverting down the rabbit hole.

Dave.

  reply	other threads:[~2022-10-06 21:52 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-05  3:41 [git pull] drm for 6.1-rc1 Dave Airlie
2022-10-05  3:41 ` Dave Airlie
2022-10-05 18:38 ` Linus Torvalds
2022-10-05 18:38   ` Linus Torvalds
2022-10-05 20:56   ` Dave Airlie
2022-10-05 20:56     ` Dave Airlie
2022-10-05 18:40 ` pr-tracker-bot
2022-10-05 18:40   ` pr-tracker-bot
2022-10-06 18:47 ` Linus Torvalds
2022-10-06 18:47   ` Linus Torvalds
2022-10-06 19:28   ` Alex Deucher
2022-10-06 19:28     ` Alex Deucher
2022-10-06 19:47     ` Linus Torvalds
2022-10-06 19:47       ` Linus Torvalds
2022-10-06 20:14       ` Alex Deucher
2022-10-06 20:14         ` Alex Deucher
2022-10-06 20:24         ` Dave Airlie
2022-10-06 20:24           ` Dave Airlie
2022-10-06 21:41           ` Dave Airlie
2022-10-06 21:41             ` Dave Airlie
2022-10-06 21:52             ` Dave Airlie [this message]
2022-10-06 21:52               ` Dave Airlie
2022-10-06 23:45           ` Linus Torvalds
2022-10-06 23:45             ` Linus Torvalds
2022-10-07  2:45             ` Dave Airlie
2022-10-07  2:45               ` Dave Airlie
2022-10-07  2:54               ` Dave Airlie
2022-10-07  2:54                 ` Dave Airlie
2022-10-07  3:03                 ` Dave Airlie
2022-10-07  3:03                   ` Dave Airlie
2022-10-07  6:11               ` Christian König
2022-10-07  6:11                 ` Christian König
2022-10-07  8:16             ` Daniel Vetter
2022-10-07  8:16               ` Daniel Vetter
2022-10-07  9:28               ` Daniel Vetter
2022-10-07  9:28                 ` Daniel Vetter
2022-10-06 19:29   ` Dave Airlie
2022-10-06 19:29     ` Dave Airlie
2022-10-06 19:41     ` Linus Torvalds
2022-10-06 19:41       ` Linus Torvalds
2022-10-07  6:52 Bert Karwatzki
2022-10-07  7:07 Bert Karwatzki
2022-10-07  7:23 Bert Karwatzki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPM=9tx8tjzz5q4gkLbh=R+xO5x-8QQOB9E=GAXrV6=-r844-A@mail.gmail.com' \
    --to=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=alexdeucher@gmail.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.