All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot
@ 2020-01-24 23:24 bugzilla-daemon
  2020-01-25  0:45 ` [Bug 206299] " bugzilla-daemon
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-24 23:24 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

            Bug ID: 206299
           Summary: [nouveau/xen] RTX 20XX instant reboot
           Product: Drivers
           Version: 2.5
    Kernel Version: 5.4.X
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: blocking
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@kernel-bugs.osdl.org
          Reporter: frederic.epitre@orange.fr
        Regression: No

Created attachment 286963
  --> https://bugzilla.kernel.org/attachment.cgi?id=286963&action=edit
Kernel log

Hi,
On several kernels 4.19.X and 5.3.X or latest one 5.4, I'm having an issue with
a NVIDIA RTX 2080TI (also reported by another user with RTX 2070
https://groups.google.com/forum/#!msg/qubes-devel/ozOQrOHsUBQ/XtIQsGm3DgAJ)
causing lot of instant reboots of machine. Specifically, the distribution is
Qubes OS so Xen is under the hood. On a classical Fedora 31 livecd I don't
succeeded to reproduce the crash which is easily reproducible in Qubes (e.g.
massive and intensive resize of windows).

Thanks to the help of Marek Marczykowski-Górecki, I obtained the following
attached kernel log using netconsole.

Any help would be very appreciated.

Frédéric Pierret

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
@ 2020-01-25  0:45 ` bugzilla-daemon
  2020-01-25  9:51 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-25  0:45 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

Ilia Mirkin (imirkin@alum.mit.edu) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |imirkin@alum.mit.edu

--- Comment #1 from Ilia Mirkin (imirkin@alum.mit.edu) ---
Comment on attachment 286963
  --> https://bugzilla.kernel.org/attachment.cgi?id=286963
Kernel log

badf5040 = bad mmio read.

Could there be some PCI situation? Can you include a full boot log?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
  2020-01-25  0:45 ` [Bug 206299] " bugzilla-daemon
@ 2020-01-25  9:51 ` bugzilla-daemon
  2020-01-25  9:51 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-25  9:51 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #2 from Frédéric Pierret (frederic.epitre@orange.fr) ---
Created attachment 286967
  --> https://bugzilla.kernel.org/attachment.cgi?id=286967&action=edit
kernel log (dmesg)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
  2020-01-25  0:45 ` [Bug 206299] " bugzilla-daemon
  2020-01-25  9:51 ` bugzilla-daemon
@ 2020-01-25  9:51 ` bugzilla-daemon
  2020-01-26 15:02 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-25  9:51 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #3 from Frédéric Pierret (frederic.epitre@orange.fr) ---
Hi Ilia,
Thank you for your answer.

(In reply to Ilia Mirkin from comment #1)
> Comment on attachment 286963 [details]
> Kernel log
> 
> badf5040 = bad mmio read.
> 
> Could there be some PCI situation? Can you include a full boot log?

You'll find dmesg.log attached. By PCI situation you mean hardware issue? If
yes, the card is normally functional under Windows. For your information, the
GPU remains attached to dom0, not pci-passthroughed on a domU.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
                   ` (2 preceding siblings ...)
  2020-01-25  9:51 ` bugzilla-daemon
@ 2020-01-26 15:02 ` bugzilla-daemon
  2020-01-26 15:07 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-26 15:02 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #4 from Frédéric Pierret (frederic.epitre@orange.fr) ---
Hi,
While debugging it I found the exception comes from gv100_disp_intr_exc_other
in gv100.c because stat = 0x00001800.

I'm trying to figure out what messed up in the 'disp' structure but I'm doing
it step by step by first searching for NULL pointers. Any advice for how to
proceed? 

Thank you.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
                   ` (3 preceding siblings ...)
  2020-01-26 15:02 ` bugzilla-daemon
@ 2020-01-26 15:07 ` bugzilla-daemon
  2020-01-26 15:55 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-26 15:07 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #5 from Ilia Mirkin (imirkin@alum.mit.edu) ---
Your kernel log doesn't have anything too weird in it (which is good). However
I did see a similar type of error with someone using coreboot (admittedly, with
an MCP77 IGP). Are you using a non-original booting mechanism? Given that
there's signed firmware situations going on, we can't just re-POST the GPU
easily, unlike in the MCP77 case.

The mmio read failures may be a red herring -- basically we try to figure out
why the error happened, and get bad mmio reads in the process. Could just be
that the error handler hasn't been properly adjusted for Turing, and reads from
bad places.

I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have
something clever to say.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
                   ` (4 preceding siblings ...)
  2020-01-26 15:07 ` bugzilla-daemon
@ 2020-01-26 15:55 ` bugzilla-daemon
  2020-01-26 20:20 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-26 15:55 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #6 from Frédéric Pierret (frederic.epitre@orange.fr) ---
(In reply to Ilia Mirkin from comment #5)
> Your kernel log doesn't have anything too weird in it (which is good).
> However I did see a similar type of error with someone using coreboot
> (admittedly, with an MCP77 IGP). Are you using a non-original booting
> mechanism? Given that there's signed firmware situations going on, we can't
> just re-POST the GPU easily, unlike in the MCP77 case.

I'm using standard default bios (legacy mode).

> The mmio read failures may be a red herring -- basically we try to figure
> out why the error happened, and get bad mmio reads in the process. Could
> just be that the error handler hasn't been properly adjusted for Turing, and
> reads from bad places.
> 
> I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have
> something clever to say.

Hope so and thank you again for your feedback.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
                   ` (5 preceding siblings ...)
  2020-01-26 15:55 ` bugzilla-daemon
@ 2020-01-26 20:20 ` bugzilla-daemon
  2020-01-26 21:45 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-26 20:20 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #7 from Frédéric Pierret (frederic.epitre@orange.fr) ---
With Marek, we think to found the problem. In nv50_disp_chan_mthd function, the
exact NULL pointer reference is mthd->data[0]->mthd. Precisely,  mthd->data is
not null but mthd->data[0] seems so.

Trying to access mthd->data[0] we get:
  BUG: kernel NULL pointer dereference, address: 0000000000000010
while trying to access mthd->data[0]->mthd, we get:
  BUG: kernel NULL pointer dereference, address: 0000000000000020

So this is exactly the issue. Any idea why mthd->data and not mthd->data[0]?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
                   ` (6 preceding siblings ...)
  2020-01-26 20:20 ` bugzilla-daemon
@ 2020-01-26 21:45 ` bugzilla-daemon
  2020-01-26 22:02 ` bugzilla-daemon
  2020-01-28  8:36 ` bugzilla-daemon
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-26 21:45 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #8 from Frédéric Pierret (frederic.epitre@orange.fr) ---
We found more information!

The previous tests was done with those added lines:

--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
@@ -75,13 +75,25 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug)
        if (debug > subdev->debug)
                return;

+       nvkm_warn(subdev, "mthd: %p", mthd);
+       nvkm_warn(subdev, "mthd->data: %p", mthd->data);
+       nvkm_warn(subdev, "&mthd->data[0]: %p", &mthd->data[0]);
+       nvkm_warn(subdev, "mthd->data[0].mthd: %p", mthd->data[0].mthd);
        for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) {

which gaves as crashlog:

[   45.513617] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1
[PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040
[   45.513633] nouveau 0000:26:00.0: disp: mthd: 00000000dfa55708
[   45.513638] nouveau 0000:26:00.0: disp: mthd->data: 00000000858af80f
[   45.513641] nouveau 0000:26:00.0: disp: &mthd->data[0]: 00000000858af80f

But replacing "%p" by "%lx", it revealed that mthd is NULL:

[   74.753207] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1
[PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040
[   74.753223] nouveau 0000:26:00.0: disp: mthd: 0
[   74.753226] nouveau 0000:26:00.0: disp: mthd->data: 10
[   74.753231] nouveau 0000:26:00.0: disp: &mthd->data[0]: 10
[   74.753241] BUG: kernel NULL pointer dereference, address: 0000000000000020
[   74.753244] #PF: supervisor read access in kernel mode

That gives some hints!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
                   ` (7 preceding siblings ...)
  2020-01-26 21:45 ` bugzilla-daemon
@ 2020-01-26 22:02 ` bugzilla-daemon
  2020-01-28  8:36 ` bugzilla-daemon
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-26 22:02 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #9 from Frédéric Pierret (frederic.epitre@orange.fr) ---
A rather simple and temporary fix we found is to add:

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
index bcf32d92ee5a..50e3539f33d2 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
@@ -74,6 +74,8 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug)

        if (debug > subdev->debug)
                return;
+       if (!mthd)
+               return;

        for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) {
                u32 base = chan->head * mthd->addr;

With that, it remains stable.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Bug 206299] [nouveau/xen] RTX 20XX instant reboot
  2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
                   ` (8 preceding siblings ...)
  2020-01-26 22:02 ` bugzilla-daemon
@ 2020-01-28  8:36 ` bugzilla-daemon
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2020-01-28  8:36 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #10 from Frédéric Pierret (frederic.epitre@orange.fr) ---
Last piece of information, aach time I'm trying to reproduce the freeze and
thanks to the fix, I can see a second information in kernel log:

[  814.207723] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1
[PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040
[  814.207749] nouveau 0000:26:00.0: bus: MMIO read of 00000000 FAULT at 611390
[ IBUS ]

And it's always repeated as the two lines.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-01-28  8:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-24 23:24 [Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot bugzilla-daemon
2020-01-25  0:45 ` [Bug 206299] " bugzilla-daemon
2020-01-25  9:51 ` bugzilla-daemon
2020-01-25  9:51 ` bugzilla-daemon
2020-01-26 15:02 ` bugzilla-daemon
2020-01-26 15:07 ` bugzilla-daemon
2020-01-26 15:55 ` bugzilla-daemon
2020-01-26 20:20 ` bugzilla-daemon
2020-01-26 21:45 ` bugzilla-daemon
2020-01-26 22:02 ` bugzilla-daemon
2020-01-28  8:36 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.