nouveau.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Monty Montgomery <xiphmont@gmail.com>
To: nouveau@lists.freedesktop.org
Subject: [Nouveau] Panic report and patch against master (Quadro FX)
Date: Fri, 21 Apr 2023 13:36:58 -0400	[thread overview]
Message-ID: <CACrD=+-DUomkWxe0X5M5vMFS_JijjPGNqVXuq+qimie99vmwzw@mail.gmail.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 12629 bytes --]

Hiya folks,

The new nouveau event/notify code in 6.2.x panics early in device
setup here on my Quadro FX hardware, resulting in device setup failure
and a black screen.

The direct cause appears to be i2c aux IDs no longer being truncated
to 8 bits before being used as an index to event->ref[] by
nvkm_event_put, nvkm_event_get, and nvkm_event_send as of
773eb04d14a11552b2c3953097ed09cde2ab4831.

The now-eliminated nvkm_i2c_intr_ctor built an nvkm_notify struct that
stored 'index' as u8, truncating the integer ID to a usable index.
The new code no longer constructs a notification ahead of time, simply
passing the integer id of the i2c bus/aux/pad/etc directly to
nvkm_event_send as int.  The ID value routinely stores
offests/flags/markers(?) in the upper bytes, and when used directly as
an array index, it goes way off the end of event->ref[] and triggers
a panic.  There's no code to check/guard out-of-range access in ref
lookup, so I also added guard/trace code to 6.1 to verify this is a
new bug, and did not record any event->ref[] index bounds errors in
6.1.15.

I don't understand the actual device setup procedure for an nVidia
card, so this surface analysis may be woefully incomplete :-)

At the point of panic, we see the i2c code calling event code with an
index of 269 (0x010d).

Apr 17 01:07:53 boatanchor kernel: nouveau: DRM:00000006:8000000e:
ioctl: mthd size 8
Apr 17 01:07:53 boatanchor kernel: nouveau: DRM:00000006:8000000e:
ioctl: mthd vers 0 mthd 00
Apr 17 01:07:53 boatanchor kernel: nouveau 0000:01:00.0: i2c: event:
ntfy allow 00000003 on 269
Apr 17 01:07:53 boatanchor kernel: nouveau 0000:01:00.0: i2c: event:
ntfy state changed
Apr 17 01:07:53 boatanchor kernel: nouveau 0000:01:00.0: i2c: event:
incr 00000003 on 269
Apr 17 01:07:53 boatanchor kernel: nouveau 0000:01:00.0: i2c: event:
allowing 0 on 269
Apr 17 01:07:53 boatanchor kernel: BUG: kernel NULL pointer
dereference, address: 0000000000000000
Apr 17 01:07:53 boatanchor kernel: #PF: supervisor instruction fetch
in kernel mode
Apr 17 01:07:53 boatanchor kernel: #PF: error_code(0x0010) - not-present page
Apr 17 01:07:53 boatanchor kernel: PGD 0 P4D 0
Apr 17 01:07:53 boatanchor kernel: Oops: 0010 [#1] PREEMPT SMP PTI
Apr 17 01:07:53 boatanchor kernel: CPU: 1 PID: 427 Comm: (udev-worker)
Not tainted 6.2.10-200.fc37.x86_64+debug #1
Apr 17 01:07:53 boatanchor kernel: Hardware name: LENOVO
25003BU/25003BU, BIOS 6KET60WW (1.30 ) 10/24/2012
Apr 17 01:07:53 boatanchor kernel: RIP: 0010:0x0
Apr 17 01:07:53 boatanchor kernel: Code: Unable to access opcode bytes
at 0xffffffffffffffd6.
Apr 17 01:07:53 boatanchor kernel: RSP: 0018:ffff967ec1b9f8c0 EFLAGS: 00010086
Apr 17 01:07:53 boatanchor kernel: RAX: 0000000000000000 RBX:
0000000000000002 RCX: 0000000000000000
Apr 17 01:07:53 boatanchor kernel: RDX: 0000000000000000 RSI:
0000000000000001 RDI: ffff8b69120fc400
Apr 17 01:07:53 boatanchor kernel: RBP: ffff8b69120fc568 R08:
ffff8b69120fc400 R09: ffff967ec1b9f690
Apr 17 01:07:53 boatanchor kernel: R10: 0000000000000003 R11:
ffffffffb1564968 R12: ffff8b631362ca88
Apr 17 01:07:53 boatanchor kernel: R13: 000000000000010d R14:
0000000000000001 R15: 0000000000000001
Apr 17 01:07:53 boatanchor kernel: FS:  00007fbc0af0cb40(0000)
GS:ffff8b6a1e200000(0000) knlGS:0000000000000000
Apr 17 01:07:53 boatanchor kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Apr 17 01:07:53 boatanchor kernel: CR2: ffffffffffffffd6 CR3:
00000001133ea000 CR4: 00000000000006e0
Apr 17 01:07:53 boatanchor kernel: Call Trace:
Apr 17 01:07:53 boatanchor kernel:  <TASK>
Apr 17 01:07:53 boatanchor kernel:  nvkm_event_ntfy_state+0x184/0x250 [nouveau]
Apr 17 01:07:53 boatanchor kernel:  nvkm_event_ntfy_allow+0x5f/0xc0 [nouveau]
Apr 17 01:07:53 boatanchor kernel:  nvkm_uevent_mthd+0x49/0x70 [nouveau]
Apr 17 01:07:53 boatanchor kernel:  nvkm_ioctl+0x10a/0x240 [nouveau]
Apr 17 01:07:53 boatanchor kernel:  nvif_object_mthd+0xcb/0x200 [nouveau]
Apr 17 01:07:53 boatanchor kernel:  nvif_event_allow+0x26/0xa0 [nouveau]
Apr 17 01:07:53 boatanchor kernel:  nouveau_display_init+0x71/0x110 [nouveau]
Apr 17 01:07:53 boatanchor kernel:
nouveau_drm_device_init+0x1d8/0x9b0 [nouveau]
Apr 17 01:07:53 boatanchor kernel:  ? pci_bus_read_config_word+0x49/0x80
Apr 17 01:07:53 boatanchor kernel:  nouveau_drm_probe+0x128/0x280 [nouveau]
Apr 17 01:07:53 boatanchor kernel:  local_pci_probe+0x41/0x90
Apr 17 01:07:53 boatanchor kernel:  pci_device_probe+0xc3/0x230
Apr 17 01:07:53 boatanchor kernel:  really_probe+0x1b9/0x410
Apr 17 01:07:53 boatanchor kernel:  __driver_probe_device+0x78/0x170
Apr 17 01:07:53 boatanchor kernel:  driver_probe_device+0x1f/0x90
Apr 17 01:07:53 boatanchor kernel:  __driver_attach+0xd2/0x1c0
Apr 17 01:07:53 boatanchor kernel:  ? __pfx___driver_attach+0x10/0x10
Apr 17 01:07:53 boatanchor kernel:  bus_for_each_dev+0x8a/0xd0
Apr 17 01:07:53 boatanchor kernel:  bus_add_driver+0x141/0x230
Apr 17 01:07:53 boatanchor kernel:  driver_register+0x77/0x120
Apr 17 01:07:53 boatanchor kernel:  ? __pfx_init_module+0x10/0x10 [nouveau]
Apr 17 01:07:53 boatanchor kernel:  do_one_initcall+0x70/0x290
Apr 17 01:07:53 boatanchor kernel:  do_init_module+0x4a/0x220
Apr 17 01:07:53 boatanchor kernel:  __do_sys_init_module+0x192/0x1c0
Apr 17 01:07:53 boatanchor kernel:  do_syscall_64+0x5b/0x80
Apr 17 01:07:53 boatanchor kernel:  ? lock_release+0x15d/0x400
Apr 17 01:07:53 boatanchor kernel:  ? preempt_count_add+0x47/0xa0
Apr 17 01:07:53 boatanchor kernel:  ? __up_read+0x98/0x220
Apr 17 01:07:53 boatanchor kernel:  ? do_user_addr_fault+0x202/0x730
Apr 17 01:07:53 boatanchor kernel:  ? exc_page_fault+0xfc/0x200
Apr 17 01:07:53 boatanchor kernel:  ? lockdep_hardirqs_off+0x9c/0xe0
Apr 17 01:07:53 boatanchor kernel:  ? asm_exc_page_fault+0x22/0x30
Apr 17 01:07:53 boatanchor kernel:  ? lockdep_hardirqs_on+0x7d/0x100
Apr 17 01:07:53 boatanchor kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Apr 17 01:07:53 boatanchor kernel: RIP: 0033:0x7fbc0b97800e
Apr 17 01:07:53 boatanchor kernel: Code: 48 8b 0d 25 5e 0c 00 f7 d8 64
89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3
 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3
48 8b 0d f2 5d 0c 00 f7 d8 64 89 01 48
Apr 17 01:07:53 boatanchor kernel: RSP: 002b:00007fff5dc02b58 EFLAGS:
00000246 ORIG_RAX: 00000000000000af
Apr 17 01:07:53 boatanchor kernel: RAX: ffffffffffffffda RBX:
00005613c37bee10 RCX: 00007fbc0b97800e
Apr 17 01:07:53 boatanchor kernel: RDX: 00005613c36b7bb0 RSI:
00000000007258f6 RDI: 00005613c3fc1b70
Apr 17 01:07:53 boatanchor kernel: RBP: 00005613c36b7bb0 R08:
00005613c36af860 R09: 00007fff5dc002f6
Apr 17 01:07:53 boatanchor kernel: R10: 0000000000000005 R11:
0000000000000246 R12: 0000000000020000
Apr 17 01:07:53 boatanchor kernel: R13: 00005613c36ab4e0 R14:
0000000000000000 R15: 00005613c3689610
Apr 17 01:07:53 boatanchor kernel:  </TASK>
Apr 17 01:07:53 boatanchor kernel: Modules linked in: nouveau(+) raid0
wacom uas usb_storage crc32c_intel serio_raw e1000e sha5
12_ssse3 sdhci_pci cqhci sdhci mmc_core drm_ttm_helper firewire_ohci
ttm mxm_wmi firewire_core drm_display_helper crc_itu_t cec
 video wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables
dm_multipath fuse
Apr 17 01:07:53 boatanchor kernel: CR2: 0000000000000000

nvkm/core/event.c:nvkm_event_get() is inlined, that's where the panic
is happening.
Noting that 269 == 0x10d, which matches an i2c pad ID allocated
earlier during CCB
enumeration, and that earlier we logged only 16 CCBs:

Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 00:
type 05 drive 00 sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0100: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 0000: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 01:
type 05 drive 01 sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0101: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 0001: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 02:
type 05 drive 03 sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0102: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 0002: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 03:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0103: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 04:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0104: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 05:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0105: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 06:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0106: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 07:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0107: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 08:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0108: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 09:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0109: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0a:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010a: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0b:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010b: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0c:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010c: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0d:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010d: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0e:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010e: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0f:
type ff drive ff sense ff share ff auxch ff
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010f: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: dcb 08
drv 00 unknown
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 020d: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: aux 010d: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 010d: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: aux 010d: ctor
Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 010d: ctor

I added additional debugging logs to determine that the index_nr for
the relevant event is only 16, so an index of 269 to event->ref[] is
obviously wrong.

It appears that the simplest maybe-correct fix is to reinstate the u8
truncation in this argument path, eg:

diff -urp a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c  2023-04-06
06:12:48.000000000 -0400
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c  2023-04-21
09:07:59.864884896 -0400
@@ -86,7 +86,7 @@ nvkm_uconn_uevent(struct nvkm_object *ob
                        if (args->v0.types &
NVIF_CONN_EVENT_V0_UNPLUG) bits |= NVKM_I2C_UNPLUG;
                        if (args->v0.types & NVIF_CONN_EVENT_V0_IRQ
) bits |= NVKM_I2C_IRQ;

-                       return nvkm_uevent_add(uevent,
&device->i2c->event, outp->dp.aux->id, bits,
+                       return nvkm_uevent_add(uevent,
&device->i2c->event, (outp->dp.aux->id)&0xff, bits,
                                               nvkm_uconn_uevent_aux);
                }
        }

This assumes the truncation was itself correct/intentional, and not
due to some other
mistake which should also be corrected.  However, a slightly larger patch I've
attached (which contains the above fix + a little extra guard/logging)
has me back up
and running under 6.2.10.

Cheers,
Monty

[-- Attachment #2: linux-6.2.10-nouveau_event.patch --]
[-- Type: text/x-patch, Size: 4149 bytes --]

diff -urp a/drivers/gpu/drm/nouveau/nvkm/core/event.c b/drivers/gpu/drm/nouveau/nvkm/core/event.c
--- a/drivers/gpu/drm/nouveau/nvkm/core/event.c	2023-04-06 06:12:48.000000000 -0400
+++ b/drivers/gpu/drm/nouveau/nvkm/core/event.c	2023-04-19 13:11:51.510540894 -0400
@@ -28,13 +28,30 @@ nvkm_event_put(struct nvkm_event *event,
 	assert_spin_locked(&event->refs_lock);
 
 	nvkm_trace(event->subdev, "event: decr %08x on %d\n", types, index);
-
+	
+	if (index >= event->index_nr) {
+     	        nvkm_warn(event->subdev, "event: index out of range (%d >= %d)!\n", index, event->index_nr);
+		return;
+	}
+	
 	while (types) {
 		int type = __ffs(types); types &= ~(1 << type);
+		if (type >= event->types_nr) {
+		        nvkm_warn(event->subdev, "event: type out of range (%d >= %d)!\n", type, event->types_nr);
+			continue;
+		}
+		
 		if (--event->refs[index * event->types_nr + type] == 0) {
-			nvkm_trace(event->subdev, "event: blocking %d on %d\n", type, index);
-			if (event->func->fini)
-				event->func->fini(event, 1 << type, index);
+			nvkm_trace(event->subdev, "event: blocking %08x on %d\n", type, index);
+			if (event->func) {
+			        if (event->func->fini) {
+					event->func->fini(event, 1 << type, index);
+				} else {
+				        nvkm_debug(event->subdev, "event: no .fini, nothing to do\n");
+				}
+			} else {
+			        nvkm_warn(event->subdev, "event: missing .func entry!\n");
+			}
 		}
 	}
 }
@@ -46,12 +63,29 @@ nvkm_event_get(struct nvkm_event *event,
 
 	nvkm_trace(event->subdev, "event: incr %08x on %d\n", types, index);
 
+	if (index >= event->index_nr) {
+     	        nvkm_warn(event->subdev, "event: index out of range (%d >= %d)!\n", index, event->index_nr);
+		return;
+	}
+
 	while (types) {
 		int type = __ffs(types); types &= ~(1 << type);
+		if (type >= event->types_nr) {
+		        nvkm_warn(event->subdev, "event: type out of range (%d >= %d)!\n", type, event->types_nr);
+			continue;
+		}
 		if (++event->refs[index * event->types_nr + type] == 1) {
-			nvkm_trace(event->subdev, "event: allowing %d on %d\n", type, index);
-			if (event->func->init)
-				event->func->init(event, 1 << type, index);
+			nvkm_trace(event->subdev, "event: allowing %08x on %d\n", type, index);
+			if (event->func) {
+			        if (event->func->init) {
+				        event->func->init(event, 1 << type, index);
+				} else {
+				        nvkm_debug(event->subdev, "event: no .init, nothing to do\n");
+				}
+			} else {
+			        nvkm_warn(event->subdev, "event: missing .func entry!\n");
+			}
+
 		}
 	}
 }
@@ -146,7 +180,7 @@ void
 nvkm_event_ntfy_add(struct nvkm_event *event, int id, u32 bits, bool wait, nvkm_event_func func,
 		    struct nvkm_event_ntfy *ntfy)
 {
-	nvkm_trace(event->subdev, "event: ntfy add %08x on %d wait:%d\n", id, bits, wait);
+       nvkm_trace(event->subdev, "event: ntfy add %08x on %d wait:%d\n", bits, id, wait);
 
 	ntfy->event = event;
 	ntfy->id = id;
@@ -201,10 +235,14 @@ int
 __nvkm_event_init(const struct nvkm_event_func *func, struct nvkm_subdev *subdev,
 		  int types_nr, int index_nr, struct nvkm_event *event)
 {
+
 	event->refs = kzalloc(array3_size(index_nr, types_nr, sizeof(*event->refs)), GFP_KERNEL);
 	if (!event->refs)
 		return -ENOMEM;
 
+	nvkm_trace(subdev, "event: init for %d types on %d indices calling %p\n",
+		   types_nr, index_nr, func);
+
 	event->func = func;
 	event->subdev = subdev;
 	event->types_nr = types_nr;
diff -urp a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c	2023-04-06 06:12:48.000000000 -0400
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c	2023-04-21 09:07:59.864884896 -0400
@@ -86,7 +86,7 @@ nvkm_uconn_uevent(struct nvkm_object *ob
 			if (args->v0.types & NVIF_CONN_EVENT_V0_UNPLUG) bits |= NVKM_I2C_UNPLUG;
 			if (args->v0.types & NVIF_CONN_EVENT_V0_IRQ   ) bits |= NVKM_I2C_IRQ;
 
-			return nvkm_uevent_add(uevent, &device->i2c->event, outp->dp.aux->id, bits,
+			return nvkm_uevent_add(uevent, &device->i2c->event, (outp->dp.aux->id)&0xff, bits,
 					       nvkm_uconn_uevent_aux);
 		}
 	}

                 reply	other threads:[~2023-04-21 17:37 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACrD=+-DUomkWxe0X5M5vMFS_JijjPGNqVXuq+qimie99vmwzw@mail.gmail.com' \
    --to=xiphmont@gmail.com \
    --cc=nouveau@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).