[PATCH 3/3] drm/nouveau: Shut down GPU on kernel shutdown

From: Lyude Paul <lyude@redhat.com>
To: nouveau@lists.freedesktop.org
Cc: Karol Herbst <kherbst@redhat.com>,
	Ben Skeggs <bskeggs@redhat.com>, David Airlie <airlied@linux.ie>,
	dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org
Subject: [PATCH 3/3] drm/nouveau: Shut down GPU on kernel shutdown
Date: Wed, 22 Aug 2018 21:40:08 -0400	[thread overview]
Message-ID: <20180823014009.21532-4-lyude@redhat.com> (raw)
In-Reply-To: <20180823014009.21532-1-lyude@redhat.com>

A little while ago I sent some patches to try to fix issues with
initializing the GM107 GPU with nouveau on the ThinkPad P50. The issues
I was witnessing were rather bizarre: seemingly at random, initializing
the GPU would fail with failed mthds from disp that nouveau had not
actually kicked through the evo channel yet. Example:

    [    1.603467] nouveau 0000:01:00.0: disp: outp 02:0006:0f48: aux power -> demand
    [    1.603931] nouveau 0000:01:00.0: disp: outp 03:0002:0f48: no heads (0 3 2)
    [    1.604375] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: no heads (0 3 4)
    [    1.604858] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> always
    [    1.605354] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> demand
    [    1.605815] nouveau 0000:01:00.0: disp: outp 05:0002:0f81: no heads (0 3 2)
--->[    1.607289] nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400 00001000 00000002
--->[    1.608818] nouveau 0000:01:00.0: disp: chid 1 mthd 0000 data 00000400 00001000 00000002
--->[    1.609500] nouveau 0000:01:00.0: disp: chid 2 mthd 0000 data 00000400 00001000 00000002
    [    1.612392] [drm:drm_dp_dpcd_read [drm_kms_helper]] sor-0006-0f42: 0x00000 AUX -> (ret=  1) 12
    [    1.612774] [drm:drm_dp_dpcd_write [drm_kms_helper]] sor-0006-0f42: 0x00111 AUX <- (ret=  1) 00
    [    1.635748] [drm:drm_dp_dpcd_access [drm_kms_helper]] Too many retries, giving up. First error: -6
    [    1.635752] [drm:drm_dp_dpcd_read [drm_kms_helper]] sor-0006-0f48: 0x00000 AUX -> (ret= -6)
    [    1.658128] [drm:drm_dp_dpcd_access [drm_kms_helper]] Too many retries, giving up. First error: -6
    [    1.658131] [drm:drm_dp_dpcd_read [drm_kms_helper]] sor-0006-0f81: 0x00000 AUX -> (ret= -6)

These failures would also occur /before/ nouveau had actually pushed
anything to the evo channel. Then, later the rest of the GPU would start
failing like so:

[    3.851956] ------------[ cut here ]------------
[    3.851958] nouveau 0000:01:00.0: timeout
[    3.851995] WARNING: CPU: 0 PID: 62 at drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.c:1560 gf100_grctx_generate+0x89d/0x8b0 [nouveau]
[    3.851997] Modules linked in: serio_raw crc32c_intel xhci_pci i915(O+) xhci_hcd nouveau(O) video mxm_wmi wmi i2c_algo_bit drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops ttm(O) drm(O) i2c_core
[    3.852010] CPU: 0 PID: 62 Comm: kworker/0:2 Tainted: G           O      4.18.0-rc8Lyude-Test+ #7
[    3.852011] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W (1.51 ) 05/18/2018
[    3.852018] Workqueue: events output_poll_execute [drm_kms_helper]
[    3.852105] RIP: 0010:gf100_grctx_generate+0x89d/0x8b0 [nouveau]
[    3.852107] Code: ff 49 8b 7c 24 10 48 8b 5f 50 48 85 db 75 04 48 8b 5f 10 e8 25 5d 30 e1 48 89 da 48 c7 c7 4e e7 2a a0 48 89 c6 e8 65 c1 e9 e0 <0f> 0b bb f0 ff ff ff e9 68 f9 ff ff 0f 1f 80 00 00 00 00 0f 1f 44
[    3.852127] RSP: 0018:ffffc9000027b898 EFLAGS: 00010282
[    3.852128] RAX: 0000000000000000 RBX: ffff880876c20bd0 RCX: 0000000000000006
[    3.852130] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff88089b415570
[    3.852132] RBP: ffffc9000027b958 R08: 0000000000000000 R09: 0000000000000000
[    3.852133] R10: ffff880876685f00 R11: ffffffff8140cc60 R12: ffff8808716d2000
[    3.852135] R13: ffffc9000027b8d0 R14: ffffc9000027b8c8 R15: ffff88087165c000
[    3.852137] FS:  0000000000000000(0000) GS:ffff88089b400000(0000) knlGS:0000000000000000
[    3.852139] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.852140] CR2: 00005621d2c180b8 CR3: 000000000200a005 CR4: 00000000003606f0
[    3.852142] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    3.852144] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    3.852145] Call Trace:
[    3.852168]  ? nv04_timer_read+0x48/0x60 [nouveau]
[    3.852191]  gf100_gr_init_ctxctl+0x536/0xa40 [nouveau]
[    3.852212]  gf100_gr_init+0x563/0x590 [nouveau]
[    3.852234]  gf100_gr_init_+0x5b/0x60 [nouveau]
[    3.852255]  nvkm_gr_init+0x1d/0x20 [nouveau]
[    3.852267]  nvkm_engine_init+0xb9/0x1f0 [nouveau]
[    3.852280]  nvkm_subdev_init+0xbc/0x210 [nouveau]
[    3.852292]  nvkm_engine_ref.part.0+0x4a/0x70 [nouveau]
[    3.852304]  nvkm_engine_ref+0x13/0x20 [nouveau]
[    3.852316]  nvkm_ioctl_new+0x12c/0x260 [nouveau]
[    3.852337]  ? nvkm_fifo_chan_dtor+0x100/0x100 [nouveau]
[    3.852358]  ? gf100_fermi_mthd+0x100/0x100 [nouveau]
[    3.852371]  nvkm_ioctl+0xe2/0x180 [nouveau]
[    3.852392]  nvkm_client_ioctl+0x12/0x20 [nouveau]
[    3.852403]  nvif_object_ioctl+0x47/0x50 [nouveau]
[    3.852415]  nvif_object_init+0xc8/0x120 [nouveau]
[    3.852435]  nvc0_fbcon_accel_init+0x5b/0x950 [nouveau]
[    3.852455]  nouveau_fbcon_create+0x5bb/0x5e0 [nouveau]
[    3.852460]  ? drm_setup_crtcs+0x247/0xa60 [drm_kms_helper]
[    3.852464]  __drm_fb_helper_initial_config_and_unlock+0x1c0/0x410 [drm_kms_helper]
[    3.852468]  drm_fb_helper_hotplug_event.part.33+0xa9/0xb0 [drm_kms_helper]
[    3.852472]  drm_fb_helper_hotplug_event+0x1c/0x30 [drm_kms_helper]
[    3.852492]  nouveau_fbcon_output_poll_changed+0xb6/0x110 [nouveau]
[    3.852496]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
[    3.852500]  output_poll_execute+0x198/0x1c0 [drm_kms_helper]
[    3.852504]  process_one_work+0x1b2/0x370
[    3.852506]  worker_thread+0x37/0x3a0
[    3.852508]  kthread+0x120/0x140
[    3.852510]  ? wq_update_unbound_numa+0x10/0x10
[    3.852511]  ? kthread_create_worker_on_cpu+0x70/0x70
[    3.852514]  ret_from_fork+0x35/0x40
[    3.852516] ---[ end trace 583fe2d8feb59e4a ]---
[    3.852733] nouveau 0000:01:00.0: gr: failed to construct context
[    3.852737] nouveau 0000:01:00.0: gr: init failed, -16

Originally I had a good bit of trouble even reproducing this issue
whatsoever. I've since then managed to figure out a reproducer that
seems to work about 70% of the time:

- Boot the machine while docked, load nouveau
- Undock the machine, wait for nouveau to go into runtime suspend
- Reboot the machine. If done correctly, you should be able to see
  nouveau briefly resume itself before the shutdown finishes
- On the next boot, the following should happen (if it doesn't, go back
  to step one):
  - If nouveau isn't loaded within 10-20 seconds of booting, you will
    probably see an unclaimed interrupt warning.
  - Once nouveau is loaded, you'll see the symptoms I've described here

At first I assumed that the BIOS was probably trying to probe for
displays before loading the OS, leaving us with a GPU in a funky
sort-of-on state. I tried some solutions that involved shutting down the
various display channels that were left on, but eventually discovered
that we were starting off with more then just disp channels left on: the
entire gr was left on as well. Additionally; it was pointed out to me by
Ben Skeggs that the BIOS doesn't make any use of evo channels anyway.

After some investigation we found the real cause of the problem, and
unfortunately it's far worse than leaving a few channels on: the version
of the BIOS on this P50 appears to have a bug which makes it so that on
full system reboots, the dedicated GPU somehow does not always get power
cycled. In fact, it's even left in nearly the same state it was in
before we finished rebooting! How awful.

While it's quite clear there's one rather impressive BIOS bug going on
here that needs to get fixed, we can at least solve most of the symptoms
of this issue by making nouveau a little better about cleaning up after
itself on kernel shutdowns/reboots. This is something nouveau is going
to need to be able to do if it's ever going to be used for things like
PCI passthrough anyway, since we want to avoid passing the GPU around
from VM guest to VM host if it's still half-way initialized.

Luckily I have some contacts at Lenovo, so I will be bringing this up
and referencing this patch to make sure that this gets fixed properly in
the P50 BIOS as well, especially since not having this fixed in the BIOS
means it's possible for us to fail to reboot if we put both the card and
the kernel into a bad state and require a full reboot. But until then,
this patch should make the problem significantly less noticeable.

For reference, the BIOS version on this P50 is version 1.52.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: Karol Herbst <kherbst@redhat.com>
[omitting Cc to stable. I'd /love/ to get this into a stable kernel, but
unfortunately there's too many large changes this depends on to do that]

Signed-off-by: Lyude Paul <lyude@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_drm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index b88b338dc79c..c37641496324 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -1151,6 +1151,7 @@ nouveau_drm_pci_driver = {
 	.id_table = nouveau_drm_pci_table,
 	.probe = nouveau_drm_probe,
 	.remove = nouveau_drm_remove,
+	.shutdown = nouveau_drm_remove,
 	.driver.pm = &nouveau_pm_ops,
 };
 
-- 
2.17.1