Re: [Nouveau] nouveau lockdep deadlock report with 5.18-rc6

From: Lyude Paul <lyude@redhat.com>
To: Hans de Goede <hdegoede@redhat.com>,
	Ben Skeggs <bskeggs@redhat.com>,
	 Karol Herbst <kherbst@redhat.com>
Cc: "nouveau@lists.freedesktop.org" <nouveau@lists.freedesktop.org>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>
Subject: Re: [Nouveau] nouveau lockdep deadlock report with 5.18-rc6
Date: Tue, 17 May 2022 18:24:55 -0400	[thread overview]
Message-ID: <1cfc459d038a3499ead4ce7c3619829263231a53.camel@redhat.com> (raw)
In-Reply-To: <ac39455b-b85c-4cf7-8cd0-089325c9514a@redhat.com>

YEah I saw this as well, will try to bisect soon

On Tue, 2022-05-17 at 13:10 +0200, Hans de Goede wrote:
> Hi All,
> 
> I just noticed the below lockdep possible deadlock report with a 5.18-rc6
> kernel on a Dell Latitude E6430 laptop with the following nvidia GPU:
> 
> 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GF108GLM [NVS
> 5200M] [10de:0dfc] (rev a1)
> 01:00.1 Audio device [0403]: NVIDIA Corporation GF108 High Definition Audio
> Controller [10de:0bea] (rev a1)
> 
> This is with the laptop in Optimus mode, so with the Intel integrated
> gfx from the i5-3320M CPU driving the LCD panel and with nothing connected
> to the HDMI connector, which is always routed to the NVIDIA GPU on this
> laptop.
> 
> The lockdep possible deadlock warning seems to happen when the NVIDIA GPU
> is runtime suspended shortly after gdm has loaded:
> 
> [   24.859171] ======================================================
> [   24.859173] WARNING: possible circular locking dependency detected
> [   24.859175] 5.18.0-rc6+ #34 Tainted: G            E    
> [   24.859178] ------------------------------------------------------
> [   24.859179] kworker/1:1/46 is trying to acquire lock:
> [   24.859181] ffff92b0c0ee0518 (&cli->mutex){+.+.}-{3:3}, at:
> nouveau_vga_lastclose+0x910/0x1030 [nouveau]
> [   24.859231] 
>                but task is already holding lock:
> [   24.859233] ffff92b0c4bf35a0 (reservation_ww_class_mutex){+.+.}-{3:3},
> at: ttm_bo_wait+0x7d/0x140 [ttm]
> [   24.859243] 
>                which lock already depends on the new lock.
> 
> [   24.859244] 
>                the existing dependency chain (in reverse order) is:
> [   24.859246] 
>                -> #1 (reservation_ww_class_mutex){+.+.}-{3:3}:
> [   24.859249]        __ww_mutex_lock.constprop.0+0xb3/0xfb0
> [   24.859256]        ww_mutex_lock+0x38/0xa0
> [   24.859259]        nouveau_bo_pin+0x30/0x380 [nouveau]
> [   24.859297]        nouveau_channel_del+0x1d7/0x3e0 [nouveau]
> [   24.859328]        nouveau_channel_new+0x48/0x730 [nouveau]
> [   24.859358]        nouveau_abi16_ioctl_channel_alloc+0x113/0x360
> [nouveau]
> [   24.859389]        drm_ioctl_kernel+0xa1/0x150
> [   24.859392]        drm_ioctl+0x21c/0x410
> [   24.859395]        nouveau_drm_ioctl+0x56/0x1820 [nouveau]
> [   24.859431]        __x64_sys_ioctl+0x8d/0xc0
> [   24.859436]        do_syscall_64+0x5b/0x80
> [   24.859440]        entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   24.859443] 
>                -> #0 (&cli->mutex){+.+.}-{3:3}:
> [   24.859446]        __lock_acquire+0x12e2/0x1f90
> [   24.859450]        lock_acquire+0xad/0x290
> [   24.859453]        __mutex_lock+0x90/0x830
> [   24.859456]        nouveau_vga_lastclose+0x910/0x1030 [nouveau]
> [   24.859493]        ttm_bo_move_to_lru_tail+0x32c/0x980 [ttm]
> [   24.859498]        ttm_mem_evict_first+0x25c/0x4b0 [ttm]
> [   24.859503]        ttm_resource_manager_evict_all+0x93/0x1b0 [ttm]
> [   24.859509]        nouveau_debugfs_fini+0x161/0x260 [nouveau]
> [   24.859545]        nouveau_drm_ioctl+0xa4a/0x1820 [nouveau]
> [   24.859582]        pci_pm_runtime_suspend+0x5c/0x180
> [   24.859585]        __rpm_callback+0x48/0x1b0
> [   24.859589]        rpm_callback+0x5a/0x70
> [   24.859591]        rpm_suspend+0x10a/0x6f0
> [   24.859594]        pm_runtime_work+0xa0/0xb0
> [   24.859596]        process_one_work+0x254/0x560
> [   24.859601]        worker_thread+0x4f/0x390
> [   24.859604]        kthread+0xe6/0x110
> [   24.859607]        ret_from_fork+0x22/0x30
> [   24.859611] 
>                other info that might help us debug this:
> 
> [   24.859612]  Possible unsafe locking scenario:
> 
> [   24.859613]        CPU0                    CPU1
> [   24.859615]        ----                    ----
> [   24.859616]   lock(reservation_ww_class_mutex);
> [   24.859618]                                lock(&cli->mutex);
> [   24.859620]                               
> lock(reservation_ww_class_mutex);
> [   24.859622]   lock(&cli->mutex);
> [   24.859624] 
>                 *** DEADLOCK ***
> 
> [   24.859625] 3 locks held by kworker/1:1/46:
> [   24.859627]  #0: ffff92b0c0bb4338 ((wq_completion)pm){+.+.}-{0:0}, at:
> process_one_work+0x1d0/0x560
> [   24.859634]  #1: ffffa8ffc02dfe80 ((work_completion)(&dev-
> >power.work)){+.+.}-{0:0}, at: process_one_work+0x1d0/0x560
> [   24.859641]  #2: ffff92b0c4bf35a0 (reservation_ww_class_mutex){+.+.}-
> {3:3}, at: ttm_bo_wait+0x7d/0x140 [ttm]
> [   24.859649] 
>                stack backtrace:
> [   24.859651] CPU: 1 PID: 46 Comm: kworker/1:1 Tainted: G            E    
> 5.18.0-rc6+ #34
> [   24.859654] Hardware name: Dell Inc. Latitude E6430/0H3MT5, BIOS A21
> 05/08/2017
> [   24.859656] Workqueue: pm pm_runtime_work
> [   24.859660] Call Trace:
> [   24.859662]  <TASK>
> [   24.859665]  dump_stack_lvl+0x5b/0x74
> [   24.859669]  check_noncircular+0xdf/0x100
> [   24.859672]  ? register_lock_class+0x38/0x470
> [   24.859678]  __lock_acquire+0x12e2/0x1f90
> [   24.859683]  lock_acquire+0xad/0x290
> [   24.859686]  ? nouveau_vga_lastclose+0x910/0x1030 [nouveau]
> [   24.859724]  ? lock_is_held_type+0xa6/0x120
> [   24.859730]  __mutex_lock+0x90/0x830
> [   24.859733]  ? nouveau_vga_lastclose+0x910/0x1030 [nouveau]
> [   24.859770]  ? nvif_vmm_map+0x114/0x130 [nouveau]
> [   24.859791]  ? nouveau_vga_lastclose+0x910/0x1030 [nouveau]
> [   24.859829]  ? nouveau_vga_lastclose+0x910/0x1030 [nouveau]
> [   24.859866]  nouveau_vga_lastclose+0x910/0x1030 [nouveau]
> [   24.859905]  ttm_bo_move_to_lru_tail+0x32c/0x980 [ttm]
> [   24.859912]  ttm_mem_evict_first+0x25c/0x4b0 [ttm]
> [   24.859919]  ? lock_release+0x20/0x2a0
> [   24.859923]  ttm_resource_manager_evict_all+0x93/0x1b0 [ttm]
> [   24.859930]  nouveau_debugfs_fini+0x161/0x260 [nouveau]
> [   24.859968]  nouveau_drm_ioctl+0xa4a/0x1820 [nouveau]
> [   24.860005]  pci_pm_runtime_suspend+0x5c/0x180
> [   24.860008]  ? pci_dev_put+0x20/0x20
> [   24.860011]  __rpm_callback+0x48/0x1b0
> [   24.860014]  ? pci_dev_put+0x20/0x20
> [   24.860018]  rpm_callback+0x5a/0x70
> [   24.860020]  ? pci_dev_put+0x20/0x20
> [   24.860023]  rpm_suspend+0x10a/0x6f0
> [   24.860025]  ? process_one_work+0x1d0/0x560
> [   24.860031]  pm_runtime_work+0xa0/0xb0
> [   24.860034]  process_one_work+0x254/0x560
> [   24.860039]  worker_thread+0x4f/0x390
> [   24.860043]  ? process_one_work+0x560/0x560
> [   24.860046]  kthread+0xe6/0x110
> [   24.860049]  ? kthread_complete_and_exit+0x20/0x20
> [   24.860053]  ret_from_fork+0x22/0x30
> [   24.860059]  </TASK>
> 
> Regards,
> 
> Hans
> 
> 

-- 
Cheers,
 Lyude Paul (she/her)
 Software Engineer at Red Hat