[PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs
@ 2018-08-01 21:14 Lyude Paul
  2018-08-01 21:14 ` [PATCH v4 1/8] drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement Lyude Paul
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Lyude Paul @ 2018-08-01 21:14 UTC (permalink / raw)
  To: nouveau
  Cc: David Airlie, linux-kernel, dri-devel, Ben Skeggs, Sean Paul,
	Maarten Lankhorst, Gustavo Padovan

This is the latest version of
https://patchwork.freedesktop.org/series/46815/

With a bunch of fixes to the new fb_helper to prevent it from breaking
module loading/unloading with nouveau. Also; lots of documentation
fixes and one fix in response to a kbuild bot.

Lyude Paul (8):
  drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement
  drm/nouveau: Enable polling even if we have runtime PM
  drm/fb_helper: Introduce suspend/resume_hotplug()
  drm/nouveau: Fix deadlock with fb_helper using new helpers
  drm/nouveau: Use pm_runtime_get_noresume() in connector_detect()
  drm/nouveau: Respond to HPDs by probing one conn at a time
  drm/nouveau: Fix deadlocks in nouveau_connector_detect()
  drm/nouveau: Call pm_runtime_get_noresume() from hpd handlers

 drivers/gpu/drm/drm_fb_helper.c             | 123 +++++++++++++++++++-
 drivers/gpu/drm/nouveau/nouveau_connector.c |  60 ++++++++--
 drivers/gpu/drm/nouveau/nouveau_connector.h |   1 +
 drivers/gpu/drm/nouveau/nouveau_display.c   |   7 +-
 drivers/gpu/drm/nouveau/nouveau_drm.c       |  16 ++-
 drivers/gpu/drm/nouveau/nouveau_fbcon.c     |   1 +
 include/drm/drm_fb_helper.h                 |  22 ++++
 7 files changed, 213 insertions(+), 17 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/8] drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement
  2018-08-01 21:14 [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs Lyude Paul
@ 2018-08-01 21:14 ` Lyude Paul
  2018-08-02 19:10   ` Lukas Wunner
  2018-08-01 21:14 ` [PATCH v4 2/8] drm/nouveau: Enable polling even if we have runtime PM Lyude Paul
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Lyude Paul @ 2018-08-01 21:14 UTC (permalink / raw)
  To: nouveau
  Cc: Lukas Wunner, Peter Ujfalusi, stable, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

Turns out this part is my fault for not noticing when reviewing
9a2eba337cace ("drm/nouveau: Fix drm poll_helper handling"). Currently
we call drm_kms_helper_poll_enable() from nouveau_display_hpd_work().
This makes basically no sense however, because that means we're calling
drm_kms_helper_poll_enable() every time we schedule the hotplug
detection work. This is also against the advice mentioned in
drm_kms_helper_poll_enable()'s documentation:

 Note that calls to enable and disable polling must be strictly ordered,
 which is automatically the case when they're only call from
 suspend/resume callbacks.

Of course, hotplugs can't really be ordered. They could even happen
immediately after we called drm_kms_helper_poll_disable() in
nouveau_display_fini(), which can lead to all sorts of issues.

Additionally; enabling polling /after/ we call
drm_helper_hpd_irq_event() could also mean that we'd miss a hotplug
event anyway, since drm_helper_hpd_irq_event() wouldn't bother trying to
probe connectors so long as polling is disabled.

So; simply move this back into nouveau_display_init() again. The race
condition that both of these patches attempted to work around has
already been fixed properly in

  d61a5c106351 ("drm/nouveau: Fix deadlock on runtime suspend")

Fixes: 9a2eba337cace ("drm/nouveau: Fix drm poll_helper handling")
Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Peter Ujfalusi <peter.ujfalusi@ti.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/nouveau/nouveau_display.c | 7 +++++--
 drivers/gpu/drm/nouveau/nouveau_drm.c     | 1 -
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c
index ec7861457b84..1d36ab5d4796 100644
--- a/drivers/gpu/drm/nouveau/nouveau_display.c
+++ b/drivers/gpu/drm/nouveau/nouveau_display.c
@@ -355,8 +355,6 @@ nouveau_display_hpd_work(struct work_struct *work)
 	pm_runtime_get_sync(drm->dev->dev);

 	drm_helper_hpd_irq_event(drm->dev);
-	/* enable polling for external displays */
-	drm_kms_helper_poll_enable(drm->dev);

 	pm_runtime_mark_last_busy(drm->dev->dev);
 	pm_runtime_put_sync(drm->dev->dev);
@@ -411,6 +409,11 @@ nouveau_display_init(struct drm_device *dev)
 	if (ret)
 		return ret;

+	/* enable connector detection and polling for connectors without HPD
+	 * support
+	 */
+	drm_kms_helper_poll_enable(dev);
+
 	/* enable hotplug interrupts */
 	drm_connector_list_iter_begin(dev, &conn_iter);
 	nouveau_for_each_non_mst_connector_iter(connector, &conn_iter) {
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index c7ec86d6c3c9..5fdc1fbe2ee5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -835,7 +835,6 @@ nouveau_pmops_runtime_suspend(struct device *dev)
 		return -EBUSY;
 	}

-	drm_kms_helper_poll_disable(drm_dev);
 	nouveau_switcheroo_optimus_dsm();
 	ret = nouveau_do_suspend(drm_dev, true);
 	pci_save_state(pdev);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/8] drm/nouveau: Enable polling even if we have runtime PM
  2018-08-01 21:14 [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs Lyude Paul
  2018-08-01 21:14 ` [PATCH v4 1/8] drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement Lyude Paul
@ 2018-08-01 21:14 ` Lyude Paul
  2018-08-02 19:14   ` Lukas Wunner
  2018-08-01 21:14 ` [PATCH v4 3/8] drm/fb_helper: Introduce suspend/resume_hotplug() Lyude Paul
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Lyude Paul @ 2018-08-01 21:14 UTC (permalink / raw)
  To: nouveau
  Cc: Lukas Wunner, Peter Ujfalusi, stable, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

Having runtime PM makes no difference on whether or not we want polling,
and it's now safe to just enable polling unconditionally in drm_load()
thanks to d61a5c106351 ("drm/nouveau: Fix deadlock on runtime suspend")

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Peter Ujfalusi <peter.ujfalusi@ti.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/nouveau/nouveau_drm.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 5fdc1fbe2ee5..ee2546db09c9 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -592,10 +592,11 @@ nouveau_drm_load(struct drm_device *dev, unsigned long flags)
 		pm_runtime_allow(dev->dev);
 		pm_runtime_mark_last_busy(dev->dev);
 		pm_runtime_put(dev->dev);
-	} else {
-		/* enable polling for external displays */
-		drm_kms_helper_poll_enable(dev);
 	}
+
+	/* enable polling for connectors without hpd */
+	drm_kms_helper_poll_enable(dev);
+
 	return 0;
 
 fail_dispinit:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/8] drm/fb_helper: Introduce suspend/resume_hotplug()
  2018-08-01 21:14 [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs Lyude Paul
  2018-08-01 21:14 ` [PATCH v4 1/8] drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement Lyude Paul
  2018-08-01 21:14 ` [PATCH v4 2/8] drm/nouveau: Enable polling even if we have runtime PM Lyude Paul
@ 2018-08-01 21:14 ` Lyude Paul
  2018-08-01 21:14 ` [PATCH v4 4/8] drm/nouveau: Fix deadlock with fb_helper using new helpers Lyude Paul
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Lyude Paul @ 2018-08-01 21:14 UTC (permalink / raw)
  To: nouveau
  Cc: stable, Lukas Wunner, Karol Herbst, Gustavo Padovan,
	Maarten Lankhorst, Sean Paul, David Airlie, dri-devel,
	linux-kernel

I'm sure I don't need to tell you that fb_helper's locking is a mess.
That being said; fb_helper's locking mess can seriously complicate the
runtime suspend/resume operations of drivers because it can invoke
atomic commits and connector probing from anywhere that calls
drm_fb_helper_hotplug_event(). Since most drivers use
drm_fb_helper_output_poll_changed() as their output_poll_changed
handler, this can happen in every single context that can fire off a
hotplug event. An example:

[  246.669625] INFO: task kworker/4:0:37 blocked for more than 120 seconds.
[  246.673398]       Not tainted 4.18.0-rc5Lyude-Test+ #2
[  246.675271] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.676527] kworker/4:0     D    0    37      2 0x80000000
[  246.677580] Workqueue: events output_poll_execute [drm_kms_helper]
[  246.678704] Call Trace:
[  246.679753]  __schedule+0x322/0xaf0
[  246.680916]  schedule+0x33/0x90
[  246.681924]  schedule_preempt_disabled+0x15/0x20
[  246.683023]  __mutex_lock+0x569/0x9a0
[  246.684035]  ? kobject_uevent_env+0x117/0x7b0
[  246.685132]  ? drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
[  246.686179]  mutex_lock_nested+0x1b/0x20
[  246.687278]  ? mutex_lock_nested+0x1b/0x20
[  246.688307]  drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
[  246.689420]  drm_fb_helper_output_poll_changed+0x23/0x30 [drm_kms_helper]
[  246.690462]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
[  246.691570]  output_poll_execute+0x198/0x1c0 [drm_kms_helper]
[  246.692611]  process_one_work+0x231/0x620
[  246.693725]  worker_thread+0x214/0x3a0
[  246.694756]  kthread+0x12b/0x150
[  246.695856]  ? wq_pool_ids_show+0x140/0x140
[  246.696888]  ? kthread_create_worker_on_cpu+0x70/0x70
[  246.697998]  ret_from_fork+0x3a/0x50
[  246.699034] INFO: task kworker/0:1:60 blocked for more than 120 seconds.
[  246.700153]       Not tainted 4.18.0-rc5Lyude-Test+ #2
[  246.701182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.702278] kworker/0:1     D    0    60      2 0x80000000
[  246.703293] Workqueue: pm pm_runtime_work
[  246.704393] Call Trace:
[  246.705403]  __schedule+0x322/0xaf0
[  246.706439]  ? wait_for_completion+0x104/0x190
[  246.707393]  schedule+0x33/0x90
[  246.708375]  schedule_timeout+0x3a5/0x590
[  246.709289]  ? mark_held_locks+0x58/0x80
[  246.710208]  ? _raw_spin_unlock_irq+0x2c/0x40
[  246.711222]  ? wait_for_completion+0x104/0x190
[  246.712134]  ? trace_hardirqs_on_caller+0xf4/0x190
[  246.713094]  ? wait_for_completion+0x104/0x190
[  246.713964]  wait_for_completion+0x12c/0x190
[  246.714895]  ? wake_up_q+0x80/0x80
[  246.715727]  ? get_work_pool+0x90/0x90
[  246.716649]  flush_work+0x1c9/0x280
[  246.717483]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  246.718442]  __cancel_work_timer+0x146/0x1d0
[  246.719247]  cancel_delayed_work_sync+0x13/0x20
[  246.720043]  drm_kms_helper_poll_disable+0x1f/0x30 [drm_kms_helper]
[  246.721123]  nouveau_pmops_runtime_suspend+0x3d/0xb0 [nouveau]
[  246.721897]  pci_pm_runtime_suspend+0x6b/0x190
[  246.722825]  ? pci_has_legacy_pm_support+0x70/0x70
[  246.723737]  __rpm_callback+0x7a/0x1d0
[  246.724721]  ? pci_has_legacy_pm_support+0x70/0x70
[  246.725607]  rpm_callback+0x24/0x80
[  246.726553]  ? pci_has_legacy_pm_support+0x70/0x70
[  246.727376]  rpm_suspend+0x142/0x6b0
[  246.728185]  pm_runtime_work+0x97/0xc0
[  246.728938]  process_one_work+0x231/0x620
[  246.729796]  worker_thread+0x44/0x3a0
[  246.730614]  kthread+0x12b/0x150
[  246.731395]  ? wq_pool_ids_show+0x140/0x140
[  246.732202]  ? kthread_create_worker_on_cpu+0x70/0x70
[  246.732878]  ret_from_fork+0x3a/0x50
[  246.733768] INFO: task kworker/4:2:422 blocked for more than 120 seconds.
[  246.734587]       Not tainted 4.18.0-rc5Lyude-Test+ #2
[  246.735393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.736113] kworker/4:2     D    0   422      2 0x80000080
[  246.736789] Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
[  246.737665] Call Trace:
[  246.738490]  __schedule+0x322/0xaf0
[  246.739250]  schedule+0x33/0x90
[  246.739908]  rpm_resume+0x19c/0x850
[  246.740750]  ? finish_wait+0x90/0x90
[  246.741541]  __pm_runtime_resume+0x4e/0x90
[  246.742370]  nv50_disp_atomic_commit+0x31/0x210 [nouveau]
[  246.743124]  drm_atomic_commit+0x4a/0x50 [drm]
[  246.743775]  restore_fbdev_mode_atomic+0x1c8/0x240 [drm_kms_helper]
[  246.744603]  restore_fbdev_mode+0x31/0x140 [drm_kms_helper]
[  246.745373]  drm_fb_helper_restore_fbdev_mode_unlocked+0x54/0xb0 [drm_kms_helper]
[  246.746220]  drm_fb_helper_set_par+0x2d/0x50 [drm_kms_helper]
[  246.746884]  drm_fb_helper_hotplug_event.part.28+0x96/0xb0 [drm_kms_helper]
[  246.747675]  drm_fb_helper_output_poll_changed+0x23/0x30 [drm_kms_helper]
[  246.748544]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
[  246.749439]  nv50_mstm_hotplug+0x15/0x20 [nouveau]
[  246.750111]  drm_dp_send_link_address+0x177/0x1c0 [drm_kms_helper]
[  246.750764]  drm_dp_check_and_send_link_address+0xa8/0xd0 [drm_kms_helper]
[  246.751602]  drm_dp_mst_link_probe_work+0x51/0x90 [drm_kms_helper]
[  246.752314]  process_one_work+0x231/0x620
[  246.752979]  worker_thread+0x44/0x3a0
[  246.753838]  kthread+0x12b/0x150
[  246.754619]  ? wq_pool_ids_show+0x140/0x140
[  246.755386]  ? kthread_create_worker_on_cpu+0x70/0x70
[  246.756162]  ret_from_fork+0x3a/0x50
[  246.756847]
	   Showing all locks held in the system:
[  246.758261] 3 locks held by kworker/4:0/37:
[  246.759016]  #0: 00000000f8df4d2d ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
[  246.759856]  #1: 00000000e6065461 ((work_completion)(&(&dev->mode_config.output_poll_work)->work)){+.+.}, at: process_one_work+0x1b3/0x620
[  246.760670]  #2: 00000000cb66735f (&helper->lock){+.+.}, at: drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
[  246.761516] 2 locks held by kworker/0:1/60:
[  246.762274]  #0: 00000000fff6be0f ((wq_completion)"pm"){+.+.}, at: process_one_work+0x1b3/0x620
[  246.762982]  #1: 000000005ab44fb4 ((work_completion)(&dev->power.work)){+.+.}, at: process_one_work+0x1b3/0x620
[  246.763890] 1 lock held by khungtaskd/64:
[  246.764664]  #0: 000000008cb8b5c3 (rcu_read_lock){....}, at: debug_show_all_locks+0x23/0x185
[  246.765588] 5 locks held by kworker/4:2/422:
[  246.766440]  #0: 00000000232f0959 ((wq_completion)"events_long"){+.+.}, at: process_one_work+0x1b3/0x620
[  246.767390]  #1: 00000000bb59b134 ((work_completion)(&mgr->work)){+.+.}, at: process_one_work+0x1b3/0x620
[  246.768154]  #2: 00000000cb66735f (&helper->lock){+.+.}, at: drm_fb_helper_restore_fbdev_mode_unlocked+0x4c/0xb0 [drm_kms_helper]
[  246.768966]  #3: 000000004c8f0b6b (crtc_ww_class_acquire){+.+.}, at: restore_fbdev_mode_atomic+0x4b/0x240 [drm_kms_helper]
[  246.769921]  #4: 000000004c34a296 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_backoff+0x8a/0x1b0 [drm]
[  246.770839] 1 lock held by dmesg/1038:
[  246.771739] 2 locks held by zsh/1172:
[  246.772650]  #0: 00000000836d0438 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
[  246.773680]  #1: 000000001f4f4d48 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0xc1/0x870

[  246.775522] =============================================

Because of this, there's an unreasonable number of places that drm
drivers would need to insert special handling to prevent trying to
resume the device from all of these contexts that can deadlock. It's
difficult even to try synchronizing with fb_helper in these contexts as
well, since any of them could introduce a deadlock by waiting to acquire
the top-level fb_helper mutex, while it's being held by another thread
that might potentially call down to pm_runtime_get_sync().

Luckily-there's no actual reason we need to allow fb_helper to handle
hotplugging at all when runtime suspending a device. If a hotplug
happens during a runtime suspend operation, there's no reason the driver
can't just re-enable fbcon's hotplug handling and bring it up to speed
with hotplugging events it may have missed by calling
drm_fb_helper_hotplug_event().

So, let's make this easy and just add helpers to handle disabling and
enabling fb_helper connector probing() without having to potentially
wait on fb_helper to finish it's work. This will let us fix the runtime
suspend/resume deadlocks that we've been experiencing with nouveau,
along with being able to fix some of the incorrect runtime PM core
interaction that other DRM drivers currently perform to work around
these issues.

Changes since v3:
- Actually check if fb_helper is NULL in both new helpers
- Actually check drm_fbdev_emulation in both new helpers
- Don't fire off a fb_helper hotplug unconditionally; only do it if
  the following conditions are true (as otherwise, calling this in the
  wrong spot will cause Bad Things to happen):
  - fb_helper hotplug handling was actually inhibited previously
  - fb_helper actually has a delayed hotplug pending
  - fb_helper is actually bound
  - fb_helper is actually initialized
- Add __must_check to drm_fb_helper_suspend_hotplug(). There's no
  situation where a driver would actually want to use this without
  checking the return value, so enforce that
- Rewrite and clarify the documentation for both helpers.
- Make sure to return true in the drm_fb_helper_suspend_hotplug() stub
  that's provided in drm_fb_helper.h when CONFIG_DRM_FBDEV_EMULATION
  isn't enabled
- Actually grab the toplevel fb_helper lock in
  drm_fb_helper_resume_hotplug(), since it's possible other activity
  (such as a hotplug) could be going on at the same time the driver
  calls drm_fb_helper_resume_hotplug(). We need this to check whether or
  not drm_fb_helper_hotplug_event() needs to be called anyway

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: stable@vger.kernel.org
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Karol Herbst <karolherbst@gmail.com>
---
 drivers/gpu/drm/drm_fb_helper.c | 123 +++++++++++++++++++++++++++++++-
 include/drm/drm_fb_helper.h     |  22 ++++++
 2 files changed, 144 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index 2ee1eaa66188..b5f1dee0c3a0 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -84,6 +84,11 @@ static DEFINE_MUTEX(kernel_fb_helper_lock);
  * For suspend/resume consider using drm_mode_config_helper_suspend() and
  * drm_mode_config_helper_resume() which takes care of fbdev as well.
  *
+ * For runtime suspend and runtime resume, drivers which need to disable
+ * normal hotplug handling should consider using
+ * drm_fb_helper_suspend_hotplug() and drm_fb_helper_resume_hotplug() to
+ * avoid deadlocking with fb_helper's hotplug handling.
+ *
  * All other functions exported by the fb helper library can be used to
  * implement the fbdev driver interface by the driver.
  *
@@ -2733,6 +2738,118 @@ int drm_fb_helper_initial_config(struct drm_fb_helper *fb_helper, int bpp_sel)
 }
 EXPORT_SYMBOL(drm_fb_helper_initial_config);
 
+/**
+ * drm_fb_helper_resume_hotplug - Uninhibit fb_helper hotplug handling
+ * @fb_helper: driver-allocated fbdev helper, can be NULL
+ *
+ * Uninhibit fb_helper's hotplug handling after it was previously inhibited by
+ * a call to drm_fb_helper_suspend_hotplug(). Unlike
+ * drm_fb_helper_suspend_hotplug(), this function will wait on
+ * fb_helper->lock.
+ *
+ * This helper will take care of handling any hotplug events that happened
+ * while fb_helper's hotplug handling was suspended. Since this possibly
+ * implies a call to drm_fb_helper_hotplug_event(), care must be taken when
+ * calling this function as it may initiate a modeset.
+ *
+ * Please note that this function is different from
+ * drm_fb_helper_set_suspend(). It does not resume fb_helper, it only allows
+ * fb_helper to probe connectors in response to changes to the device's
+ * connector configuration if this functionality was previously disabled by
+ * drm_fb_helper_suspend_hotplug(). Generally, a driver will only want to call
+ * this in it's runtime resume callbacks.
+ *
+ * Drivers calling drm_fb_helper_suspend_hotplug() must make sure to call this
+ * somewhere in their runtime resume callbacks.
+ *
+ * See also: drm_fb_helper_suspend_hotplug()
+ */
+void
+drm_fb_helper_resume_hotplug(struct drm_fb_helper *fb_helper)
+{
+	bool changed;
+
+	if (!drm_fbdev_emulation || !fb_helper)
+		return;
+
+	mutex_lock(&fb_helper->lock);
+
+	changed = !fb_helper->deferred_setup &&
+		  fb_helper->fb &&
+		  drm_fb_helper_is_bound(fb_helper) &&
+		  fb_helper->hotplug_suspended &&
+		  fb_helper->delayed_hotplug;
+	if (changed)
+		fb_helper->delayed_hotplug = false;
+
+	fb_helper->hotplug_suspended = false;
+
+	mutex_unlock(&fb_helper->lock);
+
+	if (changed)
+		drm_fb_helper_hotplug_event(fb_helper);
+}
+EXPORT_SYMBOL(drm_fb_helper_resume_hotplug);
+
+/**
+ * drm_fb_helper_suspend_hotplug - Attempt to temporarily suspend fb_helper's
+ *                                 hotplug handling
+ * @fb_helper: driver-allocated fbdev helper, can be NULL
+ *
+ * Temporarily inhibit fb_helper from responding to connector changes without
+ * blocking on fb_helper->lock, if possible. This can be called by a DRM
+ * driver early on in it's runtime suspend callback to both check whether or
+ * not fb_helper is still busy, and prevent hotplugs that might occur part-way
+ * through the runtime suspend process from being handled by fb_helper until
+ * drm_fb_helper_resume_hotplug() is called. This dramatically simplifies the
+ * runtime suspend process, as it eliminates the possibility that fb_helper
+ * might try to perform a modeset half way through the runtime suspend process
+ * in response to a connector hotplug, something which will almost certainly
+ * lead to deadlocking for drivers that need to disable normal hotplug
+ * handling in their runtime suspend handlers.
+ *
+ * Calls to this function should be put at the very start of a driver's
+ * runtime suspend operation if desired. The driver is then responsible for
+ * re-enabling fb_helper hotplug handling when normal hotplug detection
+ * becomes available on the device again by calling
+ * drm_fb_helper_resume_hotplug(). Usually, a driver will want to re-enable
+ * fb_helper hotplug handling once the hotplug detection capabilities of its
+ * devices have returned to normal (e.g. when the device is runtime resumed,
+ * or after the runtime suspend process was aborted for some reason).
+ *
+ * Please note that this function is different from
+ * drm_fb_helper_set_suspend(), in that it does not actually suspend
+ * fb_helper. It only prevents fb_helper from responding to connector hotplugs
+ * on it's own. Generally, a driver will only want to call this in its
+ * runtime suspend callback.
+ *
+ * See also: drm_fb_helper_resume_hotplug()
+ *
+ * RETURNS:
+ * True if hotplug handling was disabled successfully, or fb_helper wasn't
+ * actually initialized/enabled yet. False if grabbing &fb_helper->lock would
+ * have meant blocking on fb_helper. When this function returns false, this
+ * usually implies means that fb_helper is still busy doing something such as
+ * probing connectors or performing a modeset. Drivers should treat this the
+ * same way they would any other activity on the device, and abort the runtime
+ * suspend process as early as possible in response.
+ */
+bool __must_check
+drm_fb_helper_suspend_hotplug(struct drm_fb_helper *fb_helper)
+{
+	if (!drm_fbdev_emulation || !fb_helper)
+		return true;
+
+	if (!mutex_trylock(&fb_helper->lock))
+		return false;
+
+	fb_helper->hotplug_suspended = true;
+	mutex_unlock(&fb_helper->lock);
+
+	return true;
+}
+EXPORT_SYMBOL(drm_fb_helper_suspend_hotplug);
+
 /**
  * drm_fb_helper_hotplug_event - respond to a hotplug notification by
  *                               probing all the outputs attached to the fb
@@ -2751,6 +2868,9 @@ EXPORT_SYMBOL(drm_fb_helper_initial_config);
  * for a race-free fbcon setup and will make sure that the fbdev emulation will
  * not miss any hotplug events.
  *
+ * See also: drm_fb_helper_suspend_hotplug()
+ * See also: drm_fb_helper_resume_hotplug()
+ *
  * RETURNS:
  * 0 on success and a non-zero error code otherwise.
  */
@@ -2768,7 +2888,8 @@ int drm_fb_helper_hotplug_event(struct drm_fb_helper *fb_helper)
 		return err;
 	}
 
-	if (!fb_helper->fb || !drm_fb_helper_is_bound(fb_helper)) {
+	if (!fb_helper->fb || !drm_fb_helper_is_bound(fb_helper) ||
+	    fb_helper->hotplug_suspended) {
 		fb_helper->delayed_hotplug = true;
 		mutex_unlock(&fb_helper->lock);
 		return err;
diff --git a/include/drm/drm_fb_helper.h b/include/drm/drm_fb_helper.h
index b069433e7fc1..9c6e4ceff3af 100644
--- a/include/drm/drm_fb_helper.h
+++ b/include/drm/drm_fb_helper.h
@@ -232,6 +232,14 @@ struct drm_fb_helper {
 	 * See also: @deferred_setup
 	 */
 	int preferred_bpp;
+
+	/**
+	 * @hotplug_suspended:
+	 *
+	 * Whether or not we can currently handle hotplug events, or if we
+	 * need to wait for the DRM device to uninhibit us.
+	 */
+	bool hotplug_suspended;
 };
 
 /**
@@ -330,6 +338,11 @@ void drm_fb_helper_fbdev_teardown(struct drm_device *dev);
 
 void drm_fb_helper_lastclose(struct drm_device *dev);
 void drm_fb_helper_output_poll_changed(struct drm_device *dev);
+
+void drm_fb_helper_resume_hotplug(struct drm_fb_helper *fb_helper);
+bool __must_check
+drm_fb_helper_suspend_hotplug(struct drm_fb_helper *fb_helper);
+
 #else
 static inline void drm_fb_helper_prepare(struct drm_device *dev,
 					struct drm_fb_helper *helper,
@@ -564,6 +577,15 @@ static inline void drm_fb_helper_output_poll_changed(struct drm_device *dev)
 {
 }
 
+static inline void
+drm_fb_helper_resume_hotplug(struct drm_fb_helper *fb_helper)
+{
+}
+static inline bool __must_check
+drm_fb_helper_suspend_hotplug(struct drm_fb_helper *fb_helper)
+{
+	return true;
+}
 #endif
 
 static inline int
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 4/8] drm/nouveau: Fix deadlock with fb_helper using new helpers
  2018-08-01 21:14 [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs Lyude Paul
                   ` (2 preceding siblings ...)
  2018-08-01 21:14 ` [PATCH v4 3/8] drm/fb_helper: Introduce suspend/resume_hotplug() Lyude Paul
@ 2018-08-01 21:14 ` Lyude Paul
  2018-08-01 21:14 ` [PATCH v4 5/8] drm/nouveau: Use pm_runtime_get_noresume() in connector_detect() Lyude Paul
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Lyude Paul @ 2018-08-01 21:14 UTC (permalink / raw)
  To: nouveau
  Cc: stable, Lukas Wunner, Karol Herbst, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

This removes the potential of deadlocking with fb_helper entirely by
preventing it from handling hotplugs during the runtime suspend process
as early as possible in the suspend process. If it turns out this is not
possible, due to some fb_helper action having been queued up before we
got a time to disable hotplugging, we simply return -EBUSY so that the
runtime PM core attempts autosuspending the device again once fb_helper
isn't doing anything.

This fixes one of the issues causing deadlocks on runtime suspend/resume
with nouveau on my P50.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: stable@vger.kernel.org
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Karol Herbst <karolherbst@gmail.com>
---
 drivers/gpu/drm/nouveau/nouveau_drm.c   | 8 ++++++++
 drivers/gpu/drm/nouveau/nouveau_fbcon.c | 1 +
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index ee2546db09c9..d47cb5b2af98 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -836,6 +836,14 @@ nouveau_pmops_runtime_suspend(struct device *dev)
 		return -EBUSY;
 	}
 
+	/* There's no way for us to stop fb_helper work in reaction to
+	 * hotplugs later in the RPM process. First off: we don't want to,
+	 * fb_helper should be able to keep the GPU awake. Second off: it is
+	 * capable of grabbing basically any lock in existence.
+	 */
+	if (!drm_fb_helper_suspend_hotplug(drm_dev->fb_helper))
+		return -EBUSY;
+
 	nouveau_switcheroo_optimus_dsm();
 	ret = nouveau_do_suspend(drm_dev, true);
 	pci_save_state(pdev);
diff --git a/drivers/gpu/drm/nouveau/nouveau_fbcon.c b/drivers/gpu/drm/nouveau/nouveau_fbcon.c
index 85c1f10bc2b6..963ba630fd04 100644
--- a/drivers/gpu/drm/nouveau/nouveau_fbcon.c
+++ b/drivers/gpu/drm/nouveau/nouveau_fbcon.c
@@ -466,6 +466,7 @@ nouveau_fbcon_set_suspend_work(struct work_struct *work)
 	console_unlock();
 
 	if (state == FBINFO_STATE_RUNNING) {
+		drm_fb_helper_resume_hotplug(drm->dev->fb_helper);
 		pm_runtime_mark_last_busy(drm->dev->dev);
 		pm_runtime_put_sync(drm->dev->dev);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 5/8] drm/nouveau: Use pm_runtime_get_noresume() in connector_detect()
  2018-08-01 21:14 [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs Lyude Paul
                   ` (3 preceding siblings ...)
  2018-08-01 21:14 ` [PATCH v4 4/8] drm/nouveau: Fix deadlock with fb_helper using new helpers Lyude Paul
@ 2018-08-01 21:14 ` Lyude Paul
  2018-08-01 21:14 ` [PATCH v4 6/8] drm/nouveau: Respond to HPDs by probing one conn at a time Lyude Paul
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Lyude Paul @ 2018-08-01 21:14 UTC (permalink / raw)
  To: nouveau
  Cc: stable, Lukas Wunner, Karol Herbst, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

It's true we can't resume the device from poll workers in
nouveau_connector_detect(). We can however, prevent the autosuspend
timer from elapsing immediately if it hasn't already without risking any
sort of deadlock with the runtime suspend/resume operations. So do that
instead of entirely avoiding grabbing a power reference.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: stable@vger.kernel.org
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Karol Herbst <karolherbst@gmail.com>
---
 drivers/gpu/drm/nouveau/nouveau_connector.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
index 2a45b4c2ceb0..010d6db14cba 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.c
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
@@ -572,12 +572,16 @@ nouveau_connector_detect(struct drm_connector *connector, bool force)
 		nv_connector->edid = NULL;
 	}
 
-	/* Outputs are only polled while runtime active, so acquiring a
-	 * runtime PM ref here is unnecessary (and would deadlock upon
-	 * runtime suspend because it waits for polling to finish).
+	/* Outputs are only polled while runtime active, so resuming the
+	 * device here is unnecessary (and would deadlock upon runtime suspend
+	 * because it waits for polling to finish). We do however, want to
+	 * prevent the autosuspend timer from elapsing during this operation
+	 * if possible.
 	 */
-	if (!drm_kms_helper_is_poll_worker()) {
-		ret = pm_runtime_get_sync(connector->dev->dev);
+	if (drm_kms_helper_is_poll_worker()) {
+		pm_runtime_get_noresume(dev->dev);
+	} else {
+		ret = pm_runtime_get_sync(dev->dev);
 		if (ret < 0 && ret != -EACCES)
 			return conn_status;
 	}
@@ -655,10 +659,8 @@ nouveau_connector_detect(struct drm_connector *connector, bool force)
 
  out:
 
-	if (!drm_kms_helper_is_poll_worker()) {
-		pm_runtime_mark_last_busy(connector->dev->dev);
-		pm_runtime_put_autosuspend(connector->dev->dev);
-	}
+	pm_runtime_mark_last_busy(dev->dev);
+	pm_runtime_put_autosuspend(dev->dev);
 
 	return conn_status;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 6/8] drm/nouveau: Respond to HPDs by probing one conn at a time
  2018-08-01 21:14 [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs Lyude Paul
                   ` (4 preceding siblings ...)
  2018-08-01 21:14 ` [PATCH v4 5/8] drm/nouveau: Use pm_runtime_get_noresume() in connector_detect() Lyude Paul
@ 2018-08-01 21:14 ` Lyude Paul
  2018-08-01 21:14 ` [PATCH v4 7/8] drm/nouveau: Fix deadlocks in nouveau_connector_detect() Lyude Paul
  2018-08-01 21:14 ` [PATCH v4 8/8] drm/nouveau: Call pm_runtime_get_noresume() from hpd handlers Lyude Paul
  7 siblings, 0 replies; 16+ messages in thread
From: Lyude Paul @ 2018-08-01 21:14 UTC (permalink / raw)
  To: nouveau
  Cc: stable, Lukas Wunner, Karol Herbst, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

There isn't actually any reason we need to call drm_hpd_irq_event() from
our hotplug handler, as we already know which connector the hotplug
event was fired for. We're also going to need to avoid probing all
connectors needlessly from hotplug handlers anyway so that we can track
when nouveau_connector_detect() is being called from the context of it's
connector's hotplug handler in order to fix the next deadlocking issue.

This is (slightly) faster anyway!

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: stable@vger.kernel.org
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Karol Herbst <karolherbst@gmail.com>
---
 drivers/gpu/drm/nouveau/nouveau_connector.c | 28 ++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
index 010d6db14cba..9714e09f17db 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.c
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
@@ -1114,6 +1114,32 @@ nouveau_connector_funcs_lvds = {
 	.atomic_get_property = nouveau_conn_atomic_get_property,
 };
 
+static void
+nouveau_connector_hotplug_probe(struct nouveau_connector *nv_conn)
+{
+	struct drm_modeset_acquire_ctx ctx;
+	struct drm_connector *conn = &nv_conn->base;
+	enum drm_connector_status old_status;
+	struct drm_device *dev = conn->dev;
+	bool changed;
+
+	mutex_lock(&dev->mode_config.mutex);
+
+	drm_modeset_acquire_init(&ctx, 0);
+	drm_modeset_lock(&dev->mode_config.connection_mutex, &ctx);
+
+	old_status = conn->status;
+	conn->status = drm_helper_probe_detect(conn, &ctx, true);
+	changed = old_status != conn->status;
+
+	drm_modeset_drop_locks(&ctx);
+	drm_modeset_acquire_fini(&ctx);
+	mutex_unlock(&dev->mode_config.mutex);
+
+	if (changed)
+		drm_kms_helper_hotplug_event(dev);
+}
+
 static int
 nouveau_connector_hotplug(struct nvif_notify *notify)
 {
@@ -1138,7 +1164,7 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
 				nv50_mstm_remove(nv_encoder->dp.mstm);
 		}
 
-		drm_helper_hpd_irq_event(connector->dev);
+		nouveau_connector_hotplug_probe(nv_connector);
 	}
 
 	return NVIF_NOTIFY_KEEP;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 7/8] drm/nouveau: Fix deadlocks in nouveau_connector_detect()
  2018-08-01 21:14 [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs Lyude Paul
                   ` (5 preceding siblings ...)
  2018-08-01 21:14 ` [PATCH v4 6/8] drm/nouveau: Respond to HPDs by probing one conn at a time Lyude Paul
@ 2018-08-01 21:14 ` Lyude Paul
  2018-08-06  9:15   ` Daniel Vetter
  2018-08-01 21:14 ` [PATCH v4 8/8] drm/nouveau: Call pm_runtime_get_noresume() from hpd handlers Lyude Paul
  7 siblings, 1 reply; 16+ messages in thread
From: Lyude Paul @ 2018-08-01 21:14 UTC (permalink / raw)
  To: nouveau
  Cc: stable, Lukas Wunner, Karol Herbst, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

When we disable hotplugging on the GPU, we need to be able to
synchronize with each connector's hotplug interrupt handler before the
interrupt is finally disabled. This can be a problem however, since
nouveau_connector_detect() currently grabs a runtime power reference
when handling connector probing. This will deadlock the runtime suspend
handler like so:

[  861.480896] INFO: task kworker/0:2:61 blocked for more than 120 seconds.
[  861.483290]       Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
[  861.485158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  861.486332] kworker/0:2     D    0    61      2 0x80000000
[  861.487044] Workqueue: events nouveau_display_hpd_work [nouveau]
[  861.487737] Call Trace:
[  861.488394]  __schedule+0x322/0xaf0
[  861.489070]  schedule+0x33/0x90
[  861.489744]  rpm_resume+0x19c/0x850
[  861.490392]  ? finish_wait+0x90/0x90
[  861.491068]  __pm_runtime_resume+0x4e/0x90
[  861.491753]  nouveau_display_hpd_work+0x22/0x60 [nouveau]
[  861.492416]  process_one_work+0x231/0x620
[  861.493068]  worker_thread+0x44/0x3a0
[  861.493722]  kthread+0x12b/0x150
[  861.494342]  ? wq_pool_ids_show+0x140/0x140
[  861.494991]  ? kthread_create_worker_on_cpu+0x70/0x70
[  861.495648]  ret_from_fork+0x3a/0x50
[  861.496304] INFO: task kworker/6:2:320 blocked for more than 120 seconds.
[  861.496968]       Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
[  861.497654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  861.498341] kworker/6:2     D    0   320      2 0x80000080
[  861.499045] Workqueue: pm pm_runtime_work
[  861.499739] Call Trace:
[  861.500428]  __schedule+0x322/0xaf0
[  861.501134]  ? wait_for_completion+0x104/0x190
[  861.501851]  schedule+0x33/0x90
[  861.502564]  schedule_timeout+0x3a5/0x590
[  861.503284]  ? mark_held_locks+0x58/0x80
[  861.503988]  ? _raw_spin_unlock_irq+0x2c/0x40
[  861.504710]  ? wait_for_completion+0x104/0x190
[  861.505417]  ? trace_hardirqs_on_caller+0xf4/0x190
[  861.506136]  ? wait_for_completion+0x104/0x190
[  861.506845]  wait_for_completion+0x12c/0x190
[  861.507555]  ? wake_up_q+0x80/0x80
[  861.508268]  flush_work+0x1c9/0x280
[  861.508990]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  861.509735]  nvif_notify_put+0xb1/0xc0 [nouveau]
[  861.510482]  nouveau_display_fini+0xbd/0x170 [nouveau]
[  861.511241]  nouveau_display_suspend+0x67/0x120 [nouveau]
[  861.511969]  nouveau_do_suspend+0x5e/0x2d0 [nouveau]
[  861.512715]  nouveau_pmops_runtime_suspend+0x47/0xb0 [nouveau]
[  861.513435]  pci_pm_runtime_suspend+0x6b/0x180
[  861.514165]  ? pci_has_legacy_pm_support+0x70/0x70
[  861.514897]  __rpm_callback+0x7a/0x1d0
[  861.515618]  ? pci_has_legacy_pm_support+0x70/0x70
[  861.516313]  rpm_callback+0x24/0x80
[  861.517027]  ? pci_has_legacy_pm_support+0x70/0x70
[  861.517741]  rpm_suspend+0x142/0x6b0
[  861.518449]  pm_runtime_work+0x97/0xc0
[  861.519144]  process_one_work+0x231/0x620
[  861.519831]  worker_thread+0x44/0x3a0
[  861.520522]  kthread+0x12b/0x150
[  861.521220]  ? wq_pool_ids_show+0x140/0x140
[  861.521925]  ? kthread_create_worker_on_cpu+0x70/0x70
[  861.522622]  ret_from_fork+0x3a/0x50
[  861.523299] INFO: task kworker/6:0:1329 blocked for more than 120 seconds.
[  861.523977]       Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
[  861.524644] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  861.525349] kworker/6:0     D    0  1329      2 0x80000000
[  861.526073] Workqueue: events nvif_notify_work [nouveau]
[  861.526751] Call Trace:
[  861.527411]  __schedule+0x322/0xaf0
[  861.528089]  schedule+0x33/0x90
[  861.528758]  rpm_resume+0x19c/0x850
[  861.529399]  ? finish_wait+0x90/0x90
[  861.530073]  __pm_runtime_resume+0x4e/0x90
[  861.530798]  nouveau_connector_detect+0x7e/0x510 [nouveau]
[  861.531459]  ? ww_mutex_lock+0x47/0x80
[  861.532097]  ? ww_mutex_lock+0x47/0x80
[  861.532819]  ? drm_modeset_lock+0x88/0x130 [drm]
[  861.533481]  drm_helper_probe_detect_ctx+0xa0/0x100 [drm_kms_helper]
[  861.534127]  drm_helper_hpd_irq_event+0xa4/0x120 [drm_kms_helper]
[  861.534940]  nouveau_connector_hotplug+0x98/0x120 [nouveau]
[  861.535556]  nvif_notify_work+0x2d/0xb0 [nouveau]
[  861.536221]  process_one_work+0x231/0x620
[  861.536994]  worker_thread+0x44/0x3a0
[  861.537757]  kthread+0x12b/0x150
[  861.538463]  ? wq_pool_ids_show+0x140/0x140
[  861.539102]  ? kthread_create_worker_on_cpu+0x70/0x70
[  861.539815]  ret_from_fork+0x3a/0x50
[  861.540521]
               Showing all locks held in the system:
[  861.541696] 2 locks held by kworker/0:2/61:
[  861.542406]  #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
[  861.543071]  #1: 0000000076868126 ((work_completion)(&drm->hpd_work)){+.+.}, at: process_one_work+0x1b3/0x620
[  861.543814] 1 lock held by khungtaskd/64:
[  861.544535]  #0: 0000000059db4b53 (rcu_read_lock){....}, at: debug_show_all_locks+0x23/0x185
[  861.545160] 3 locks held by kworker/6:2/320:
[  861.545896]  #0: 00000000d9e1bc59 ((wq_completion)"pm"){+.+.}, at: process_one_work+0x1b3/0x620
[  861.546702]  #1: 00000000c9f92d84 ((work_completion)(&dev->power.work)){+.+.}, at: process_one_work+0x1b3/0x620
[  861.547443]  #2: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at: nouveau_display_fini+0x96/0x170 [nouveau]
[  861.548146] 1 lock held by dmesg/983:
[  861.548889] 2 locks held by zsh/1250:
[  861.549605]  #0: 00000000348e3cf6 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
[  861.550393]  #1: 000000007009a7a8 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0xc1/0x870
[  861.551122] 6 locks held by kworker/6:0/1329:
[  861.551957]  #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
[  861.552765]  #1: 00000000ddb499ad ((work_completion)(&notify->work)#2){+.+.}, at: process_one_work+0x1b3/0x620
[  861.553582]  #2: 000000006e013cbe (&dev->mode_config.mutex){+.+.}, at: drm_helper_hpd_irq_event+0x6c/0x120 [drm_kms_helper]
[  861.554357]  #3: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at: drm_helper_hpd_irq_event+0x78/0x120 [drm_kms_helper]
[  861.555227]  #4: 0000000044f294d9 (crtc_ww_class_acquire){+.+.}, at: drm_helper_probe_detect_ctx+0x3d/0x100 [drm_kms_helper]
[  861.556133]  #5: 00000000db193642 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_lock+0x4b/0x130 [drm]

[  861.557864] =============================================

[  861.559507] NMI backtrace for cpu 2
[  861.560363] CPU: 2 PID: 64 Comm: khungtaskd Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
[  861.561197] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W (1.51 ) 05/18/2018
[  861.561948] Call Trace:
[  861.562757]  dump_stack+0x8e/0xd3
[  861.563516]  nmi_cpu_backtrace.cold.3+0x14/0x5a
[  861.564269]  ? lapic_can_unplug_cpu.cold.27+0x42/0x42
[  861.565029]  nmi_trigger_cpumask_backtrace+0xa1/0xae
[  861.565789]  arch_trigger_cpumask_backtrace+0x19/0x20
[  861.566558]  watchdog+0x316/0x580
[  861.567355]  kthread+0x12b/0x150
[  861.568114]  ? reset_hung_task_detector+0x20/0x20
[  861.568863]  ? kthread_create_worker_on_cpu+0x70/0x70
[  861.569598]  ret_from_fork+0x3a/0x50
[  861.570370] Sending NMI from CPU 2 to CPUs 0-1,3-7:
[  861.571426] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x7f/0x120
[  861.571429] NMI backtrace for cpu 7 skipped: idling at intel_idle+0x7f/0x120
[  861.571432] NMI backtrace for cpu 3 skipped: idling at intel_idle+0x7f/0x120
[  861.571464] NMI backtrace for cpu 5 skipped: idling at intel_idle+0x7f/0x120
[  861.571467] NMI backtrace for cpu 0 skipped: idling at intel_idle+0x7f/0x120
[  861.571469] NMI backtrace for cpu 4 skipped: idling at intel_idle+0x7f/0x120
[  861.571472] NMI backtrace for cpu 1 skipped: idling at intel_idle+0x7f/0x120
[  861.572428] Kernel panic - not syncing: hung_task: blocked tasks

So: fix this with a new trick; store the current task_struct that's
executing in the nouveau_connector structure, then avoid attempting to
runtime resume the device when we know that we're just running from the
context of our hotplug interrupt handler. Since hpd interrupts are only
enabled while the device is runtime active, this should be totally safe.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: stable@vger.kernel.org
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Karol Herbst <karolherbst@gmail.com>
---
 drivers/gpu/drm/nouveau/nouveau_connector.c | 16 ++++++++++------
 drivers/gpu/drm/nouveau/nouveau_connector.h |  1 +
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
index 9714e09f17db..8409c3f2c3a1 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.c
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
@@ -572,13 +572,14 @@ nouveau_connector_detect(struct drm_connector *connector, bool force)
 		nv_connector->edid = NULL;
 	}
 
-	/* Outputs are only polled while runtime active, so resuming the
-	 * device here is unnecessary (and would deadlock upon runtime suspend
-	 * because it waits for polling to finish). We do however, want to
-	 * prevent the autosuspend timer from elapsing during this operation
-	 * if possible.
+	/* Output polling and HPD only happens while we're runtime active, so
+	 * resuming the device here is unnecessary (and would deadlock upon
+	 * runtime suspend because it waits for polling to finish). We do
+	 * however, want to prevent the autosuspend timer from elapsing during
+	 * this operation if possible.
 	 */
-	if (drm_kms_helper_is_poll_worker()) {
+	if (drm_kms_helper_is_poll_worker() ||
+	    nv_connector->hpd_task == current) {
 		pm_runtime_get_noresume(dev->dev);
 	} else {
 		ret = pm_runtime_get_sync(dev->dev);
@@ -1151,6 +1152,8 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
 	const char *name = connector->name;
 	struct nouveau_encoder *nv_encoder;
 
+	nv_connector->hpd_task = current;
+
 	if (rep->mask & NVIF_NOTIFY_CONN_V0_IRQ) {
 		NV_DEBUG(drm, "service %s\n", name);
 		if ((nv_encoder = find_encoder(connector, DCB_OUTPUT_DP)))
@@ -1167,6 +1170,7 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
 		nouveau_connector_hotplug_probe(nv_connector);
 	}
 
+	nv_connector->hpd_task = NULL;
 	return NVIF_NOTIFY_KEEP;
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.h b/drivers/gpu/drm/nouveau/nouveau_connector.h
index 2d9d35a146a4..1964e682ba13 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.h
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.h
@@ -45,6 +45,7 @@ struct nouveau_connector {
 	u8 *dcb;
 
 	struct nvif_notify hpd;
+	struct task_struct *hpd_task;
 
 	struct drm_dp_aux aux;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 8/8] drm/nouveau: Call pm_runtime_get_noresume() from hpd handlers
  2018-08-01 21:14 [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs Lyude Paul
                   ` (6 preceding siblings ...)
  2018-08-01 21:14 ` [PATCH v4 7/8] drm/nouveau: Fix deadlocks in nouveau_connector_detect() Lyude Paul
@ 2018-08-01 21:14 ` Lyude Paul
  2018-08-02 19:40   ` Lukas Wunner
  7 siblings, 1 reply; 16+ messages in thread
From: Lyude Paul @ 2018-08-01 21:14 UTC (permalink / raw)
  To: nouveau
  Cc: stable, Lukas Wunner, Karol Herbst, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

We can't and don't need to try resuming the device from our hotplug
handlers, but hotplug events are generally something we'd like to keep
the device awake for whenever possible. So, grab a PM ref safely in our
hotplug handlers using pm_runtime_get_noresume() and mark the device as
busy once we're finished.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: stable@vger.kernel.org
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Karol Herbst <karolherbst@gmail.com>
---
 drivers/gpu/drm/nouveau/nouveau_connector.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
index 8409c3f2c3a1..5a8e8c1ad647 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.c
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
@@ -1152,6 +1152,11 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
 	const char *name = connector->name;
 	struct nouveau_encoder *nv_encoder;
 
+	/* Resuming the device here isn't possible; but the suspend PM ops
+	 * will wait for us to finish our work before disabling us so this
+	 * should be enough
+	 */
+	pm_runtime_get_noresume(drm->dev->dev);
 	nv_connector->hpd_task = current;
 
 	if (rep->mask & NVIF_NOTIFY_CONN_V0_IRQ) {
@@ -1171,6 +1176,9 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
 	}
 
 	nv_connector->hpd_task = NULL;
+
+	pm_runtime_mark_last_busy(drm->dev->dev);
+	pm_runtime_put_autosuspend(drm->dev->dev);
 	return NVIF_NOTIFY_KEEP;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/8] drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement
  2018-08-01 21:14 ` [PATCH v4 1/8] drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement Lyude Paul
@ 2018-08-02 19:10   ` Lukas Wunner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukas Wunner @ 2018-08-02 19:10 UTC (permalink / raw)
  To: Lyude Paul
  Cc: nouveau, Peter Ujfalusi, stable, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

Hi Lyude,

thanks a lot for your persistence in attacking these runtime PM
issues in nouveau.  v4 of this series looks very good overall.
I only have comments on patches 1, 2 and 8.  On this patch 1
it's merely a nit:

On Wed, Aug 01, 2018 at 05:14:51PM -0400, Lyude Paul wrote:
> --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
> @@ -835,7 +835,6 @@ nouveau_pmops_runtime_suspend(struct device *dev)
>  		return -EBUSY;
>  	}
>  
> -	drm_kms_helper_poll_disable(drm_dev);
>  	nouveau_switcheroo_optimus_dsm();
>  	ret = nouveau_do_suspend(drm_dev, true);
>  	pci_save_state(pdev);

This third hunk in the patch lacks an explanation in the commit message.
I think its rationale is that drm_kms_helper_poll_disable() is already
invoked in nouveau_display_fini(), which gets called on ->runtime_suspend,
hence need not be invoked once more in nouveau_pmops_runtime_suspend().

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/8] drm/nouveau: Enable polling even if we have runtime PM
  2018-08-01 21:14 ` [PATCH v4 2/8] drm/nouveau: Enable polling even if we have runtime PM Lyude Paul
@ 2018-08-02 19:14   ` Lukas Wunner
  2018-08-02 19:37     ` Lyude Paul
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Wunner @ 2018-08-02 19:14 UTC (permalink / raw)
  To: Lyude Paul
  Cc: nouveau, Peter Ujfalusi, stable, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

On Wed, Aug 01, 2018 at 05:14:52PM -0400, Lyude Paul wrote:
> --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
> @@ -592,10 +592,11 @@ nouveau_drm_load(struct drm_device *dev, unsigned long flags)
>  		pm_runtime_allow(dev->dev);
>  		pm_runtime_mark_last_busy(dev->dev);
>  		pm_runtime_put(dev->dev);
> -	} else {
> -		/* enable polling for external displays */
> -		drm_kms_helper_poll_enable(dev);
>  	}
> +
> +	/* enable polling for connectors without hpd */
> +	drm_kms_helper_poll_enable(dev);
> +

I'm wondering why drm_kms_helper_poll_enable() is called here at all.
Does the invocation in nouveau_display_init() not suffice?  Can there
be a situation when nouveau_display_init() is not called despite there
being connectors that need to be polled?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/8] drm/nouveau: Enable polling even if we have runtime PM
  2018-08-02 19:14   ` Lukas Wunner
@ 2018-08-02 19:37     ` Lyude Paul
  0 siblings, 0 replies; 16+ messages in thread
From: Lyude Paul @ 2018-08-02 19:37 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: nouveau, Peter Ujfalusi, stable, Ben Skeggs, David Airlie,
	dri-devel, linux-kernel

oooh, thanks for pointing that out. I should have grepped for this function
before doing that commit. There also appears to be another duplicate
drm_kms_helper_poll_enable() in nouveau_vga.c which also looks equally
suspecious as being useless. I will fix this up and also remove that other
duplicate call in the next version of this

On Thu, 2018-08-02 at 21:14 +0200, Lukas Wunner wrote:
> On Wed, Aug 01, 2018 at 05:14:52PM -0400, Lyude Paul wrote:
> > --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
> > @@ -592,10 +592,11 @@ nouveau_drm_load(struct drm_device *dev, unsigned
> > long flags)
> >  		pm_runtime_allow(dev->dev);
> >  		pm_runtime_mark_last_busy(dev->dev);
> >  		pm_runtime_put(dev->dev);
> > -	} else {
> > -		/* enable polling for external displays */
> > -		drm_kms_helper_poll_enable(dev);
> >  	}
> > +
> > +	/* enable polling for connectors without hpd */
> > +	drm_kms_helper_poll_enable(dev);
> > +
> 
> I'm wondering why drm_kms_helper_poll_enable() is called here at all.
> Does the invocation in nouveau_display_init() not suffice?  Can there
> be a situation when nouveau_display_init() is not called despite there
> being connectors that need to be polled?
> 
> Thanks,
> 
> Lukas
-- 
Cheers,
	Lyude Paul


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 8/8] drm/nouveau: Call pm_runtime_get_noresume() from hpd handlers
  2018-08-01 21:14 ` [PATCH v4 8/8] drm/nouveau: Call pm_runtime_get_noresume() from hpd handlers Lyude Paul
@ 2018-08-02 19:40   ` Lukas Wunner
  2018-08-02 20:00     ` Lyude Paul
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Wunner @ 2018-08-02 19:40 UTC (permalink / raw)
  To: Lyude Paul
  Cc: nouveau, Karol Herbst, Ben Skeggs, David Airlie, dri-devel, linux-kernel

On Wed, Aug 01, 2018 at 05:14:58PM -0400, Lyude Paul wrote:
> We can't and don't need to try resuming the device from our hotplug
> handlers, but hotplug events are generally something we'd like to keep
> the device awake for whenever possible. So, grab a PM ref safely in our
> hotplug handlers using pm_runtime_get_noresume() and mark the device as
> busy once we're finished.

Patch looks fine in principle, but doesn't seem to be sufficient to fix
the following race:

1. runtime_suspend commences
2. user plugs in a display before the runtime_suspend worker disables
   hotplug interrupts in nouveau_display_fini()
3. hotplug is handled, display is lit up
4. runtime_suspend worker waits for hotplug handler to finish
5. GPU is runtime suspended and the newly plugged in display goes black

The call to pm_runtime_mark_last_busy() has no effect in this situation
because rpm_suspend() doesn't look at the last_busy variable after the
call to rpm_callback().  What's necessary here is that runtime_suspend is
aborted and -EBUSY returned.

Thanks,

Lukas

> 
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Cc: stable@vger.kernel.org
> Cc: Lukas Wunner <lukas@wunner.de>
> Cc: Karol Herbst <karolherbst@gmail.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_connector.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
> index 8409c3f2c3a1..5a8e8c1ad647 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_connector.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
> @@ -1152,6 +1152,11 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
>  	const char *name = connector->name;
>  	struct nouveau_encoder *nv_encoder;
>  
> +	/* Resuming the device here isn't possible; but the suspend PM ops
> +	 * will wait for us to finish our work before disabling us so this
> +	 * should be enough
> +	 */
> +	pm_runtime_get_noresume(drm->dev->dev);
>  	nv_connector->hpd_task = current;
>  
>  	if (rep->mask & NVIF_NOTIFY_CONN_V0_IRQ) {
> @@ -1171,6 +1176,9 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
>  	}
>  
>  	nv_connector->hpd_task = NULL;
> +
> +	pm_runtime_mark_last_busy(drm->dev->dev);
> +	pm_runtime_put_autosuspend(drm->dev->dev);
>  	return NVIF_NOTIFY_KEEP;
>  }
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 8/8] drm/nouveau: Call pm_runtime_get_noresume() from hpd handlers
  2018-08-02 19:40   ` Lukas Wunner
@ 2018-08-02 20:00     ` Lyude Paul
  0 siblings, 0 replies; 16+ messages in thread
From: Lyude Paul @ 2018-08-02 20:00 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: nouveau, Karol Herbst, Ben Skeggs, David Airlie, dri-devel, linux-kernel

On Thu, 2018-08-02 at 21:40 +0200, Lukas Wunner wrote:
> On Wed, Aug 01, 2018 at 05:14:58PM -0400, Lyude Paul wrote:
> > We can't and don't need to try resuming the device from our hotplug
> > handlers, but hotplug events are generally something we'd like to keep
> > the device awake for whenever possible. So, grab a PM ref safely in our
> > hotplug handlers using pm_runtime_get_noresume() and mark the device as
> > busy once we're finished.
> 
> Patch looks fine in principle, but doesn't seem to be sufficient to fix
> the following race:
> 
> 1. runtime_suspend commences
> 2. user plugs in a display before the runtime_suspend worker disables
>    hotplug interrupts in nouveau_display_fini()
> 3. hotplug is handled, display is lit up
> 4. runtime_suspend worker waits for hotplug handler to finish
> 5. GPU is runtime suspended and the newly plugged in display goes black
> 
> The call to pm_runtime_mark_last_busy() has no effect in this situation
> because rpm_suspend() doesn't look at the last_busy variable after the
> call to rpm_callback().  What's necessary here is that runtime_suspend is
> aborted and -EBUSY returned.

So: that wasn't actually what this patch was meant to fix, this is just meant
to make it a little more likely that the GPU won't fall asleep while doing
connector probing, since there's no risk handling it here.

That being said; I think there's some parts you're missing on this. Mainly
that regardless of whether or not the GPU ends up getting runtime suspended
here, a hotplug event has still been propogated to userspace. If fbcon isn't
the one in charge in that scenario (e.g. we have gnome-shell, X, etc. running)
then whoever is can still respond to the hotplug as usual, and the probing
from userspace will result in waking the GPU back up again since
nouveau_connector_detect() will grab a runtime PM ref since it's not being
called from the hpd context. I don't think this is the case yet for MST
connectors since they don't use nouveau_connector_detect(), and I'll see if I
fix that as well in the next patch series.

Now; what you pointed out made me realize I'm not sure that would apply when
fbcon does happen to be the one in control, since fb_helper will have it's
hotplug handling suspended at this point, which will mean that it won't
respond to the connector change until the GPU is runtime resumed by something
else. That definitely should get fixed.

Also: let me know if any of this doesn't sound correct, there's so much to
this I wouldn't be surprised if I missed something else
> 
> Thanks,
> 
> Lukas
> 
> > 
> > Signed-off-by: Lyude Paul <lyude@redhat.com>
> > Cc: stable@vger.kernel.org
> > Cc: Lukas Wunner <lukas@wunner.de>
> > Cc: Karol Herbst <karolherbst@gmail.com>
> > ---
> >  drivers/gpu/drm/nouveau/nouveau_connector.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c
> > b/drivers/gpu/drm/nouveau/nouveau_connector.c
> > index 8409c3f2c3a1..5a8e8c1ad647 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_connector.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
> > @@ -1152,6 +1152,11 @@ nouveau_connector_hotplug(struct nvif_notify
> > *notify)
> >  	const char *name = connector->name;
> >  	struct nouveau_encoder *nv_encoder;
> >  
> > +	/* Resuming the device here isn't possible; but the suspend PM ops
> > +	 * will wait for us to finish our work before disabling us so this
> > +	 * should be enough
> > +	 */
> > +	pm_runtime_get_noresume(drm->dev->dev);
> >  	nv_connector->hpd_task = current;
> >  
> >  	if (rep->mask & NVIF_NOTIFY_CONN_V0_IRQ) {
> > @@ -1171,6 +1176,9 @@ nouveau_connector_hotplug(struct nvif_notify
> > *notify)
> >  	}
> >  
> >  	nv_connector->hpd_task = NULL;
> > +
> > +	pm_runtime_mark_last_busy(drm->dev->dev);
> > +	pm_runtime_put_autosuspend(drm->dev->dev);
> >  	return NVIF_NOTIFY_KEEP;
> >  }
> >  
> > -- 
> > 2.17.1
> > 
-- 
Cheers,
	Lyude Paul


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 7/8] drm/nouveau: Fix deadlocks in nouveau_connector_detect()
  2018-08-01 21:14 ` [PATCH v4 7/8] drm/nouveau: Fix deadlocks in nouveau_connector_detect() Lyude Paul
@ 2018-08-06  9:15   ` Daniel Vetter
  2018-08-06 19:29     ` Lyude Paul
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2018-08-06  9:15 UTC (permalink / raw)
  To: Lyude Paul
  Cc: nouveau, David Airlie, Karol Herbst, dri-devel, linux-kernel,
	stable, Ben Skeggs

On Wed, Aug 01, 2018 at 05:14:57PM -0400, Lyude Paul wrote:
> When we disable hotplugging on the GPU, we need to be able to
> synchronize with each connector's hotplug interrupt handler before the
> interrupt is finally disabled. This can be a problem however, since
> nouveau_connector_detect() currently grabs a runtime power reference
> when handling connector probing. This will deadlock the runtime suspend
> handler like so:
> 
> [  861.480896] INFO: task kworker/0:2:61 blocked for more than 120 seconds.
> [  861.483290]       Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
> [  861.485158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  861.486332] kworker/0:2     D    0    61      2 0x80000000
> [  861.487044] Workqueue: events nouveau_display_hpd_work [nouveau]
> [  861.487737] Call Trace:
> [  861.488394]  __schedule+0x322/0xaf0
> [  861.489070]  schedule+0x33/0x90
> [  861.489744]  rpm_resume+0x19c/0x850
> [  861.490392]  ? finish_wait+0x90/0x90
> [  861.491068]  __pm_runtime_resume+0x4e/0x90
> [  861.491753]  nouveau_display_hpd_work+0x22/0x60 [nouveau]
> [  861.492416]  process_one_work+0x231/0x620
> [  861.493068]  worker_thread+0x44/0x3a0
> [  861.493722]  kthread+0x12b/0x150
> [  861.494342]  ? wq_pool_ids_show+0x140/0x140
> [  861.494991]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  861.495648]  ret_from_fork+0x3a/0x50
> [  861.496304] INFO: task kworker/6:2:320 blocked for more than 120 seconds.
> [  861.496968]       Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
> [  861.497654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  861.498341] kworker/6:2     D    0   320      2 0x80000080
> [  861.499045] Workqueue: pm pm_runtime_work
> [  861.499739] Call Trace:
> [  861.500428]  __schedule+0x322/0xaf0
> [  861.501134]  ? wait_for_completion+0x104/0x190
> [  861.501851]  schedule+0x33/0x90
> [  861.502564]  schedule_timeout+0x3a5/0x590
> [  861.503284]  ? mark_held_locks+0x58/0x80
> [  861.503988]  ? _raw_spin_unlock_irq+0x2c/0x40
> [  861.504710]  ? wait_for_completion+0x104/0x190
> [  861.505417]  ? trace_hardirqs_on_caller+0xf4/0x190
> [  861.506136]  ? wait_for_completion+0x104/0x190
> [  861.506845]  wait_for_completion+0x12c/0x190
> [  861.507555]  ? wake_up_q+0x80/0x80
> [  861.508268]  flush_work+0x1c9/0x280
> [  861.508990]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> [  861.509735]  nvif_notify_put+0xb1/0xc0 [nouveau]
> [  861.510482]  nouveau_display_fini+0xbd/0x170 [nouveau]
> [  861.511241]  nouveau_display_suspend+0x67/0x120 [nouveau]
> [  861.511969]  nouveau_do_suspend+0x5e/0x2d0 [nouveau]
> [  861.512715]  nouveau_pmops_runtime_suspend+0x47/0xb0 [nouveau]
> [  861.513435]  pci_pm_runtime_suspend+0x6b/0x180
> [  861.514165]  ? pci_has_legacy_pm_support+0x70/0x70
> [  861.514897]  __rpm_callback+0x7a/0x1d0
> [  861.515618]  ? pci_has_legacy_pm_support+0x70/0x70
> [  861.516313]  rpm_callback+0x24/0x80
> [  861.517027]  ? pci_has_legacy_pm_support+0x70/0x70
> [  861.517741]  rpm_suspend+0x142/0x6b0
> [  861.518449]  pm_runtime_work+0x97/0xc0
> [  861.519144]  process_one_work+0x231/0x620
> [  861.519831]  worker_thread+0x44/0x3a0
> [  861.520522]  kthread+0x12b/0x150
> [  861.521220]  ? wq_pool_ids_show+0x140/0x140
> [  861.521925]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  861.522622]  ret_from_fork+0x3a/0x50
> [  861.523299] INFO: task kworker/6:0:1329 blocked for more than 120 seconds.
> [  861.523977]       Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
> [  861.524644] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  861.525349] kworker/6:0     D    0  1329      2 0x80000000
> [  861.526073] Workqueue: events nvif_notify_work [nouveau]
> [  861.526751] Call Trace:
> [  861.527411]  __schedule+0x322/0xaf0
> [  861.528089]  schedule+0x33/0x90
> [  861.528758]  rpm_resume+0x19c/0x850
> [  861.529399]  ? finish_wait+0x90/0x90
> [  861.530073]  __pm_runtime_resume+0x4e/0x90
> [  861.530798]  nouveau_connector_detect+0x7e/0x510 [nouveau]
> [  861.531459]  ? ww_mutex_lock+0x47/0x80
> [  861.532097]  ? ww_mutex_lock+0x47/0x80
> [  861.532819]  ? drm_modeset_lock+0x88/0x130 [drm]
> [  861.533481]  drm_helper_probe_detect_ctx+0xa0/0x100 [drm_kms_helper]
> [  861.534127]  drm_helper_hpd_irq_event+0xa4/0x120 [drm_kms_helper]
> [  861.534940]  nouveau_connector_hotplug+0x98/0x120 [nouveau]
> [  861.535556]  nvif_notify_work+0x2d/0xb0 [nouveau]
> [  861.536221]  process_one_work+0x231/0x620
> [  861.536994]  worker_thread+0x44/0x3a0
> [  861.537757]  kthread+0x12b/0x150
> [  861.538463]  ? wq_pool_ids_show+0x140/0x140
> [  861.539102]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  861.539815]  ret_from_fork+0x3a/0x50
> [  861.540521]
>                Showing all locks held in the system:
> [  861.541696] 2 locks held by kworker/0:2/61:
> [  861.542406]  #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
> [  861.543071]  #1: 0000000076868126 ((work_completion)(&drm->hpd_work)){+.+.}, at: process_one_work+0x1b3/0x620
> [  861.543814] 1 lock held by khungtaskd/64:
> [  861.544535]  #0: 0000000059db4b53 (rcu_read_lock){....}, at: debug_show_all_locks+0x23/0x185
> [  861.545160] 3 locks held by kworker/6:2/320:
> [  861.545896]  #0: 00000000d9e1bc59 ((wq_completion)"pm"){+.+.}, at: process_one_work+0x1b3/0x620
> [  861.546702]  #1: 00000000c9f92d84 ((work_completion)(&dev->power.work)){+.+.}, at: process_one_work+0x1b3/0x620
> [  861.547443]  #2: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at: nouveau_display_fini+0x96/0x170 [nouveau]
> [  861.548146] 1 lock held by dmesg/983:
> [  861.548889] 2 locks held by zsh/1250:
> [  861.549605]  #0: 00000000348e3cf6 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> [  861.550393]  #1: 000000007009a7a8 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0xc1/0x870
> [  861.551122] 6 locks held by kworker/6:0/1329:
> [  861.551957]  #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
> [  861.552765]  #1: 00000000ddb499ad ((work_completion)(&notify->work)#2){+.+.}, at: process_one_work+0x1b3/0x620
> [  861.553582]  #2: 000000006e013cbe (&dev->mode_config.mutex){+.+.}, at: drm_helper_hpd_irq_event+0x6c/0x120 [drm_kms_helper]
> [  861.554357]  #3: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at: drm_helper_hpd_irq_event+0x78/0x120 [drm_kms_helper]
> [  861.555227]  #4: 0000000044f294d9 (crtc_ww_class_acquire){+.+.}, at: drm_helper_probe_detect_ctx+0x3d/0x100 [drm_kms_helper]
> [  861.556133]  #5: 00000000db193642 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_lock+0x4b/0x130 [drm]
> 
> [  861.557864] =============================================
> 
> [  861.559507] NMI backtrace for cpu 2
> [  861.560363] CPU: 2 PID: 64 Comm: khungtaskd Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
> [  861.561197] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W (1.51 ) 05/18/2018
> [  861.561948] Call Trace:
> [  861.562757]  dump_stack+0x8e/0xd3
> [  861.563516]  nmi_cpu_backtrace.cold.3+0x14/0x5a
> [  861.564269]  ? lapic_can_unplug_cpu.cold.27+0x42/0x42
> [  861.565029]  nmi_trigger_cpumask_backtrace+0xa1/0xae
> [  861.565789]  arch_trigger_cpumask_backtrace+0x19/0x20
> [  861.566558]  watchdog+0x316/0x580
> [  861.567355]  kthread+0x12b/0x150
> [  861.568114]  ? reset_hung_task_detector+0x20/0x20
> [  861.568863]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  861.569598]  ret_from_fork+0x3a/0x50
> [  861.570370] Sending NMI from CPU 2 to CPUs 0-1,3-7:
> [  861.571426] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x7f/0x120
> [  861.571429] NMI backtrace for cpu 7 skipped: idling at intel_idle+0x7f/0x120
> [  861.571432] NMI backtrace for cpu 3 skipped: idling at intel_idle+0x7f/0x120
> [  861.571464] NMI backtrace for cpu 5 skipped: idling at intel_idle+0x7f/0x120
> [  861.571467] NMI backtrace for cpu 0 skipped: idling at intel_idle+0x7f/0x120
> [  861.571469] NMI backtrace for cpu 4 skipped: idling at intel_idle+0x7f/0x120
> [  861.571472] NMI backtrace for cpu 1 skipped: idling at intel_idle+0x7f/0x120
> [  861.572428] Kernel panic - not syncing: hung_task: blocked tasks
> 
> So: fix this with a new trick; store the current task_struct that's
> executing in the nouveau_connector structure, then avoid attempting to
> runtime resume the device when we know that we're just running from the
> context of our hotplug interrupt handler. Since hpd interrupts are only
> enabled while the device is runtime active, this should be totally safe.
> 
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Cc: stable@vger.kernel.org
> Cc: Lukas Wunner <lukas@wunner.de>
> Cc: Karol Herbst <karolherbst@gmail.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_connector.c | 16 ++++++++++------
>  drivers/gpu/drm/nouveau/nouveau_connector.h |  1 +
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
> index 9714e09f17db..8409c3f2c3a1 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_connector.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
> @@ -572,13 +572,14 @@ nouveau_connector_detect(struct drm_connector *connector, bool force)
>  		nv_connector->edid = NULL;
>  	}
>  
> -	/* Outputs are only polled while runtime active, so resuming the
> -	 * device here is unnecessary (and would deadlock upon runtime suspend
> -	 * because it waits for polling to finish). We do however, want to
> -	 * prevent the autosuspend timer from elapsing during this operation
> -	 * if possible.
> +	/* Output polling and HPD only happens while we're runtime active, so
> +	 * resuming the device here is unnecessary (and would deadlock upon
> +	 * runtime suspend because it waits for polling to finish). We do
> +	 * however, want to prevent the autosuspend timer from elapsing during
> +	 * this operation if possible.
>  	 */
> -	if (drm_kms_helper_is_poll_worker()) {
> +	if (drm_kms_helper_is_poll_worker() ||

Uh, this entirely slipped through my rader. Looks very much like a hack.
If you want to avoid doing runtime pm from the poll worker imo the right
thing to do would be to use pm_runtime_get_if_in_use() and skip the entire
detection (returning the current state) if you can't get a runtime pm
reference. You could also use the force parameter to figure out whether
userspace is asking for this information or not.

Otoh it's a bit nasty to stop the polling (if the hw really requires it)
just because the display is suspended, makes waking up the machine just a
notch more annoying. E.g. in i915 we keep polling (and use some ACPI
methods if available for hpd detection, to waste less power on rpm
wakeups).
-Daniel

> +	    nv_connector->hpd_task == current) {
>  		pm_runtime_get_noresume(dev->dev);
>  	} else {
>  		ret = pm_runtime_get_sync(dev->dev);
> @@ -1151,6 +1152,8 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
>  	const char *name = connector->name;
>  	struct nouveau_encoder *nv_encoder;
>  
> +	nv_connector->hpd_task = current;
> +
>  	if (rep->mask & NVIF_NOTIFY_CONN_V0_IRQ) {
>  		NV_DEBUG(drm, "service %s\n", name);
>  		if ((nv_encoder = find_encoder(connector, DCB_OUTPUT_DP)))
> @@ -1167,6 +1170,7 @@ nouveau_connector_hotplug(struct nvif_notify *notify)
>  		nouveau_connector_hotplug_probe(nv_connector);
>  	}
>  
> +	nv_connector->hpd_task = NULL;
>  	return NVIF_NOTIFY_KEEP;
>  }
>  
> diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.h b/drivers/gpu/drm/nouveau/nouveau_connector.h
> index 2d9d35a146a4..1964e682ba13 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_connector.h
> +++ b/drivers/gpu/drm/nouveau/nouveau_connector.h
> @@ -45,6 +45,7 @@ struct nouveau_connector {
>  	u8 *dcb;
>  
>  	struct nvif_notify hpd;
> +	struct task_struct *hpd_task;
>  
>  	struct drm_dp_aux aux;
>  
> -- 
> 2.17.1
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 7/8] drm/nouveau: Fix deadlocks in nouveau_connector_detect()
  2018-08-06  9:15   ` Daniel Vetter
@ 2018-08-06 19:29     ` Lyude Paul
  0 siblings, 0 replies; 16+ messages in thread
From: Lyude Paul @ 2018-08-06 19:29 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: nouveau, David Airlie, Karol Herbst, dri-devel, linux-kernel,
	stable, Ben Skeggs

On Mon, 2018-08-06 at 11:15 +0200, Daniel Vetter wrote:
> On Wed, Aug 01, 2018 at 05:14:57PM -0400, Lyude Paul wrote:
> > When we disable hotplugging on the GPU, we need to be able to
> > synchronize with each connector's hotplug interrupt handler before the
> > interrupt is finally disabled. This can be a problem however, since
> > nouveau_connector_detect() currently grabs a runtime power reference
> > when handling connector probing. This will deadlock the runtime suspend
> > handler like so:
> > 
> > [  861.480896] INFO: task kworker/0:2:61 blocked for more than 120
> > seconds.
> > [  861.483290]       Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
> > [  861.485158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [  861.486332] kworker/0:2     D    0    61      2 0x80000000
> > [  861.487044] Workqueue: events nouveau_display_hpd_work [nouveau]
> > [  861.487737] Call Trace:
> > [  861.488394]  __schedule+0x322/0xaf0
> > [  861.489070]  schedule+0x33/0x90
> > [  861.489744]  rpm_resume+0x19c/0x850
> > [  861.490392]  ? finish_wait+0x90/0x90
> > [  861.491068]  __pm_runtime_resume+0x4e/0x90
> > [  861.491753]  nouveau_display_hpd_work+0x22/0x60 [nouveau]
> > [  861.492416]  process_one_work+0x231/0x620
> > [  861.493068]  worker_thread+0x44/0x3a0
> > [  861.493722]  kthread+0x12b/0x150
> > [  861.494342]  ? wq_pool_ids_show+0x140/0x140
> > [  861.494991]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  861.495648]  ret_from_fork+0x3a/0x50
> > [  861.496304] INFO: task kworker/6:2:320 blocked for more than 120
> > seconds.
> > [  861.496968]       Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
> > [  861.497654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [  861.498341] kworker/6:2     D    0   320      2 0x80000080
> > [  861.499045] Workqueue: pm pm_runtime_work
> > [  861.499739] Call Trace:
> > [  861.500428]  __schedule+0x322/0xaf0
> > [  861.501134]  ? wait_for_completion+0x104/0x190
> > [  861.501851]  schedule+0x33/0x90
> > [  861.502564]  schedule_timeout+0x3a5/0x590
> > [  861.503284]  ? mark_held_locks+0x58/0x80
> > [  861.503988]  ? _raw_spin_unlock_irq+0x2c/0x40
> > [  861.504710]  ? wait_for_completion+0x104/0x190
> > [  861.505417]  ? trace_hardirqs_on_caller+0xf4/0x190
> > [  861.506136]  ? wait_for_completion+0x104/0x190
> > [  861.506845]  wait_for_completion+0x12c/0x190
> > [  861.507555]  ? wake_up_q+0x80/0x80
> > [  861.508268]  flush_work+0x1c9/0x280
> > [  861.508990]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> > [  861.509735]  nvif_notify_put+0xb1/0xc0 [nouveau]
> > [  861.510482]  nouveau_display_fini+0xbd/0x170 [nouveau]
> > [  861.511241]  nouveau_display_suspend+0x67/0x120 [nouveau]
> > [  861.511969]  nouveau_do_suspend+0x5e/0x2d0 [nouveau]
> > [  861.512715]  nouveau_pmops_runtime_suspend+0x47/0xb0 [nouveau]
> > [  861.513435]  pci_pm_runtime_suspend+0x6b/0x180
> > [  861.514165]  ? pci_has_legacy_pm_support+0x70/0x70
> > [  861.514897]  __rpm_callback+0x7a/0x1d0
> > [  861.515618]  ? pci_has_legacy_pm_support+0x70/0x70
> > [  861.516313]  rpm_callback+0x24/0x80
> > [  861.517027]  ? pci_has_legacy_pm_support+0x70/0x70
> > [  861.517741]  rpm_suspend+0x142/0x6b0
> > [  861.518449]  pm_runtime_work+0x97/0xc0
> > [  861.519144]  process_one_work+0x231/0x620
> > [  861.519831]  worker_thread+0x44/0x3a0
> > [  861.520522]  kthread+0x12b/0x150
> > [  861.521220]  ? wq_pool_ids_show+0x140/0x140
> > [  861.521925]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  861.522622]  ret_from_fork+0x3a/0x50
> > [  861.523299] INFO: task kworker/6:0:1329 blocked for more than 120
> > seconds.
> > [  861.523977]       Tainted: G           O      4.18.0-rc6Lyude-Test+ #1
> > [  861.524644] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [  861.525349] kworker/6:0     D    0  1329      2 0x80000000
> > [  861.526073] Workqueue: events nvif_notify_work [nouveau]
> > [  861.526751] Call Trace:
> > [  861.527411]  __schedule+0x322/0xaf0
> > [  861.528089]  schedule+0x33/0x90
> > [  861.528758]  rpm_resume+0x19c/0x850
> > [  861.529399]  ? finish_wait+0x90/0x90
> > [  861.530073]  __pm_runtime_resume+0x4e/0x90
> > [  861.530798]  nouveau_connector_detect+0x7e/0x510 [nouveau]
> > [  861.531459]  ? ww_mutex_lock+0x47/0x80
> > [  861.532097]  ? ww_mutex_lock+0x47/0x80
> > [  861.532819]  ? drm_modeset_lock+0x88/0x130 [drm]
> > [  861.533481]  drm_helper_probe_detect_ctx+0xa0/0x100 [drm_kms_helper]
> > [  861.534127]  drm_helper_hpd_irq_event+0xa4/0x120 [drm_kms_helper]
> > [  861.534940]  nouveau_connector_hotplug+0x98/0x120 [nouveau]
> > [  861.535556]  nvif_notify_work+0x2d/0xb0 [nouveau]
> > [  861.536221]  process_one_work+0x231/0x620
> > [  861.536994]  worker_thread+0x44/0x3a0
> > [  861.537757]  kthread+0x12b/0x150
> > [  861.538463]  ? wq_pool_ids_show+0x140/0x140
> > [  861.539102]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  861.539815]  ret_from_fork+0x3a/0x50
> > [  861.540521]
> >                Showing all locks held in the system:
> > [  861.541696] 2 locks held by kworker/0:2/61:
> > [  861.542406]  #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at:
> > process_one_work+0x1b3/0x620
> > [  861.543071]  #1: 0000000076868126 ((work_completion)(&drm-
> > >hpd_work)){+.+.}, at: process_one_work+0x1b3/0x620
> > [  861.543814] 1 lock held by khungtaskd/64:
> > [  861.544535]  #0: 0000000059db4b53 (rcu_read_lock){....}, at:
> > debug_show_all_locks+0x23/0x185
> > [  861.545160] 3 locks held by kworker/6:2/320:
> > [  861.545896]  #0: 00000000d9e1bc59 ((wq_completion)"pm"){+.+.}, at:
> > process_one_work+0x1b3/0x620
> > [  861.546702]  #1: 00000000c9f92d84 ((work_completion)(&dev-
> > >power.work)){+.+.}, at: process_one_work+0x1b3/0x620
> > [  861.547443]  #2: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at:
> > nouveau_display_fini+0x96/0x170 [nouveau]
> > [  861.548146] 1 lock held by dmesg/983:
> > [  861.548889] 2 locks held by zsh/1250:
> > [  861.549605]  #0: 00000000348e3cf6 (&tty->ldisc_sem){++++}, at:
> > ldsem_down_read+0x37/0x40
> > [  861.550393]  #1: 000000007009a7a8 (&ldata->atomic_read_lock){+.+.}, at:
> > n_tty_read+0xc1/0x870
> > [  861.551122] 6 locks held by kworker/6:0/1329:
> > [  861.551957]  #0: 000000002dbf8af5 ((wq_completion)"events"){+.+.}, at:
> > process_one_work+0x1b3/0x620
> > [  861.552765]  #1: 00000000ddb499ad ((work_completion)(&notify-
> > >work)#2){+.+.}, at: process_one_work+0x1b3/0x620
> > [  861.553582]  #2: 000000006e013cbe (&dev->mode_config.mutex){+.+.}, at:
> > drm_helper_hpd_irq_event+0x6c/0x120 [drm_kms_helper]
> > [  861.554357]  #3: 000000004afc5de1 (drm_connector_list_iter){.+.+}, at:
> > drm_helper_hpd_irq_event+0x78/0x120 [drm_kms_helper]
> > [  861.555227]  #4: 0000000044f294d9 (crtc_ww_class_acquire){+.+.}, at:
> > drm_helper_probe_detect_ctx+0x3d/0x100 [drm_kms_helper]
> > [  861.556133]  #5: 00000000db193642 (crtc_ww_class_mutex){+.+.}, at:
> > drm_modeset_lock+0x4b/0x130 [drm]
> > 
> > [  861.557864] =============================================
> > 
> > [  861.559507] NMI backtrace for cpu 2
> > [  861.560363] CPU: 2 PID: 64 Comm: khungtaskd Tainted:
> > G           O      4.18.0-rc6Lyude-Test+ #1
> > [  861.561197] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W
> > (1.51 ) 05/18/2018
> > [  861.561948] Call Trace:
> > [  861.562757]  dump_stack+0x8e/0xd3
> > [  861.563516]  nmi_cpu_backtrace.cold.3+0x14/0x5a
> > [  861.564269]  ? lapic_can_unplug_cpu.cold.27+0x42/0x42
> > [  861.565029]  nmi_trigger_cpumask_backtrace+0xa1/0xae
> > [  861.565789]  arch_trigger_cpumask_backtrace+0x19/0x20
> > [  861.566558]  watchdog+0x316/0x580
> > [  861.567355]  kthread+0x12b/0x150
> > [  861.568114]  ? reset_hung_task_detector+0x20/0x20
> > [  861.568863]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  861.569598]  ret_from_fork+0x3a/0x50
> > [  861.570370] Sending NMI from CPU 2 to CPUs 0-1,3-7:
> > [  861.571426] NMI backtrace for cpu 6 skipped: idling at
> > intel_idle+0x7f/0x120
> > [  861.571429] NMI backtrace for cpu 7 skipped: idling at
> > intel_idle+0x7f/0x120
> > [  861.571432] NMI backtrace for cpu 3 skipped: idling at
> > intel_idle+0x7f/0x120
> > [  861.571464] NMI backtrace for cpu 5 skipped: idling at
> > intel_idle+0x7f/0x120
> > [  861.571467] NMI backtrace for cpu 0 skipped: idling at
> > intel_idle+0x7f/0x120
> > [  861.571469] NMI backtrace for cpu 4 skipped: idling at
> > intel_idle+0x7f/0x120
> > [  861.571472] NMI backtrace for cpu 1 skipped: idling at
> > intel_idle+0x7f/0x120
> > [  861.572428] Kernel panic - not syncing: hung_task: blocked tasks
> > 
> > So: fix this with a new trick; store the current task_struct that's
> > executing in the nouveau_connector structure, then avoid attempting to
> > runtime resume the device when we know that we're just running from the
> > context of our hotplug interrupt handler. Since hpd interrupts are only
> > enabled while the device is runtime active, this should be totally safe.
> > 
> > Signed-off-by: Lyude Paul <lyude@redhat.com>
> > Cc: stable@vger.kernel.org
> > Cc: Lukas Wunner <lukas@wunner.de>
> > Cc: Karol Herbst <karolherbst@gmail.com>
> > ---
> >  drivers/gpu/drm/nouveau/nouveau_connector.c | 16 ++++++++++------
> >  drivers/gpu/drm/nouveau/nouveau_connector.h |  1 +
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c
> > b/drivers/gpu/drm/nouveau/nouveau_connector.c
> > index 9714e09f17db..8409c3f2c3a1 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_connector.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
> > @@ -572,13 +572,14 @@ nouveau_connector_detect(struct drm_connector
> > *connector, bool force)
> >  		nv_connector->edid = NULL;
> >  	}
> >  
> > -	/* Outputs are only polled while runtime active, so resuming the
> > -	 * device here is unnecessary (and would deadlock upon runtime suspend
> > -	 * because it waits for polling to finish). We do however, want to
> > -	 * prevent the autosuspend timer from elapsing during this operation
> > -	 * if possible.
> > +	/* Output polling and HPD only happens while we're runtime active, so
> > +	 * resuming the device here is unnecessary (and would deadlock upon
> > +	 * runtime suspend because it waits for polling to finish). We do
> > +	 * however, want to prevent the autosuspend timer from elapsing during
> > +	 * this operation if possible.
> >  	 */
> > -	if (drm_kms_helper_is_poll_worker()) {
> > +	if (drm_kms_helper_is_poll_worker() ||
> 
> Uh, this entirely slipped through my rader. Looks very much like a hack.
> If you want to avoid doing runtime pm from the poll worker imo the right
> thing to do would be to use pm_runtime_get_if_in_use() and skip the entire
> detection (returning the current state) if you can't get a runtime pm
> reference. You could also use the force parameter to figure out whether
> userspace is asking for this information or not.
nope! While you're definitely right that we should actually avoid disabling
output polling during runtime suspend (see my response to your response on
patch 3), this part is still needed and also matches what i915 currently does
in it's irq handlers (see: all the disable_rpm_wakeref_asserts() calls in
i915_hpd.c). pm_runtime_get_if_in_use() won't work because (taken from
Documentation/power/runtime_pm.txt):

  int pm_runtime_get_if_in_use(struct device *dev);
    - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the
      runtime PM status is RPM_ACTIVE and the runtime PM usage counter is
      nonzero, increment the counter and return 1; otherwise return 0 without
      changing the counter

Which means if the usage count is non-zero (e.g. all points where the GPU is
awake but isn't in use and is waiting for the autosuspend delay to timeout),
we'll just avoid connector detection entirely without at least having a screen
active or some other kind of activity that grabs a runtime power reference.

Additionally: nouveau_connector_detect() is what sees whether or not the
connector status has changed in order to determine whether or not we want to
propogate the event to sysfs. If we avoid running the function at any point,
we are just dropping hotplug events. 
> 
> Otoh it's a bit nasty to stop the polling (if the hw really requires it)
> just because the display is suspended, makes waking up the machine just a
> notch more annoying. E.g. in i915 we keep polling (and use some ACPI
> methods if available for hpd detection, to waste less power on rpm
> wakeups).
> -Daniel
> 
> > +	    nv_connector->hpd_task == current) {
> >  		pm_runtime_get_noresume(dev->dev);
> >  	} else {
> >  		ret = pm_runtime_get_sync(dev->dev);
> > @@ -1151,6 +1152,8 @@ nouveau_connector_hotplug(struct nvif_notify
> > *notify)
> >  	const char *name = connector->name;
> >  	struct nouveau_encoder *nv_encoder;
> >  
> > +	nv_connector->hpd_task = current;
> > +
> >  	if (rep->mask & NVIF_NOTIFY_CONN_V0_IRQ) {
> >  		NV_DEBUG(drm, "service %s\n", name);
> >  		if ((nv_encoder = find_encoder(connector, DCB_OUTPUT_DP)))
> > @@ -1167,6 +1170,7 @@ nouveau_connector_hotplug(struct nvif_notify
> > *notify)
> >  		nouveau_connector_hotplug_probe(nv_connector);
> >  	}
> >  
> > +	nv_connector->hpd_task = NULL;
> >  	return NVIF_NOTIFY_KEEP;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.h
> > b/drivers/gpu/drm/nouveau/nouveau_connector.h
> > index 2d9d35a146a4..1964e682ba13 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_connector.h
> > +++ b/drivers/gpu/drm/nouveau/nouveau_connector.h
> > @@ -45,6 +45,7 @@ struct nouveau_connector {
> >  	u8 *dcb;
> >  
> >  	struct nvif_notify hpd;
> > +	struct task_struct *hpd_task;
> >  
> >  	struct drm_dp_aux aux;
> >  
> > -- 
> > 2.17.1
> > 
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> 
-- 
Cheers,
	Lyude Paul


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-08-06 19:29 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-01 21:14 [PATCH v4 0/8] Fix connector probing deadlocks from RPM bugs Lyude Paul
2018-08-01 21:14 ` [PATCH v4 1/8] drm/nouveau: Fix bogus drm_kms_helper_poll_enable() placement Lyude Paul
2018-08-02 19:10   ` Lukas Wunner
2018-08-01 21:14 ` [PATCH v4 2/8] drm/nouveau: Enable polling even if we have runtime PM Lyude Paul
2018-08-02 19:14   ` Lukas Wunner
2018-08-02 19:37     ` Lyude Paul
2018-08-01 21:14 ` [PATCH v4 3/8] drm/fb_helper: Introduce suspend/resume_hotplug() Lyude Paul
2018-08-01 21:14 ` [PATCH v4 4/8] drm/nouveau: Fix deadlock with fb_helper using new helpers Lyude Paul
2018-08-01 21:14 ` [PATCH v4 5/8] drm/nouveau: Use pm_runtime_get_noresume() in connector_detect() Lyude Paul
2018-08-01 21:14 ` [PATCH v4 6/8] drm/nouveau: Respond to HPDs by probing one conn at a time Lyude Paul
2018-08-01 21:14 ` [PATCH v4 7/8] drm/nouveau: Fix deadlocks in nouveau_connector_detect() Lyude Paul
2018-08-06  9:15   ` Daniel Vetter
2018-08-06 19:29     ` Lyude Paul
2018-08-01 21:14 ` [PATCH v4 8/8] drm/nouveau: Call pm_runtime_get_noresume() from hpd handlers Lyude Paul
2018-08-02 19:40   ` Lukas Wunner
2018-08-02 20:00     ` Lyude Paul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).