linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix connector probing deadlocks from RPM bugs
@ 2018-07-18 20:56 Lyude Paul
  2018-07-18 20:56 ` [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm() Lyude Paul
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Lyude Paul @ 2018-07-18 20:56 UTC (permalink / raw)
  To: nouveau
  Cc: David Airlie, Daniel Vetter, Sean Paul, Maarten Lankhorst,
	linux-kernel, dri-devel, Ben Skeggs, Gustavo Padovan,
	Ville Syrjälä,
	Lyude Paul

This is a trimmed down version of

https://patchwork.freedesktop.org/series/46637/

with all of the review comments addressed.

The last version of this series had fixes for the i2c and DP aux busses
to ensure that the GPU would be turned on whenever anything tried to
access the i2c/aux busses. Unfortunately: one of the fixes apparently
contained some very incorrect usage of Linux's runtime PM interface in
order to prevent runtime suspend/resume requests coming from within
nouveau's suspend/resume callbacks from deadlocking the system.

Turns out: fixing the i2c/dp aux problem is a lot harder then I
originally anticipated. We either need to come up with helpers for DRM
that work around the PM core, which is really not ideal, or come up with
a new feature for the RPM core that allows us to inhibit suspend/resume
callbacks. The former seems to be somewhat trivial to implement, but
coming up with a decent interface for that is going to take a bit more
time and I'd much rather have issues causing deadlocks get fixed first.
This means that drm_dp_aux_dev is going to be broken on nouveau for
laptops with hybrid GPUs using RPM, but it was already broken before
anyhow.

So: this just contains the seriously important fixes that will stop
people's machines from crashing very hard. Hopefully I can come up with
a solution to the i2c/aux problem soon after fixing some other glaring
MST bugs nouveau currently has.

Lyude Paul (2):
  drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm()
  drm/probe_helper: Add
    drm_helper_probe_single_connector_modes_with_rpm()

 drivers/gpu/drm/drm_fb_helper.c             | 23 +++++++++++++++
 drivers/gpu/drm/drm_probe_helper.c          | 31 +++++++++++++++++++++
 drivers/gpu/drm/nouveau/dispnv50/disp.c     |  4 +--
 drivers/gpu/drm/nouveau/nouveau_connector.c |  4 +--
 include/drm/drm_crtc_helper.h               |  7 +++--
 include/drm/drm_fb_helper.h                 |  5 ++++
 6 files changed, 67 insertions(+), 7 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm()
  2018-07-18 20:56 [PATCH 0/2] Fix connector probing deadlocks from RPM bugs Lyude Paul
@ 2018-07-18 20:56 ` Lyude Paul
  2018-07-19  7:49   ` Lukas Wunner
  2018-07-18 20:56 ` [PATCH 2/2] drm/probe_helper: Add drm_helper_probe_single_connector_modes_with_rpm() Lyude Paul
  2018-08-06  8:31 ` [PATCH 0/2] Fix connector probing deadlocks from RPM bugs Daniel Vetter
  2 siblings, 1 reply; 10+ messages in thread
From: Lyude Paul @ 2018-07-18 20:56 UTC (permalink / raw)
  To: nouveau
  Cc: Lukas Wunner, stable, Gustavo Padovan, Maarten Lankhorst,
	Sean Paul, David Airlie, Ben Skeggs, Daniel Vetter,
	Ville Syrjälä,
	dri-devel, linux-kernel

When DP MST hubs get confused, they can occasionally stop responding for
a good bit of time up until the point where the DRM driver manages to
do the right DPCD accesses to get it to start responding again. In a
worst case scenario however, this process can take upwards of 10+
seconds.

Currently we use the default output_poll_changed handler
drm_fb_helper_output_poll_changed() to handle output polling, which
doesn't happen to grab any power references on the device when polling.
If we're unlucky enough to have a hub (such as Lenovo's infamous laptop
docks for the P5x/P7x series) that's easily startled and confused, this
can lead to a pretty nasty deadlock situation that looks like this:

- Hotplug event from hub happens, we enter
  drm_fb_helper_output_poll_changed() and start communicating with the
  hub
- While we're in drm_fb_helper_output_poll_changed() and attempting to
  communicate with the hub, we end up confusing it and cause it to stop
  responding for at least 10 seconds
- After 5 seconds of being in drm_fb_helper_output_poll_changed(), the
  pm core attempts to put the GPU into autosuspend, which ends up
  calling drm_kms_helper_poll_disable()
- While the runtime PM core is waiting in drm_kms_helper_poll_disable()
  for the output poll to finish, we end up finally detecting an MST
  display
- We notice the new display and tries to enable it, which triggers
  an atomic commit which triggers a call to pm_runtime_get_sync()
- the output poll thread deadlocks the pm core waiting for the pm core
  to finish the autosuspend request while the pm core waits for the
  output poll thread to finish

Sample:
[  246.669625] INFO: task kworker/4:0:37 blocked for more than 120 seconds.
[  246.673398]       Not tainted 4.18.0-rc5Lyude-Test+ #2
[  246.675271] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.676527] kworker/4:0     D    0    37      2 0x80000000
[  246.677580] Workqueue: events output_poll_execute [drm_kms_helper]
[  246.678704] Call Trace:
[  246.679753]  __schedule+0x322/0xaf0
[  246.680916]  schedule+0x33/0x90
[  246.681924]  schedule_preempt_disabled+0x15/0x20
[  246.683023]  __mutex_lock+0x569/0x9a0
[  246.684035]  ? kobject_uevent_env+0x117/0x7b0
[  246.685132]  ? drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
[  246.686179]  mutex_lock_nested+0x1b/0x20
[  246.687278]  ? mutex_lock_nested+0x1b/0x20
[  246.688307]  drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
[  246.689420]  drm_fb_helper_output_poll_changed+0x23/0x30 [drm_kms_helper]
[  246.690462]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
[  246.691570]  output_poll_execute+0x198/0x1c0 [drm_kms_helper]
[  246.692611]  process_one_work+0x231/0x620
[  246.693725]  worker_thread+0x214/0x3a0
[  246.694756]  kthread+0x12b/0x150
[  246.695856]  ? wq_pool_ids_show+0x140/0x140
[  246.696888]  ? kthread_create_worker_on_cpu+0x70/0x70
[  246.697998]  ret_from_fork+0x3a/0x50
[  246.699034] INFO: task kworker/0:1:60 blocked for more than 120 seconds.
[  246.700153]       Not tainted 4.18.0-rc5Lyude-Test+ #2
[  246.701182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.702278] kworker/0:1     D    0    60      2 0x80000000
[  246.703293] Workqueue: pm pm_runtime_work
[  246.704393] Call Trace:
[  246.705403]  __schedule+0x322/0xaf0
[  246.706439]  ? wait_for_completion+0x104/0x190
[  246.707393]  schedule+0x33/0x90
[  246.708375]  schedule_timeout+0x3a5/0x590
[  246.709289]  ? mark_held_locks+0x58/0x80
[  246.710208]  ? _raw_spin_unlock_irq+0x2c/0x40
[  246.711222]  ? wait_for_completion+0x104/0x190
[  246.712134]  ? trace_hardirqs_on_caller+0xf4/0x190
[  246.713094]  ? wait_for_completion+0x104/0x190
[  246.713964]  wait_for_completion+0x12c/0x190
[  246.714895]  ? wake_up_q+0x80/0x80
[  246.715727]  ? get_work_pool+0x90/0x90
[  246.716649]  flush_work+0x1c9/0x280
[  246.717483]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  246.718442]  __cancel_work_timer+0x146/0x1d0
[  246.719247]  cancel_delayed_work_sync+0x13/0x20
[  246.720043]  drm_kms_helper_poll_disable+0x1f/0x30 [drm_kms_helper]
[  246.721123]  nouveau_pmops_runtime_suspend+0x3d/0xb0 [nouveau]
[  246.721897]  pci_pm_runtime_suspend+0x6b/0x190
[  246.722825]  ? pci_has_legacy_pm_support+0x70/0x70
[  246.723737]  __rpm_callback+0x7a/0x1d0
[  246.724721]  ? pci_has_legacy_pm_support+0x70/0x70
[  246.725607]  rpm_callback+0x24/0x80
[  246.726553]  ? pci_has_legacy_pm_support+0x70/0x70
[  246.727376]  rpm_suspend+0x142/0x6b0
[  246.728185]  pm_runtime_work+0x97/0xc0
[  246.728938]  process_one_work+0x231/0x620
[  246.729796]  worker_thread+0x44/0x3a0
[  246.730614]  kthread+0x12b/0x150
[  246.731395]  ? wq_pool_ids_show+0x140/0x140
[  246.732202]  ? kthread_create_worker_on_cpu+0x70/0x70
[  246.732878]  ret_from_fork+0x3a/0x50
[  246.733768] INFO: task kworker/4:2:422 blocked for more than 120 seconds.
[  246.734587]       Not tainted 4.18.0-rc5Lyude-Test+ #2
[  246.735393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  246.736113] kworker/4:2     D    0   422      2 0x80000080
[  246.736789] Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
[  246.737665] Call Trace:
[  246.738490]  __schedule+0x322/0xaf0
[  246.739250]  schedule+0x33/0x90
[  246.739908]  rpm_resume+0x19c/0x850
[  246.740750]  ? finish_wait+0x90/0x90
[  246.741541]  __pm_runtime_resume+0x4e/0x90
[  246.742370]  nv50_disp_atomic_commit+0x31/0x210 [nouveau]
[  246.743124]  drm_atomic_commit+0x4a/0x50 [drm]
[  246.743775]  restore_fbdev_mode_atomic+0x1c8/0x240 [drm_kms_helper]
[  246.744603]  restore_fbdev_mode+0x31/0x140 [drm_kms_helper]
[  246.745373]  drm_fb_helper_restore_fbdev_mode_unlocked+0x54/0xb0 [drm_kms_helper]
[  246.746220]  drm_fb_helper_set_par+0x2d/0x50 [drm_kms_helper]
[  246.746884]  drm_fb_helper_hotplug_event.part.28+0x96/0xb0 [drm_kms_helper]
[  246.747675]  drm_fb_helper_output_poll_changed+0x23/0x30 [drm_kms_helper]
[  246.748544]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
[  246.749439]  nv50_mstm_hotplug+0x15/0x20 [nouveau]
[  246.750111]  drm_dp_send_link_address+0x177/0x1c0 [drm_kms_helper]
[  246.750764]  drm_dp_check_and_send_link_address+0xa8/0xd0 [drm_kms_helper]
[  246.751602]  drm_dp_mst_link_probe_work+0x51/0x90 [drm_kms_helper]
[  246.752314]  process_one_work+0x231/0x620
[  246.752979]  worker_thread+0x44/0x3a0
[  246.753838]  kthread+0x12b/0x150
[  246.754619]  ? wq_pool_ids_show+0x140/0x140
[  246.755386]  ? kthread_create_worker_on_cpu+0x70/0x70
[  246.756162]  ret_from_fork+0x3a/0x50
[  246.756847]
               Showing all locks held in the system:
[  246.758261] 3 locks held by kworker/4:0/37:
[  246.759016]  #0: 00000000f8df4d2d ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
[  246.759856]  #1: 00000000e6065461 ((work_completion)(&(&dev->mode_config.output_poll_work)->work)){+.+.}, at: process_one_work+0x1b3/0x620
[  246.760670]  #2: 00000000cb66735f (&helper->lock){+.+.}, at: drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
[  246.761516] 2 locks held by kworker/0:1/60:
[  246.762274]  #0: 00000000fff6be0f ((wq_completion)"pm"){+.+.}, at: process_one_work+0x1b3/0x620
[  246.762982]  #1: 000000005ab44fb4 ((work_completion)(&dev->power.work)){+.+.}, at: process_one_work+0x1b3/0x620
[  246.763890] 1 lock held by khungtaskd/64:
[  246.764664]  #0: 000000008cb8b5c3 (rcu_read_lock){....}, at: debug_show_all_locks+0x23/0x185
[  246.765588] 5 locks held by kworker/4:2/422:
[  246.766440]  #0: 00000000232f0959 ((wq_completion)"events_long"){+.+.}, at: process_one_work+0x1b3/0x620
[  246.767390]  #1: 00000000bb59b134 ((work_completion)(&mgr->work)){+.+.}, at: process_one_work+0x1b3/0x620
[  246.768154]  #2: 00000000cb66735f (&helper->lock){+.+.}, at: drm_fb_helper_restore_fbdev_mode_unlocked+0x4c/0xb0 [drm_kms_helper]
[  246.768966]  #3: 000000004c8f0b6b (crtc_ww_class_acquire){+.+.}, at: restore_fbdev_mode_atomic+0x4b/0x240 [drm_kms_helper]
[  246.769921]  #4: 000000004c34a296 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_backoff+0x8a/0x1b0 [drm]
[  246.770839] 1 lock held by dmesg/1038:
[  246.771739] 2 locks held by zsh/1172:
[  246.772650]  #0: 00000000836d0438 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
[  246.773680]  #1: 000000001f4f4d48 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0xc1/0x870

[  246.775522] =============================================

So, to fix this (and any other possible deadlock issues like this that
could occur in the output_poll_changed patch) we make sure that
nouveau's output_poll_changed functions grab a runtime power ref before
sending any hotplug events, and hold it until we're finished. We
introduce this through adding a generic DRM helper which other drivers
may reuse.

This fixes deadlock issues when in fbcon with nouveau on my P50, and
should fix it for everyone else's as well!

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Karol Herbst <karolherbst@gmail.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org
---
Changes since v1:
 - Add a generic helper for DRM to handle this

 drivers/gpu/drm/drm_fb_helper.c         | 23 +++++++++++++++++++++++
 drivers/gpu/drm/nouveau/dispnv50/disp.c |  2 +-
 include/drm/drm_fb_helper.h             |  5 +++++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index 2ee1eaa66188..1ab2f3646526 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -34,6 +34,7 @@
 #include <linux/sysrq.h>
 #include <linux/slab.h>
 #include <linux/module.h>
+#include <linux/pm_runtime.h>
 #include <drm/drmP.h>
 #include <drm/drm_crtc.h>
 #include <drm/drm_fb_helper.h>
@@ -2928,6 +2929,28 @@ void drm_fb_helper_output_poll_changed(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_fb_helper_output_poll_changed);
 
+/**
+ * drm_fb_helper_output_poll_changed_with_rpm - DRM mode
+ *                                              config \.output_poll_changed
+ *                                              helper for fbdev emulation
+ * @dev: DRM device
+ *
+ * Same as drm_fb_helper_output_poll_changed, except that it makes sure that
+ * the device is active by synchronously grabbing a runtime power reference
+ * while probing.
+ */
+void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device *dev)
+{
+	int ret;
+
+	ret = pm_runtime_get_sync(dev->dev);
+	if (WARN_ON(ret < 0 && ret != -EACCES))
+		return;
+	drm_fb_helper_hotplug_event(dev->fb_helper);
+	pm_runtime_put(dev->dev);
+}
+EXPORT_SYMBOL(drm_fb_helper_output_poll_changed_with_rpm);
+
 /* The Kconfig DRM_KMS_HELPER selects FRAMEBUFFER_CONSOLE (if !EXPERT)
  * but the module doesn't depend on any fb console symbols.  At least
  * attempt to load fbcon to avoid leaving the system without a usable console.
diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c b/drivers/gpu/drm/nouveau/dispnv50/disp.c
index eb3e41a78806..fa3ab618a0f9 100644
--- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
@@ -2015,7 +2015,7 @@ nv50_disp_atomic_state_alloc(struct drm_device *dev)
 static const struct drm_mode_config_funcs
 nv50_disp_func = {
 	.fb_create = nouveau_user_framebuffer_create,
-	.output_poll_changed = drm_fb_helper_output_poll_changed,
+	.output_poll_changed = drm_fb_helper_output_poll_changed_with_rpm,
 	.atomic_check = nv50_disp_atomic_check,
 	.atomic_commit = nv50_disp_atomic_commit,
 	.atomic_state_alloc = nv50_disp_atomic_state_alloc,
diff --git a/include/drm/drm_fb_helper.h b/include/drm/drm_fb_helper.h
index b069433e7fc1..ca809bfbaebb 100644
--- a/include/drm/drm_fb_helper.h
+++ b/include/drm/drm_fb_helper.h
@@ -330,6 +330,7 @@ void drm_fb_helper_fbdev_teardown(struct drm_device *dev);
 
 void drm_fb_helper_lastclose(struct drm_device *dev);
 void drm_fb_helper_output_poll_changed(struct drm_device *dev);
+void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device *dev);
 #else
 static inline void drm_fb_helper_prepare(struct drm_device *dev,
 					struct drm_fb_helper *helper,
@@ -564,6 +565,10 @@ static inline void drm_fb_helper_output_poll_changed(struct drm_device *dev)
 {
 }
 
+static inline void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device *dev)
+{
+}
+
 #endif
 
 static inline int
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] drm/probe_helper: Add drm_helper_probe_single_connector_modes_with_rpm()
  2018-07-18 20:56 [PATCH 0/2] Fix connector probing deadlocks from RPM bugs Lyude Paul
  2018-07-18 20:56 ` [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm() Lyude Paul
@ 2018-07-18 20:56 ` Lyude Paul
  2018-07-19  8:08   ` Lukas Wunner
  2018-08-06  8:31 ` [PATCH 0/2] Fix connector probing deadlocks from RPM bugs Daniel Vetter
  2 siblings, 1 reply; 10+ messages in thread
From: Lyude Paul @ 2018-07-18 20:56 UTC (permalink / raw)
  To: nouveau
  Cc: Lukas Wunner, stable, Gustavo Padovan, Maarten Lankhorst,
	Sean Paul, David Airlie, Ben Skeggs, Daniel Vetter,
	Ville Syrjälä,
	dri-devel, linux-kernel

For nouveau, while the GPU is guaranteed to be on when a hotplug has
been received, the same assertion does not hold true if a connector
probe has been started by userspace without having had received a sysfs
event.

So ensure that any connector probing keeps the GPU alive for the
duration of the probe by introducing
drm_helper_probe_single_connector_modes_with_rpm(). It's the same as
drm_helper_probe_single_connector_modes, but it handles holding a power
reference to the device for the duration of the connector probe.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Karol Herbst <karolherbst@gmail.com>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org
---
Changes since v1:
 - Add a generic helper to DRM to handle this

 drivers/gpu/drm/drm_probe_helper.c          | 31 +++++++++++++++++++++
 drivers/gpu/drm/nouveau/dispnv50/disp.c     |  2 +-
 drivers/gpu/drm/nouveau/nouveau_connector.c |  4 +--
 include/drm/drm_crtc_helper.h               |  7 +++--
 4 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/drm_probe_helper.c b/drivers/gpu/drm/drm_probe_helper.c
index 527743394150..0a9d6748b854 100644
--- a/drivers/gpu/drm/drm_probe_helper.c
+++ b/drivers/gpu/drm/drm_probe_helper.c
@@ -31,6 +31,7 @@
 
 #include <linux/export.h>
 #include <linux/moduleparam.h>
+#include <linux/pm_runtime.h>
 
 #include <drm/drmP.h>
 #include <drm/drm_crtc.h>
@@ -541,6 +542,36 @@ int drm_helper_probe_single_connector_modes(struct drm_connector *connector,
 }
 EXPORT_SYMBOL(drm_helper_probe_single_connector_modes);
 
+/**
+ * drm_helper_probe_single_connector_modes_with_rpm - get complete set of
+ *                                                    display modes
+ * @connector: connector to probe
+ * @maxX: max width for modes
+ * @maxY: max height for modes
+ *
+ * Same as drm_helper_probe_single_connector_modes, except that it makes sure
+ * that the device is active by synchronously grabbing a runtime power
+ * reference while probing.
+ *
+ * Returns:
+ * The number of modes found on @connector.
+ */
+int drm_helper_probe_single_connector_modes_with_rpm(struct drm_connector *connector,
+						     uint32_t maxX, uint32_t maxY)
+{
+	int ret;
+
+	ret = pm_runtime_get_sync(connector->dev->dev);
+	if (ret < 0 && ret != -EACCES)
+		return ret;
+
+	ret = drm_helper_probe_single_connector_modes(connector, maxX, maxY);
+
+	pm_runtime_put(connector->dev->dev);
+	return ret;
+}
+EXPORT_SYMBOL(drm_helper_probe_single_connector_modes_with_rpm);
+
 /**
  * drm_kms_helper_hotplug_event - fire off KMS hotplug events
  * @dev: drm_device whose connector state changed
diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c b/drivers/gpu/drm/nouveau/dispnv50/disp.c
index fa3ab618a0f9..c54767b50fd8 100644
--- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
@@ -858,7 +858,7 @@ static const struct drm_connector_funcs
 nv50_mstc = {
 	.reset = nouveau_conn_reset,
 	.detect = nv50_mstc_detect,
-	.fill_modes = drm_helper_probe_single_connector_modes,
+	.fill_modes = drm_helper_probe_single_connector_modes_with_rpm,
 	.destroy = nv50_mstc_destroy,
 	.atomic_duplicate_state = nouveau_conn_atomic_duplicate_state,
 	.atomic_destroy_state = nouveau_conn_atomic_destroy_state,
diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
index 2a45b4c2ceb0..8d9070779261 100644
--- a/drivers/gpu/drm/nouveau/nouveau_connector.c
+++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
@@ -1088,7 +1088,7 @@ nouveau_connector_funcs = {
 	.reset = nouveau_conn_reset,
 	.detect = nouveau_connector_detect,
 	.force = nouveau_connector_force,
-	.fill_modes = drm_helper_probe_single_connector_modes,
+	.fill_modes = drm_helper_probe_single_connector_modes_with_rpm,
 	.set_property = nouveau_connector_set_property,
 	.destroy = nouveau_connector_destroy,
 	.atomic_duplicate_state = nouveau_conn_atomic_duplicate_state,
@@ -1103,7 +1103,7 @@ nouveau_connector_funcs_lvds = {
 	.reset = nouveau_conn_reset,
 	.detect = nouveau_connector_detect_lvds,
 	.force = nouveau_connector_force,
-	.fill_modes = drm_helper_probe_single_connector_modes,
+	.fill_modes = drm_helper_probe_single_connector_modes_with_rpm,
 	.set_property = nouveau_connector_set_property,
 	.destroy = nouveau_connector_destroy,
 	.atomic_duplicate_state = nouveau_conn_atomic_duplicate_state,
diff --git a/include/drm/drm_crtc_helper.h b/include/drm/drm_crtc_helper.h
index 6914633037a5..8f3f6d6fcc8c 100644
--- a/include/drm/drm_crtc_helper.h
+++ b/include/drm/drm_crtc_helper.h
@@ -64,9 +64,10 @@ int drm_helper_crtc_mode_set_base(struct drm_crtc *crtc, int x, int y,
 				  struct drm_framebuffer *old_fb);
 
 /* drm_probe_helper.c */
-int drm_helper_probe_single_connector_modes(struct drm_connector
-					    *connector, uint32_t maxX,
-					    uint32_t maxY);
+int drm_helper_probe_single_connector_modes(struct drm_connector *connector,
+					    uint32_t maxX, uint32_t maxY);
+int drm_helper_probe_single_connector_modes_with_rpm(struct drm_connector *connector,
+						     uint32_t maxX, uint32_t maxY);
 int drm_helper_probe_detect(struct drm_connector *connector,
 			    struct drm_modeset_acquire_ctx *ctx,
 			    bool force);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm()
  2018-07-18 20:56 ` [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm() Lyude Paul
@ 2018-07-19  7:49   ` Lukas Wunner
  2018-07-20  0:08     ` Lyude Paul
  0 siblings, 1 reply; 10+ messages in thread
From: Lukas Wunner @ 2018-07-19  7:49 UTC (permalink / raw)
  To: Lyude Paul
  Cc: nouveau, Gustavo Padovan, Maarten Lankhorst, Sean Paul,
	David Airlie, Ben Skeggs, Daniel Vetter, Ville Syrjälä,
	dri-devel, linux-kernel

On Wed, Jul 18, 2018 at 04:56:39PM -0400, Lyude Paul wrote:
> When DP MST hubs get confused, they can occasionally stop responding for
> a good bit of time up until the point where the DRM driver manages to
> do the right DPCD accesses to get it to start responding again. In a
> worst case scenario however, this process can take upwards of 10+
> seconds.
> 
> Currently we use the default output_poll_changed handler
> drm_fb_helper_output_poll_changed() to handle output polling, which
> doesn't happen to grab any power references on the device when polling.
> If we're unlucky enough to have a hub (such as Lenovo's infamous laptop
> docks for the P5x/P7x series) that's easily startled and confused, this
> can lead to a pretty nasty deadlock situation that looks like this:
> 
> - Hotplug event from hub happens, we enter
>   drm_fb_helper_output_poll_changed() and start communicating with the
>   hub
> - While we're in drm_fb_helper_output_poll_changed() and attempting to
>   communicate with the hub, we end up confusing it and cause it to stop
>   responding for at least 10 seconds
> - After 5 seconds of being in drm_fb_helper_output_poll_changed(), the
>   pm core attempts to put the GPU into autosuspend, which ends up
>   calling drm_kms_helper_poll_disable()
> - While the runtime PM core is waiting in drm_kms_helper_poll_disable()
>   for the output poll to finish, we end up finally detecting an MST
>   display
> - We notice the new display and tries to enable it, which triggers
>   an atomic commit which triggers a call to pm_runtime_get_sync()
> - the output poll thread deadlocks the pm core waiting for the pm core
>   to finish the autosuspend request while the pm core waits for the
>   output poll thread to finish

This will still deadlock if ->runtime_suspend commences before the
hotplug event and the hotplug event occurs before polling has been
disabled by ->runtime_suspend.

The correct fix is to call pm_runtime_get_sync() *conditionally* in
the atomic commit which enables the display, using the same conditional
as d61a5c106351, i.e. if (!drm_kms_helper_is_poll_worker()).

Now I realize I sent you down the wrong path when I suggested to
introduce a DRM helper here.  My apologies, I didn't fully appreciate
what is going awry here!

Anything that happens in nouveau's poll worker never needs to acquire
a runtime PM ref because polling is only enabled while runtime active,
and ->runtime_suspend waits for an ongoing poll to finish.

Thinking a bit more about this, our mistake is to acquire runtime PM
refs too far down in the call stack.  As a fix that can be backported
to stable, adding if (!drm_kms_helper_is_poll_worker()) conditionals
seems fine to me, but the long term fix is to push acquisition of refs
further up in the call stack.

E.g., if the user forces connector probing via sysfs, a runtime PM ref
should be acquired in status_store() in drm_sysfs.c before invoking
connector->funcs->fill_modes().  That way, if the user forces connector
probing while the GPU is powering down, rpm_resume() will correctly wait
for rpm_suspend() to finish before resuming the card.  So if we architect
it like this, we're actually using the functionality provided by the
PM core in the way that it's supposed to be used.

The problem is that adding pm_runtime_get_sync() to status_store()
conflicts with the desire to have a library of generic DRM functions:
Some GPUs may be able to probe connectors without resuming to runtime
active state, others don't use runtime PM at all.  One solution that
comes to mind is a driver_features flag which tells the DRM core whether
to acquire a runtime PM ref in various places.

In your original patches 4 and 5, what exactly was the call stack which
led to i2c being accessed while runtime suspended?  Was it sysfs access
via /sys/class/i2c-adapter/* ?  If so, acquisition of the runtime PM ref
needs to likewise happen in that sysfs entry point, rather than deep down
in the call stack upon accessing the i2c bus.

Thanks,

Lukas

> 
> Sample:
> [  246.669625] INFO: task kworker/4:0:37 blocked for more than 120 seconds.
> [  246.673398]       Not tainted 4.18.0-rc5Lyude-Test+ #2
> [  246.675271] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  246.676527] kworker/4:0     D    0    37      2 0x80000000
> [  246.677580] Workqueue: events output_poll_execute [drm_kms_helper]
> [  246.678704] Call Trace:
> [  246.679753]  __schedule+0x322/0xaf0
> [  246.680916]  schedule+0x33/0x90
> [  246.681924]  schedule_preempt_disabled+0x15/0x20
> [  246.683023]  __mutex_lock+0x569/0x9a0
> [  246.684035]  ? kobject_uevent_env+0x117/0x7b0
> [  246.685132]  ? drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
> [  246.686179]  mutex_lock_nested+0x1b/0x20
> [  246.687278]  ? mutex_lock_nested+0x1b/0x20
> [  246.688307]  drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
> [  246.689420]  drm_fb_helper_output_poll_changed+0x23/0x30 [drm_kms_helper]
> [  246.690462]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
> [  246.691570]  output_poll_execute+0x198/0x1c0 [drm_kms_helper]
> [  246.692611]  process_one_work+0x231/0x620
> [  246.693725]  worker_thread+0x214/0x3a0
> [  246.694756]  kthread+0x12b/0x150
> [  246.695856]  ? wq_pool_ids_show+0x140/0x140
> [  246.696888]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  246.697998]  ret_from_fork+0x3a/0x50
> [  246.699034] INFO: task kworker/0:1:60 blocked for more than 120 seconds.
> [  246.700153]       Not tainted 4.18.0-rc5Lyude-Test+ #2
> [  246.701182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  246.702278] kworker/0:1     D    0    60      2 0x80000000
> [  246.703293] Workqueue: pm pm_runtime_work
> [  246.704393] Call Trace:
> [  246.705403]  __schedule+0x322/0xaf0
> [  246.706439]  ? wait_for_completion+0x104/0x190
> [  246.707393]  schedule+0x33/0x90
> [  246.708375]  schedule_timeout+0x3a5/0x590
> [  246.709289]  ? mark_held_locks+0x58/0x80
> [  246.710208]  ? _raw_spin_unlock_irq+0x2c/0x40
> [  246.711222]  ? wait_for_completion+0x104/0x190
> [  246.712134]  ? trace_hardirqs_on_caller+0xf4/0x190
> [  246.713094]  ? wait_for_completion+0x104/0x190
> [  246.713964]  wait_for_completion+0x12c/0x190
> [  246.714895]  ? wake_up_q+0x80/0x80
> [  246.715727]  ? get_work_pool+0x90/0x90
> [  246.716649]  flush_work+0x1c9/0x280
> [  246.717483]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> [  246.718442]  __cancel_work_timer+0x146/0x1d0
> [  246.719247]  cancel_delayed_work_sync+0x13/0x20
> [  246.720043]  drm_kms_helper_poll_disable+0x1f/0x30 [drm_kms_helper]
> [  246.721123]  nouveau_pmops_runtime_suspend+0x3d/0xb0 [nouveau]
> [  246.721897]  pci_pm_runtime_suspend+0x6b/0x190
> [  246.722825]  ? pci_has_legacy_pm_support+0x70/0x70
> [  246.723737]  __rpm_callback+0x7a/0x1d0
> [  246.724721]  ? pci_has_legacy_pm_support+0x70/0x70
> [  246.725607]  rpm_callback+0x24/0x80
> [  246.726553]  ? pci_has_legacy_pm_support+0x70/0x70
> [  246.727376]  rpm_suspend+0x142/0x6b0
> [  246.728185]  pm_runtime_work+0x97/0xc0
> [  246.728938]  process_one_work+0x231/0x620
> [  246.729796]  worker_thread+0x44/0x3a0
> [  246.730614]  kthread+0x12b/0x150
> [  246.731395]  ? wq_pool_ids_show+0x140/0x140
> [  246.732202]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  246.732878]  ret_from_fork+0x3a/0x50
> [  246.733768] INFO: task kworker/4:2:422 blocked for more than 120 seconds.
> [  246.734587]       Not tainted 4.18.0-rc5Lyude-Test+ #2
> [  246.735393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  246.736113] kworker/4:2     D    0   422      2 0x80000080
> [  246.736789] Workqueue: events_long drm_dp_mst_link_probe_work [drm_kms_helper]
> [  246.737665] Call Trace:
> [  246.738490]  __schedule+0x322/0xaf0
> [  246.739250]  schedule+0x33/0x90
> [  246.739908]  rpm_resume+0x19c/0x850
> [  246.740750]  ? finish_wait+0x90/0x90
> [  246.741541]  __pm_runtime_resume+0x4e/0x90
> [  246.742370]  nv50_disp_atomic_commit+0x31/0x210 [nouveau]
> [  246.743124]  drm_atomic_commit+0x4a/0x50 [drm]
> [  246.743775]  restore_fbdev_mode_atomic+0x1c8/0x240 [drm_kms_helper]
> [  246.744603]  restore_fbdev_mode+0x31/0x140 [drm_kms_helper]
> [  246.745373]  drm_fb_helper_restore_fbdev_mode_unlocked+0x54/0xb0 [drm_kms_helper]
> [  246.746220]  drm_fb_helper_set_par+0x2d/0x50 [drm_kms_helper]
> [  246.746884]  drm_fb_helper_hotplug_event.part.28+0x96/0xb0 [drm_kms_helper]
> [  246.747675]  drm_fb_helper_output_poll_changed+0x23/0x30 [drm_kms_helper]
> [  246.748544]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
> [  246.749439]  nv50_mstm_hotplug+0x15/0x20 [nouveau]
> [  246.750111]  drm_dp_send_link_address+0x177/0x1c0 [drm_kms_helper]
> [  246.750764]  drm_dp_check_and_send_link_address+0xa8/0xd0 [drm_kms_helper]
> [  246.751602]  drm_dp_mst_link_probe_work+0x51/0x90 [drm_kms_helper]
> [  246.752314]  process_one_work+0x231/0x620
> [  246.752979]  worker_thread+0x44/0x3a0
> [  246.753838]  kthread+0x12b/0x150
> [  246.754619]  ? wq_pool_ids_show+0x140/0x140
> [  246.755386]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  246.756162]  ret_from_fork+0x3a/0x50
> [  246.756847]
>                Showing all locks held in the system:
> [  246.758261] 3 locks held by kworker/4:0/37:
> [  246.759016]  #0: 00000000f8df4d2d ((wq_completion)"events"){+.+.}, at: process_one_work+0x1b3/0x620
> [  246.759856]  #1: 00000000e6065461 ((work_completion)(&(&dev->mode_config.output_poll_work)->work)){+.+.}, at: process_one_work+0x1b3/0x620
> [  246.760670]  #2: 00000000cb66735f (&helper->lock){+.+.}, at: drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
> [  246.761516] 2 locks held by kworker/0:1/60:
> [  246.762274]  #0: 00000000fff6be0f ((wq_completion)"pm"){+.+.}, at: process_one_work+0x1b3/0x620
> [  246.762982]  #1: 000000005ab44fb4 ((work_completion)(&dev->power.work)){+.+.}, at: process_one_work+0x1b3/0x620
> [  246.763890] 1 lock held by khungtaskd/64:
> [  246.764664]  #0: 000000008cb8b5c3 (rcu_read_lock){....}, at: debug_show_all_locks+0x23/0x185
> [  246.765588] 5 locks held by kworker/4:2/422:
> [  246.766440]  #0: 00000000232f0959 ((wq_completion)"events_long"){+.+.}, at: process_one_work+0x1b3/0x620
> [  246.767390]  #1: 00000000bb59b134 ((work_completion)(&mgr->work)){+.+.}, at: process_one_work+0x1b3/0x620
> [  246.768154]  #2: 00000000cb66735f (&helper->lock){+.+.}, at: drm_fb_helper_restore_fbdev_mode_unlocked+0x4c/0xb0 [drm_kms_helper]
> [  246.768966]  #3: 000000004c8f0b6b (crtc_ww_class_acquire){+.+.}, at: restore_fbdev_mode_atomic+0x4b/0x240 [drm_kms_helper]
> [  246.769921]  #4: 000000004c34a296 (crtc_ww_class_mutex){+.+.}, at: drm_modeset_backoff+0x8a/0x1b0 [drm]
> [  246.770839] 1 lock held by dmesg/1038:
> [  246.771739] 2 locks held by zsh/1172:
> [  246.772650]  #0: 00000000836d0438 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> [  246.773680]  #1: 000000001f4f4d48 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0xc1/0x870
> 
> [  246.775522] =============================================
> 
> So, to fix this (and any other possible deadlock issues like this that
> could occur in the output_poll_changed patch) we make sure that
> nouveau's output_poll_changed functions grab a runtime power ref before
> sending any hotplug events, and hold it until we're finished. We
> introduce this through adding a generic DRM helper which other drivers
> may reuse.
> 
> This fixes deadlock issues when in fbcon with nouveau on my P50, and
> should fix it for everyone else's as well!
> 
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Reviewed-by: Karol Herbst <karolherbst@gmail.com>
> Cc: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org
> ---
> Changes since v1:
>  - Add a generic helper for DRM to handle this
> 
>  drivers/gpu/drm/drm_fb_helper.c         | 23 +++++++++++++++++++++++
>  drivers/gpu/drm/nouveau/dispnv50/disp.c |  2 +-
>  include/drm/drm_fb_helper.h             |  5 +++++
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
> index 2ee1eaa66188..1ab2f3646526 100644
> --- a/drivers/gpu/drm/drm_fb_helper.c
> +++ b/drivers/gpu/drm/drm_fb_helper.c
> @@ -34,6 +34,7 @@
>  #include <linux/sysrq.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> +#include <linux/pm_runtime.h>
>  #include <drm/drmP.h>
>  #include <drm/drm_crtc.h>
>  #include <drm/drm_fb_helper.h>
> @@ -2928,6 +2929,28 @@ void drm_fb_helper_output_poll_changed(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_fb_helper_output_poll_changed);
>  
> +/**
> + * drm_fb_helper_output_poll_changed_with_rpm - DRM mode
> + *                                              config \.output_poll_changed
> + *                                              helper for fbdev emulation
> + * @dev: DRM device
> + *
> + * Same as drm_fb_helper_output_poll_changed, except that it makes sure that
> + * the device is active by synchronously grabbing a runtime power reference
> + * while probing.
> + */
> +void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device *dev)
> +{
> +	int ret;
> +
> +	ret = pm_runtime_get_sync(dev->dev);
> +	if (WARN_ON(ret < 0 && ret != -EACCES))
> +		return;
> +	drm_fb_helper_hotplug_event(dev->fb_helper);
> +	pm_runtime_put(dev->dev);
> +}
> +EXPORT_SYMBOL(drm_fb_helper_output_poll_changed_with_rpm);
> +
>  /* The Kconfig DRM_KMS_HELPER selects FRAMEBUFFER_CONSOLE (if !EXPERT)
>   * but the module doesn't depend on any fb console symbols.  At least
>   * attempt to load fbcon to avoid leaving the system without a usable console.
> diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> index eb3e41a78806..fa3ab618a0f9 100644
> --- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
> +++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> @@ -2015,7 +2015,7 @@ nv50_disp_atomic_state_alloc(struct drm_device *dev)
>  static const struct drm_mode_config_funcs
>  nv50_disp_func = {
>  	.fb_create = nouveau_user_framebuffer_create,
> -	.output_poll_changed = drm_fb_helper_output_poll_changed,
> +	.output_poll_changed = drm_fb_helper_output_poll_changed_with_rpm,
>  	.atomic_check = nv50_disp_atomic_check,
>  	.atomic_commit = nv50_disp_atomic_commit,
>  	.atomic_state_alloc = nv50_disp_atomic_state_alloc,
> diff --git a/include/drm/drm_fb_helper.h b/include/drm/drm_fb_helper.h
> index b069433e7fc1..ca809bfbaebb 100644
> --- a/include/drm/drm_fb_helper.h
> +++ b/include/drm/drm_fb_helper.h
> @@ -330,6 +330,7 @@ void drm_fb_helper_fbdev_teardown(struct drm_device *dev);
>  
>  void drm_fb_helper_lastclose(struct drm_device *dev);
>  void drm_fb_helper_output_poll_changed(struct drm_device *dev);
> +void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device *dev);
>  #else
>  static inline void drm_fb_helper_prepare(struct drm_device *dev,
>  					struct drm_fb_helper *helper,
> @@ -564,6 +565,10 @@ static inline void drm_fb_helper_output_poll_changed(struct drm_device *dev)
>  {
>  }
>  
> +static inline void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device *dev)
> +{
> +}
> +
>  #endif
>  
>  static inline int
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] drm/probe_helper: Add drm_helper_probe_single_connector_modes_with_rpm()
  2018-07-18 20:56 ` [PATCH 2/2] drm/probe_helper: Add drm_helper_probe_single_connector_modes_with_rpm() Lyude Paul
@ 2018-07-19  8:08   ` Lukas Wunner
  0 siblings, 0 replies; 10+ messages in thread
From: Lukas Wunner @ 2018-07-19  8:08 UTC (permalink / raw)
  To: Lyude Paul
  Cc: nouveau, Gustavo Padovan, Maarten Lankhorst, Sean Paul,
	David Airlie, Ben Skeggs, Daniel Vetter, Ville Syrjälä,
	dri-devel, linux-kernel

On Wed, Jul 18, 2018 at 04:56:40PM -0400, Lyude Paul wrote:
> For nouveau, while the GPU is guaranteed to be on when a hotplug has
> been received, the same assertion does not hold true if a connector
> probe has been started by userspace without having had received a sysfs
> event.
> 
> So ensure that any connector probing keeps the GPU alive for the
> duration of the probe by introducing
> drm_helper_probe_single_connector_modes_with_rpm(). It's the same as
> drm_helper_probe_single_connector_modes, but it handles holding a power
> reference to the device for the duration of the connector probe.

Hm, a runtime PM ref is already acquired in nouveau_connector_detect().
I'm wondering why that's not sufficient?

Thanks,

Lukas

> 
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Reviewed-by: Karol Herbst <karolherbst@gmail.com>
> Cc: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org
> ---
> Changes since v1:
>  - Add a generic helper to DRM to handle this
> 
>  drivers/gpu/drm/drm_probe_helper.c          | 31 +++++++++++++++++++++
>  drivers/gpu/drm/nouveau/dispnv50/disp.c     |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_connector.c |  4 +--
>  include/drm/drm_crtc_helper.h               |  7 +++--
>  4 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_probe_helper.c b/drivers/gpu/drm/drm_probe_helper.c
> index 527743394150..0a9d6748b854 100644
> --- a/drivers/gpu/drm/drm_probe_helper.c
> +++ b/drivers/gpu/drm/drm_probe_helper.c
> @@ -31,6 +31,7 @@
>  
>  #include <linux/export.h>
>  #include <linux/moduleparam.h>
> +#include <linux/pm_runtime.h>
>  
>  #include <drm/drmP.h>
>  #include <drm/drm_crtc.h>
> @@ -541,6 +542,36 @@ int drm_helper_probe_single_connector_modes(struct drm_connector *connector,
>  }
>  EXPORT_SYMBOL(drm_helper_probe_single_connector_modes);
>  
> +/**
> + * drm_helper_probe_single_connector_modes_with_rpm - get complete set of
> + *                                                    display modes
> + * @connector: connector to probe
> + * @maxX: max width for modes
> + * @maxY: max height for modes
> + *
> + * Same as drm_helper_probe_single_connector_modes, except that it makes sure
> + * that the device is active by synchronously grabbing a runtime power
> + * reference while probing.
> + *
> + * Returns:
> + * The number of modes found on @connector.
> + */
> +int drm_helper_probe_single_connector_modes_with_rpm(struct drm_connector *connector,
> +						     uint32_t maxX, uint32_t maxY)
> +{
> +	int ret;
> +
> +	ret = pm_runtime_get_sync(connector->dev->dev);
> +	if (ret < 0 && ret != -EACCES)
> +		return ret;
> +
> +	ret = drm_helper_probe_single_connector_modes(connector, maxX, maxY);
> +
> +	pm_runtime_put(connector->dev->dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL(drm_helper_probe_single_connector_modes_with_rpm);
> +
>  /**
>   * drm_kms_helper_hotplug_event - fire off KMS hotplug events
>   * @dev: drm_device whose connector state changed
> diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> index fa3ab618a0f9..c54767b50fd8 100644
> --- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
> +++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> @@ -858,7 +858,7 @@ static const struct drm_connector_funcs
>  nv50_mstc = {
>  	.reset = nouveau_conn_reset,
>  	.detect = nv50_mstc_detect,
> -	.fill_modes = drm_helper_probe_single_connector_modes,
> +	.fill_modes = drm_helper_probe_single_connector_modes_with_rpm,
>  	.destroy = nv50_mstc_destroy,
>  	.atomic_duplicate_state = nouveau_conn_atomic_duplicate_state,
>  	.atomic_destroy_state = nouveau_conn_atomic_destroy_state,
> diff --git a/drivers/gpu/drm/nouveau/nouveau_connector.c b/drivers/gpu/drm/nouveau/nouveau_connector.c
> index 2a45b4c2ceb0..8d9070779261 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_connector.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_connector.c
> @@ -1088,7 +1088,7 @@ nouveau_connector_funcs = {
>  	.reset = nouveau_conn_reset,
>  	.detect = nouveau_connector_detect,
>  	.force = nouveau_connector_force,
> -	.fill_modes = drm_helper_probe_single_connector_modes,
> +	.fill_modes = drm_helper_probe_single_connector_modes_with_rpm,
>  	.set_property = nouveau_connector_set_property,
>  	.destroy = nouveau_connector_destroy,
>  	.atomic_duplicate_state = nouveau_conn_atomic_duplicate_state,
> @@ -1103,7 +1103,7 @@ nouveau_connector_funcs_lvds = {
>  	.reset = nouveau_conn_reset,
>  	.detect = nouveau_connector_detect_lvds,
>  	.force = nouveau_connector_force,
> -	.fill_modes = drm_helper_probe_single_connector_modes,
> +	.fill_modes = drm_helper_probe_single_connector_modes_with_rpm,
>  	.set_property = nouveau_connector_set_property,
>  	.destroy = nouveau_connector_destroy,
>  	.atomic_duplicate_state = nouveau_conn_atomic_duplicate_state,
> diff --git a/include/drm/drm_crtc_helper.h b/include/drm/drm_crtc_helper.h
> index 6914633037a5..8f3f6d6fcc8c 100644
> --- a/include/drm/drm_crtc_helper.h
> +++ b/include/drm/drm_crtc_helper.h
> @@ -64,9 +64,10 @@ int drm_helper_crtc_mode_set_base(struct drm_crtc *crtc, int x, int y,
>  				  struct drm_framebuffer *old_fb);
>  
>  /* drm_probe_helper.c */
> -int drm_helper_probe_single_connector_modes(struct drm_connector
> -					    *connector, uint32_t maxX,
> -					    uint32_t maxY);
> +int drm_helper_probe_single_connector_modes(struct drm_connector *connector,
> +					    uint32_t maxX, uint32_t maxY);
> +int drm_helper_probe_single_connector_modes_with_rpm(struct drm_connector *connector,
> +						     uint32_t maxX, uint32_t maxY);
>  int drm_helper_probe_detect(struct drm_connector *connector,
>  			    struct drm_modeset_acquire_ctx *ctx,
>  			    bool force);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm()
  2018-07-19  7:49   ` Lukas Wunner
@ 2018-07-20  0:08     ` Lyude Paul
  2018-07-20  0:17       ` Lyude Paul
  2018-07-21  9:39       ` Lukas Wunner
  0 siblings, 2 replies; 10+ messages in thread
From: Lyude Paul @ 2018-07-20  0:08 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: nouveau, Gustavo Padovan, Maarten Lankhorst, Sean Paul,
	David Airlie, Ben Skeggs, Daniel Vetter, Ville Syrjälä,
	dri-devel, linux-kernel

On Thu, 2018-07-19 at 09:49 +0200, Lukas Wunner wrote:
> On Wed, Jul 18, 2018 at 04:56:39PM -0400, Lyude Paul wrote:
> > When DP MST hubs get confused, they can occasionally stop responding for
> > a good bit of time up until the point where the DRM driver manages to
> > do the right DPCD accesses to get it to start responding again. In a
> > worst case scenario however, this process can take upwards of 10+
> > seconds.
> > 
> > Currently we use the default output_poll_changed handler
> > drm_fb_helper_output_poll_changed() to handle output polling, which
> > doesn't happen to grab any power references on the device when polling.
> > If we're unlucky enough to have a hub (such as Lenovo's infamous laptop
> > docks for the P5x/P7x series) that's easily startled and confused, this
> > can lead to a pretty nasty deadlock situation that looks like this:
> > 
> > - Hotplug event from hub happens, we enter
> >   drm_fb_helper_output_poll_changed() and start communicating with the
> >   hub
> > - While we're in drm_fb_helper_output_poll_changed() and attempting to
> >   communicate with the hub, we end up confusing it and cause it to stop
> >   responding for at least 10 seconds
> > - After 5 seconds of being in drm_fb_helper_output_poll_changed(), the
> >   pm core attempts to put the GPU into autosuspend, which ends up
> >   calling drm_kms_helper_poll_disable()
> > - While the runtime PM core is waiting in drm_kms_helper_poll_disable()
> >   for the output poll to finish, we end up finally detecting an MST
> >   display
> > - We notice the new display and tries to enable it, which triggers
> >   an atomic commit which triggers a call to pm_runtime_get_sync()
> > - the output poll thread deadlocks the pm core waiting for the pm core
> >   to finish the autosuspend request while the pm core waits for the
> >   output poll thread to finish
> 
> This will still deadlock if ->runtime_suspend commences before the
> hotplug event and the hotplug event occurs before polling has been
> disabled by ->runtime_suspend.
> 
> The correct fix is to call pm_runtime_get_sync() *conditionally* in
> the atomic commit which enables the display, using the same conditional
> as d61a5c106351, i.e. if (!drm_kms_helper_is_poll_worker()).
> 
> Now I realize I sent you down the wrong path when I suggested to
> introduce a DRM helper here.  My apologies, I didn't fully appreciate
> what is going awry here!
> 
> Anything that happens in nouveau's poll worker never needs to acquire
> a runtime PM ref because polling is only enabled while runtime active,
> and ->runtime_suspend waits for an ongoing poll to finish.

Hopefully I'm not missing anything here, but this doesn't really make any
sense. I'm not sure I explained myself as well as I thought I did. To be
honest, I had to double check this about literally 20 times to make sure I was
actually understanding this issue correctly. Turns out I was missing a couple
of parts, so I'm going to try again at explaining this using a diagram that
shows the various threads running concurrently

START: Driver load
       |
       |
       |           Thread 1
        ----- output_poll_execute()
                       |
           drm_helper_hpd_irq_event()
                       |
                       |  Schedules →            Thread 2
                       ----------------- nouveau_display_hpd_work()
                    Finish                          |
                                           pm_runtime_get_sync() <--- keeps GPU alive
                                                    |
                                                   ...
                                                    |
                       ------------------------------
                <still Thread 2>
             drm_helper_hpd_irq_event()
                       |
                       |         Schedules →
                       ------------------------------
                       |                         Thread 3
Drop last RPM ref -v   |                   output_poll_execute()
             pm_runtime_put_sync()                  |
                       |             drm_dp_mst_topology_mgr_set_mst()
                    Finish                          |
                                                    |
                               ← Schedules          |
                       -----------------------------|
                       |                            |
                    Thread 4                        |
                       |                            |
          drm_dp_mst_link_probe_work()  drm_dp_mst_wait_tx_reply() <-- these wait on eachother
                       |                            |                     this is normal
                       |    Sideband transactions   |
                       |    happen here, this is    |
                       |    where timeouts happen   |
                       |                            |
                       |      5 seconds later...    | autosuspend delay kicks in
                       |            ...             |                        |
                       |                            |                     Thread 5
                       |  Communication + timeouts  |                 pm_runtime_work
                       |  are still happening here  |                        |
                       |                            |           nouveau_pmops_runtime_suspend()
                       |                            |                        |
                       |  Success! We can enable    |                        |
                       |        displays now!       |            drm_kms_helper_poll_disable()
                       |                            |                        |
                       |                 *Atomic commit begins*              |
                       |                            |     <-------------     |
                       | More sideband transactions |       Waiting on       |
                       |          ......            |  output_poll_execute() |
                       |                            |     <-------------     |
                       |                  pm_runtime_get_sync()              |
                       |                            |                        |
                       |   -------------------->    |     ------------->     |
                       |        Waiting on          |       Waiting on       |
                       |     output_poll_exec()     |    suspend requests    |
                       |   -------------------->    |     ------------->     |
                       |                            |                        |
                       ----------------------------------------------------->|
                                                                         DEADLOCK

phew. that took a LONG time to come up with.

Anyway-that's why your explanation doesn't make sense: the deadlock is
happening because we're calling pm_runtime_get_sync(). If we were to make that
call conditional (e.g. drm_kms_helper_is_poll_worker()), all that would mean
is that we wouldn't grab any runtime power reference and the GPU would
immediately suspend once the atomic commit finished, as the suspend request in
Thread 5 would finally get unblocked and thus----suspend.

Hopefully I explained that better this time, I'll definitely make sure to
actually include that diagram in the patch. As for whether or not this patch
is even the right solution, I will need to confirm that tommorrow (if you
don't think it is still, please feel free to say so!) because it's getting
late here.

Cheers!
	Lyude

> 
> Thinking a bit more about this, our mistake is to acquire runtime PM
> refs too far down in the call stack.  As a fix that can be backported
> to stable, adding if (!drm_kms_helper_is_poll_worker()) conditionals
> seems fine to me, but the long term fix is to push acquisition of refs
> further up in the call stack.
> 
> E.g., if the user forces connector probing via sysfs, a runtime PM ref
> should be acquired in status_store() in drm_sysfs.c before invoking
> connector->funcs->fill_modes().  That way, if the user forces connector
> probing while the GPU is powering down, rpm_resume() will correctly wait
> for rpm_suspend() to finish before resuming the card.  So if we architect
> it like this, we're actually using the functionality provided by the
> PM core in the way that it's supposed to be used.
> 
> The problem is that adding pm_runtime_get_sync() to status_store()
> conflicts with the desire to have a library of generic DRM functions:
> Some GPUs may be able to probe connectors without resuming to runtime
> active state, others don't use runtime PM at all.  One solution that
> comes to mind is a driver_features flag which tells the DRM core whether
> to acquire a runtime PM ref in various places.
> 
> In your original patches 4 and 5, what exactly was the call stack which
> led to i2c being accessed while runtime suspended?  Was it sysfs access
> via /sys/class/i2c-adapter/* ?  If so, acquisition of the runtime PM ref
> needs to likewise happen in that sysfs entry point, rather than deep down
> in the call stack upon accessing the i2c bus.
> 
> Thanks,
> 
> Lukas
> 
> > 
> > Sample:
> > [  246.669625] INFO: task kworker/4:0:37 blocked for more than 120
> > seconds.
> > [  246.673398]       Not tainted 4.18.0-rc5Lyude-Test+ #2
> > [  246.675271] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [  246.676527] kworker/4:0     D    0    37      2 0x80000000
> > [  246.677580] Workqueue: events output_poll_execute [drm_kms_helper]
> > [  246.678704] Call Trace:
> > [  246.679753]  __schedule+0x322/0xaf0
> > [  246.680916]  schedule+0x33/0x90
> > [  246.681924]  schedule_preempt_disabled+0x15/0x20
> > [  246.683023]  __mutex_lock+0x569/0x9a0
> > [  246.684035]  ? kobject_uevent_env+0x117/0x7b0
> > [  246.685132]  ? drm_fb_helper_hotplug_event.part.28+0x20/0xb0
> > [drm_kms_helper]
> > [  246.686179]  mutex_lock_nested+0x1b/0x20
> > [  246.687278]  ? mutex_lock_nested+0x1b/0x20
> > [  246.688307]  drm_fb_helper_hotplug_event.part.28+0x20/0xb0
> > [drm_kms_helper]
> > [  246.689420]  drm_fb_helper_output_poll_changed+0x23/0x30
> > [drm_kms_helper]
> > [  246.690462]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
> > [  246.691570]  output_poll_execute+0x198/0x1c0 [drm_kms_helper]
> > [  246.692611]  process_one_work+0x231/0x620
> > [  246.693725]  worker_thread+0x214/0x3a0
> > [  246.694756]  kthread+0x12b/0x150
> > [  246.695856]  ? wq_pool_ids_show+0x140/0x140
> > [  246.696888]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  246.697998]  ret_from_fork+0x3a/0x50
> > [  246.699034] INFO: task kworker/0:1:60 blocked for more than 120
> > seconds.
> > [  246.700153]       Not tainted 4.18.0-rc5Lyude-Test+ #2
> > [  246.701182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [  246.702278] kworker/0:1     D    0    60      2 0x80000000
> > [  246.703293] Workqueue: pm pm_runtime_work
> > [  246.704393] Call Trace:
> > [  246.705403]  __schedule+0x322/0xaf0
> > [  246.706439]  ? wait_for_completion+0x104/0x190
> > [  246.707393]  schedule+0x33/0x90
> > [  246.708375]  schedule_timeout+0x3a5/0x590
> > [  246.709289]  ? mark_held_locks+0x58/0x80
> > [  246.710208]  ? _raw_spin_unlock_irq+0x2c/0x40
> > [  246.711222]  ? wait_for_completion+0x104/0x190
> > [  246.712134]  ? trace_hardirqs_on_caller+0xf4/0x190
> > [  246.713094]  ? wait_for_completion+0x104/0x190
> > [  246.713964]  wait_for_completion+0x12c/0x190
> > [  246.714895]  ? wake_up_q+0x80/0x80
> > [  246.715727]  ? get_work_pool+0x90/0x90
> > [  246.716649]  flush_work+0x1c9/0x280
> > [  246.717483]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> > [  246.718442]  __cancel_work_timer+0x146/0x1d0
> > [  246.719247]  cancel_delayed_work_sync+0x13/0x20
> > [  246.720043]  drm_kms_helper_poll_disable+0x1f/0x30 [drm_kms_helper]
> > [  246.721123]  nouveau_pmops_runtime_suspend+0x3d/0xb0 [nouveau]
> > [  246.721897]  pci_pm_runtime_suspend+0x6b/0x190
> > [  246.722825]  ? pci_has_legacy_pm_support+0x70/0x70
> > [  246.723737]  __rpm_callback+0x7a/0x1d0
> > [  246.724721]  ? pci_has_legacy_pm_support+0x70/0x70
> > [  246.725607]  rpm_callback+0x24/0x80
> > [  246.726553]  ? pci_has_legacy_pm_support+0x70/0x70
> > [  246.727376]  rpm_suspend+0x142/0x6b0
> > [  246.728185]  pm_runtime_work+0x97/0xc0
> > [  246.728938]  process_one_work+0x231/0x620
> > [  246.729796]  worker_thread+0x44/0x3a0
> > [  246.730614]  kthread+0x12b/0x150
> > [  246.731395]  ? wq_pool_ids_show+0x140/0x140
> > [  246.732202]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  246.732878]  ret_from_fork+0x3a/0x50
> > [  246.733768] INFO: task kworker/4:2:422 blocked for more than 120
> > seconds.
> > [  246.734587]       Not tainted 4.18.0-rc5Lyude-Test+ #2
> > [  246.735393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [  246.736113] kworker/4:2     D    0   422      2 0x80000080
> > [  246.736789] Workqueue: events_long drm_dp_mst_link_probe_work
> > [drm_kms_helper]
> > [  246.737665] Call Trace:
> > [  246.738490]  __schedule+0x322/0xaf0
> > [  246.739250]  schedule+0x33/0x90
> > [  246.739908]  rpm_resume+0x19c/0x850
> > [  246.740750]  ? finish_wait+0x90/0x90
> > [  246.741541]  __pm_runtime_resume+0x4e/0x90
> > [  246.742370]  nv50_disp_atomic_commit+0x31/0x210 [nouveau]
> > [  246.743124]  drm_atomic_commit+0x4a/0x50 [drm]
> > [  246.743775]  restore_fbdev_mode_atomic+0x1c8/0x240 [drm_kms_helper]
> > [  246.744603]  restore_fbdev_mode+0x31/0x140 [drm_kms_helper]
> > [  246.745373]  drm_fb_helper_restore_fbdev_mode_unlocked+0x54/0xb0
> > [drm_kms_helper]
> > [  246.746220]  drm_fb_helper_set_par+0x2d/0x50 [drm_kms_helper]
> > [  246.746884]  drm_fb_helper_hotplug_event.part.28+0x96/0xb0
> > [drm_kms_helper]
> > [  246.747675]  drm_fb_helper_output_poll_changed+0x23/0x30
> > [drm_kms_helper]
> > [  246.748544]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
> > [  246.749439]  nv50_mstm_hotplug+0x15/0x20 [nouveau]
> > [  246.750111]  drm_dp_send_link_address+0x177/0x1c0 [drm_kms_helper]
> > [  246.750764]  drm_dp_check_and_send_link_address+0xa8/0xd0
> > [drm_kms_helper]
> > [  246.751602]  drm_dp_mst_link_probe_work+0x51/0x90 [drm_kms_helper]
> > [  246.752314]  process_one_work+0x231/0x620
> > [  246.752979]  worker_thread+0x44/0x3a0
> > [  246.753838]  kthread+0x12b/0x150
> > [  246.754619]  ? wq_pool_ids_show+0x140/0x140
> > [  246.755386]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  246.756162]  ret_from_fork+0x3a/0x50
> > [  246.756847]
> >                Showing all locks held in the system:
> > [  246.758261] 3 locks held by kworker/4:0/37:
> > [  246.759016]  #0: 00000000f8df4d2d ((wq_completion)"events"){+.+.}, at:
> > process_one_work+0x1b3/0x620
> > [  246.759856]  #1: 00000000e6065461 ((work_completion)(&(&dev-
> > >mode_config.output_poll_work)->work)){+.+.}, at:
> > process_one_work+0x1b3/0x620
> > [  246.760670]  #2: 00000000cb66735f (&helper->lock){+.+.}, at:
> > drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
> > [  246.761516] 2 locks held by kworker/0:1/60:
> > [  246.762274]  #0: 00000000fff6be0f ((wq_completion)"pm"){+.+.}, at:
> > process_one_work+0x1b3/0x620
> > [  246.762982]  #1: 000000005ab44fb4 ((work_completion)(&dev-
> > >power.work)){+.+.}, at: process_one_work+0x1b3/0x620
> > [  246.763890] 1 lock held by khungtaskd/64:
> > [  246.764664]  #0: 000000008cb8b5c3 (rcu_read_lock){....}, at:
> > debug_show_all_locks+0x23/0x185
> > [  246.765588] 5 locks held by kworker/4:2/422:
> > [  246.766440]  #0: 00000000232f0959 ((wq_completion)"events_long"){+.+.},
> > at: process_one_work+0x1b3/0x620
> > [  246.767390]  #1: 00000000bb59b134 ((work_completion)(&mgr-
> > >work)){+.+.}, at: process_one_work+0x1b3/0x620
> > [  246.768154]  #2: 00000000cb66735f (&helper->lock){+.+.}, at:
> > drm_fb_helper_restore_fbdev_mode_unlocked+0x4c/0xb0 [drm_kms_helper]
> > [  246.768966]  #3: 000000004c8f0b6b (crtc_ww_class_acquire){+.+.}, at:
> > restore_fbdev_mode_atomic+0x4b/0x240 [drm_kms_helper]
> > [  246.769921]  #4: 000000004c34a296 (crtc_ww_class_mutex){+.+.}, at:
> > drm_modeset_backoff+0x8a/0x1b0 [drm]
> > [  246.770839] 1 lock held by dmesg/1038:
> > [  246.771739] 2 locks held by zsh/1172:
> > [  246.772650]  #0: 00000000836d0438 (&tty->ldisc_sem){++++}, at:
> > ldsem_down_read+0x37/0x40
> > [  246.773680]  #1: 000000001f4f4d48 (&ldata->atomic_read_lock){+.+.}, at:
> > n_tty_read+0xc1/0x870
> > 
> > [  246.775522] =============================================
> > 
> > So, to fix this (and any other possible deadlock issues like this that
> > could occur in the output_poll_changed patch) we make sure that
> > nouveau's output_poll_changed functions grab a runtime power ref before
> > sending any hotplug events, and hold it until we're finished. We
> > introduce this through adding a generic DRM helper which other drivers
> > may reuse.
> > 
> > This fixes deadlock issues when in fbcon with nouveau on my P50, and
> > should fix it for everyone else's as well!
> > 
> > Signed-off-by: Lyude Paul <lyude@redhat.com>
> > Reviewed-by: Karol Herbst <karolherbst@gmail.com>
> > Cc: Lukas Wunner <lukas@wunner.de>
> > Cc: stable@vger.kernel.org
> > ---
> > Changes since v1:
> >  - Add a generic helper for DRM to handle this
> > 
> >  drivers/gpu/drm/drm_fb_helper.c         | 23 +++++++++++++++++++++++
> >  drivers/gpu/drm/nouveau/dispnv50/disp.c |  2 +-
> >  include/drm/drm_fb_helper.h             |  5 +++++
> >  3 files changed, 29 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_fb_helper.c
> > b/drivers/gpu/drm/drm_fb_helper.c
> > index 2ee1eaa66188..1ab2f3646526 100644
> > --- a/drivers/gpu/drm/drm_fb_helper.c
> > +++ b/drivers/gpu/drm/drm_fb_helper.c
> > @@ -34,6 +34,7 @@
> >  #include <linux/sysrq.h>
> >  #include <linux/slab.h>
> >  #include <linux/module.h>
> > +#include <linux/pm_runtime.h>
> >  #include <drm/drmP.h>
> >  #include <drm/drm_crtc.h>
> >  #include <drm/drm_fb_helper.h>
> > @@ -2928,6 +2929,28 @@ void drm_fb_helper_output_poll_changed(struct
> > drm_device *dev)
> >  }
> >  EXPORT_SYMBOL(drm_fb_helper_output_poll_changed);
> >  
> > +/**
> > + * drm_fb_helper_output_poll_changed_with_rpm - DRM mode
> > + *                                              config
> > \.output_poll_changed
> > + *                                              helper for fbdev
> > emulation
> > + * @dev: DRM device
> > + *
> > + * Same as drm_fb_helper_output_poll_changed, except that it makes sure
> > that
> > + * the device is active by synchronously grabbing a runtime power
> > reference
> > + * while probing.
> > + */
> > +void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device *dev)
> > +{
> > +	int ret;
> > +
> > +	ret = pm_runtime_get_sync(dev->dev);
> > +	if (WARN_ON(ret < 0 && ret != -EACCES))
> > +		return;
> > +	drm_fb_helper_hotplug_event(dev->fb_helper);
> > +	pm_runtime_put(dev->dev);
> > +}
> > +EXPORT_SYMBOL(drm_fb_helper_output_poll_changed_with_rpm);
> > +
> >  /* The Kconfig DRM_KMS_HELPER selects FRAMEBUFFER_CONSOLE (if !EXPERT)
> >   * but the module doesn't depend on any fb console symbols.  At least
> >   * attempt to load fbcon to avoid leaving the system without a usable
> > console.
> > diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c
> > b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> > index eb3e41a78806..fa3ab618a0f9 100644
> > --- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
> > +++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> > @@ -2015,7 +2015,7 @@ nv50_disp_atomic_state_alloc(struct drm_device *dev)
> >  static const struct drm_mode_config_funcs
> >  nv50_disp_func = {
> >  	.fb_create = nouveau_user_framebuffer_create,
> > -	.output_poll_changed = drm_fb_helper_output_poll_changed,
> > +	.output_poll_changed =
> > drm_fb_helper_output_poll_changed_with_rpm,
> >  	.atomic_check = nv50_disp_atomic_check,
> >  	.atomic_commit = nv50_disp_atomic_commit,
> >  	.atomic_state_alloc = nv50_disp_atomic_state_alloc,
> > diff --git a/include/drm/drm_fb_helper.h b/include/drm/drm_fb_helper.h
> > index b069433e7fc1..ca809bfbaebb 100644
> > --- a/include/drm/drm_fb_helper.h
> > +++ b/include/drm/drm_fb_helper.h
> > @@ -330,6 +330,7 @@ void drm_fb_helper_fbdev_teardown(struct drm_device
> > *dev);
> >  
> >  void drm_fb_helper_lastclose(struct drm_device *dev);
> >  void drm_fb_helper_output_poll_changed(struct drm_device *dev);
> > +void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device *dev);
> >  #else
> >  static inline void drm_fb_helper_prepare(struct drm_device *dev,
> >  					struct drm_fb_helper *helper,
> > @@ -564,6 +565,10 @@ static inline void
> > drm_fb_helper_output_poll_changed(struct drm_device *dev)
> >  {
> >  }
> >  
> > +static inline void drm_fb_helper_output_poll_changed_with_rpm(struct
> > drm_device *dev)
> > +{
> > +}
> > +
> >  #endif
> >  
> >  static inline int
> > -- 
> > 2.17.1
> > 
-- 
Cheers,
	Lyude Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm()
  2018-07-20  0:08     ` Lyude Paul
@ 2018-07-20  0:17       ` Lyude Paul
  2018-07-21  9:39       ` Lukas Wunner
  1 sibling, 0 replies; 10+ messages in thread
From: Lyude Paul @ 2018-07-20  0:17 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: nouveau, Gustavo Padovan, Maarten Lankhorst, Sean Paul,
	David Airlie, Ben Skeggs, Daniel Vetter, Ville Syrjälä,
	dri-devel, linux-kernel

On Thu, 2018-07-19 at 20:08 -0400, Lyude Paul wrote:
> [snip]
> > 
> > Anything that happens in nouveau's poll worker never needs to acquire
> > a runtime PM ref because polling is only enabled while runtime active,
> > and ->runtime_suspend waits for an ongoing poll to finish.
> 
> Hopefully I'm not missing anything here, but this doesn't really make any
> sense. I'm not sure I explained myself as well as I thought I did. To be
> honest, I had to double check this about literally 20 times to make sure I
> was
> actually understanding this issue correctly. Turns out I was missing a
> couple
> of parts, so I'm going to try again at explaining this using a diagram that
> shows the various threads running concurrently
> 
> START: Driver load
>        |
>        |
>        |           Thread 1
>         ----- output_poll_execute()
>                        |
>            drm_helper_hpd_irq_event()
>                        |
>                        |  Schedules →            Thread 2
>                        ----------------- nouveau_display_hpd_work()
>                     Finish                          |
>                                            pm_runtime_get_sync() <--- keeps
> GPU alive
>                                                     |
>                                                    ...
>                                                     |
>                        ------------------------------
>                 <still Thread 2>
>              drm_helper_hpd_irq_event()
>                        |
>                        |         Schedules →
>                        ------------------------------
>                        |                         Thread 3
> Drop last RPM ref -v   |                   output_poll_execute()
>              pm_runtime_put_sync()                  |

                      ^^^^^^
As an additional note, I realized this might seem wrong but it isn't

pm_runtime_put_sync() calls down to nouveau's runtime idle callback, which
does this:

static int
nouveau_pmops_runtime_idle(struct device *dev)
{
	if (!nouveau_pmops_runtime()) {
		pm_runtime_forbid(dev);
		return -EBUSY;
	}

	pm_runtime_mark_last_busy(dev);
	pm_runtime_autosuspend(dev);
	/* we don't want the main rpm_idle to call suspend - we want to
autosuspend */
	return 1;
}

So, it doesn't actually synchronously suspend the GPU, it just starts up the
autosuspend thread

Just wanted to make sure there wasn't any confusion :)

>                        |             drm_dp_mst_topology_mgr_set_mst()
>                     Finish                          |
>                                                     |
>                                ← Schedules          |
>                        -----------------------------|
>                        |                            |
>                     Thread 4                        |
>                        |                            |
>           drm_dp_mst_link_probe_work()  drm_dp_mst_wait_tx_reply() <-- these
> wait on eachother
>                        |                            |                     th
> is is normal
>                        |    Sideband transactions   |
>                        |    happen here, this is    |
>                        |    where timeouts happen   |
>                        |                            |
>                        |      5 seconds later...    | autosuspend delay
> kicks in
>                        |            ...             |                       
>  |
>                        |                            |                     Th
> read 5
>                        |  Communication +
> timeouts  |                 pm_runtime_work
>                        |  are still happening
> here  |                        |
>                        |                            |           nouveau_pmop
> s_runtime_suspend()
>                        |                            |                       
>  |
>                        |  Success! We can
> enable    |                        |
>                        |        displays
> now!       |            drm_kms_helper_poll_disable()
>                        |                            |                       
>  |
>                        |                 *Atomic commit
> begins*              |
>                        |                            |     <----------
> ---     |
>                        | More sideband transactions |       Waiting
> on       |
>                        |          ......            |  output_poll_execute()
> |
>                        |                            |     <----------
> ---     |
>                        |                  pm_runtime_get_sync()             
>  |
>                        |                            |                       
>  |
>                        |   -------------------->    |     -------------
> >     |
>                        |        Waiting on          |       Waiting
> on       |
>                        |     output_poll_exec()     |    suspend
> requests    |
>                        |   -------------------->    |     -------------
> >     |
>                        |                            |                       
>  |
>                        -----------------------------------------------------
> >|
>                                                                          DEA
> DLOCK
> 
> phew. that took a LONG time to come up with.
> 
> Anyway-that's why your explanation doesn't make sense: the deadlock is
> happening because we're calling pm_runtime_get_sync(). If we were to make
> that
> call conditional (e.g. drm_kms_helper_is_poll_worker()), all that would mean
> is that we wouldn't grab any runtime power reference and the GPU would
> immediately suspend once the atomic commit finished, as the suspend request
> in
> Thread 5 would finally get unblocked and thus----suspend.
> 
> Hopefully I explained that better this time, I'll definitely make sure to
> actually include that diagram in the patch. As for whether or not this patch
> is even the right solution, I will need to confirm that tommorrow (if you
> don't think it is still, please feel free to say so!) because it's getting
> late here.
> 
> Cheers!
> 	Lyude
> 
> > 
> > Thinking a bit more about this, our mistake is to acquire runtime PM
> > refs too far down in the call stack.  As a fix that can be backported
> > to stable, adding if (!drm_kms_helper_is_poll_worker()) conditionals
> > seems fine to me, but the long term fix is to push acquisition of refs
> > further up in the call stack.
> > 
> > E.g., if the user forces connector probing via sysfs, a runtime PM ref
> > should be acquired in status_store() in drm_sysfs.c before invoking
> > connector->funcs->fill_modes().  That way, if the user forces connector
> > probing while the GPU is powering down, rpm_resume() will correctly wait
> > for rpm_suspend() to finish before resuming the card.  So if we architect
> > it like this, we're actually using the functionality provided by the
> > PM core in the way that it's supposed to be used.
> > 
> > The problem is that adding pm_runtime_get_sync() to status_store()
> > conflicts with the desire to have a library of generic DRM functions:
> > Some GPUs may be able to probe connectors without resuming to runtime
> > active state, others don't use runtime PM at all.  One solution that
> > comes to mind is a driver_features flag which tells the DRM core whether
> > to acquire a runtime PM ref in various places.
> > 
> > In your original patches 4 and 5, what exactly was the call stack which
> > led to i2c being accessed while runtime suspended?  Was it sysfs access
> > via /sys/class/i2c-adapter/* ?  If so, acquisition of the runtime PM ref
> > needs to likewise happen in that sysfs entry point, rather than deep down
> > in the call stack upon accessing the i2c bus.
> > 
> > Thanks,
> > 
> > Lukas
> > 
> > > 
> > > Sample:
> > > [  246.669625] INFO: task kworker/4:0:37 blocked for more than 120
> > > seconds.
> > > [  246.673398]       Not tainted 4.18.0-rc5Lyude-Test+ #2
> > > [  246.675271] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables
> > > this message.
> > > [  246.676527] kworker/4:0     D    0    37      2 0x80000000
> > > [  246.677580] Workqueue: events output_poll_execute [drm_kms_helper]
> > > [  246.678704] Call Trace:
> > > [  246.679753]  __schedule+0x322/0xaf0
> > > [  246.680916]  schedule+0x33/0x90
> > > [  246.681924]  schedule_preempt_disabled+0x15/0x20
> > > [  246.683023]  __mutex_lock+0x569/0x9a0
> > > [  246.684035]  ? kobject_uevent_env+0x117/0x7b0
> > > [  246.685132]  ? drm_fb_helper_hotplug_event.part.28+0x20/0xb0
> > > [drm_kms_helper]
> > > [  246.686179]  mutex_lock_nested+0x1b/0x20
> > > [  246.687278]  ? mutex_lock_nested+0x1b/0x20
> > > [  246.688307]  drm_fb_helper_hotplug_event.part.28+0x20/0xb0
> > > [drm_kms_helper]
> > > [  246.689420]  drm_fb_helper_output_poll_changed+0x23/0x30
> > > [drm_kms_helper]
> > > [  246.690462]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
> > > [  246.691570]  output_poll_execute+0x198/0x1c0 [drm_kms_helper]
> > > [  246.692611]  process_one_work+0x231/0x620
> > > [  246.693725]  worker_thread+0x214/0x3a0
> > > [  246.694756]  kthread+0x12b/0x150
> > > [  246.695856]  ? wq_pool_ids_show+0x140/0x140
> > > [  246.696888]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [  246.697998]  ret_from_fork+0x3a/0x50
> > > [  246.699034] INFO: task kworker/0:1:60 blocked for more than 120
> > > seconds.
> > > [  246.700153]       Not tainted 4.18.0-rc5Lyude-Test+ #2
> > > [  246.701182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables
> > > this message.
> > > [  246.702278] kworker/0:1     D    0    60      2 0x80000000
> > > [  246.703293] Workqueue: pm pm_runtime_work
> > > [  246.704393] Call Trace:
> > > [  246.705403]  __schedule+0x322/0xaf0
> > > [  246.706439]  ? wait_for_completion+0x104/0x190
> > > [  246.707393]  schedule+0x33/0x90
> > > [  246.708375]  schedule_timeout+0x3a5/0x590
> > > [  246.709289]  ? mark_held_locks+0x58/0x80
> > > [  246.710208]  ? _raw_spin_unlock_irq+0x2c/0x40
> > > [  246.711222]  ? wait_for_completion+0x104/0x190
> > > [  246.712134]  ? trace_hardirqs_on_caller+0xf4/0x190
> > > [  246.713094]  ? wait_for_completion+0x104/0x190
> > > [  246.713964]  wait_for_completion+0x12c/0x190
> > > [  246.714895]  ? wake_up_q+0x80/0x80
> > > [  246.715727]  ? get_work_pool+0x90/0x90
> > > [  246.716649]  flush_work+0x1c9/0x280
> > > [  246.717483]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> > > [  246.718442]  __cancel_work_timer+0x146/0x1d0
> > > [  246.719247]  cancel_delayed_work_sync+0x13/0x20
> > > [  246.720043]  drm_kms_helper_poll_disable+0x1f/0x30 [drm_kms_helper]
> > > [  246.721123]  nouveau_pmops_runtime_suspend+0x3d/0xb0 [nouveau]
> > > [  246.721897]  pci_pm_runtime_suspend+0x6b/0x190
> > > [  246.722825]  ? pci_has_legacy_pm_support+0x70/0x70
> > > [  246.723737]  __rpm_callback+0x7a/0x1d0
> > > [  246.724721]  ? pci_has_legacy_pm_support+0x70/0x70
> > > [  246.725607]  rpm_callback+0x24/0x80
> > > [  246.726553]  ? pci_has_legacy_pm_support+0x70/0x70
> > > [  246.727376]  rpm_suspend+0x142/0x6b0
> > > [  246.728185]  pm_runtime_work+0x97/0xc0
> > > [  246.728938]  process_one_work+0x231/0x620
> > > [  246.729796]  worker_thread+0x44/0x3a0
> > > [  246.730614]  kthread+0x12b/0x150
> > > [  246.731395]  ? wq_pool_ids_show+0x140/0x140
> > > [  246.732202]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [  246.732878]  ret_from_fork+0x3a/0x50
> > > [  246.733768] INFO: task kworker/4:2:422 blocked for more than 120
> > > seconds.
> > > [  246.734587]       Not tainted 4.18.0-rc5Lyude-Test+ #2
> > > [  246.735393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables
> > > this message.
> > > [  246.736113] kworker/4:2     D    0   422      2 0x80000080
> > > [  246.736789] Workqueue: events_long drm_dp_mst_link_probe_work
> > > [drm_kms_helper]
> > > [  246.737665] Call Trace:
> > > [  246.738490]  __schedule+0x322/0xaf0
> > > [  246.739250]  schedule+0x33/0x90
> > > [  246.739908]  rpm_resume+0x19c/0x850
> > > [  246.740750]  ? finish_wait+0x90/0x90
> > > [  246.741541]  __pm_runtime_resume+0x4e/0x90
> > > [  246.742370]  nv50_disp_atomic_commit+0x31/0x210 [nouveau]
> > > [  246.743124]  drm_atomic_commit+0x4a/0x50 [drm]
> > > [  246.743775]  restore_fbdev_mode_atomic+0x1c8/0x240 [drm_kms_helper]
> > > [  246.744603]  restore_fbdev_mode+0x31/0x140 [drm_kms_helper]
> > > [  246.745373]  drm_fb_helper_restore_fbdev_mode_unlocked+0x54/0xb0
> > > [drm_kms_helper]
> > > [  246.746220]  drm_fb_helper_set_par+0x2d/0x50 [drm_kms_helper]
> > > [  246.746884]  drm_fb_helper_hotplug_event.part.28+0x96/0xb0
> > > [drm_kms_helper]
> > > [  246.747675]  drm_fb_helper_output_poll_changed+0x23/0x30
> > > [drm_kms_helper]
> > > [  246.748544]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
> > > [  246.749439]  nv50_mstm_hotplug+0x15/0x20 [nouveau]
> > > [  246.750111]  drm_dp_send_link_address+0x177/0x1c0 [drm_kms_helper]
> > > [  246.750764]  drm_dp_check_and_send_link_address+0xa8/0xd0
> > > [drm_kms_helper]
> > > [  246.751602]  drm_dp_mst_link_probe_work+0x51/0x90 [drm_kms_helper]
> > > [  246.752314]  process_one_work+0x231/0x620
> > > [  246.752979]  worker_thread+0x44/0x3a0
> > > [  246.753838]  kthread+0x12b/0x150
> > > [  246.754619]  ? wq_pool_ids_show+0x140/0x140
> > > [  246.755386]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [  246.756162]  ret_from_fork+0x3a/0x50
> > > [  246.756847]
> > >                Showing all locks held in the system:
> > > [  246.758261] 3 locks held by kworker/4:0/37:
> > > [  246.759016]  #0: 00000000f8df4d2d ((wq_completion)"events"){+.+.},
> > > at:
> > > process_one_work+0x1b3/0x620
> > > [  246.759856]  #1: 00000000e6065461 ((work_completion)(&(&dev-
> > > > mode_config.output_poll_work)->work)){+.+.}, at:
> > > 
> > > process_one_work+0x1b3/0x620
> > > [  246.760670]  #2: 00000000cb66735f (&helper->lock){+.+.}, at:
> > > drm_fb_helper_hotplug_event.part.28+0x20/0xb0 [drm_kms_helper]
> > > [  246.761516] 2 locks held by kworker/0:1/60:
> > > [  246.762274]  #0: 00000000fff6be0f ((wq_completion)"pm"){+.+.}, at:
> > > process_one_work+0x1b3/0x620
> > > [  246.762982]  #1: 000000005ab44fb4 ((work_completion)(&dev-
> > > > power.work)){+.+.}, at: process_one_work+0x1b3/0x620
> > > 
> > > [  246.763890] 1 lock held by khungtaskd/64:
> > > [  246.764664]  #0: 000000008cb8b5c3 (rcu_read_lock){....}, at:
> > > debug_show_all_locks+0x23/0x185
> > > [  246.765588] 5 locks held by kworker/4:2/422:
> > > [  246.766440]  #0: 00000000232f0959
> > > ((wq_completion)"events_long"){+.+.},
> > > at: process_one_work+0x1b3/0x620
> > > [  246.767390]  #1: 00000000bb59b134 ((work_completion)(&mgr-
> > > > work)){+.+.}, at: process_one_work+0x1b3/0x620
> > > 
> > > [  246.768154]  #2: 00000000cb66735f (&helper->lock){+.+.}, at:
> > > drm_fb_helper_restore_fbdev_mode_unlocked+0x4c/0xb0 [drm_kms_helper]
> > > [  246.768966]  #3: 000000004c8f0b6b (crtc_ww_class_acquire){+.+.}, at:
> > > restore_fbdev_mode_atomic+0x4b/0x240 [drm_kms_helper]
> > > [  246.769921]  #4: 000000004c34a296 (crtc_ww_class_mutex){+.+.}, at:
> > > drm_modeset_backoff+0x8a/0x1b0 [drm]
> > > [  246.770839] 1 lock held by dmesg/1038:
> > > [  246.771739] 2 locks held by zsh/1172:
> > > [  246.772650]  #0: 00000000836d0438 (&tty->ldisc_sem){++++}, at:
> > > ldsem_down_read+0x37/0x40
> > > [  246.773680]  #1: 000000001f4f4d48 (&ldata->atomic_read_lock){+.+.},
> > > at:
> > > n_tty_read+0xc1/0x870
> > > 
> > > [  246.775522] =============================================
> > > 
> > > So, to fix this (and any other possible deadlock issues like this that
> > > could occur in the output_poll_changed patch) we make sure that
> > > nouveau's output_poll_changed functions grab a runtime power ref before
> > > sending any hotplug events, and hold it until we're finished. We
> > > introduce this through adding a generic DRM helper which other drivers
> > > may reuse.
> > > 
> > > This fixes deadlock issues when in fbcon with nouveau on my P50, and
> > > should fix it for everyone else's as well!
> > > 
> > > Signed-off-by: Lyude Paul <lyude@redhat.com>
> > > Reviewed-by: Karol Herbst <karolherbst@gmail.com>
> > > Cc: Lukas Wunner <lukas@wunner.de>
> > > Cc: stable@vger.kernel.org
> > > ---
> > > Changes since v1:
> > >  - Add a generic helper for DRM to handle this
> > > 
> > >  drivers/gpu/drm/drm_fb_helper.c         | 23 +++++++++++++++++++++++
> > >  drivers/gpu/drm/nouveau/dispnv50/disp.c |  2 +-
> > >  include/drm/drm_fb_helper.h             |  5 +++++
> > >  3 files changed, 29 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/drm_fb_helper.c
> > > b/drivers/gpu/drm/drm_fb_helper.c
> > > index 2ee1eaa66188..1ab2f3646526 100644
> > > --- a/drivers/gpu/drm/drm_fb_helper.c
> > > +++ b/drivers/gpu/drm/drm_fb_helper.c
> > > @@ -34,6 +34,7 @@
> > >  #include <linux/sysrq.h>
> > >  #include <linux/slab.h>
> > >  #include <linux/module.h>
> > > +#include <linux/pm_runtime.h>
> > >  #include <drm/drmP.h>
> > >  #include <drm/drm_crtc.h>
> > >  #include <drm/drm_fb_helper.h>
> > > @@ -2928,6 +2929,28 @@ void drm_fb_helper_output_poll_changed(struct
> > > drm_device *dev)
> > >  }
> > >  EXPORT_SYMBOL(drm_fb_helper_output_poll_changed);
> > >  
> > > +/**
> > > + * drm_fb_helper_output_poll_changed_with_rpm - DRM mode
> > > + *                                              config
> > > \.output_poll_changed
> > > + *                                              helper for fbdev
> > > emulation
> > > + * @dev: DRM device
> > > + *
> > > + * Same as drm_fb_helper_output_poll_changed, except that it makes sure
> > > that
> > > + * the device is active by synchronously grabbing a runtime power
> > > reference
> > > + * while probing.
> > > + */
> > > +void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device *dev)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = pm_runtime_get_sync(dev->dev);
> > > +	if (WARN_ON(ret < 0 && ret != -EACCES))
> > > +		return;
> > > +	drm_fb_helper_hotplug_event(dev->fb_helper);
> > > +	pm_runtime_put(dev->dev);
> > > +}
> > > +EXPORT_SYMBOL(drm_fb_helper_output_poll_changed_with_rpm);
> > > +
> > >  /* The Kconfig DRM_KMS_HELPER selects FRAMEBUFFER_CONSOLE (if !EXPERT)
> > >   * but the module doesn't depend on any fb console symbols.  At least
> > >   * attempt to load fbcon to avoid leaving the system without a usable
> > > console.
> > > diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c
> > > b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> > > index eb3e41a78806..fa3ab618a0f9 100644
> > > --- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
> > > +++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
> > > @@ -2015,7 +2015,7 @@ nv50_disp_atomic_state_alloc(struct drm_device
> > > *dev)
> > >  static const struct drm_mode_config_funcs
> > >  nv50_disp_func = {
> > >  	.fb_create = nouveau_user_framebuffer_create,
> > > -	.output_poll_changed = drm_fb_helper_output_poll_changed,
> > > +	.output_poll_changed =
> > > drm_fb_helper_output_poll_changed_with_rpm,
> > >  	.atomic_check = nv50_disp_atomic_check,
> > >  	.atomic_commit = nv50_disp_atomic_commit,
> > >  	.atomic_state_alloc = nv50_disp_atomic_state_alloc,
> > > diff --git a/include/drm/drm_fb_helper.h b/include/drm/drm_fb_helper.h
> > > index b069433e7fc1..ca809bfbaebb 100644
> > > --- a/include/drm/drm_fb_helper.h
> > > +++ b/include/drm/drm_fb_helper.h
> > > @@ -330,6 +330,7 @@ void drm_fb_helper_fbdev_teardown(struct drm_device
> > > *dev);
> > >  
> > >  void drm_fb_helper_lastclose(struct drm_device *dev);
> > >  void drm_fb_helper_output_poll_changed(struct drm_device *dev);
> > > +void drm_fb_helper_output_poll_changed_with_rpm(struct drm_device
> > > *dev);
> > >  #else
> > >  static inline void drm_fb_helper_prepare(struct drm_device *dev,
> > >  					struct drm_fb_helper *helper,
> > > @@ -564,6 +565,10 @@ static inline void
> > > drm_fb_helper_output_poll_changed(struct drm_device *dev)
> > >  {
> > >  }
> > >  
> > > +static inline void drm_fb_helper_output_poll_changed_with_rpm(struct
> > > drm_device *dev)
> > > +{
> > > +}
> > > +
> > >  #endif
> > >  
> > >  static inline int
> > > -- 
> > > 2.17.1
> > > 
-- 
Cheers,
	Lyude Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm()
  2018-07-20  0:08     ` Lyude Paul
  2018-07-20  0:17       ` Lyude Paul
@ 2018-07-21  9:39       ` Lukas Wunner
  2018-07-23 17:50         ` Lyude Paul
  1 sibling, 1 reply; 10+ messages in thread
From: Lukas Wunner @ 2018-07-21  9:39 UTC (permalink / raw)
  To: Lyude Paul
  Cc: nouveau, Gustavo Padovan, Maarten Lankhorst, Sean Paul,
	David Airlie, Ben Skeggs, Daniel Vetter, Ville Syrjälä,
	dri-devel, linux-kernel

On Thu, Jul 19, 2018 at 08:08:15PM -0400, Lyude Paul wrote:
> On Thu, 2018-07-19 at 09:49 +0200, Lukas Wunner wrote:
> > On Wed, Jul 18, 2018 at 04:56:39PM -0400, Lyude Paul wrote:
> > > When DP MST hubs get confused, they can occasionally stop responding for
> > > a good bit of time up until the point where the DRM driver manages to
> > > do the right DPCD accesses to get it to start responding again. In a
> > > worst case scenario however, this process can take upwards of 10+
> > > seconds.
> > > 
> > > Currently we use the default output_poll_changed handler
> > > drm_fb_helper_output_poll_changed() to handle output polling, which
> > > doesn't happen to grab any power references on the device when polling.
> > > If we're unlucky enough to have a hub (such as Lenovo's infamous laptop
> > > docks for the P5x/P7x series) that's easily startled and confused, this
> > > can lead to a pretty nasty deadlock situation that looks like this:
> > > 
> > > - Hotplug event from hub happens, we enter
> > >   drm_fb_helper_output_poll_changed() and start communicating with the
> > >   hub
> > > - While we're in drm_fb_helper_output_poll_changed() and attempting to
> > >   communicate with the hub, we end up confusing it and cause it to stop
> > >   responding for at least 10 seconds
> > > - After 5 seconds of being in drm_fb_helper_output_poll_changed(), the
> > >   pm core attempts to put the GPU into autosuspend, which ends up
> > >   calling drm_kms_helper_poll_disable()
> > > - While the runtime PM core is waiting in drm_kms_helper_poll_disable()
> > >   for the output poll to finish, we end up finally detecting an MST
> > >   display
> > > - We notice the new display and tries to enable it, which triggers
> > >   an atomic commit which triggers a call to pm_runtime_get_sync()
> > > - the output poll thread deadlocks the pm core waiting for the pm core
> > >   to finish the autosuspend request while the pm core waits for the
> > >   output poll thread to finish
> > 
> > The correct fix is to call pm_runtime_get_sync() *conditionally* in
> > the atomic commit which enables the display, using the same conditional
> > as d61a5c106351, i.e. if (!drm_kms_helper_is_poll_worker()).

First of all, I was mistaken when I wrote above that a check for
!drm_kms_helper_is_poll_worker() would solve the problem.  Sorry!
It doesn't because the call to pm_runtime_get_sync() is not happening
in output_poll_execute() but in drm_dp_mst_link_probe_work().

Looking once more at the three stack traces you've provided, we've got:
- output_poll_execute() stuck waiting for fb_helper->lock
  which is held by drm_dp_mst_link_probe_work()
- rpm_suspend() stuck waiting for output_poll_execute() to finish
- drm_dp_mst_link_probe_work() stuck waiting in rpm_resume()

For the moment we can ignore the first task, i.e. output_poll_execute(),
and focus on the latter two.

As said I'm unfamiliar with MST but browsing through drm_dp_mst_topology.c
I notice that drm_dp_mst_link_probe_work() is the ->work element in
drm_dp_mst_topology_mgr() and is queued on HPD.  I further notice that
the work item is flushed on ->runtime_suspend:

nouveau_pmops_runtime_suspend()
  nouveau_do_suspend()
    nouveau_display_suspend()
      nouveau_display_fini()
        disp->fini() == nv50_display_fini()
	  nv50_mstm_fini()
	    drm_dp_mst_topology_mgr_suspend()
	      flush_work(&mgr->work);

And before the work item is flushed, the HPD source is quiesced.

So it looks like drm_dp_mst_link_probe_work() can only ever run
while the GPU is runtime resumed, it never runs while the GPU is
runtime suspended.  This means that you don't have to acquire any
runtime PM references in or below drm_dp_mst_link_probe_work().
Au contraire, you must not acquire any because it will deadlock while
the GPU is runtime suspending.  If there are functions which are
called from drm_dp_mst_link_probe_work() as well as from other contexts,
and those other contexts need a runtime PM ref to be acquired,
you need to acquire the runtime PM ref conditionally on not being
drm_dp_mst_link_probe_work() (using the current_work() technique).

Alternatively, move acquisition of the runtime PM ref further up in
the call chain to those other contexts.


> Anyway-that's why your explanation doesn't make sense: the deadlock is
> happening because we're calling pm_runtime_get_sync(). If we were to
> make that call conditional (e.g. drm_kms_helper_is_poll_worker()),
> all that would mean is that we wouldn't grab any runtime power reference
> and the GPU would immediately suspend once the atomic commit finished,
> as the suspend request in Thread 5 would finally get unblocked and thus
> ----suspend.

Right, that seems to be a bug nouveau_pmops_runtime_suspend():

If a display is plugged in while the GPU is about to runtime suspend,
the display may be lit up by output_poll_execute() but the GPU will
then nevertheless be powered off.

I guess after calling drm_kms_helper_poll_disable() we should re-check
if a crtc has been activated.  This should have bumped the runtime PM
refcount and have_disp_power_ref should be true.  In that case, the
nouveau_pmops_runtime_suspend() should return -EBUSY to abort the
runtime_suspend.

The same check seems necessary after flushing drm_dp_mst_link_probe_work():
If the work item lit up a new display, all previous suspend steps need
to be unwound and -EBUSY needs to be returned to the PM core.

Communication with an MST hub exceeding the autosuspend timeout is
just one scenario where this bug manifests itself.

BTW, drm_kms_helper_poll_disable() seems to be called twice in the
runtime_suspend code path, once in nouveau_pmops_runtime_suspend()
and a second time in nouveau_display_fini().

A stupid question, I notice that nv50_display_fini() calls nv50_mstm_fini()
only if encoder_type != DRM_MODE_ENCODER_DPMST.  Why isn't that == ?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm()
  2018-07-21  9:39       ` Lukas Wunner
@ 2018-07-23 17:50         ` Lyude Paul
  0 siblings, 0 replies; 10+ messages in thread
From: Lyude Paul @ 2018-07-23 17:50 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: nouveau, Gustavo Padovan, Maarten Lankhorst, Sean Paul,
	David Airlie, Ben Skeggs, Daniel Vetter, Ville Syrjälä,
	dri-devel, linux-kernel

On Sat, 2018-07-21 at 11:39 +0200, Lukas Wunner wrote:
> On Thu, Jul 19, 2018 at 08:08:15PM -0400, Lyude Paul wrote:
> > On Thu, 2018-07-19 at 09:49 +0200, Lukas Wunner wrote:
> > > On Wed, Jul 18, 2018 at 04:56:39PM -0400, Lyude Paul wrote:
> > > > When DP MST hubs get confused, they can occasionally stop responding
> > > > for
> > > > a good bit of time up until the point where the DRM driver manages to
> > > > do the right DPCD accesses to get it to start responding again. In a
> > > > worst case scenario however, this process can take upwards of 10+
> > > > seconds.
> > > > 
> > > > Currently we use the default output_poll_changed handler
> > > > drm_fb_helper_output_poll_changed() to handle output polling, which
> > > > doesn't happen to grab any power references on the device when
> > > > polling.
> > > > If we're unlucky enough to have a hub (such as Lenovo's infamous
> > > > laptop
> > > > docks for the P5x/P7x series) that's easily startled and confused,
> > > > this
> > > > can lead to a pretty nasty deadlock situation that looks like this:
> > > > 
> > > > - Hotplug event from hub happens, we enter
> > > >   drm_fb_helper_output_poll_changed() and start communicating with the
> > > >   hub
> > > > - While we're in drm_fb_helper_output_poll_changed() and attempting to
> > > >   communicate with the hub, we end up confusing it and cause it to
> > > > stop
> > > >   responding for at least 10 seconds
> > > > - After 5 seconds of being in drm_fb_helper_output_poll_changed(), the
> > > >   pm core attempts to put the GPU into autosuspend, which ends up
> > > >   calling drm_kms_helper_poll_disable()
> > > > - While the runtime PM core is waiting in
> > > > drm_kms_helper_poll_disable()
> > > >   for the output poll to finish, we end up finally detecting an MST
> > > >   display
> > > > - We notice the new display and tries to enable it, which triggers
> > > >   an atomic commit which triggers a call to pm_runtime_get_sync()
> > > > - the output poll thread deadlocks the pm core waiting for the pm core
> > > >   to finish the autosuspend request while the pm core waits for the
> > > >   output poll thread to finish
> > > 
> > > The correct fix is to call pm_runtime_get_sync() *conditionally* in
> > > the atomic commit which enables the display, using the same conditional
> > > as d61a5c106351, i.e. if (!drm_kms_helper_is_poll_worker()).
> 
> First of all, I was mistaken when I wrote above that a check for
> !drm_kms_helper_is_poll_worker() would solve the problem.  Sorry!
> It doesn't because the call to pm_runtime_get_sync() is not happening
> in output_poll_execute() but in drm_dp_mst_link_probe_work().
> 
> Looking once more at the three stack traces you've provided, we've got:
> - output_poll_execute() stuck waiting for fb_helper->lock
>   which is held by drm_dp_mst_link_probe_work()
> - rpm_suspend() stuck waiting for output_poll_execute() to finish
> - drm_dp_mst_link_probe_work() stuck waiting in rpm_resume()
> 
> For the moment we can ignore the first task, i.e. output_poll_execute(),
> and focus on the latter two.
> 
> As said I'm unfamiliar with MST but browsing through drm_dp_mst_topology.c
> I notice that drm_dp_mst_link_probe_work() is the ->work element in
> drm_dp_mst_topology_mgr() and is queued on HPD.  I further notice that
> the work item is flushed on ->runtime_suspend:
> 
> nouveau_pmops_runtime_suspend()
>   nouveau_do_suspend()
>     nouveau_display_suspend()
>       nouveau_display_fini()
>         disp->fini() == nv50_display_fini()
> 	  nv50_mstm_fini()
> 	    drm_dp_mst_topology_mgr_suspend()
> 	      flush_work(&mgr->work);
> 
> And before the work item is flushed, the HPD source is quiesced.
> 
> So it looks like drm_dp_mst_link_probe_work() can only ever run
> while the GPU is runtime resumed, it never runs while the GPU is
> runtime suspended.  This means that you don't have to acquire any
> runtime PM references in or below drm_dp_mst_link_probe_work().
> Au contraire, you must not acquire any because it will deadlock while
> the GPU is runtime suspending.  If there are functions which are
> called from drm_dp_mst_link_probe_work() as well as from other contexts,
> and those other contexts need a runtime PM ref to be acquired,
> you need to acquire the runtime PM ref conditionally on not being
> drm_dp_mst_link_probe_work() (using the current_work() technique).
> 
> Alternatively, move acquisition of the runtime PM ref further up in
> the call chain to those other contexts.
> 
> 
> > Anyway-that's why your explanation doesn't make sense: the deadlock is
> > happening because we're calling pm_runtime_get_sync(). If we were to
> > make that call conditional (e.g. drm_kms_helper_is_poll_worker()),
> > all that would mean is that we wouldn't grab any runtime power reference
> > and the GPU would immediately suspend once the atomic commit finished,
> > as the suspend request in Thread 5 would finally get unblocked and thus
> > ----suspend.
> 
> Right, that seems to be a bug nouveau_pmops_runtime_suspend():
> 
> If a display is plugged in while the GPU is about to runtime suspend,
> the display may be lit up by output_poll_execute() but the GPU will
> then nevertheless be powered off.
> 
> I guess after calling drm_kms_helper_poll_disable() we should re-check
> if a crtc has been activated.  This should have bumped the runtime PM
> refcount and have_disp_power_ref should be true.  In that case, the
> nouveau_pmops_runtime_suspend() should return -EBUSY to abort the
> runtime_suspend.
> 
> The same check seems necessary after flushing drm_dp_mst_link_probe_work():
> If the work item lit up a new display, all previous suspend steps need
> to be unwound and -EBUSY needs to be returned to the PM core.
> 
> Communication with an MST hub exceeding the autosuspend timeout is
> just one scenario where this bug manifests itself.
> 
> BTW, drm_kms_helper_poll_disable() seems to be called twice in the
> runtime_suspend code path, once in nouveau_pmops_runtime_suspend()
> and a second time in nouveau_display_fini().
> 
> A stupid question, I notice that nv50_display_fini() calls nv50_mstm_fini()
> only if encoder_type != DRM_MODE_ENCODER_DPMST.  Why isn't that == ?
Because there's a difference between DPMST connectors and encoders vs. the
rest of the device's encoders. Every DP MST topology will take up a single
"physical" DP connector on the device that will be marked as disconnected.
This connector also owns the "mstm" (MST manager, referred to as the
drm_dp_mst_topology_mgr in DRM), which through the callbacks nouveau provides
is responsible for creating the fake DP MST ports and encoders. All of these
fake ports will have DPMST encoders as opposed to the physical DP ports, which
will have TMDS encoders. Hence-mstms are only on physical connectors with
TMDS, not fake connectors with DPMST.


> 
> Thanks,
> 
> Lukas
-- 
Cheers,
	Lyude Paul


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] Fix connector probing deadlocks from RPM bugs
  2018-07-18 20:56 [PATCH 0/2] Fix connector probing deadlocks from RPM bugs Lyude Paul
  2018-07-18 20:56 ` [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm() Lyude Paul
  2018-07-18 20:56 ` [PATCH 2/2] drm/probe_helper: Add drm_helper_probe_single_connector_modes_with_rpm() Lyude Paul
@ 2018-08-06  8:31 ` Daniel Vetter
  2 siblings, 0 replies; 10+ messages in thread
From: Daniel Vetter @ 2018-08-06  8:31 UTC (permalink / raw)
  To: Lyude Paul
  Cc: nouveau, David Airlie, Daniel Vetter, Sean Paul,
	Maarten Lankhorst, linux-kernel, dri-devel, Ben Skeggs,
	Gustavo Padovan, Ville Syrjälä

On Wed, Jul 18, 2018 at 04:56:38PM -0400, Lyude Paul wrote:
> This is a trimmed down version of
> 
> https://patchwork.freedesktop.org/series/46637/
> 
> with all of the review comments addressed.
> 
> The last version of this series had fixes for the i2c and DP aux busses
> to ensure that the GPU would be turned on whenever anything tried to
> access the i2c/aux busses. Unfortunately: one of the fixes apparently
> contained some very incorrect usage of Linux's runtime PM interface in
> order to prevent runtime suspend/resume requests coming from within
> nouveau's suspend/resume callbacks from deadlocking the system.
> 
> Turns out: fixing the i2c/dp aux problem is a lot harder then I
> originally anticipated. We either need to come up with helpers for DRM
> that work around the PM core, which is really not ideal, or come up with
> a new feature for the RPM core that allows us to inhibit suspend/resume
> callbacks. The former seems to be somewhat trivial to implement, but
> coming up with a decent interface for that is going to take a bit more
> time and I'd much rather have issues causing deadlocks get fixed first.
> This means that drm_dp_aux_dev is going to be broken on nouveau for
> laptops with hybrid GPUs using RPM, but it was already broken before
> anyhow.
> 
> So: this just contains the seriously important fixes that will stop
> people's machines from crashing very hard. Hopefully I can come up with
> a solution to the i2c/aux problem soon after fixing some other glaring
> MST bugs nouveau currently has.

Why do we need all this? i915 has full rpm support, including i2c and dp
aux correctly working, and everything else too. Why is nouveau special?

Also, there's a metric pile of other drivers using the existing helpers an
infrastructure, with full rpm support, all seemingly being happy too.

Given all that it smells a bit like nouveau has it's rpm design backwards,
which would indicate that these new helpers should be nouveau-specific
wrappers and not generic. If that's not the case, then I'd like to first
understand why nouveau needs them and why no one else seems to need these.
-Daniel

> 
> Lyude Paul (2):
>   drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm()
>   drm/probe_helper: Add
>     drm_helper_probe_single_connector_modes_with_rpm()
> 
>  drivers/gpu/drm/drm_fb_helper.c             | 23 +++++++++++++++
>  drivers/gpu/drm/drm_probe_helper.c          | 31 +++++++++++++++++++++
>  drivers/gpu/drm/nouveau/dispnv50/disp.c     |  4 +--
>  drivers/gpu/drm/nouveau/nouveau_connector.c |  4 +--
>  include/drm/drm_crtc_helper.h               |  7 +++--
>  include/drm/drm_fb_helper.h                 |  5 ++++
>  6 files changed, 67 insertions(+), 7 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-08-06  8:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-18 20:56 [PATCH 0/2] Fix connector probing deadlocks from RPM bugs Lyude Paul
2018-07-18 20:56 ` [PATCH 1/2] drm/fb_helper: Add drm_fb_helper_output_poll_changed_with_rpm() Lyude Paul
2018-07-19  7:49   ` Lukas Wunner
2018-07-20  0:08     ` Lyude Paul
2018-07-20  0:17       ` Lyude Paul
2018-07-21  9:39       ` Lukas Wunner
2018-07-23 17:50         ` Lyude Paul
2018-07-18 20:56 ` [PATCH 2/2] drm/probe_helper: Add drm_helper_probe_single_connector_modes_with_rpm() Lyude Paul
2018-07-19  8:08   ` Lukas Wunner
2018-08-06  8:31 ` [PATCH 0/2] Fix connector probing deadlocks from RPM bugs Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).