Re: [RESEND PATCH v2 1/2] drm/dp/mst: Reprobe EDID for MST ports on resume

From: Lyude Paul <lyude@redhat.com>
To: Juston Li <juston.li@intel.com>,
	intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Cc: clinton.a.taylor@intel.com, nathan.d.ciobanu@intel.com,
	mario.limonciello@dell.com, jared_dominguez@dell.com,
	linux-kernel@vger.kernel.org, Lyude <cpaul@redhat.com>,
	stable@vger.kernel.org
Subject: Re: [RESEND PATCH v2 1/2] drm/dp/mst: Reprobe EDID for MST ports on resume
Date: Wed, 24 Oct 2018 18:09:54 -0400	[thread overview]
Message-ID: <f9b43dde4868d3c079b652b48babde9b8bebc728.camel@redhat.com> (raw)
In-Reply-To: <20181024021925.27026-2-juston.li@intel.com>

Thought this was going to be an easy review until I realized that there's
multiple problems in nouveau this would cause issues with, even if we didn't
pay attention to the -EINVAL that gets returned. The suspend/resume order in
nouveau needs some fixing up to prevent this patch from causing timeouts, and
then also there's a hidden surprise

[  972.294437] ============================================
[  972.295225] WARNING: possible recursive locking detected
[  972.296042] 4.19.0-rc8mst-reprobe+ #2 Not tainted
[  972.296842] --------------------------------------------
[  972.297614] zsh/6840 is trying to acquire lock:
[  972.298441] 0000000072e521f7 (&dev->mode_config.mutex){+.+.}, at:
drm_dp_mst_edids_changed+0xc0/0x180 [drm_kms_helper]
[  972.299299] 
               but task is already holding lock:
[  972.300991] 0000000072e521f7 (&dev->mode_config.mutex){+.+.}, at:
status_store+0x34/0x180 [drm]
[  972.301875] 
               other info that might help us debug this:
[  972.303563]  Possible unsafe locking scenario:

[  972.305315]        CPU0
[  972.306182]        ----
[  972.307061]   lock(&dev->mode_config.mutex);
[  972.307943]   lock(&dev->mode_config.mutex);
[  972.308819] 
                *** DEADLOCK ***

[  972.311446]  May be due to missing lock nesting notation

[  972.313234] 6 locks held by zsh/6840:
[  972.314135]  #0: 0000000092b5ba9d (sb_writers#4){.+.+}, at:
vfs_write+0x13e/0x1a0
[  972.315049]  #1: 00000000e628e6e9 (&of->mutex){+.+.}, at:
kernfs_fop_write+0xbd/0x1a0
[  972.315980]  #2: 000000000fb65e6c (kn->count#240){.+.+}, at:
kernfs_fop_write+0xc5/0x1a0
[  972.316928]  #3: 0000000072e521f7 (&dev->mode_config.mutex){+.+.}, at:
status_store+0x34/0x180 [drm]
[  972.317892]  #4: 00000000344224c2 (crtc_ww_class_acquire){+.+.}, at:
drm_helper_probe_single_connector_modes+0x66/0x6c0 [drm_kms_helper]
[  972.318863]  #5: 00000000d65352e2 (crtc_ww_class_mutex){+.+.}, at:
drm_modeset_lock+0x44/0x110 [drm]
[  972.319860] 
               stack backtrace:
[  972.321800] CPU: 5 PID: 6840 Comm: zsh Kdump: loaded Not tainted 4.19.0-
rc8mst-reprobe+ #2
[  972.322772] Hardware name: LENOVO 20EQZ1J7US/20EQZ1J7US, BIOS N1EET80W
(1.53 ) 09/14/2018
[  972.323792] Call Trace:
[  972.324821]  dump_stack+0x85/0xc0
[  972.325832]  __lock_acquire.cold.59+0x158/0x227
[  972.326887]  ? retint_kernel+0x10/0x10
[  972.327918]  lock_acquire+0x9e/0x170
[  972.328948]  ? drm_dp_mst_edids_changed+0xc0/0x180 [drm_kms_helper]
[  972.329986]  ? drm_dp_mst_edids_changed+0xc0/0x180 [drm_kms_helper]
[  972.331066]  __mutex_lock+0x68/0x900
[  972.332094]  ? drm_dp_mst_edids_changed+0xc0/0x180 [drm_kms_helper]
[  972.333146]  ? __mutex_unlock_slowpath+0x4b/0x2b0
[  972.334221]  ? drm_dp_mst_edids_changed+0xc0/0x180 [drm_kms_helper]
[  972.335255]  drm_dp_mst_edids_changed+0xc0/0x180 [drm_kms_helper]
[  972.336318]  drm_dp_mst_topology_mgr_resume+0x104/0x190 [drm_kms_helper]
[  972.337397]  nv50_display_init+0x73/0xd0 [nouveau]
[  972.338483]  nouveau_display_init+0x8e/0xe0 [nouveau]
[  972.339547]  nouveau_display_resume+0x39/0x250 [nouveau]
[  972.340634]  ? pci_restore_standard_config+0x40/0x40
[  972.341762]  nouveau_do_resume+0x79/0x150 [nouveau]
[  972.342850]  nouveau_pmops_runtime_resume+0x88/0x150 [nouveau]
[  972.343915]  pci_pm_runtime_resume+0x78/0xb0
[  972.345004]  __rpm_callback+0x75/0x1b0
[  972.346084]  ? pci_restore_standard_config+0x40/0x40
[  972.347148]  rpm_callback+0x1f/0x70
[  972.348256]  ? pci_restore_standard_config+0x40/0x40
[  972.349351]  rpm_resume+0x5d4/0x810
[  972.350446]  ? drm_modeset_lock+0x44/0x110 [drm]
[  972.351573]  __pm_runtime_resume+0x47/0x70
[  972.352737]  nouveau_connector_detect+0x373/0x470 [nouveau]
[  972.353841]  ? _cond_resched+0x15/0x30
[  972.354945]  ? ww_mutex_lock+0x30/0x60
[  972.356044]  ? drm_modeset_lock+0x44/0x110 [drm]
[  972.357146]  drm_helper_probe_single_connector_modes+0xd9/0x6c0
[drm_kms_helper]
[  972.358274]  ? printk+0x58/0x6f
[  972.359400]  status_store+0xb2/0x180 [drm]
[  972.360519]  kernfs_fop_write+0xf0/0x1a0
[  972.361711]  __vfs_write+0x36/0x180
[  972.362842]  ? rcu_read_lock_sched_held+0x55/0x60
[  972.363974]  ? rcu_sync_lockdep_assert+0x2e/0x60
[  972.365103]  ? __sb_start_write+0x115/0x170
[  972.366241]  ? vfs_write+0x13e/0x1a0
[  972.367365]  vfs_write+0xa5/0x1a0
[  972.368486]  ksys_write+0x52/0xc0
[  972.369596]  do_syscall_64+0x60/0x1a0
[  972.370782]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  972.371890] RIP: 0033:0x7fc93e169ef4
[  972.373012] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00
00 66 90 48 8d 05 f1 3a 2d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d
00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
[  972.374214] RSP: 002b:00007ffcd7ad1918 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[  972.375421] RAX: ffffffffffffffda RBX: 0000000000000007 RCX:
00007fc93e169ef4
[  972.376628] RDX: 0000000000000007 RSI: 000055aefe5c30c0 RDI:
0000000000000001
[  972.377872] RBP: 000055aefe5c30c0 R08: 00007fc93e43a8c0 R09:
00007fc93f6cdb80
[  972.379082] R10: 000000000000000a R11: 0000000000000246 R12:
00007fc93e439760
[  972.380311] R13: 0000000000000007 R14: 00007fc93e434760 R15:
0000000000000007

I must not have noticed that back when I wrote these patches! Whoops.

Since there's going to be quite a number of changes I need to make to this I'm
just going to make the changes myself! I'll make sure to Cc you with the
respin

On Tue, 2018-10-23 at 19:19 -0700, Juston Li wrote:
> From: Lyude <cpaul@redhat.com>
> 
> As observed with the latest ThinkPad docks, we unfortunately can't rely
> on docks keeping us updated with hotplug events that happened while we
> were suspended. On top of that, even if the number of connectors remains
> the same between suspend and resume it's still not safe to assume that
> there were no hotplugs, since a different monitor might have been
> plugged into a port another monitor previously occupied. As such, we
> need to go through all of the MST ports and check whether or not their
> EDIDs have changed.
> 
> In addition to that, we also now return -EINVAL from
> drm_dp_mst_topology_mgr_resume to indicate to callers that they need to
> reset the MST connection, and that they can't rely on any other method
> of reprobing.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Lyude <cpaul@redhat.com>
> Signed-off-by: Juston Li <juston.li@intel.com>
> ---
>  drivers/gpu/drm/drm_dp_mst_topology.c | 94 ++++++++++++++++++++++++++-
>  1 file changed, 93 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c
> b/drivers/gpu/drm/drm_dp_mst_topology.c
> index 5ff1d79b86c4..88abebe52021 100644
> --- a/drivers/gpu/drm/drm_dp_mst_topology.c
> +++ b/drivers/gpu/drm/drm_dp_mst_topology.c
> @@ -29,6 +29,7 @@
>  #include <linux/i2c.h>
>  #include <drm/drm_dp_mst_helper.h>
>  #include <drm/drmP.h>
> +#include <drm/drm_edid.h>
>  
>  #include <drm/drm_fixed.h>
>  #include <drm/drm_atomic.h>
> @@ -2201,6 +2202,64 @@ void drm_dp_mst_topology_mgr_suspend(struct
> drm_dp_mst_topology_mgr *mgr)
>  }
>  EXPORT_SYMBOL(drm_dp_mst_topology_mgr_suspend);
>  
> +static bool drm_dp_mst_edids_changed(struct drm_dp_mst_topology_mgr *mgr,
> +				     struct drm_dp_mst_port *port)
> +{
> +	struct drm_device *dev;
> +	struct drm_connector *connector;
> +	struct drm_dp_mst_port *dport;
> +	struct drm_dp_mst_branch *mstb;
> +	struct edid *current_edid, *cached_edid;
> +	bool ret = false;
> +
> +	port = drm_dp_get_validated_port_ref(mgr, port);
> +	if (!port)
> +		return false;
> +
> +	mstb = drm_dp_get_validated_mstb_ref(mgr, port->mstb);
> +	if (mstb) {
> +		list_for_each_entry(dport, &port->mstb->ports, next) {
> +			ret = drm_dp_mst_edids_changed(mgr, dport);
> +			if (ret)
> +				break;
> +		}
> +
> +		drm_dp_put_mst_branch_device(mstb);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	connector = port->connector;
> +	if (!connector || !port->aux.ddc.algo)
> +		goto out;
> +
> +	dev = connector->dev;
> +	mutex_lock(&dev->mode_config.mutex);
> +
> +	current_edid = drm_get_edid(connector, &port->aux.ddc);
> +	if (connector->edid_blob_ptr)
> +		cached_edid = (void *)connector->edid_blob_ptr->data;
> +	else
> +		return false;
> +
> +	if ((current_edid && cached_edid && memcmp(current_edid, cached_edid,
> +						   sizeof(struct edid)) != 0)
> ||
> +	    (!current_edid && cached_edid) || (current_edid && !cached_edid))
> {
> +		ret = true;
> +		DRM_DEBUG_KMS("EDID on %s changed, reprobing connectors\n",
> +			      connector->name);
> +	}
> +
> +	mutex_unlock(&dev->mode_config.mutex);
> +
> +	kfree(current_edid);
> +
> +out:
> +	drm_dp_put_port(port);
> +
> +	return ret;
> +}
> +
>  /**
>   * drm_dp_mst_topology_mgr_resume() - resume the MST manager
>   * @mgr: manager to resume
> @@ -2210,9 +2269,15 @@ EXPORT_SYMBOL(drm_dp_mst_topology_mgr_suspend);
>   *
>   * if the device fails this returns -1, and the driver should do
>   * a full MST reprobe, in case we were undocked.
> + *
> + * if the device can no longer be trusted, this returns -EINVAL
> + * and the driver should unconditionally disconnect and reconnect
> + * the dock.
>   */
>  int drm_dp_mst_topology_mgr_resume(struct drm_dp_mst_topology_mgr *mgr)
>  {
> +	struct drm_dp_mst_branch *mstb;
> +	struct drm_dp_mst_port *port;
>  	int ret = 0;
>  
>  	mutex_lock(&mgr->lock);
> @@ -2246,8 +2311,35 @@ int drm_dp_mst_topology_mgr_resume(struct
> drm_dp_mst_topology_mgr *mgr)
>  		drm_dp_check_mstb_guid(mgr->mst_primary, guid);
>  
>  		ret = 0;
> -	} else
> +
> +		/*
> +		 * Some hubs also forget to notify us of hotplugs that
> happened
> +		 * while we were in suspend, so we need to verify that the
> edid
> +		 * hasn't changed for any of the connectors. If it has been,
> +		 * we unfortunately can't rely on the dock updating us with
> +		 * hotplug events, so indicate we need a full reconnect.
> +		 */
> +
> +		/* MST's I2C helpers can't be used while holding this lock */
> +		mutex_unlock(&mgr->lock);
> +
> +		mstb = drm_dp_get_validated_mstb_ref(mgr, mgr->mst_primary);
> +		if (mstb) {
> +			list_for_each_entry(port, &mstb->ports, next) {
> +				if (drm_dp_mst_edids_changed(mgr, port)) {
> +					ret = -EINVAL;
> +					break;
> +				}
> +			}
> +
> +			drm_dp_put_mst_branch_device(mstb);
> +		}
> +	} else {
>  		ret = -1;
> +		mutex_unlock(&mgr->lock);
> +	}
> +
> +	return ret;
>  
>  out_unlock:
>  	mutex_unlock(&mgr->lock);
-- 
Cheers,
	Lyude Paul