All of lore.kernel.org
 help / color / mirror / Atom feed
* Fwd: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
       [not found] <680389204.71.1512980755810.JavaMail.jenkins@ip-172-30-0-246>
@ 2017-12-11 10:30 ` Guillaume Tucker
  2017-12-11 17:05     ` Daniel Vetter
  0 siblings, 1 reply; 35+ messages in thread
From: Guillaume Tucker @ 2017-12-11 10:30 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Mark Brown, Kevin Hilman, Matt Hart, Thierry Escande,
	Tomeu Vizoso, Enric Balletbo i Serra, linux-samsung-soc,
	kernel-build-reports, linux-arm-kernel

Hi Daniel,

Please see below, I've had several bisection results pointing at
that commit over the week-end on mainline but also on linux-next
and net-next.  While the peach-pi is a bit flaky at the moment
and is likely to have more than one issue, it does seem like this
commit is causing some well reproducible kernel hang.

Here's a re-run with v4.15-rc3 showing the issue:

   https://lava.collabora.co.uk/scheduler/job/1018478

and here's another one with the change mentioned below reverted:

   https://lava.collabora.co.uk/scheduler/job/1018479

They both show a warning about "unbalanced disables for lcd_vdd",
I don't know if this is related as I haven't investigated any
further.  It does appear to reliably hang with v4.15-rc3 and
boot most of the time with the commit reverted though.

The automated kernelci.org bisection is still an experimental
tool and it may well be a false positive, so please take this
result with a pinch of salt...

Hope this helps!

Best wishes,
Guillaume


-------- Forwarded Message --------
Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC)
From: kernelci.org bot <bot@kernelci.org>
To: guillaume.tucker@collabora.com

Bisection result for mainline/master (v4.15-rc3) on peach-pi

Good known revision:

     c6b3e96 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Bad known revision:

     50c4c4e Linux 4.15-rc3

Extra parameters:

     Tree:      mainline
     Branch:    master
     Target:    peach-pi
     Lab:       lab-collabora
     Defconfig: exynos_defconfig
     Plan:      boot


Breaking commit found:

-------------------------------------------------------------------------------
commit a703c55004e1c5076d57e43771b3e11117796ea0
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Dec 4 21:48:18 2017 +0100

     drm: safely free connectors from connector_iter
     
     In
     
     commit 613051dac40da1751ab269572766d3348d45a197
     Author: Daniel Vetter <daniel.vetter@ffwll.ch>
     Date:   Wed Dec 14 00:08:06 2016 +0100
     
         drm: locking&new iterators for connector_list
     
     we've went to extreme lengths to make sure connector iterations works
     in any context, without introducing any additional locking context.
     This worked, except for a small fumble in the implementation:
     
     When we actually race with a concurrent connector unplug event, and
     our temporary connector reference turns out to be the final one, then
     everything breaks: We call the connector release function from
     whatever context we happen to be in, which can be an irq/atomic
     context. And connector freeing grabs all kinds of locks and stuff.
     
     Fix this by creating a specially safe put function for connetor_iter,
     which (in this rare case) punts the cleanup to a worker.
     
     Reported-by: Ben Widawsky <ben@bwidawsk.net>
     Cc: Ben Widawsky <ben@bwidawsk.net>
     Fixes: 613051dac40d ("drm: locking&new iterators for connector_list")
     Cc: Dave Airlie <airlied@gmail.com>
     Cc: Chris Wilson <chris@chris-wilson.co.uk>
     Cc: Sean Paul <seanpaul@chromium.org>
     Cc: <stable@vger.kernel.org> # v4.11+
     Reviewed-by: Dave Airlie <airlied@gmail.com>
     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
     Link: https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter@ffwll.ch

diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c
index 25f4b2e..4820141 100644
--- a/drivers/gpu/drm/drm_connector.c
+++ b/drivers/gpu/drm/drm_connector.c
@@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref)
  	connector->funcs->destroy(connector);
  }
  
+static void drm_connector_free_work_fn(struct work_struct *work)
+{
+	struct drm_connector *connector =
+		container_of(work, struct drm_connector, free_work);
+	struct drm_device *dev = connector->dev;
+
+	drm_mode_object_unregister(dev, &connector->base);
+	connector->funcs->destroy(connector);
+}
+
  /**
   * drm_connector_init - Init a preallocated connector
   * @dev: DRM device
@@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev,
  	if (ret)
  		return ret;
  
+	INIT_WORK(&connector->free_work, drm_connector_free_work_fn);
+
  	connector->base.properties = &connector->properties;
  	connector->dev = dev;
  	connector->funcs = funcs;
@@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device *dev,
  }
  EXPORT_SYMBOL(drm_connector_list_iter_begin);
  
+/*
+ * Extra-safe connector put function that works in any context. Should only be
+ * used from the connector_iter functions, where we never really expect to
+ * actually release the connector when dropping our final reference.
+ */
+static void
+drm_connector_put_safe(struct drm_connector *conn)
+{
+	if (refcount_dec_and_test(&conn->base.refcount.refcount))
+		schedule_work(&conn->free_work);
+}
+
  /**
   * drm_connector_list_iter_next - return next connector
   * @iter: connectr_list iterator
@@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct drm_connector_list_iter *iter)
  	spin_unlock_irqrestore(&config->connector_list_lock, flags);
  
  	if (old_conn)
-		drm_connector_put(old_conn);
+		drm_connector_put_safe(old_conn);
  
  	return iter->conn;
  }
@@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct drm_connector_list_iter *iter)
  {
  	iter->dev = NULL;
  	if (iter->conn)
-		drm_connector_put(iter->conn);
+		drm_connector_put_safe(iter->conn);
  	lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
  }
  EXPORT_SYMBOL(drm_connector_list_iter_end);
diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c
index cda8bfa..cc78b3d 100644
--- a/drivers/gpu/drm/drm_mode_config.c
+++ b/drivers/gpu/drm/drm_mode_config.c
@@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev)
  		drm_connector_put(connector);
  	}
  	drm_connector_list_iter_end(&conn_iter);
+	/* connector_iter drops references in a work item. */
+	flush_scheduled_work();
  	if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) {
  		drm_connector_list_iter_begin(dev, &conn_iter);
  		drm_for_each_connector_iter(connector, &conn_iter)
diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h
index df9807a..a4649c5 100644
--- a/include/drm/drm_connector.h
+++ b/include/drm/drm_connector.h
@@ -916,6 +916,14 @@ struct drm_connector {
  	uint8_t num_h_tile, num_v_tile;
  	uint8_t tile_h_loc, tile_v_loc;
  	uint16_t tile_h_size, tile_v_size;
+
+	/**
+	 * @free_work:
+	 *
+	 * Work used only by &drm_connector_iter to be able to clean up a
+	 * connector from any context.
+	 */
+	struct work_struct free_work;
  };
  
  #define obj_to_connector(x) container_of(x, struct drm_connector, base)
-------------------------------------------------------------------------------


Git bisection log:

-------------------------------------------------------------------------------
git bisect start
# good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94
# bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3
git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36
# bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d
# bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d
# bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux
git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12
# bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992
# bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop NONCONTIG flag for buffers allocated without IOMMU
git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80
# good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the shrink request to prevent skip huge pool
git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6
# good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
git bisect good db8f884ca7fe6af64d443d1510464efe23826131
# bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828
# bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter
git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0
# first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free connectors from connector_iter
-------------------------------------------------------------------------------

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-11 10:30 ` Fwd: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging Guillaume Tucker
@ 2017-12-11 17:05     ` Daniel Vetter
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Vetter @ 2017-12-11 17:05 UTC (permalink / raw)
  To: Guillaume Tucker
  Cc: Mark Brown, Kevin Hilman, Matt Hart, Thierry Escande,
	Tomeu Vizoso, Enric Balletbo i Serra, linux-samsung-soc,
	kernel-build-reports, linux-arm-kernel

On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:
> Hi Daniel,
>
> Please see below, I've had several bisection results pointing at
> that commit over the week-end on mainline but also on linux-next
> and net-next.  While the peach-pi is a bit flaky at the moment
> and is likely to have more than one issue, it does seem like this
> commit is causing some well reproducible kernel hang.
>
> Here's a re-run with v4.15-rc3 showing the issue:
>
>   https://lava.collabora.co.uk/scheduler/job/1018478
>
> and here's another one with the change mentioned below reverted:
>
>   https://lava.collabora.co.uk/scheduler/job/1018479
>
> They both show a warning about "unbalanced disables for lcd_vdd",
> I don't know if this is related as I haven't investigated any
> further.  It does appear to reliably hang with v4.15-rc3 and
> boot most of the time with the commit reverted though.
>
> The automated kernelci.org bisection is still an experimental
> tool and it may well be a false positive, so please take this
> result with a pinch of salt...

The patch just very minimal moves the connector cleanup around (so
timing change), but except when you unload a driver (or maybe that
funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
more info than "seems to hang a bit more" I have no idea what's wrong.
The patch itself should work, at least it survived quite some serious
testing we do on everything.
-Daniel

> Hope this helps!
>
> Best wishes,
> Guillaume
>
>
> -------- Forwarded Message --------
> Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
> Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC)
> From: kernelci.org bot <bot@kernelci.org>
> To: guillaume.tucker@collabora.com
>
> Bisection result for mainline/master (v4.15-rc3) on peach-pi
>
> Good known revision:
>
>     c6b3e96 Merge branch 'for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
>
> Bad known revision:
>
>     50c4c4e Linux 4.15-rc3
>
> Extra parameters:
>
>     Tree:      mainline
>     Branch:    master
>     Target:    peach-pi
>     Lab:       lab-collabora
>     Defconfig: exynos_defconfig
>     Plan:      boot
>
>
> Breaking commit found:
>
> -------------------------------------------------------------------------------
> commit a703c55004e1c5076d57e43771b3e11117796ea0
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Mon Dec 4 21:48:18 2017 +0100
>
>     drm: safely free connectors from connector_iter
>         In
>         commit 613051dac40da1751ab269572766d3348d45a197
>     Author: Daniel Vetter <daniel.vetter@ffwll.ch>
>     Date:   Wed Dec 14 00:08:06 2016 +0100
>             drm: locking&new iterators for connector_list
>         we've went to extreme lengths to make sure connector iterations
> works
>     in any context, without introducing any additional locking context.
>     This worked, except for a small fumble in the implementation:
>         When we actually race with a concurrent connector unplug event, and
>     our temporary connector reference turns out to be the final one, then
>     everything breaks: We call the connector release function from
>     whatever context we happen to be in, which can be an irq/atomic
>     context. And connector freeing grabs all kinds of locks and stuff.
>         Fix this by creating a specially safe put function for
> connetor_iter,
>     which (in this rare case) punts the cleanup to a worker.
>         Reported-by: Ben Widawsky <ben@bwidawsk.net>
>     Cc: Ben Widawsky <ben@bwidawsk.net>
>     Fixes: 613051dac40d ("drm: locking&new iterators for connector_list")
>     Cc: Dave Airlie <airlied@gmail.com>
>     Cc: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Sean Paul <seanpaul@chromium.org>
>     Cc: <stable@vger.kernel.org> # v4.11+
>     Reviewed-by: Dave Airlie <airlied@gmail.com>
>     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter@ffwll.ch
>
> diff --git a/drivers/gpu/drm/drm_connector.c
> b/drivers/gpu/drm/drm_connector.c
> index 25f4b2e..4820141 100644
> --- a/drivers/gpu/drm/drm_connector.c
> +++ b/drivers/gpu/drm/drm_connector.c
> @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref)
>         connector->funcs->destroy(connector);
>  }
>  +static void drm_connector_free_work_fn(struct work_struct *work)
> +{
> +       struct drm_connector *connector =
> +               container_of(work, struct drm_connector, free_work);
> +       struct drm_device *dev = connector->dev;
> +
> +       drm_mode_object_unregister(dev, &connector->base);
> +       connector->funcs->destroy(connector);
> +}
> +
>  /**
>   * drm_connector_init - Init a preallocated connector
>   * @dev: DRM device
> @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev,
>         if (ret)
>                 return ret;
>  +      INIT_WORK(&connector->free_work, drm_connector_free_work_fn);
> +
>         connector->base.properties = &connector->properties;
>         connector->dev = dev;
>         connector->funcs = funcs;
> @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device
> *dev,
>  }
>  EXPORT_SYMBOL(drm_connector_list_iter_begin);
>  +/*
> + * Extra-safe connector put function that works in any context. Should only
> be
> + * used from the connector_iter functions, where we never really expect to
> + * actually release the connector when dropping our final reference.
> + */
> +static void
> +drm_connector_put_safe(struct drm_connector *conn)
> +{
> +       if (refcount_dec_and_test(&conn->base.refcount.refcount))
> +               schedule_work(&conn->free_work);
> +}
> +
>  /**
>   * drm_connector_list_iter_next - return next connector
>   * @iter: connectr_list iterator
> @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct
> drm_connector_list_iter *iter)
>         spin_unlock_irqrestore(&config->connector_list_lock, flags);
>         if (old_conn)
> -               drm_connector_put(old_conn);
> +               drm_connector_put_safe(old_conn);
>         return iter->conn;
>  }
> @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct
> drm_connector_list_iter *iter)
>  {
>         iter->dev = NULL;
>         if (iter->conn)
> -               drm_connector_put(iter->conn);
> +               drm_connector_put_safe(iter->conn);
>         lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
>  }
>  EXPORT_SYMBOL(drm_connector_list_iter_end);
> diff --git a/drivers/gpu/drm/drm_mode_config.c
> b/drivers/gpu/drm/drm_mode_config.c
> index cda8bfa..cc78b3d 100644
> --- a/drivers/gpu/drm/drm_mode_config.c
> +++ b/drivers/gpu/drm/drm_mode_config.c
> @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev)
>                 drm_connector_put(connector);
>         }
>         drm_connector_list_iter_end(&conn_iter);
> +       /* connector_iter drops references in a work item. */
> +       flush_scheduled_work();
>         if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) {
>                 drm_connector_list_iter_begin(dev, &conn_iter);
>                 drm_for_each_connector_iter(connector, &conn_iter)
> diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h
> index df9807a..a4649c5 100644
> --- a/include/drm/drm_connector.h
> +++ b/include/drm/drm_connector.h
> @@ -916,6 +916,14 @@ struct drm_connector {
>         uint8_t num_h_tile, num_v_tile;
>         uint8_t tile_h_loc, tile_v_loc;
>         uint16_t tile_h_size, tile_v_size;
> +
> +       /**
> +        * @free_work:
> +        *
> +        * Work used only by &drm_connector_iter to be able to clean up a
> +        * connector from any context.
> +        */
> +       struct work_struct free_work;
>  };
>   #define obj_to_connector(x) container_of(x, struct drm_connector, base)
> -------------------------------------------------------------------------------
>
>
> Git bisection log:
>
> -------------------------------------------------------------------------------
> git bisect start
> # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus'
> of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
> git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94
> # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3
> git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36
> # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d
> # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2'
> of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
> git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d
> # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag
> 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux
> git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12
> # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag
> 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel
> into drm-fixes
> git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992
> # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop
> NONCONTIG flag for buffers allocated without IOMMU
> git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80
> # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the
> shrink request to prevent skip huge pool
> git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6
> # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch
> 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
> git bisect good db8f884ca7fe6af64d443d1510464efe23826131
> # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag
> 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc
> into drm-fixes
> git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828
> # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free
> connectors from connector_iter
> git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0
> # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely
> free connectors from connector_iter
> -------------------------------------------------------------------------------



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-11 17:05     ` Daniel Vetter
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Vetter @ 2017-12-11 17:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
<guillaume.tucker@collabora.com> wrote:
> Hi Daniel,
>
> Please see below, I've had several bisection results pointing at
> that commit over the week-end on mainline but also on linux-next
> and net-next.  While the peach-pi is a bit flaky at the moment
> and is likely to have more than one issue, it does seem like this
> commit is causing some well reproducible kernel hang.
>
> Here's a re-run with v4.15-rc3 showing the issue:
>
>   https://lava.collabora.co.uk/scheduler/job/1018478
>
> and here's another one with the change mentioned below reverted:
>
>   https://lava.collabora.co.uk/scheduler/job/1018479
>
> They both show a warning about "unbalanced disables for lcd_vdd",
> I don't know if this is related as I haven't investigated any
> further.  It does appear to reliably hang with v4.15-rc3 and
> boot most of the time with the commit reverted though.
>
> The automated kernelci.org bisection is still an experimental
> tool and it may well be a false positive, so please take this
> result with a pinch of salt...

The patch just very minimal moves the connector cleanup around (so
timing change), but except when you unload a driver (or maybe that
funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
more info than "seems to hang a bit more" I have no idea what's wrong.
The patch itself should work, at least it survived quite some serious
testing we do on everything.
-Daniel

> Hope this helps!
>
> Best wishes,
> Guillaume
>
>
> -------- Forwarded Message --------
> Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
> Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC)
> From: kernelci.org bot <bot@kernelci.org>
> To: guillaume.tucker at collabora.com
>
> Bisection result for mainline/master (v4.15-rc3) on peach-pi
>
> Good known revision:
>
>     c6b3e96 Merge branch 'for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
>
> Bad known revision:
>
>     50c4c4e Linux 4.15-rc3
>
> Extra parameters:
>
>     Tree:      mainline
>     Branch:    master
>     Target:    peach-pi
>     Lab:       lab-collabora
>     Defconfig: exynos_defconfig
>     Plan:      boot
>
>
> Breaking commit found:
>
> -------------------------------------------------------------------------------
> commit a703c55004e1c5076d57e43771b3e11117796ea0
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Mon Dec 4 21:48:18 2017 +0100
>
>     drm: safely free connectors from connector_iter
>         In
>         commit 613051dac40da1751ab269572766d3348d45a197
>     Author: Daniel Vetter <daniel.vetter@ffwll.ch>
>     Date:   Wed Dec 14 00:08:06 2016 +0100
>             drm: locking&new iterators for connector_list
>         we've went to extreme lengths to make sure connector iterations
> works
>     in any context, without introducing any additional locking context.
>     This worked, except for a small fumble in the implementation:
>         When we actually race with a concurrent connector unplug event, and
>     our temporary connector reference turns out to be the final one, then
>     everything breaks: We call the connector release function from
>     whatever context we happen to be in, which can be an irq/atomic
>     context. And connector freeing grabs all kinds of locks and stuff.
>         Fix this by creating a specially safe put function for
> connetor_iter,
>     which (in this rare case) punts the cleanup to a worker.
>         Reported-by: Ben Widawsky <ben@bwidawsk.net>
>     Cc: Ben Widawsky <ben@bwidawsk.net>
>     Fixes: 613051dac40d ("drm: locking&new iterators for connector_list")
>     Cc: Dave Airlie <airlied@gmail.com>
>     Cc: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Sean Paul <seanpaul@chromium.org>
>     Cc: <stable@vger.kernel.org> # v4.11+
>     Reviewed-by: Dave Airlie <airlied@gmail.com>
>     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter at ffwll.ch
>
> diff --git a/drivers/gpu/drm/drm_connector.c
> b/drivers/gpu/drm/drm_connector.c
> index 25f4b2e..4820141 100644
> --- a/drivers/gpu/drm/drm_connector.c
> +++ b/drivers/gpu/drm/drm_connector.c
> @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref)
>         connector->funcs->destroy(connector);
>  }
>  +static void drm_connector_free_work_fn(struct work_struct *work)
> +{
> +       struct drm_connector *connector =
> +               container_of(work, struct drm_connector, free_work);
> +       struct drm_device *dev = connector->dev;
> +
> +       drm_mode_object_unregister(dev, &connector->base);
> +       connector->funcs->destroy(connector);
> +}
> +
>  /**
>   * drm_connector_init - Init a preallocated connector
>   * @dev: DRM device
> @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev,
>         if (ret)
>                 return ret;
>  +      INIT_WORK(&connector->free_work, drm_connector_free_work_fn);
> +
>         connector->base.properties = &connector->properties;
>         connector->dev = dev;
>         connector->funcs = funcs;
> @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device
> *dev,
>  }
>  EXPORT_SYMBOL(drm_connector_list_iter_begin);
>  +/*
> + * Extra-safe connector put function that works in any context. Should only
> be
> + * used from the connector_iter functions, where we never really expect to
> + * actually release the connector when dropping our final reference.
> + */
> +static void
> +drm_connector_put_safe(struct drm_connector *conn)
> +{
> +       if (refcount_dec_and_test(&conn->base.refcount.refcount))
> +               schedule_work(&conn->free_work);
> +}
> +
>  /**
>   * drm_connector_list_iter_next - return next connector
>   * @iter: connectr_list iterator
> @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct
> drm_connector_list_iter *iter)
>         spin_unlock_irqrestore(&config->connector_list_lock, flags);
>         if (old_conn)
> -               drm_connector_put(old_conn);
> +               drm_connector_put_safe(old_conn);
>         return iter->conn;
>  }
> @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct
> drm_connector_list_iter *iter)
>  {
>         iter->dev = NULL;
>         if (iter->conn)
> -               drm_connector_put(iter->conn);
> +               drm_connector_put_safe(iter->conn);
>         lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
>  }
>  EXPORT_SYMBOL(drm_connector_list_iter_end);
> diff --git a/drivers/gpu/drm/drm_mode_config.c
> b/drivers/gpu/drm/drm_mode_config.c
> index cda8bfa..cc78b3d 100644
> --- a/drivers/gpu/drm/drm_mode_config.c
> +++ b/drivers/gpu/drm/drm_mode_config.c
> @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev)
>                 drm_connector_put(connector);
>         }
>         drm_connector_list_iter_end(&conn_iter);
> +       /* connector_iter drops references in a work item. */
> +       flush_scheduled_work();
>         if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) {
>                 drm_connector_list_iter_begin(dev, &conn_iter);
>                 drm_for_each_connector_iter(connector, &conn_iter)
> diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h
> index df9807a..a4649c5 100644
> --- a/include/drm/drm_connector.h
> +++ b/include/drm/drm_connector.h
> @@ -916,6 +916,14 @@ struct drm_connector {
>         uint8_t num_h_tile, num_v_tile;
>         uint8_t tile_h_loc, tile_v_loc;
>         uint16_t tile_h_size, tile_v_size;
> +
> +       /**
> +        * @free_work:
> +        *
> +        * Work used only by &drm_connector_iter to be able to clean up a
> +        * connector from any context.
> +        */
> +       struct work_struct free_work;
>  };
>   #define obj_to_connector(x) container_of(x, struct drm_connector, base)
> -------------------------------------------------------------------------------
>
>
> Git bisection log:
>
> -------------------------------------------------------------------------------
> git bisect start
> # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus'
> of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
> git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94
> # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3
> git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36
> # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d
> # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2'
> of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
> git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d
> # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag
> 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux
> git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12
> # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag
> 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel
> into drm-fixes
> git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992
> # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop
> NONCONTIG flag for buffers allocated without IOMMU
> git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80
> # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the
> shrink request to prevent skip huge pool
> git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6
> # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch
> 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
> git bisect good db8f884ca7fe6af64d443d1510464efe23826131
> # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag
> 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc
> into drm-fixes
> git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828
> # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free
> connectors from connector_iter
> git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0
> # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely
> free connectors from connector_iter
> -------------------------------------------------------------------------------



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-11 17:05     ` Daniel Vetter
@ 2017-12-11 22:28       ` Javier Martinez Canillas
  -1 siblings, 0 replies; 35+ messages in thread
From: Javier Martinez Canillas @ 2017-12-11 22:28 UTC (permalink / raw)
  To: Daniel Vetter, Marek Szyprowski, Shuah Khan
  Cc: Guillaume Tucker, Mark Brown, Kevin Hilman, Matt Hart,
	Thierry Escande, Tomeu Vizoso, Enric Balletbo i Serra,
	linux-samsung-soc, kernel-build-reports, linux-arm-kernel

[adding Marek and Shuah to cc list]

On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
> <guillaume.tucker@collabora.com> wrote:
>> Hi Daniel,
>>
>> Please see below, I've had several bisection results pointing at
>> that commit over the week-end on mainline but also on linux-next
>> and net-next.  While the peach-pi is a bit flaky at the moment
>> and is likely to have more than one issue, it does seem like this
>> commit is causing some well reproducible kernel hang.
>>
>> Here's a re-run with v4.15-rc3 showing the issue:
>>
>>   https://lava.collabora.co.uk/scheduler/job/1018478
>>
>> and here's another one with the change mentioned below reverted:
>>
>>   https://lava.collabora.co.uk/scheduler/job/1018479
>>
>> They both show a warning about "unbalanced disables for lcd_vdd",
>> I don't know if this is related as I haven't investigated any
>> further.  It does appear to reliably hang with v4.15-rc3 and
>> boot most of the time with the commit reverted though.
>>
>> The automated kernelci.org bisection is still an experimental
>> tool and it may well be a false positive, so please take this
>> result with a pinch of salt...
>
> The patch just very minimal moves the connector cleanup around (so
> timing change), but except when you unload a driver (or maybe that
> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
> more info than "seems to hang a bit more" I have no idea what's wrong.
> The patch itself should work, at least it survived quite some serious
> testing we do on everything.
> -Daniel
>

Marek was pointing to a different culprit [0] in this [1] thread. I
see that both commits made it to v4.15-rc3, which is the first version
where boot fails. So maybe is a combination of both? Or rather
reverting one patch masks the error in the other.

I've access to the machine but unfortunately not a lot of time to dig
on this, I could try to do it in the weekend though.

[0]: https://patchwork.kernel.org/patch/10067711/
[1]: https://www.spinics.net/lists/arm-kernel/msg622152.html

Best regards,
Javier

>> Hope this helps!
>>
>> Best wishes,
>> Guillaume
>>
>>
>> -------- Forwarded Message --------
>> Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
>> Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC)
>> From: kernelci.org bot <bot@kernelci.org>
>> To: guillaume.tucker@collabora.com
>>
>> Bisection result for mainline/master (v4.15-rc3) on peach-pi
>>
>> Good known revision:
>>
>>     c6b3e96 Merge branch 'for-linus' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
>>
>> Bad known revision:
>>
>>     50c4c4e Linux 4.15-rc3
>>
>> Extra parameters:
>>
>>     Tree:      mainline
>>     Branch:    master
>>     Target:    peach-pi
>>     Lab:       lab-collabora
>>     Defconfig: exynos_defconfig
>>     Plan:      boot
>>
>>
>> Breaking commit found:
>>
>> -------------------------------------------------------------------------------
>> commit a703c55004e1c5076d57e43771b3e11117796ea0
>> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Date:   Mon Dec 4 21:48:18 2017 +0100
>>
>>     drm: safely free connectors from connector_iter
>>         In
>>         commit 613051dac40da1751ab269572766d3348d45a197
>>     Author: Daniel Vetter <daniel.vetter@ffwll.ch>
>>     Date:   Wed Dec 14 00:08:06 2016 +0100
>>             drm: locking&new iterators for connector_list
>>         we've went to extreme lengths to make sure connector iterations
>> works
>>     in any context, without introducing any additional locking context.
>>     This worked, except for a small fumble in the implementation:
>>         When we actually race with a concurrent connector unplug event, and
>>     our temporary connector reference turns out to be the final one, then
>>     everything breaks: We call the connector release function from
>>     whatever context we happen to be in, which can be an irq/atomic
>>     context. And connector freeing grabs all kinds of locks and stuff.
>>         Fix this by creating a specially safe put function for
>> connetor_iter,
>>     which (in this rare case) punts the cleanup to a worker.
>>         Reported-by: Ben Widawsky <ben@bwidawsk.net>
>>     Cc: Ben Widawsky <ben@bwidawsk.net>
>>     Fixes: 613051dac40d ("drm: locking&new iterators for connector_list")
>>     Cc: Dave Airlie <airlied@gmail.com>
>>     Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>     Cc: Sean Paul <seanpaul@chromium.org>
>>     Cc: <stable@vger.kernel.org> # v4.11+
>>     Reviewed-by: Dave Airlie <airlied@gmail.com>
>>     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>     Link:
>> https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter@ffwll.ch
>>
>> diff --git a/drivers/gpu/drm/drm_connector.c
>> b/drivers/gpu/drm/drm_connector.c
>> index 25f4b2e..4820141 100644
>> --- a/drivers/gpu/drm/drm_connector.c
>> +++ b/drivers/gpu/drm/drm_connector.c
>> @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref)
>>         connector->funcs->destroy(connector);
>>  }
>>  +static void drm_connector_free_work_fn(struct work_struct *work)
>> +{
>> +       struct drm_connector *connector =
>> +               container_of(work, struct drm_connector, free_work);
>> +       struct drm_device *dev = connector->dev;
>> +
>> +       drm_mode_object_unregister(dev, &connector->base);
>> +       connector->funcs->destroy(connector);
>> +}
>> +
>>  /**
>>   * drm_connector_init - Init a preallocated connector
>>   * @dev: DRM device
>> @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev,
>>         if (ret)
>>                 return ret;
>>  +      INIT_WORK(&connector->free_work, drm_connector_free_work_fn);
>> +
>>         connector->base.properties = &connector->properties;
>>         connector->dev = dev;
>>         connector->funcs = funcs;
>> @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device
>> *dev,
>>  }
>>  EXPORT_SYMBOL(drm_connector_list_iter_begin);
>>  +/*
>> + * Extra-safe connector put function that works in any context. Should only
>> be
>> + * used from the connector_iter functions, where we never really expect to
>> + * actually release the connector when dropping our final reference.
>> + */
>> +static void
>> +drm_connector_put_safe(struct drm_connector *conn)
>> +{
>> +       if (refcount_dec_and_test(&conn->base.refcount.refcount))
>> +               schedule_work(&conn->free_work);
>> +}
>> +
>>  /**
>>   * drm_connector_list_iter_next - return next connector
>>   * @iter: connectr_list iterator
>> @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct
>> drm_connector_list_iter *iter)
>>         spin_unlock_irqrestore(&config->connector_list_lock, flags);
>>         if (old_conn)
>> -               drm_connector_put(old_conn);
>> +               drm_connector_put_safe(old_conn);
>>         return iter->conn;
>>  }
>> @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct
>> drm_connector_list_iter *iter)
>>  {
>>         iter->dev = NULL;
>>         if (iter->conn)
>> -               drm_connector_put(iter->conn);
>> +               drm_connector_put_safe(iter->conn);
>>         lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
>>  }
>>  EXPORT_SYMBOL(drm_connector_list_iter_end);
>> diff --git a/drivers/gpu/drm/drm_mode_config.c
>> b/drivers/gpu/drm/drm_mode_config.c
>> index cda8bfa..cc78b3d 100644
>> --- a/drivers/gpu/drm/drm_mode_config.c
>> +++ b/drivers/gpu/drm/drm_mode_config.c
>> @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev)
>>                 drm_connector_put(connector);
>>         }
>>         drm_connector_list_iter_end(&conn_iter);
>> +       /* connector_iter drops references in a work item. */
>> +       flush_scheduled_work();
>>         if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) {
>>                 drm_connector_list_iter_begin(dev, &conn_iter);
>>                 drm_for_each_connector_iter(connector, &conn_iter)
>> diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h
>> index df9807a..a4649c5 100644
>> --- a/include/drm/drm_connector.h
>> +++ b/include/drm/drm_connector.h
>> @@ -916,6 +916,14 @@ struct drm_connector {
>>         uint8_t num_h_tile, num_v_tile;
>>         uint8_t tile_h_loc, tile_v_loc;
>>         uint16_t tile_h_size, tile_v_size;
>> +
>> +       /**
>> +        * @free_work:
>> +        *
>> +        * Work used only by &drm_connector_iter to be able to clean up a
>> +        * connector from any context.
>> +        */
>> +       struct work_struct free_work;
>>  };
>>   #define obj_to_connector(x) container_of(x, struct drm_connector, base)
>> -------------------------------------------------------------------------------
>>
>>
>> Git bisection log:
>>
>> -------------------------------------------------------------------------------
>> git bisect start
>> # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus'
>> of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
>> git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94
>> # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3
>> git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36
>> # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
>> git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d
>> # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2'
>> of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
>> git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d
>> # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag
>> 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux
>> git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12
>> # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag
>> 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel
>> into drm-fixes
>> git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992
>> # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop
>> NONCONTIG flag for buffers allocated without IOMMU
>> git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80
>> # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the
>> shrink request to prevent skip huge pool
>> git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6
>> # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch
>> 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
>> git bisect good db8f884ca7fe6af64d443d1510464efe23826131
>> # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag
>> 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc
>> into drm-fixes
>> git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828
>> # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free
>> connectors from connector_iter
>> git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0
>> # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely
>> free connectors from connector_iter
>> -------------------------------------------------------------------------------
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> --
> To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-11 22:28       ` Javier Martinez Canillas
  0 siblings, 0 replies; 35+ messages in thread
From: Javier Martinez Canillas @ 2017-12-11 22:28 UTC (permalink / raw)
  To: linux-arm-kernel

[adding Marek and Shuah to cc list]

On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
> <guillaume.tucker@collabora.com> wrote:
>> Hi Daniel,
>>
>> Please see below, I've had several bisection results pointing at
>> that commit over the week-end on mainline but also on linux-next
>> and net-next.  While the peach-pi is a bit flaky at the moment
>> and is likely to have more than one issue, it does seem like this
>> commit is causing some well reproducible kernel hang.
>>
>> Here's a re-run with v4.15-rc3 showing the issue:
>>
>>   https://lava.collabora.co.uk/scheduler/job/1018478
>>
>> and here's another one with the change mentioned below reverted:
>>
>>   https://lava.collabora.co.uk/scheduler/job/1018479
>>
>> They both show a warning about "unbalanced disables for lcd_vdd",
>> I don't know if this is related as I haven't investigated any
>> further.  It does appear to reliably hang with v4.15-rc3 and
>> boot most of the time with the commit reverted though.
>>
>> The automated kernelci.org bisection is still an experimental
>> tool and it may well be a false positive, so please take this
>> result with a pinch of salt...
>
> The patch just very minimal moves the connector cleanup around (so
> timing change), but except when you unload a driver (or maybe that
> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
> more info than "seems to hang a bit more" I have no idea what's wrong.
> The patch itself should work, at least it survived quite some serious
> testing we do on everything.
> -Daniel
>

Marek was pointing to a different culprit [0] in this [1] thread. I
see that both commits made it to v4.15-rc3, which is the first version
where boot fails. So maybe is a combination of both? Or rather
reverting one patch masks the error in the other.

I've access to the machine but unfortunately not a lot of time to dig
on this, I could try to do it in the weekend though.

[0]: https://patchwork.kernel.org/patch/10067711/
[1]: https://www.spinics.net/lists/arm-kernel/msg622152.html

Best regards,
Javier

>> Hope this helps!
>>
>> Best wishes,
>> Guillaume
>>
>>
>> -------- Forwarded Message --------
>> Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
>> Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC)
>> From: kernelci.org bot <bot@kernelci.org>
>> To: guillaume.tucker at collabora.com
>>
>> Bisection result for mainline/master (v4.15-rc3) on peach-pi
>>
>> Good known revision:
>>
>>     c6b3e96 Merge branch 'for-linus' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
>>
>> Bad known revision:
>>
>>     50c4c4e Linux 4.15-rc3
>>
>> Extra parameters:
>>
>>     Tree:      mainline
>>     Branch:    master
>>     Target:    peach-pi
>>     Lab:       lab-collabora
>>     Defconfig: exynos_defconfig
>>     Plan:      boot
>>
>>
>> Breaking commit found:
>>
>> -------------------------------------------------------------------------------
>> commit a703c55004e1c5076d57e43771b3e11117796ea0
>> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
>> Date:   Mon Dec 4 21:48:18 2017 +0100
>>
>>     drm: safely free connectors from connector_iter
>>         In
>>         commit 613051dac40da1751ab269572766d3348d45a197
>>     Author: Daniel Vetter <daniel.vetter@ffwll.ch>
>>     Date:   Wed Dec 14 00:08:06 2016 +0100
>>             drm: locking&new iterators for connector_list
>>         we've went to extreme lengths to make sure connector iterations
>> works
>>     in any context, without introducing any additional locking context.
>>     This worked, except for a small fumble in the implementation:
>>         When we actually race with a concurrent connector unplug event, and
>>     our temporary connector reference turns out to be the final one, then
>>     everything breaks: We call the connector release function from
>>     whatever context we happen to be in, which can be an irq/atomic
>>     context. And connector freeing grabs all kinds of locks and stuff.
>>         Fix this by creating a specially safe put function for
>> connetor_iter,
>>     which (in this rare case) punts the cleanup to a worker.
>>         Reported-by: Ben Widawsky <ben@bwidawsk.net>
>>     Cc: Ben Widawsky <ben@bwidawsk.net>
>>     Fixes: 613051dac40d ("drm: locking&new iterators for connector_list")
>>     Cc: Dave Airlie <airlied@gmail.com>
>>     Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>     Cc: Sean Paul <seanpaul@chromium.org>
>>     Cc: <stable@vger.kernel.org> # v4.11+
>>     Reviewed-by: Dave Airlie <airlied@gmail.com>
>>     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>     Link:
>> https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter at ffwll.ch
>>
>> diff --git a/drivers/gpu/drm/drm_connector.c
>> b/drivers/gpu/drm/drm_connector.c
>> index 25f4b2e..4820141 100644
>> --- a/drivers/gpu/drm/drm_connector.c
>> +++ b/drivers/gpu/drm/drm_connector.c
>> @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref)
>>         connector->funcs->destroy(connector);
>>  }
>>  +static void drm_connector_free_work_fn(struct work_struct *work)
>> +{
>> +       struct drm_connector *connector =
>> +               container_of(work, struct drm_connector, free_work);
>> +       struct drm_device *dev = connector->dev;
>> +
>> +       drm_mode_object_unregister(dev, &connector->base);
>> +       connector->funcs->destroy(connector);
>> +}
>> +
>>  /**
>>   * drm_connector_init - Init a preallocated connector
>>   * @dev: DRM device
>> @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev,
>>         if (ret)
>>                 return ret;
>>  +      INIT_WORK(&connector->free_work, drm_connector_free_work_fn);
>> +
>>         connector->base.properties = &connector->properties;
>>         connector->dev = dev;
>>         connector->funcs = funcs;
>> @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device
>> *dev,
>>  }
>>  EXPORT_SYMBOL(drm_connector_list_iter_begin);
>>  +/*
>> + * Extra-safe connector put function that works in any context. Should only
>> be
>> + * used from the connector_iter functions, where we never really expect to
>> + * actually release the connector when dropping our final reference.
>> + */
>> +static void
>> +drm_connector_put_safe(struct drm_connector *conn)
>> +{
>> +       if (refcount_dec_and_test(&conn->base.refcount.refcount))
>> +               schedule_work(&conn->free_work);
>> +}
>> +
>>  /**
>>   * drm_connector_list_iter_next - return next connector
>>   * @iter: connectr_list iterator
>> @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct
>> drm_connector_list_iter *iter)
>>         spin_unlock_irqrestore(&config->connector_list_lock, flags);
>>         if (old_conn)
>> -               drm_connector_put(old_conn);
>> +               drm_connector_put_safe(old_conn);
>>         return iter->conn;
>>  }
>> @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct
>> drm_connector_list_iter *iter)
>>  {
>>         iter->dev = NULL;
>>         if (iter->conn)
>> -               drm_connector_put(iter->conn);
>> +               drm_connector_put_safe(iter->conn);
>>         lock_release(&connector_list_iter_dep_map, 0, _RET_IP_);
>>  }
>>  EXPORT_SYMBOL(drm_connector_list_iter_end);
>> diff --git a/drivers/gpu/drm/drm_mode_config.c
>> b/drivers/gpu/drm/drm_mode_config.c
>> index cda8bfa..cc78b3d 100644
>> --- a/drivers/gpu/drm/drm_mode_config.c
>> +++ b/drivers/gpu/drm/drm_mode_config.c
>> @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev)
>>                 drm_connector_put(connector);
>>         }
>>         drm_connector_list_iter_end(&conn_iter);
>> +       /* connector_iter drops references in a work item. */
>> +       flush_scheduled_work();
>>         if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) {
>>                 drm_connector_list_iter_begin(dev, &conn_iter);
>>                 drm_for_each_connector_iter(connector, &conn_iter)
>> diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h
>> index df9807a..a4649c5 100644
>> --- a/include/drm/drm_connector.h
>> +++ b/include/drm/drm_connector.h
>> @@ -916,6 +916,14 @@ struct drm_connector {
>>         uint8_t num_h_tile, num_v_tile;
>>         uint8_t tile_h_loc, tile_v_loc;
>>         uint16_t tile_h_size, tile_v_size;
>> +
>> +       /**
>> +        * @free_work:
>> +        *
>> +        * Work used only by &drm_connector_iter to be able to clean up a
>> +        * connector from any context.
>> +        */
>> +       struct work_struct free_work;
>>  };
>>   #define obj_to_connector(x) container_of(x, struct drm_connector, base)
>> -------------------------------------------------------------------------------
>>
>>
>> Git bisection log:
>>
>> -------------------------------------------------------------------------------
>> git bisect start
>> # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus'
>> of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
>> git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94
>> # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3
>> git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36
>> # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
>> git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d
>> # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2'
>> of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
>> git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d
>> # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag
>> 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux
>> git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12
>> # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag
>> 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel
>> into drm-fixes
>> git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992
>> # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop
>> NONCONTIG flag for buffers allocated without IOMMU
>> git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80
>> # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the
>> shrink request to prevent skip huge pool
>> git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6
>> # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch
>> 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
>> git bisect good db8f884ca7fe6af64d443d1510464efe23826131
>> # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag
>> 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc
>> into drm-fixes
>> git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828
>> # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free
>> connectors from connector_iter
>> git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0
>> # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely
>> free connectors from connector_iter
>> -------------------------------------------------------------------------------
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> --
> To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-11 22:28       ` Javier Martinez Canillas
@ 2017-12-11 22:54         ` Javier Martinez Canillas
  -1 siblings, 0 replies; 35+ messages in thread
From: Javier Martinez Canillas @ 2017-12-11 22:54 UTC (permalink / raw)
  To: Daniel Vetter, Marek Szyprowski, Shuah Khan
  Cc: Guillaume Tucker, Mark Brown, Kevin Hilman, Matt Hart,
	Thierry Escande, Tomeu Vizoso, Enric Balletbo i Serra,
	linux-samsung-soc, kernel-build-reports, linux-arm-kernel

On Mon, Dec 11, 2017 at 11:28 PM, Javier Martinez Canillas
<javier@dowhile0.org> wrote:
> [adding Marek and Shuah to cc list]

[snip]

>>>
>>> Please see below, I've had several bisection results pointing at
>>> that commit over the week-end on mainline but also on linux-next
>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>> and is likely to have more than one issue, it does seem like this
>>> commit is causing some well reproducible kernel hang.
>>>
>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>
>>>   https://lava.collabora.co.uk/scheduler/job/1018478
>>>
>>> and here's another one with the change mentioned below reverted:
>>>
>>>   https://lava.collabora.co.uk/scheduler/job/1018479
>>>
>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>> I don't know if this is related as I haven't investigated any
>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>> boot most of the time with the commit reverted though.
>>>
>>> The automated kernelci.org bisection is still an experimental
>>> tool and it may well be a false positive, so please take this
>>> result with a pinch of salt...
>>
>> The patch just very minimal moves the connector cleanup around (so
>> timing change), but except when you unload a driver (or maybe that
>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>> more info than "seems to hang a bit more" I have no idea what's wrong.
>> The patch itself should work, at least it survived quite some serious
>> testing we do on everything.
>> -Daniel
>>
>
> Marek was pointing to a different culprit [0] in this [1] thread. I
> see that both commits made it to v4.15-rc3, which is the first version
> where boot fails. So maybe is a combination of both? Or rather
> reverting one patch masks the error in the other.
>
> I've access to the machine but unfortunately not a lot of time to dig
> on this, I could try to do it in the weekend though.
>
> [0]: https://patchwork.kernel.org/patch/10067711/
> [1]: https://www.spinics.net/lists/arm-kernel/msg622152.html
>

So I gave a quick look to this, and at the very least there's a bug in
the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
exynos: Add status property to Exynos 542x Mixer nodes").

I've posted a fix for that:

https://patchwork.kernel.org/patch/10105921/

I believe this could be also be the cause for the boot failure, since
I see in the boot log that things start to go wrong after exynos-drm
fails to bind the HDMI component:

[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
0xc1398690): -1

Anyway, I don't have access to the machine now, but it would be nice
if someone test. Or I would do in a few days.

Best regards,
Javier

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-11 22:54         ` Javier Martinez Canillas
  0 siblings, 0 replies; 35+ messages in thread
From: Javier Martinez Canillas @ 2017-12-11 22:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 11, 2017 at 11:28 PM, Javier Martinez Canillas
<javier@dowhile0.org> wrote:
> [adding Marek and Shuah to cc list]

[snip]

>>>
>>> Please see below, I've had several bisection results pointing at
>>> that commit over the week-end on mainline but also on linux-next
>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>> and is likely to have more than one issue, it does seem like this
>>> commit is causing some well reproducible kernel hang.
>>>
>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>
>>>   https://lava.collabora.co.uk/scheduler/job/1018478
>>>
>>> and here's another one with the change mentioned below reverted:
>>>
>>>   https://lava.collabora.co.uk/scheduler/job/1018479
>>>
>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>> I don't know if this is related as I haven't investigated any
>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>> boot most of the time with the commit reverted though.
>>>
>>> The automated kernelci.org bisection is still an experimental
>>> tool and it may well be a false positive, so please take this
>>> result with a pinch of salt...
>>
>> The patch just very minimal moves the connector cleanup around (so
>> timing change), but except when you unload a driver (or maybe that
>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>> more info than "seems to hang a bit more" I have no idea what's wrong.
>> The patch itself should work, at least it survived quite some serious
>> testing we do on everything.
>> -Daniel
>>
>
> Marek was pointing to a different culprit [0] in this [1] thread. I
> see that both commits made it to v4.15-rc3, which is the first version
> where boot fails. So maybe is a combination of both? Or rather
> reverting one patch masks the error in the other.
>
> I've access to the machine but unfortunately not a lot of time to dig
> on this, I could try to do it in the weekend though.
>
> [0]: https://patchwork.kernel.org/patch/10067711/
> [1]: https://www.spinics.net/lists/arm-kernel/msg622152.html
>

So I gave a quick look to this, and at the very least there's a bug in
the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
exynos: Add status property to Exynos 542x Mixer nodes").

I've posted a fix for that:

https://patchwork.kernel.org/patch/10105921/

I believe this could be also be the cause for the boot failure, since
I see in the boot log that things start to go wrong after exynos-drm
fails to bind the HDMI component:

[ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
0xc1398690): -1

Anyway, I don't have access to the machine now, but it would be nice
if someone test. Or I would do in a few days.

Best regards,
Javier

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-11 22:54         ` Javier Martinez Canillas
@ 2017-12-11 22:58           ` Russell King - ARM Linux
  -1 siblings, 0 replies; 35+ messages in thread
From: Russell King - ARM Linux @ 2017-12-11 22:58 UTC (permalink / raw)
  To: Javier Martinez Canillas
  Cc: Daniel Vetter, Marek Szyprowski, Shuah Khan, linux-samsung-soc,
	Tomeu Vizoso, kernel-build-reports, Thierry Escande,
	Kevin Hilman, Enric Balletbo i Serra, linux-arm-kernel

On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
> So I gave a quick look to this, and at the very least there's a bug in
> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
> exynos: Add status property to Exynos 542x Mixer nodes").
> 
> I've posted a fix for that:
> 
> https://patchwork.kernel.org/patch/10105921/
> 
> I believe this could be also be the cause for the boot failure, since
> I see in the boot log that things start to go wrong after exynos-drm
> fails to bind the HDMI component:
> 
> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
> 0xc1398690): -1

Umm, -1 ?  Looking that error code up in
include/uapi/asm-generic/errno-base.h says it's -EPERM.

I suspect that's someone just returning -1 because they're lazy...
which is real bad form and needs fixing.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-11 22:58           ` Russell King - ARM Linux
  0 siblings, 0 replies; 35+ messages in thread
From: Russell King - ARM Linux @ 2017-12-11 22:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
> So I gave a quick look to this, and at the very least there's a bug in
> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
> exynos: Add status property to Exynos 542x Mixer nodes").
> 
> I've posted a fix for that:
> 
> https://patchwork.kernel.org/patch/10105921/
> 
> I believe this could be also be the cause for the boot failure, since
> I see in the boot log that things start to go wrong after exynos-drm
> fails to bind the HDMI component:
> 
> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
> 0xc1398690): -1

Umm, -1 ?  Looking that error code up in
include/uapi/asm-generic/errno-base.h says it's -EPERM.

I suspect that's someone just returning -1 because they're lazy...
which is real bad form and needs fixing.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-11 22:58           ` Russell King - ARM Linux
@ 2017-12-11 23:02             ` Russell King - ARM Linux
  -1 siblings, 0 replies; 35+ messages in thread
From: Russell King - ARM Linux @ 2017-12-11 23:02 UTC (permalink / raw)
  To: Javier Martinez Canillas
  Cc: linux-samsung-soc, Tomeu Vizoso, kernel-build-reports,
	Daniel Vetter, Shuah Khan, Thierry Escande, Kevin Hilman,
	Enric Balletbo i Serra, linux-arm-kernel, Marek Szyprowski

On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
> > So I gave a quick look to this, and at the very least there's a bug in
> > the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
> > exynos: Add status property to Exynos 542x Mixer nodes").
> > 
> > I've posted a fix for that:
> > 
> > https://patchwork.kernel.org/patch/10105921/
> > 
> > I believe this could be also be the cause for the boot failure, since
> > I see in the boot log that things start to go wrong after exynos-drm
> > fails to bind the HDMI component:
> > 
> > [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
> > 0xc1398690): -1
> 
> Umm, -1 ?  Looking that error code up in
> include/uapi/asm-generic/errno-base.h says it's -EPERM.
> 
> I suspect that's someone just returning -1 because they're lazy...
> which is real bad form and needs fixing.

Oh, it really is -EPERM:

struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev,
                                       enum exynos_drm_output_type out_type)
{
        struct drm_crtc *crtc;

        drm_for_each_crtc(crtc, drm_dev)
                if (to_exynos_crtc(crtc)->type == out_type)
                        return to_exynos_crtc(crtc);

        return ERR_PTR(-EPERM);
}

Does "Operation not permitted" really convey the error here?  It doesn't
look like a permission error to me.

Can we please avoid abusing errno codes?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-11 23:02             ` Russell King - ARM Linux
  0 siblings, 0 replies; 35+ messages in thread
From: Russell King - ARM Linux @ 2017-12-11 23:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
> > So I gave a quick look to this, and at the very least there's a bug in
> > the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
> > exynos: Add status property to Exynos 542x Mixer nodes").
> > 
> > I've posted a fix for that:
> > 
> > https://patchwork.kernel.org/patch/10105921/
> > 
> > I believe this could be also be the cause for the boot failure, since
> > I see in the boot log that things start to go wrong after exynos-drm
> > fails to bind the HDMI component:
> > 
> > [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
> > 0xc1398690): -1
> 
> Umm, -1 ?  Looking that error code up in
> include/uapi/asm-generic/errno-base.h says it's -EPERM.
> 
> I suspect that's someone just returning -1 because they're lazy...
> which is real bad form and needs fixing.

Oh, it really is -EPERM:

struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev,
                                       enum exynos_drm_output_type out_type)
{
        struct drm_crtc *crtc;

        drm_for_each_crtc(crtc, drm_dev)
                if (to_exynos_crtc(crtc)->type == out_type)
                        return to_exynos_crtc(crtc);

        return ERR_PTR(-EPERM);
}

Does "Operation not permitted" really convey the error here?  It doesn't
look like a permission error to me.

Can we please avoid abusing errno codes?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-11 23:02             ` Russell King - ARM Linux
@ 2017-12-11 23:25               ` Shuah Khan
  -1 siblings, 0 replies; 35+ messages in thread
From: Shuah Khan @ 2017-12-11 23:25 UTC (permalink / raw)
  To: Russell King - ARM Linux, Javier Martinez Canillas,
	Marek Szyprowski, shuah Khan
  Cc: linux-samsung-soc, Tomeu Vizoso, kernel-build-reports,
	Daniel Vetter, Thierry Escande, Kevin Hilman,
	Enric Balletbo i Serra, linux-arm-kernel

On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
>>> So I gave a quick look to this, and at the very least there's a bug in
>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>
>>> I've posted a fix for that:
>>>
>>> https://patchwork.kernel.org/patch/10105921/
>>>
>>> I believe this could be also be the cause for the boot failure, since
>>> I see in the boot log that things start to go wrong after exynos-drm
>>> fails to bind the HDMI component:
>>>
>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>> 0xc1398690): -1
>>
>> Umm, -1 ?  Looking that error code up in
>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>
>> I suspect that's someone just returning -1 because they're lazy...
>> which is real bad form and needs fixing.
> 
> Oh, it really is -EPERM:
> 
> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev,
>                                        enum exynos_drm_output_type out_type)
> {
>         struct drm_crtc *crtc;
> 
>         drm_for_each_crtc(crtc, drm_dev)
>                 if (to_exynos_crtc(crtc)->type == out_type)
>                         return to_exynos_crtc(crtc);
> 
>         return ERR_PTR(-EPERM);
> }
> 
> Does "Operation not permitted" really convey the error here?  It doesn't
> look like a permission error to me.
> 
> Can we please avoid abusing errno codes?

I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
with top commit g968edbd worked just fine for me last Friday. I ran several
tests and everything checked out except the exynos-gsc lockdep issue I sent
a 4.14 patch for.

However, with 4.15-rc3, dmesg is gets filled with 

[  342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.

I will start bisect and try to isolate the problem. I suspect this is related to dts
changes perhaps? I used to this problem a while back and it has been fixed.

thanks,
-- Shuah 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-11 23:25               ` Shuah Khan
  0 siblings, 0 replies; 35+ messages in thread
From: Shuah Khan @ 2017-12-11 23:25 UTC (permalink / raw)
  To: linux-arm-kernel

On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
>>> So I gave a quick look to this, and at the very least there's a bug in
>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>
>>> I've posted a fix for that:
>>>
>>> https://patchwork.kernel.org/patch/10105921/
>>>
>>> I believe this could be also be the cause for the boot failure, since
>>> I see in the boot log that things start to go wrong after exynos-drm
>>> fails to bind the HDMI component:
>>>
>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>> 0xc1398690): -1
>>
>> Umm, -1 ?  Looking that error code up in
>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>
>> I suspect that's someone just returning -1 because they're lazy...
>> which is real bad form and needs fixing.
> 
> Oh, it really is -EPERM:
> 
> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev,
>                                        enum exynos_drm_output_type out_type)
> {
>         struct drm_crtc *crtc;
> 
>         drm_for_each_crtc(crtc, drm_dev)
>                 if (to_exynos_crtc(crtc)->type == out_type)
>                         return to_exynos_crtc(crtc);
> 
>         return ERR_PTR(-EPERM);
> }
> 
> Does "Operation not permitted" really convey the error here?  It doesn't
> look like a permission error to me.
> 
> Can we please avoid abusing errno codes?

I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
with top commit g968edbd worked just fine for me last Friday. I ran several
tests and everything checked out except the exynos-gsc lockdep issue I sent
a 4.14 patch for.

However, with 4.15-rc3, dmesg is gets filled with 

[  342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
[  402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer

Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.

I will start bisect and try to isolate the problem. I suspect this is related to dts
changes perhaps? I used to this problem a while back and it has been fixed.

thanks,
-- Shuah 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-11 23:25               ` Shuah Khan
@ 2017-12-12  7:54                 ` Marek Szyprowski
  -1 siblings, 0 replies; 35+ messages in thread
From: Marek Szyprowski @ 2017-12-12  7:54 UTC (permalink / raw)
  To: Shuah Khan, Russell King - ARM Linux, Javier Martinez Canillas
  Cc: linux-samsung-soc, Tomeu Vizoso, kernel-build-reports,
	Daniel Vetter, Thierry Escande, Kevin Hilman,
	Enric Balletbo i Serra, linux-arm-kernel

Hi Shuah,

On 2017-12-12 00:25, Shuah Khan wrote:
> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>
>>>> I've posted a fix for that:
>>>>
>>>> https://patchwork.kernel.org/patch/10105921/
>>>>
>>>> I believe this could be also be the cause for the boot failure, since
>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>> fails to bind the HDMI component:
>>>>
>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>> 0xc1398690): -1
>>> Umm, -1 ?  Looking that error code up in
>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>
>>> I suspect that's someone just returning -1 because they're lazy...
>>> which is real bad form and needs fixing.
>> Oh, it really is -EPERM:
>>
>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev,
>>                                         enum exynos_drm_output_type out_type)
>> {
>>          struct drm_crtc *crtc;
>>
>>          drm_for_each_crtc(crtc, drm_dev)
>>                  if (to_exynos_crtc(crtc)->type == out_type)
>>                          return to_exynos_crtc(crtc);
>>
>>          return ERR_PTR(-EPERM);
>> }
>>
>> Does "Operation not permitted" really convey the error here?  It doesn't
>> look like a permission error to me.
>>
>> Can we please avoid abusing errno codes?
> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
> with top commit g968edbd worked just fine for me last Friday. I ran several
> tests and everything checked out except the exynos-gsc lockdep issue I sent
> a 4.14 patch for.
>
> However, with 4.15-rc3, dmesg is gets filled with
>
> [  342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
>
> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>
> I will start bisect and try to isolate the problem. I suspect this is related to dts
> changes perhaps? I used to this problem a while back and it has been fixed.

This warning has been added intentionally, see following discussions:
https://patchwork.kernel.org/patch/10034919/
https://patchwork.kernel.org/patch/10070475/

This means that your test apps should be updated or you should enable Exynos
IOMMU support in your config. Maybe it is a good time to finally enable it
in exynos_defconfig.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12  7:54                 ` Marek Szyprowski
  0 siblings, 0 replies; 35+ messages in thread
From: Marek Szyprowski @ 2017-12-12  7:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Shuah,

On 2017-12-12 00:25, Shuah Khan wrote:
> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote:
>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>
>>>> I've posted a fix for that:
>>>>
>>>> https://patchwork.kernel.org/patch/10105921/
>>>>
>>>> I believe this could be also be the cause for the boot failure, since
>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>> fails to bind the HDMI component:
>>>>
>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>> 0xc1398690): -1
>>> Umm, -1 ?  Looking that error code up in
>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>
>>> I suspect that's someone just returning -1 because they're lazy...
>>> which is real bad form and needs fixing.
>> Oh, it really is -EPERM:
>>
>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev,
>>                                         enum exynos_drm_output_type out_type)
>> {
>>          struct drm_crtc *crtc;
>>
>>          drm_for_each_crtc(crtc, drm_dev)
>>                  if (to_exynos_crtc(crtc)->type == out_type)
>>                          return to_exynos_crtc(crtc);
>>
>>          return ERR_PTR(-EPERM);
>> }
>>
>> Does "Operation not permitted" really convey the error here?  It doesn't
>> look like a permission error to me.
>>
>> Can we please avoid abusing errno codes?
> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
> with top commit g968edbd worked just fine for me last Friday. I ran several
> tests and everything checked out except the exynos-gsc lockdep issue I sent
> a 4.14 patch for.
>
> However, with 4.15-rc3, dmesg is gets filled with
>
> [  342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
> [  402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer
>
> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>
> I will start bisect and try to isolate the problem. I suspect this is related to dts
> changes perhaps? I used to this problem a while back and it has been fixed.

This warning has been added intentionally, see following discussions:
https://patchwork.kernel.org/patch/10034919/
https://patchwork.kernel.org/patch/10070475/

This means that your test apps should be updated or you should enable Exynos
IOMMU support in your config. Maybe it is a good time to finally enable it
in exynos_defconfig.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-12  7:54                 ` Marek Szyprowski
@ 2017-12-12  8:00                   ` Javier Martinez Canillas
  -1 siblings, 0 replies; 35+ messages in thread
From: Javier Martinez Canillas @ 2017-12-12  8:00 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Shuah Khan, Russell King - ARM Linux, linux-samsung-soc,
	Tomeu Vizoso, kernel-build-reports, Daniel Vetter,
	Thierry Escande, Kevin Hilman, Enric Balletbo i Serra,
	linux-arm-kernel

Hello Marek,

On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hi Shuah,
>
>
> On 2017-12-12 00:25, Shuah Khan wrote:
>>
>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>>>
>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>>>
>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas
>>>> wrote:
>>>>>
>>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>>
>>>>> I've posted a fix for that:
>>>>>
>>>>> https://patchwork.kernel.org/patch/10105921/
>>>>>
>>>>> I believe this could be also be the cause for the boot failure, since
>>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>>> fails to bind the HDMI component:
>>>>>
>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>>> 0xc1398690): -1
>>>>
>>>> Umm, -1 ?  Looking that error code up in
>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>>
>>>> I suspect that's someone just returning -1 because they're lazy...
>>>> which is real bad form and needs fixing.
>>>
>>> Oh, it really is -EPERM:
>>>
>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device
>>> *drm_dev,
>>>                                         enum exynos_drm_output_type
>>> out_type)
>>> {
>>>          struct drm_crtc *crtc;
>>>
>>>          drm_for_each_crtc(crtc, drm_dev)
>>>                  if (to_exynos_crtc(crtc)->type == out_type)
>>>                          return to_exynos_crtc(crtc);
>>>
>>>          return ERR_PTR(-EPERM);
>>> }
>>>
>>> Does "Operation not permitted" really convey the error here?  It doesn't
>>> look like a permission error to me.
>>>
>>> Can we please avoid abusing errno codes?
>>
>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
>> with top commit g968edbd worked just fine for me last Friday. I ran
>> several
>> tests and everything checked out except the exynos-gsc lockdep issue I
>> sent
>> a 4.14 patch for.
>>
>> However, with 4.15-rc3, dmesg is gets filled with
>>
>> [  342.337181] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  342.337470] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  342.337851] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.382346] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.396682] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.399244] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.399496] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.399848] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.400163] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.400495] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.401294] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.401595] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>>
>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>>
>> I will start bisect and try to isolate the problem. I suspect this is
>> related to dts
>> changes perhaps? I used to this problem a while back and it has been
>> fixed.
>
>
> This warning has been added intentionally, see following discussions:
> https://patchwork.kernel.org/patch/10034919/
> https://patchwork.kernel.org/patch/10070475/
>
> This means that your test apps should be updated or you should enable Exynos
> IOMMU support in your config. Maybe it is a good time to finally enable it
> in exynos_defconfig.
>

Has the issue that the boot-loader keeps the display controller
enabled and scanning pages on the Exynos Chromebooks resolved?

I think that's that preventing to enable it by default in
exynos_defconfig since it caused boot failures when enabled on these
machines. I don't follow exynos development too closely nowadays so
maybe there's a fix in place now.

> Best regards
> --
> Marek Szyprowski, PhD
> Samsung R&D Institute Poland

Best regards,
Javier

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12  8:00                   ` Javier Martinez Canillas
  0 siblings, 0 replies; 35+ messages in thread
From: Javier Martinez Canillas @ 2017-12-12  8:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Marek,

On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hi Shuah,
>
>
> On 2017-12-12 00:25, Shuah Khan wrote:
>>
>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>>>
>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>>>
>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas
>>>> wrote:
>>>>>
>>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>>
>>>>> I've posted a fix for that:
>>>>>
>>>>> https://patchwork.kernel.org/patch/10105921/
>>>>>
>>>>> I believe this could be also be the cause for the boot failure, since
>>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>>> fails to bind the HDMI component:
>>>>>
>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>>> 0xc1398690): -1
>>>>
>>>> Umm, -1 ?  Looking that error code up in
>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>>
>>>> I suspect that's someone just returning -1 because they're lazy...
>>>> which is real bad form and needs fixing.
>>>
>>> Oh, it really is -EPERM:
>>>
>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device
>>> *drm_dev,
>>>                                         enum exynos_drm_output_type
>>> out_type)
>>> {
>>>          struct drm_crtc *crtc;
>>>
>>>          drm_for_each_crtc(crtc, drm_dev)
>>>                  if (to_exynos_crtc(crtc)->type == out_type)
>>>                          return to_exynos_crtc(crtc);
>>>
>>>          return ERR_PTR(-EPERM);
>>> }
>>>
>>> Does "Operation not permitted" really convey the error here?  It doesn't
>>> look like a permission error to me.
>>>
>>> Can we please avoid abusing errno codes?
>>
>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
>> with top commit g968edbd worked just fine for me last Friday. I ran
>> several
>> tests and everything checked out except the exynos-gsc lockdep issue I
>> sent
>> a 4.14 patch for.
>>
>> However, with 4.15-rc3, dmesg is gets filled with
>>
>> [  342.337181] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  342.337470] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  342.337851] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.382346] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.396682] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.399244] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.399496] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.399848] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.400163] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.400495] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.401294] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>> [  402.401595] [drm] Non-contiguous allocation is not supported without
>> IOMMU, falling back to contiguous buffer
>>
>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>>
>> I will start bisect and try to isolate the problem. I suspect this is
>> related to dts
>> changes perhaps? I used to this problem a while back and it has been
>> fixed.
>
>
> This warning has been added intentionally, see following discussions:
> https://patchwork.kernel.org/patch/10034919/
> https://patchwork.kernel.org/patch/10070475/
>
> This means that your test apps should be updated or you should enable Exynos
> IOMMU support in your config. Maybe it is a good time to finally enable it
> in exynos_defconfig.
>

Has the issue that the boot-loader keeps the display controller
enabled and scanning pages on the Exynos Chromebooks resolved?

I think that's that preventing to enable it by default in
exynos_defconfig since it caused boot failures when enabled on these
machines. I don't follow exynos development too closely nowadays so
maybe there's a fix in place now.

> Best regards
> --
> Marek Szyprowski, PhD
> Samsung R&D Institute Poland

Best regards,
Javier

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-12  8:00                   ` Javier Martinez Canillas
@ 2017-12-12  8:47                     ` Marek Szyprowski
  -1 siblings, 0 replies; 35+ messages in thread
From: Marek Szyprowski @ 2017-12-12  8:47 UTC (permalink / raw)
  To: Javier Martinez Canillas
  Cc: Shuah Khan, Russell King - ARM Linux, linux-samsung-soc,
	Tomeu Vizoso, kernel-build-reports, Daniel Vetter,
	Thierry Escande, Kevin Hilman, Enric Balletbo i Serra,
	linux-arm-kernel

Hi Javier,

On 2017-12-12 09:00, Javier Martinez Canillas wrote:
> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> On 2017-12-12 00:25, Shuah Khan wrote:
>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas
>>>>> wrote:
>>>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>>>
>>>>>> I've posted a fix for that:
>>>>>>
>>>>>> https://patchwork.kernel.org/patch/10105921/
>>>>>>
>>>>>> I believe this could be also be the cause for the boot failure, since
>>>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>>>> fails to bind the HDMI component:
>>>>>>
>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>>>> 0xc1398690): -1
>>>>> Umm, -1 ?  Looking that error code up in
>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>>>
>>>>> I suspect that's someone just returning -1 because they're lazy...
>>>>> which is real bad form and needs fixing.
>>>> Oh, it really is -EPERM:
>>>>
>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device
>>>> *drm_dev,
>>>>                                          enum exynos_drm_output_type
>>>> out_type)
>>>> {
>>>>           struct drm_crtc *crtc;
>>>>
>>>>           drm_for_each_crtc(crtc, drm_dev)
>>>>                   if (to_exynos_crtc(crtc)->type == out_type)
>>>>                           return to_exynos_crtc(crtc);
>>>>
>>>>           return ERR_PTR(-EPERM);
>>>> }
>>>>
>>>> Does "Operation not permitted" really convey the error here?  It doesn't
>>>> look like a permission error to me.
>>>>
>>>> Can we please avoid abusing errno codes?
>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
>>> with top commit g968edbd worked just fine for me last Friday. I ran
>>> several
>>> tests and everything checked out except the exynos-gsc lockdep issue I
>>> sent
>>> a 4.14 patch for.
>>>
>>> However, with 4.15-rc3, dmesg is gets filled with
>>>
>>> [  342.337181] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  342.337470] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  342.337851] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.382346] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.396682] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.399244] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.399496] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.399848] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.400163] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.400495] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.401294] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.401595] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>>
>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>>>
>>> I will start bisect and try to isolate the problem. I suspect this is
>>> related to dts
>>> changes perhaps? I used to this problem a while back and it has been
>>> fixed.
>>
>> This warning has been added intentionally, see following discussions:
>> https://patchwork.kernel.org/patch/10034919/
>> https://patchwork.kernel.org/patch/10070475/
>>
>> This means that your test apps should be updated or you should enable Exynos
>> IOMMU support in your config. Maybe it is a good time to finally enable it
>> in exynos_defconfig.
>>
> Has the issue that the boot-loader keeps the display controller
> enabled and scanning pages on the Exynos Chromebooks resolved?
>
> I think that's that preventing to enable it by default in
> exynos_defconfig since it caused boot failures when enabled on these
> machines. I don't follow exynos development too closely nowadays so
> maybe there's a fix in place now.

Not directly. I still didn't find time to properly add support for
devices, which were left in-working state (with active DMA
transactions) by bootloader, but due to some other changes in the
order of operations during boot process, power domains are
initialized very early and due to temporary lack of devices (which
are not yet added to the system), are turned off. This practically
stops FIMD for scanning framebuffer and "solves" this issue.

I've checked now and Exynos Snow Chromebook boots fine with IOMMU
support enabled, both with v4.15-rc3 and linux-next.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12  8:47                     ` Marek Szyprowski
  0 siblings, 0 replies; 35+ messages in thread
From: Marek Szyprowski @ 2017-12-12  8:47 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Javier,

On 2017-12-12 09:00, Javier Martinez Canillas wrote:
> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> On 2017-12-12 00:25, Shuah Khan wrote:
>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas
>>>>> wrote:
>>>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>>>
>>>>>> I've posted a fix for that:
>>>>>>
>>>>>> https://patchwork.kernel.org/patch/10105921/
>>>>>>
>>>>>> I believe this could be also be the cause for the boot failure, since
>>>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>>>> fails to bind the HDMI component:
>>>>>>
>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>>>> 0xc1398690): -1
>>>>> Umm, -1 ?  Looking that error code up in
>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>>>
>>>>> I suspect that's someone just returning -1 because they're lazy...
>>>>> which is real bad form and needs fixing.
>>>> Oh, it really is -EPERM:
>>>>
>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device
>>>> *drm_dev,
>>>>                                          enum exynos_drm_output_type
>>>> out_type)
>>>> {
>>>>           struct drm_crtc *crtc;
>>>>
>>>>           drm_for_each_crtc(crtc, drm_dev)
>>>>                   if (to_exynos_crtc(crtc)->type == out_type)
>>>>                           return to_exynos_crtc(crtc);
>>>>
>>>>           return ERR_PTR(-EPERM);
>>>> }
>>>>
>>>> Does "Operation not permitted" really convey the error here?  It doesn't
>>>> look like a permission error to me.
>>>>
>>>> Can we please avoid abusing errno codes?
>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
>>> with top commit g968edbd worked just fine for me last Friday. I ran
>>> several
>>> tests and everything checked out except the exynos-gsc lockdep issue I
>>> sent
>>> a 4.14 patch for.
>>>
>>> However, with 4.15-rc3, dmesg is gets filled with
>>>
>>> [  342.337181] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  342.337470] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  342.337851] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.382346] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.396682] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.399244] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.399496] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.399848] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.400163] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.400495] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.401294] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>> [  402.401595] [drm] Non-contiguous allocation is not supported without
>>> IOMMU, falling back to contiguous buffer
>>>
>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>>>
>>> I will start bisect and try to isolate the problem. I suspect this is
>>> related to dts
>>> changes perhaps? I used to this problem a while back and it has been
>>> fixed.
>>
>> This warning has been added intentionally, see following discussions:
>> https://patchwork.kernel.org/patch/10034919/
>> https://patchwork.kernel.org/patch/10070475/
>>
>> This means that your test apps should be updated or you should enable Exynos
>> IOMMU support in your config. Maybe it is a good time to finally enable it
>> in exynos_defconfig.
>>
> Has the issue that the boot-loader keeps the display controller
> enabled and scanning pages on the Exynos Chromebooks resolved?
>
> I think that's that preventing to enable it by default in
> exynos_defconfig since it caused boot failures when enabled on these
> machines. I don't follow exynos development too closely nowadays so
> maybe there's a fix in place now.

Not directly. I still didn't find time to properly add support for
devices, which were left in-working state (with active DMA
transactions) by bootloader, but due to some other changes in the
order of operations during boot process, power domains are
initialized very early and due to temporary lack of devices (which
are not yet added to the system), are turned off. This practically
stops FIMD for scanning framebuffer and "solves" this issue.

I've checked now and Exynos Snow Chromebook boots fine with IOMMU
support enabled, both with v4.15-rc3 and linux-next.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-12  8:47                     ` Marek Szyprowski
@ 2017-12-12 10:10                       ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 35+ messages in thread
From: Krzysztof Kozlowski @ 2017-12-12 10:10 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Javier Martinez Canillas, Shuah Khan, Russell King - ARM Linux,
	linux-samsung-soc, Tomeu Vizoso, kernel-build-reports,
	Daniel Vetter, Thierry Escande, Kevin Hilman,
	Enric Balletbo i Serra, linux-arm-kernel

On Tue, Dec 12, 2017 at 9:47 AM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hi Javier,
>
> On 2017-12-12 09:00, Javier Martinez Canillas wrote:
>>
>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
>> <m.szyprowski@samsung.com> wrote:

(...)

>>> This warning has been added intentionally, see following discussions:
>>> https://patchwork.kernel.org/patch/10034919/
>>> https://patchwork.kernel.org/patch/10070475/
>>>
>>> This means that your test apps should be updated or you should enable
>>> Exynos
>>> IOMMU support in your config. Maybe it is a good time to finally enable
>>> it
>>> in exynos_defconfig.
>>>
>> Has the issue that the boot-loader keeps the display controller
>> enabled and scanning pages on the Exynos Chromebooks resolved?
>>
>> I think that's that preventing to enable it by default in
>> exynos_defconfig since it caused boot failures when enabled on these
>> machines. I don't follow exynos development too closely nowadays so
>> maybe there's a fix in place now.
>
>
> Not directly. I still didn't find time to properly add support for
> devices, which were left in-working state (with active DMA
> transactions) by bootloader, but due to some other changes in the
> order of operations during boot process, power domains are
> initialized very early and due to temporary lack of devices (which
> are not yet added to the system), are turned off. This practically
> stops FIMD for scanning framebuffer and "solves" this issue.
>
> I've checked now and Exynos Snow Chromebook boots fine with IOMMU
> support enabled, both with v4.15-rc3 and linux-next.

Then it looks like we could give EXYNOS_IOMMU a try. At least only on
exynos_defconfig which would leave multi_v7 as a platform to compare.

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12 10:10                       ` Krzysztof Kozlowski
  0 siblings, 0 replies; 35+ messages in thread
From: Krzysztof Kozlowski @ 2017-12-12 10:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 12, 2017 at 9:47 AM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hi Javier,
>
> On 2017-12-12 09:00, Javier Martinez Canillas wrote:
>>
>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
>> <m.szyprowski@samsung.com> wrote:

(...)

>>> This warning has been added intentionally, see following discussions:
>>> https://patchwork.kernel.org/patch/10034919/
>>> https://patchwork.kernel.org/patch/10070475/
>>>
>>> This means that your test apps should be updated or you should enable
>>> Exynos
>>> IOMMU support in your config. Maybe it is a good time to finally enable
>>> it
>>> in exynos_defconfig.
>>>
>> Has the issue that the boot-loader keeps the display controller
>> enabled and scanning pages on the Exynos Chromebooks resolved?
>>
>> I think that's that preventing to enable it by default in
>> exynos_defconfig since it caused boot failures when enabled on these
>> machines. I don't follow exynos development too closely nowadays so
>> maybe there's a fix in place now.
>
>
> Not directly. I still didn't find time to properly add support for
> devices, which were left in-working state (with active DMA
> transactions) by bootloader, but due to some other changes in the
> order of operations during boot process, power domains are
> initialized very early and due to temporary lack of devices (which
> are not yet added to the system), are turned off. This practically
> stops FIMD for scanning framebuffer and "solves" this issue.
>
> I've checked now and Exynos Snow Chromebook boots fine with IOMMU
> support enabled, both with v4.15-rc3 and linux-next.

Then it looks like we could give EXYNOS_IOMMU a try. At least only on
exynos_defconfig which would leave multi_v7 as a platform to compare.

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-11 22:28       ` Javier Martinez Canillas
@ 2017-12-12 11:38         ` Marek Szyprowski
  -1 siblings, 0 replies; 35+ messages in thread
From: Marek Szyprowski @ 2017-12-12 11:38 UTC (permalink / raw)
  To: Javier Martinez Canillas, Daniel Vetter, Shuah Khan
  Cc: Guillaume Tucker, Mark Brown, Kevin Hilman, Matt Hart,
	Thierry Escande, Tomeu Vizoso, Enric Balletbo i Serra,
	linux-samsung-soc, kernel-build-reports, linux-arm-kernel

Hi All,

On 2017-12-11 23:28, Javier Martinez Canillas wrote:
> [adding Marek and Shuah to cc list]
>
> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
>> <guillaume.tucker@collabora.com> wrote:
>>> Hi Daniel,
>>>
>>> Please see below, I've had several bisection results pointing at
>>> that commit over the week-end on mainline but also on linux-next
>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>> and is likely to have more than one issue, it does seem like this
>>> commit is causing some well reproducible kernel hang.
>>>
>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>
>>>    https://lava.collabora.co.uk/scheduler/job/1018478
>>>
>>> and here's another one with the change mentioned below reverted:
>>>
>>>    https://lava.collabora.co.uk/scheduler/job/1018479
>>>
>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>> I don't know if this is related as I haven't investigated any
>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>> boot most of the time with the commit reverted though.
>>>
>>> The automated kernelci.org bisection is still an experimental
>>> tool and it may well be a false positive, so please take this
>>> result with a pinch of salt...
>> The patch just very minimal moves the connector cleanup around (so
>> timing change), but except when you unload a driver (or maybe that
>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>> more info than "seems to hang a bit more" I have no idea what's wrong.
>> The patch itself should work, at least it survived quite some serious
>> testing we do on everything.
>> -Daniel
>>
> Marek was pointing to a different culprit [0] in this [1] thread. I
> see that both commits made it to v4.15-rc3, which is the first version
> where boot fails. So maybe is a combination of both? Or rather
> reverting one patch masks the error in the other.
>
> I've access to the machine but unfortunately not a lot of time to dig
> on this, I could try to do it in the weekend though.

After a recent discussion on the Javier's patch:
https://patchwork.kernel.org/patch/10106417/
I've managed to reproduce this issue also on Exynos5250 based Samsung
Snow Chromebook and investigate a bit.

It is caused by a deadlock in the main kernel workqueue. Here are details:

1. Exynos DRM fails to initialize due to missing regulators and gets moved
to deferred probe device list

2. Deferred probe is triggered and kernel "events" workqueue calls
deferred_probe_work_func()

3. exynos_drm_bind() is called, component_bind_all() fails due to missing
Exynos Mixer device

4. error handling path is executed in exynos_drm_bind(), which calls
drm_mode_config_cleanup()

5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
deadlock.

Do You have idea how to fix this issue properly?

Taking a look at git blame, this indeed shows that the issue has been
introduced by the commit a703c55004e1 ("drm: safely free connectors from
connector_ite"), which added a call to flush_scheduled_work() in
drm_mode_config_cleanup().

drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
called from the workqueue, but I don't have idea how to check that. The
other way of fixing it would be to resurrect separate workqueue for DRM
related events.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12 11:38         ` Marek Szyprowski
  0 siblings, 0 replies; 35+ messages in thread
From: Marek Szyprowski @ 2017-12-12 11:38 UTC (permalink / raw)
  To: linux-arm-kernel

Hi All,

On 2017-12-11 23:28, Javier Martinez Canillas wrote:
> [adding Marek and Shuah to cc list]
>
> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
>> <guillaume.tucker@collabora.com> wrote:
>>> Hi Daniel,
>>>
>>> Please see below, I've had several bisection results pointing at
>>> that commit over the week-end on mainline but also on linux-next
>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>> and is likely to have more than one issue, it does seem like this
>>> commit is causing some well reproducible kernel hang.
>>>
>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>
>>>    https://lava.collabora.co.uk/scheduler/job/1018478
>>>
>>> and here's another one with the change mentioned below reverted:
>>>
>>>    https://lava.collabora.co.uk/scheduler/job/1018479
>>>
>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>> I don't know if this is related as I haven't investigated any
>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>> boot most of the time with the commit reverted though.
>>>
>>> The automated kernelci.org bisection is still an experimental
>>> tool and it may well be a false positive, so please take this
>>> result with a pinch of salt...
>> The patch just very minimal moves the connector cleanup around (so
>> timing change), but except when you unload a driver (or maybe that
>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>> more info than "seems to hang a bit more" I have no idea what's wrong.
>> The patch itself should work, at least it survived quite some serious
>> testing we do on everything.
>> -Daniel
>>
> Marek was pointing to a different culprit [0] in this [1] thread. I
> see that both commits made it to v4.15-rc3, which is the first version
> where boot fails. So maybe is a combination of both? Or rather
> reverting one patch masks the error in the other.
>
> I've access to the machine but unfortunately not a lot of time to dig
> on this, I could try to do it in the weekend though.

After a recent discussion on the Javier's patch:
https://patchwork.kernel.org/patch/10106417/
I've managed to reproduce this issue also on Exynos5250 based Samsung
Snow Chromebook and investigate a bit.

It is caused by a deadlock in the main kernel workqueue. Here are details:

1. Exynos DRM fails to initialize due to missing regulators and gets moved
to deferred probe device list

2. Deferred probe is triggered and kernel "events" workqueue calls
deferred_probe_work_func()

3. exynos_drm_bind() is called, component_bind_all() fails due to missing
Exynos Mixer device

4. error handling path is executed in exynos_drm_bind(), which calls
drm_mode_config_cleanup()

5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
deadlock.

Do You have idea how to fix this issue properly?

Taking a look at git blame, this indeed shows that the issue has been
introduced by the commit a703c55004e1 ("drm: safely free connectors from
connector_ite"), which added a call to flush_scheduled_work() in
drm_mode_config_cleanup().

drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
called from the workqueue, but I don't have idea how to check that. The
other way of fixing it would be to resurrect separate workqueue for DRM
related events.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-12 11:38         ` Marek Szyprowski
@ 2017-12-12 14:39           ` Shuah Khan
  -1 siblings, 0 replies; 35+ messages in thread
From: Shuah Khan @ 2017-12-12 14:39 UTC (permalink / raw)
  To: Marek Szyprowski, Javier Martinez Canillas, Daniel Vetter
  Cc: Guillaume Tucker, Mark Brown, Kevin Hilman, Matt Hart,
	Thierry Escande, Tomeu Vizoso, Enric Balletbo i Serra,
	linux-samsung-soc, kernel-build-reports, linux-arm-kernel,
	Shuah Khan

Hi Marek,

On 12/12/2017 04:38 AM, Marek Szyprowski wrote:
> Hi All,
> 
> On 2017-12-11 23:28, Javier Martinez Canillas wrote:
>> [adding Marek and Shuah to cc list]
>>
>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
>>> <guillaume.tucker@collabora.com> wrote:
>>>> Hi Daniel,
>>>>
>>>> Please see below, I've had several bisection results pointing at
>>>> that commit over the week-end on mainline but also on linux-next
>>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>>> and is likely to have more than one issue, it does seem like this
>>>> commit is causing some well reproducible kernel hang.
>>>>
>>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>>
>>>>    https://lava.collabora.co.uk/scheduler/job/1018478
>>>>
>>>> and here's another one with the change mentioned below reverted:
>>>>
>>>>    https://lava.collabora.co.uk/scheduler/job/1018479
>>>>
>>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>>> I don't know if this is related as I haven't investigated any
>>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>>> boot most of the time with the commit reverted though.
>>>>
>>>> The automated kernelci.org bisection is still an experimental
>>>> tool and it may well be a false positive, so please take this
>>>> result with a pinch of salt...
>>> The patch just very minimal moves the connector cleanup around (so
>>> timing change), but except when you unload a driver (or maybe that
>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>>> more info than "seems to hang a bit more" I have no idea what's wrong.
>>> The patch itself should work, at least it survived quite some serious
>>> testing we do on everything.
>>> -Daniel
>>>
>> Marek was pointing to a different culprit [0] in this [1] thread. I
>> see that both commits made it to v4.15-rc3, which is the first version
>> where boot fails. So maybe is a combination of both? Or rather
>> reverting one patch masks the error in the other.
>>
>> I've access to the machine but unfortunately not a lot of time to dig
>> on this, I could try to do it in the weekend though.
> 
> After a recent discussion on the Javier's patch:
> https://patchwork.kernel.org/patch/10106417/
> I've managed to reproduce this issue also on Exynos5250 based Samsung
> Snow Chromebook and investigate a bit.
> 
> It is caused by a deadlock in the main kernel workqueue. Here are details:
> 
> 1. Exynos DRM fails to initialize due to missing regulators and gets moved
> to deferred probe device list
> 
> 2. Deferred probe is triggered and kernel "events" workqueue calls
> deferred_probe_work_func()
> 
> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing
> Exynos Mixer device
> 
> 4. error handling path is executed in exynos_drm_bind(), which calls
> drm_mode_config_cleanup()
> 
> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
> deadlock.
> 
> Do You have idea how to fix this issue properly?
> 
> Taking a look at git blame, this indeed shows that the issue has been
> introduced by the commit a703c55004e1 ("drm: safely free connectors from
> connector_ite"), which added a call to flush_scheduled_work() in
> drm_mode_config_cleanup().

This commit is making its way into stable releases. It has been added to
4.14-6 stable. If this patch poses problems, maybe somebody should comment
on the stable release thread.

> 
> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
> called from the workqueue, but I don't have idea how to check that. The
> other way of fixing it would be to resurrect separate workqueue for DRM
> related events.
> 

Especially since there is no solution :)

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12 14:39           ` Shuah Khan
  0 siblings, 0 replies; 35+ messages in thread
From: Shuah Khan @ 2017-12-12 14:39 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Marek,

On 12/12/2017 04:38 AM, Marek Szyprowski wrote:
> Hi All,
> 
> On 2017-12-11 23:28, Javier Martinez Canillas wrote:
>> [adding Marek and Shuah to cc list]
>>
>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
>>> <guillaume.tucker@collabora.com> wrote:
>>>> Hi Daniel,
>>>>
>>>> Please see below, I've had several bisection results pointing at
>>>> that commit over the week-end on mainline but also on linux-next
>>>> and net-next.? While the peach-pi is a bit flaky at the moment
>>>> and is likely to have more than one issue, it does seem like this
>>>> commit is causing some well reproducible kernel hang.
>>>>
>>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>>
>>>> ?? https://lava.collabora.co.uk/scheduler/job/1018478
>>>>
>>>> and here's another one with the change mentioned below reverted:
>>>>
>>>> ?? https://lava.collabora.co.uk/scheduler/job/1018479
>>>>
>>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>>> I don't know if this is related as I haven't investigated any
>>>> further.? It does appear to reliably hang with v4.15-rc3 and
>>>> boot most of the time with the commit reverted though.
>>>>
>>>> The automated kernelci.org bisection is still an experimental
>>>> tool and it may well be a false positive, so please take this
>>>> result with a pinch of salt...
>>> The patch just very minimal moves the connector cleanup around (so
>>> timing change), but except when you unload a driver (or maybe that
>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>>> more info than "seems to hang a bit more" I have no idea what's wrong.
>>> The patch itself should work, at least it survived quite some serious
>>> testing we do on everything.
>>> -Daniel
>>>
>> Marek was pointing to a different culprit [0] in this [1] thread. I
>> see that both commits made it to v4.15-rc3, which is the first version
>> where boot fails. So maybe is a combination of both? Or rather
>> reverting one patch masks the error in the other.
>>
>> I've access to the machine but unfortunately not a lot of time to dig
>> on this, I could try to do it in the weekend though.
> 
> After a recent discussion on the Javier's patch:
> https://patchwork.kernel.org/patch/10106417/
> I've managed to reproduce this issue also on Exynos5250 based Samsung
> Snow Chromebook and investigate a bit.
> 
> It is caused by a deadlock in the main kernel workqueue. Here are details:
> 
> 1. Exynos DRM fails to initialize due to missing regulators and gets moved
> to deferred probe device list
> 
> 2. Deferred probe is triggered and kernel "events" workqueue calls
> deferred_probe_work_func()
> 
> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing
> Exynos Mixer device
> 
> 4. error handling path is executed in exynos_drm_bind(), which calls
> drm_mode_config_cleanup()
> 
> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
> deadlock.
> 
> Do You have idea how to fix this issue properly?
> 
> Taking a look at git blame, this indeed shows that the issue has been
> introduced by the commit a703c55004e1 ("drm: safely free connectors from
> connector_ite"), which added a call to flush_scheduled_work() in
> drm_mode_config_cleanup().

This commit is making its way into stable releases. It has been added to
4.14-6 stable. If this patch poses problems, maybe somebody should comment
on the stable release thread.

> 
> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
> called from the workqueue, but I don't have idea how to check that. The
> other way of fixing it would be to resurrect separate workqueue for DRM
> related events.
> 

Especially since there is no solution :)

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-12  8:47                     ` Marek Szyprowski
@ 2017-12-12 14:47                       ` Shuah Khan
  -1 siblings, 0 replies; 35+ messages in thread
From: Shuah Khan @ 2017-12-12 14:47 UTC (permalink / raw)
  To: Marek Szyprowski, Javier Martinez Canillas
  Cc: Russell King - ARM Linux, linux-samsung-soc, Tomeu Vizoso,
	kernel-build-reports, Daniel Vetter, Thierry Escande,
	Kevin Hilman, Enric Balletbo i Serra, linux-arm-kernel,
	Shuah Khan

On 12/12/2017 01:47 AM, Marek Szyprowski wrote:
> Hi Javier,
> 
> On 2017-12-12 09:00, Javier Martinez Canillas wrote:
>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
>> <m.szyprowski@samsung.com> wrote:
>>> On 2017-12-12 00:25, Shuah Khan wrote:
>>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas
>>>>>> wrote:
>>>>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>>>>
>>>>>>> I've posted a fix for that:
>>>>>>>
>>>>>>> https://patchwork.kernel.org/patch/10105921/
>>>>>>>
>>>>>>> I believe this could be also be the cause for the boot failure, since
>>>>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>>>>> fails to bind the HDMI component:
>>>>>>>
>>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>>>>> 0xc1398690): -1
>>>>>> Umm, -1 ?  Looking that error code up in
>>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>>>>
>>>>>> I suspect that's someone just returning -1 because they're lazy...
>>>>>> which is real bad form and needs fixing.
>>>>> Oh, it really is -EPERM:
>>>>>
>>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device
>>>>> *drm_dev,
>>>>>                                          enum exynos_drm_output_type
>>>>> out_type)
>>>>> {
>>>>>           struct drm_crtc *crtc;
>>>>>
>>>>>           drm_for_each_crtc(crtc, drm_dev)
>>>>>                   if (to_exynos_crtc(crtc)->type == out_type)
>>>>>                           return to_exynos_crtc(crtc);
>>>>>
>>>>>           return ERR_PTR(-EPERM);
>>>>> }
>>>>>
>>>>> Does "Operation not permitted" really convey the error here?  It doesn't
>>>>> look like a permission error to me.
>>>>>
>>>>> Can we please avoid abusing errno codes?
>>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
>>>> with top commit g968edbd worked just fine for me last Friday. I ran
>>>> several
>>>> tests and everything checked out except the exynos-gsc lockdep issue I
>>>> sent
>>>> a 4.14 patch for.
>>>>
>>>> However, with 4.15-rc3, dmesg is gets filled with
>>>>
>>>> [  342.337181] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  342.337470] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  342.337851] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  402.382346] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  402.396682] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  402.399244] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  402.399496] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  402.399848] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  402.400163] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  402.400495] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  402.401294] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [  402.401595] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>>
>>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>>>>
>>>> I will start bisect and try to isolate the problem. I suspect this is
>>>> related to dts
>>>> changes perhaps? I used to this problem a while back and it has been
>>>> fixed.
>>>
>>> This warning has been added intentionally, see following discussions:
>>> https://patchwork.kernel.org/patch/10034919/
>>> https://patchwork.kernel.org/patch/10070475/
>>>
>>> This means that your test apps should be updated or you should enable Exynos
>>> IOMMU support in your config. Maybe it is a good time to finally enable it
>>> in exynos_defconfig.
>>>
>> Has the issue that the boot-loader keeps the display controller
>> enabled and scanning pages on the Exynos Chromebooks resolved?
>>
>> I think that's that preventing to enable it by default in
>> exynos_defconfig since it caused boot failures when enabled on these
>> machines. I don't follow exynos development too closely nowadays so
>> maybe there's a fix in place now.
> 
> Not directly. I still didn't find time to properly add support for
> devices, which were left in-working state (with active DMA
> transactions) by bootloader, but due to some other changes in the
> order of operations during boot process, power domains are
> initialized very early and due to temporary lack of devices (which
> are not yet added to the system), are turned off. This practically
> stops FIMD for scanning framebuffer and "solves" this issue.
> 
> I've checked now and Exynos Snow Chromebook boots fine with IOMMU
> support enabled, both with v4.15-rc3 and linux-next.
> 

Good to know it doesn't break Exynos Snow. This is why I test without
IOMMU and then enable IOMMU on odroid-xu4 for test with IOMMU

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12 14:47                       ` Shuah Khan
  0 siblings, 0 replies; 35+ messages in thread
From: Shuah Khan @ 2017-12-12 14:47 UTC (permalink / raw)
  To: linux-arm-kernel

On 12/12/2017 01:47 AM, Marek Szyprowski wrote:
> Hi Javier,
> 
> On 2017-12-12 09:00, Javier Martinez Canillas wrote:
>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
>> <m.szyprowski@samsung.com> wrote:
>>> On 2017-12-12 00:25, Shuah Khan wrote:
>>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas
>>>>>> wrote:
>>>>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>>>>
>>>>>>> I've posted a fix for that:
>>>>>>>
>>>>>>> https://patchwork.kernel.org/patch/10105921/
>>>>>>>
>>>>>>> I believe this could be also be the cause for the boot failure, since
>>>>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>>>>> fails to bind the HDMI component:
>>>>>>>
>>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>>>>> 0xc1398690): -1
>>>>>> Umm, -1 ?? Looking that error code up in
>>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>>>>
>>>>>> I suspect that's someone just returning -1 because they're lazy...
>>>>>> which is real bad form and needs fixing.
>>>>> Oh, it really is -EPERM:
>>>>>
>>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device
>>>>> *drm_dev,
>>>>> ???????????????????????????????????????? enum exynos_drm_output_type
>>>>> out_type)
>>>>> {
>>>>> ????????? struct drm_crtc *crtc;
>>>>>
>>>>> ????????? drm_for_each_crtc(crtc, drm_dev)
>>>>> ????????????????? if (to_exynos_crtc(crtc)->type == out_type)
>>>>> ????????????????????????? return to_exynos_crtc(crtc);
>>>>>
>>>>> ????????? return ERR_PTR(-EPERM);
>>>>> }
>>>>>
>>>>> Does "Operation not permitted" really convey the error here?? It doesn't
>>>>> look like a permission error to me.
>>>>>
>>>>> Can we please avoid abusing errno codes?
>>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
>>>> with top commit g968edbd worked just fine for me last Friday. I ran
>>>> several
>>>> tests and everything checked out except the exynos-gsc lockdep issue I
>>>> sent
>>>> a 4.14 patch for.
>>>>
>>>> However, with 4.15-rc3, dmesg is gets filled with
>>>>
>>>> [? 342.337181] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 342.337470] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 342.337851] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 402.382346] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 402.396682] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 402.399244] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 402.399496] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 402.399848] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 402.400163] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 402.400495] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 402.401294] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>> [? 402.401595] [drm] Non-contiguous allocation is not supported without
>>>> IOMMU, falling back to contiguous buffer
>>>>
>>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>>>>
>>>> I will start bisect and try to isolate the problem. I suspect this is
>>>> related to dts
>>>> changes perhaps? I used to this problem a while back and it has been
>>>> fixed.
>>>
>>> This warning has been added intentionally, see following discussions:
>>> https://patchwork.kernel.org/patch/10034919/
>>> https://patchwork.kernel.org/patch/10070475/
>>>
>>> This means that your test apps should be updated or you should enable Exynos
>>> IOMMU support in your config. Maybe it is a good time to finally enable it
>>> in exynos_defconfig.
>>>
>> Has the issue that the boot-loader keeps the display controller
>> enabled and scanning pages on the Exynos Chromebooks resolved?
>>
>> I think that's that preventing to enable it by default in
>> exynos_defconfig since it caused boot failures when enabled on these
>> machines. I don't follow exynos development too closely nowadays so
>> maybe there's a fix in place now.
> 
> Not directly. I still didn't find time to properly add support for
> devices, which were left in-working state (with active DMA
> transactions) by bootloader, but due to some other changes in the
> order of operations during boot process, power domains are
> initialized very early and due to temporary lack of devices (which
> are not yet added to the system), are turned off. This practically
> stops FIMD for scanning framebuffer and "solves" this issue.
> 
> I've checked now and Exynos Snow Chromebook boots fine with IOMMU
> support enabled, both with v4.15-rc3 and linux-next.
> 

Good to know it doesn't break Exynos Snow. This is why I test without
IOMMU and then enable IOMMU on odroid-xu4 for test with IOMMU

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-12 11:38         ` Marek Szyprowski
@ 2017-12-12 18:14           ` Daniel Vetter
  -1 siblings, 0 replies; 35+ messages in thread
From: Daniel Vetter @ 2017-12-12 18:14 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Javier Martinez Canillas, Shuah Khan, Guillaume Tucker,
	Mark Brown, Kevin Hilman, Matt Hart, Thierry Escande,
	Tomeu Vizoso, Enric Balletbo i Serra, linux-samsung-soc,
	kernel-build-reports, linux-arm-kernel

On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hi All,
>
>
> On 2017-12-11 23:28, Javier Martinez Canillas wrote:
>>
>> [adding Marek and Shuah to cc list]
>>
>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch>
>> wrote:
>>>
>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
>>> <guillaume.tucker@collabora.com> wrote:
>>>>
>>>> Hi Daniel,
>>>>
>>>> Please see below, I've had several bisection results pointing at
>>>> that commit over the week-end on mainline but also on linux-next
>>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>>> and is likely to have more than one issue, it does seem like this
>>>> commit is causing some well reproducible kernel hang.
>>>>
>>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>>
>>>>    https://lava.collabora.co.uk/scheduler/job/1018478
>>>>
>>>> and here's another one with the change mentioned below reverted:
>>>>
>>>>    https://lava.collabora.co.uk/scheduler/job/1018479
>>>>
>>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>>> I don't know if this is related as I haven't investigated any
>>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>>> boot most of the time with the commit reverted though.
>>>>
>>>> The automated kernelci.org bisection is still an experimental
>>>> tool and it may well be a false positive, so please take this
>>>> result with a pinch of salt...
>>>
>>> The patch just very minimal moves the connector cleanup around (so
>>> timing change), but except when you unload a driver (or maybe that
>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>>> more info than "seems to hang a bit more" I have no idea what's wrong.
>>> The patch itself should work, at least it survived quite some serious
>>> testing we do on everything.
>>> -Daniel
>>>
>> Marek was pointing to a different culprit [0] in this [1] thread. I
>> see that both commits made it to v4.15-rc3, which is the first version
>> where boot fails. So maybe is a combination of both? Or rather
>> reverting one patch masks the error in the other.
>>
>> I've access to the machine but unfortunately not a lot of time to dig
>> on this, I could try to do it in the weekend though.
>
>
> After a recent discussion on the Javier's patch:
> https://patchwork.kernel.org/patch/10106417/
> I've managed to reproduce this issue also on Exynos5250 based Samsung
> Snow Chromebook and investigate a bit.
>
> It is caused by a deadlock in the main kernel workqueue. Here are details:
>
> 1. Exynos DRM fails to initialize due to missing regulators and gets moved
> to deferred probe device list
>
> 2. Deferred probe is triggered and kernel "events" workqueue calls
> deferred_probe_work_func()
>
> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing
> Exynos Mixer device
>
> 4. error handling path is executed in exynos_drm_bind(), which calls
> drm_mode_config_cleanup()
>
> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
> deadlock.
>
> Do You have idea how to fix this issue properly?
>
> Taking a look at git blame, this indeed shows that the issue has been
> introduced by the commit a703c55004e1 ("drm: safely free connectors from
> connector_ite"), which added a call to flush_scheduled_work() in
> drm_mode_config_cleanup().
>
> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
> called from the workqueue, but I don't have idea how to check that. The
> other way of fixing it would be to resurrect separate workqueue for DRM
> related events.

We need to flush the work there, or things will go wrong on unload. I
think the real fix is to make sure that the connector cleanup work
isn't run on the same work stuff as any driver bind stuff, which yes
means new separate workqueue just for this.

I guess the simple fix is to do that in the drm, but tbh I'm surprised
that driver bind/deferred probe hasn't run into this problem anywhere
else yet.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12 18:14           ` Daniel Vetter
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Vetter @ 2017-12-12 18:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> Hi All,
>
>
> On 2017-12-11 23:28, Javier Martinez Canillas wrote:
>>
>> [adding Marek and Shuah to cc list]
>>
>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch>
>> wrote:
>>>
>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
>>> <guillaume.tucker@collabora.com> wrote:
>>>>
>>>> Hi Daniel,
>>>>
>>>> Please see below, I've had several bisection results pointing at
>>>> that commit over the week-end on mainline but also on linux-next
>>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>>> and is likely to have more than one issue, it does seem like this
>>>> commit is causing some well reproducible kernel hang.
>>>>
>>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>>
>>>>    https://lava.collabora.co.uk/scheduler/job/1018478
>>>>
>>>> and here's another one with the change mentioned below reverted:
>>>>
>>>>    https://lava.collabora.co.uk/scheduler/job/1018479
>>>>
>>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>>> I don't know if this is related as I haven't investigated any
>>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>>> boot most of the time with the commit reverted though.
>>>>
>>>> The automated kernelci.org bisection is still an experimental
>>>> tool and it may well be a false positive, so please take this
>>>> result with a pinch of salt...
>>>
>>> The patch just very minimal moves the connector cleanup around (so
>>> timing change), but except when you unload a driver (or maybe that
>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>>> more info than "seems to hang a bit more" I have no idea what's wrong.
>>> The patch itself should work, at least it survived quite some serious
>>> testing we do on everything.
>>> -Daniel
>>>
>> Marek was pointing to a different culprit [0] in this [1] thread. I
>> see that both commits made it to v4.15-rc3, which is the first version
>> where boot fails. So maybe is a combination of both? Or rather
>> reverting one patch masks the error in the other.
>>
>> I've access to the machine but unfortunately not a lot of time to dig
>> on this, I could try to do it in the weekend though.
>
>
> After a recent discussion on the Javier's patch:
> https://patchwork.kernel.org/patch/10106417/
> I've managed to reproduce this issue also on Exynos5250 based Samsung
> Snow Chromebook and investigate a bit.
>
> It is caused by a deadlock in the main kernel workqueue. Here are details:
>
> 1. Exynos DRM fails to initialize due to missing regulators and gets moved
> to deferred probe device list
>
> 2. Deferred probe is triggered and kernel "events" workqueue calls
> deferred_probe_work_func()
>
> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing
> Exynos Mixer device
>
> 4. error handling path is executed in exynos_drm_bind(), which calls
> drm_mode_config_cleanup()
>
> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
> deadlock.
>
> Do You have idea how to fix this issue properly?
>
> Taking a look at git blame, this indeed shows that the issue has been
> introduced by the commit a703c55004e1 ("drm: safely free connectors from
> connector_ite"), which added a call to flush_scheduled_work() in
> drm_mode_config_cleanup().
>
> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
> called from the workqueue, but I don't have idea how to check that. The
> other way of fixing it would be to resurrect separate workqueue for DRM
> related events.

We need to flush the work there, or things will go wrong on unload. I
think the real fix is to make sure that the connector cleanup work
isn't run on the same work stuff as any driver bind stuff, which yes
means new separate workqueue just for this.

I guess the simple fix is to do that in the drm, but tbh I'm surprised
that driver bind/deferred probe hasn't run into this problem anywhere
else yet.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-12 14:47                       ` Shuah Khan
@ 2017-12-12 18:26                         ` Shuah Khan
  -1 siblings, 0 replies; 35+ messages in thread
From: Shuah Khan @ 2017-12-12 18:26 UTC (permalink / raw)
  To: Marek Szyprowski, Javier Martinez Canillas
  Cc: Russell King - ARM Linux, linux-samsung-soc, Tomeu Vizoso,
	kernel-build-reports, Daniel Vetter, Thierry Escande,
	Kevin Hilman, Enric Balletbo i Serra, linux-arm-kernel,
	Shuah Khan

On 12/12/2017 07:47 AM, Shuah Khan wrote:
> On 12/12/2017 01:47 AM, Marek Szyprowski wrote:
>> Hi Javier,
>>
>> On 2017-12-12 09:00, Javier Martinez Canillas wrote:
>>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
>>> <m.szyprowski@samsung.com> wrote:
>>>> On 2017-12-12 00:25, Shuah Khan wrote:
>>>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>>>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas
>>>>>>> wrote:
>>>>>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>>>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>>>>>
>>>>>>>> I've posted a fix for that:
>>>>>>>>
>>>>>>>> https://patchwork.kernel.org/patch/10105921/
>>>>>>>>
>>>>>>>> I believe this could be also be the cause for the boot failure, since
>>>>>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>>>>>> fails to bind the HDMI component:
>>>>>>>>
>>>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>>>>>> 0xc1398690): -1
>>>>>>> Umm, -1 ?  Looking that error code up in
>>>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>>>>>
>>>>>>> I suspect that's someone just returning -1 because they're lazy...
>>>>>>> which is real bad form and needs fixing.
>>>>>> Oh, it really is -EPERM:
>>>>>>
>>>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device
>>>>>> *drm_dev,
>>>>>>                                          enum exynos_drm_output_type
>>>>>> out_type)
>>>>>> {
>>>>>>           struct drm_crtc *crtc;
>>>>>>
>>>>>>           drm_for_each_crtc(crtc, drm_dev)
>>>>>>                   if (to_exynos_crtc(crtc)->type == out_type)
>>>>>>                           return to_exynos_crtc(crtc);
>>>>>>
>>>>>>           return ERR_PTR(-EPERM);
>>>>>> }
>>>>>>
>>>>>> Does "Operation not permitted" really convey the error here?  It doesn't
>>>>>> look like a permission error to me.
>>>>>>
>>>>>> Can we please avoid abusing errno codes?
>>>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
>>>>> with top commit g968edbd worked just fine for me last Friday. I ran
>>>>> several
>>>>> tests and everything checked out except the exynos-gsc lockdep issue I
>>>>> sent
>>>>> a 4.14 patch for.
>>>>>
>>>>> However, with 4.15-rc3, dmesg is gets filled with
>>>>>
>>>>> [  342.337181] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  342.337470] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  342.337851] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  402.382346] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  402.396682] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  402.399244] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  402.399496] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  402.399848] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  402.400163] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  402.400495] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  402.401294] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [  402.401595] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>>
>>>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>>>>>
>>>>> I will start bisect and try to isolate the problem. I suspect this is
>>>>> related to dts
>>>>> changes perhaps? I used to this problem a while back and it has been
>>>>> fixed.
>>>>
>>>> This warning has been added intentionally, see following discussions:
>>>> https://patchwork.kernel.org/patch/10034919/
>>>> https://patchwork.kernel.org/patch/10070475/
>>>>
>>>> This means that your test apps should be updated or you should enable Exynos
>>>> IOMMU support in your config. Maybe it is a good time to finally enable it
>>>> in exynos_defconfig.
>>>>
>>> Has the issue that the boot-loader keeps the display controller
>>> enabled and scanning pages on the Exynos Chromebooks resolved?
>>>
>>> I think that's that preventing to enable it by default in
>>> exynos_defconfig since it caused boot failures when enabled on these
>>> machines. I don't follow exynos development too closely nowadays so
>>> maybe there's a fix in place now.
>>
>> Not directly. I still didn't find time to properly add support for
>> devices, which were left in-working state (with active DMA
>> transactions) by bootloader, but due to some other changes in the
>> order of operations during boot process, power domains are
>> initialized very early and due to temporary lack of devices (which
>> are not yet added to the system), are turned off. This practically
>> stops FIMD for scanning framebuffer and "solves" this issue.
>>
>> I've checked now and Exynos Snow Chromebook boots fine with IOMMU
>> support enabled, both with v4.15-rc3 and linux-next.

Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent
a patch to do that a while back. The decision at the time to not pull
that patch is was based on systems not booting with it enabled.

Is it time to revisit that or the recommendation is for IOMMU to be
enabled in configs manually on systems it is safe to do so?

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12 18:26                         ` Shuah Khan
  0 siblings, 0 replies; 35+ messages in thread
From: Shuah Khan @ 2017-12-12 18:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 12/12/2017 07:47 AM, Shuah Khan wrote:
> On 12/12/2017 01:47 AM, Marek Szyprowski wrote:
>> Hi Javier,
>>
>> On 2017-12-12 09:00, Javier Martinez Canillas wrote:
>>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski
>>> <m.szyprowski@samsung.com> wrote:
>>>> On 2017-12-12 00:25, Shuah Khan wrote:
>>>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote:
>>>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote:
>>>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas
>>>>>>> wrote:
>>>>>>>> So I gave a quick look to this, and at the very least there's a bug in
>>>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts:
>>>>>>>> exynos: Add status property to Exynos 542x Mixer nodes").
>>>>>>>>
>>>>>>>> I've posted a fix for that:
>>>>>>>>
>>>>>>>> https://patchwork.kernel.org/patch/10105921/
>>>>>>>>
>>>>>>>> I believe this could be also be the cause for the boot failure, since
>>>>>>>> I see in the boot log that things start to go wrong after exynos-drm
>>>>>>>> fails to bind the HDMI component:
>>>>>>>>
>>>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops
>>>>>>>> 0xc1398690): -1
>>>>>>> Umm, -1 ?? Looking that error code up in
>>>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM.
>>>>>>>
>>>>>>> I suspect that's someone just returning -1 because they're lazy...
>>>>>>> which is real bad form and needs fixing.
>>>>>> Oh, it really is -EPERM:
>>>>>>
>>>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device
>>>>>> *drm_dev,
>>>>>> ???????????????????????????????????????? enum exynos_drm_output_type
>>>>>> out_type)
>>>>>> {
>>>>>> ????????? struct drm_crtc *crtc;
>>>>>>
>>>>>> ????????? drm_for_each_crtc(crtc, drm_dev)
>>>>>> ????????????????? if (to_exynos_crtc(crtc)->type == out_type)
>>>>>> ????????????????????????? return to_exynos_crtc(crtc);
>>>>>>
>>>>>> ????????? return ERR_PTR(-EPERM);
>>>>>> }
>>>>>>
>>>>>> Does "Operation not permitted" really convey the error here?? It doesn't
>>>>>> look like a permission error to me.
>>>>>>
>>>>>> Can we please avoid abusing errno codes?
>>>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+
>>>>> with top commit g968edbd worked just fine for me last Friday. I ran
>>>>> several
>>>>> tests and everything checked out except the exynos-gsc lockdep issue I
>>>>> sent
>>>>> a 4.14 patch for.
>>>>>
>>>>> However, with 4.15-rc3, dmesg is gets filled with
>>>>>
>>>>> [? 342.337181] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 342.337470] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 342.337851] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 402.382346] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 402.396682] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 402.399244] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 402.399496] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 402.399848] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 402.400163] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 402.400495] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 402.401294] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>> [? 402.401595] [drm] Non-contiguous allocation is not supported without
>>>>> IOMMU, falling back to contiguous buffer
>>>>>
>>>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig.
>>>>>
>>>>> I will start bisect and try to isolate the problem. I suspect this is
>>>>> related to dts
>>>>> changes perhaps? I used to this problem a while back and it has been
>>>>> fixed.
>>>>
>>>> This warning has been added intentionally, see following discussions:
>>>> https://patchwork.kernel.org/patch/10034919/
>>>> https://patchwork.kernel.org/patch/10070475/
>>>>
>>>> This means that your test apps should be updated or you should enable Exynos
>>>> IOMMU support in your config. Maybe it is a good time to finally enable it
>>>> in exynos_defconfig.
>>>>
>>> Has the issue that the boot-loader keeps the display controller
>>> enabled and scanning pages on the Exynos Chromebooks resolved?
>>>
>>> I think that's that preventing to enable it by default in
>>> exynos_defconfig since it caused boot failures when enabled on these
>>> machines. I don't follow exynos development too closely nowadays so
>>> maybe there's a fix in place now.
>>
>> Not directly. I still didn't find time to properly add support for
>> devices, which were left in-working state (with active DMA
>> transactions) by bootloader, but due to some other changes in the
>> order of operations during boot process, power domains are
>> initialized very early and due to temporary lack of devices (which
>> are not yet added to the system), are turned off. This practically
>> stops FIMD for scanning framebuffer and "solves" this issue.
>>
>> I've checked now and Exynos Snow Chromebook boots fine with IOMMU
>> support enabled, both with v4.15-rc3 and linux-next.

Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent
a patch to do that a while back. The decision at the time to not pull
that patch is was based on systems not booting with it enabled.

Is it time to revisit that or the recommendation is for IOMMU to be
enabled in configs manually on systems it is safe to do so?

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-12 18:26                         ` Shuah Khan
@ 2017-12-12 18:58                           ` Javier Martinez Canillas
  -1 siblings, 0 replies; 35+ messages in thread
From: Javier Martinez Canillas @ 2017-12-12 18:58 UTC (permalink / raw)
  To: Shuah Khan
  Cc: Marek Szyprowski, Russell King - ARM Linux, linux-samsung-soc,
	Tomeu Vizoso, kernel-build-reports, Daniel Vetter,
	Thierry Escande, Kevin Hilman, Enric Balletbo i Serra,
	linux-arm-kernel

Hello Shuah,

On Tue, Dec 12, 2017 at 7:26 PM, Shuah Khan <shuahkh@osg.samsung.com> wrote:

[snip]

>>>
>>> Not directly. I still didn't find time to properly add support for
>>> devices, which were left in-working state (with active DMA
>>> transactions) by bootloader, but due to some other changes in the
>>> order of operations during boot process, power domains are
>>> initialized very early and due to temporary lack of devices (which
>>> are not yet added to the system), are turned off. This practically
>>> stops FIMD for scanning framebuffer and "solves" this issue.
>>>
>>> I've checked now and Exynos Snow Chromebook boots fine with IOMMU
>>> support enabled, both with v4.15-rc3 and linux-next.
>
> Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent
> a patch to do that a while back. The decision at the time to not pull
> that patch is was based on systems not booting with it enabled.
>
> Is it time to revisit that or the recommendation is for IOMMU to be
> enabled in configs manually on systems it is safe to do so?
>

Yes, I think it would be good to have it enabled by default if that
doesn't cause boot issues anymore.

Could you please resend your patch and cc Marek and me so we can test
on Snow and Peach Chromebooks?

> thanks,
> -- Shuah

Best regards,
Javier

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-12 18:58                           ` Javier Martinez Canillas
  0 siblings, 0 replies; 35+ messages in thread
From: Javier Martinez Canillas @ 2017-12-12 18:58 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Shuah,

On Tue, Dec 12, 2017 at 7:26 PM, Shuah Khan <shuahkh@osg.samsung.com> wrote:

[snip]

>>>
>>> Not directly. I still didn't find time to properly add support for
>>> devices, which were left in-working state (with active DMA
>>> transactions) by bootloader, but due to some other changes in the
>>> order of operations during boot process, power domains are
>>> initialized very early and due to temporary lack of devices (which
>>> are not yet added to the system), are turned off. This practically
>>> stops FIMD for scanning framebuffer and "solves" this issue.
>>>
>>> I've checked now and Exynos Snow Chromebook boots fine with IOMMU
>>> support enabled, both with v4.15-rc3 and linux-next.
>
> Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent
> a patch to do that a while back. The decision at the time to not pull
> that patch is was based on systems not booting with it enabled.
>
> Is it time to revisit that or the recommendation is for IOMMU to be
> enabled in configs manually on systems it is safe to do so?
>

Yes, I think it would be good to have it enabled by default if that
doesn't cause boot issues anymore.

Could you please resend your patch and cc Marek and me so we can test
on Snow and Peach Chromebooks?

> thanks,
> -- Shuah

Best regards,
Javier

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
  2017-12-12 18:14           ` Daniel Vetter
@ 2017-12-13  9:02             ` Marek Szyprowski
  -1 siblings, 0 replies; 35+ messages in thread
From: Marek Szyprowski @ 2017-12-13  9:02 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Javier Martinez Canillas, Shuah Khan, Guillaume Tucker,
	Mark Brown, Kevin Hilman, Matt Hart, Thierry Escande,
	Tomeu Vizoso, Enric Balletbo i Serra, linux-samsung-soc,
	kernel-build-reports, linux-arm-kernel

Hi Daniel,

On 2017-12-12 19:14, Daniel Vetter wrote:
> On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> Hi All,
>>
>>
>> On 2017-12-11 23:28, Javier Martinez Canillas wrote:
>>> [adding Marek and Shuah to cc list]
>>>
>>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch>
>>> wrote:
>>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
>>>> <guillaume.tucker@collabora.com> wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> Please see below, I've had several bisection results pointing at
>>>>> that commit over the week-end on mainline but also on linux-next
>>>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>>>> and is likely to have more than one issue, it does seem like this
>>>>> commit is causing some well reproducible kernel hang.
>>>>>
>>>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>>>
>>>>>     https://lava.collabora.co.uk/scheduler/job/1018478
>>>>>
>>>>> and here's another one with the change mentioned below reverted:
>>>>>
>>>>>     https://lava.collabora.co.uk/scheduler/job/1018479
>>>>>
>>>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>>>> I don't know if this is related as I haven't investigated any
>>>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>>>> boot most of the time with the commit reverted though.
>>>>>
>>>>> The automated kernelci.org bisection is still an experimental
>>>>> tool and it may well be a false positive, so please take this
>>>>> result with a pinch of salt...
>>>> The patch just very minimal moves the connector cleanup around (so
>>>> timing change), but except when you unload a driver (or maybe that
>>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>>>> more info than "seems to hang a bit more" I have no idea what's wrong.
>>>> The patch itself should work, at least it survived quite some serious
>>>> testing we do on everything.
>>>> -Daniel
>>>>
>>> Marek was pointing to a different culprit [0] in this [1] thread. I
>>> see that both commits made it to v4.15-rc3, which is the first version
>>> where boot fails. So maybe is a combination of both? Or rather
>>> reverting one patch masks the error in the other.
>>>
>>> I've access to the machine but unfortunately not a lot of time to dig
>>> on this, I could try to do it in the weekend though.
>>
>> After a recent discussion on the Javier's patch:
>> https://patchwork.kernel.org/patch/10106417/
>> I've managed to reproduce this issue also on Exynos5250 based Samsung
>> Snow Chromebook and investigate a bit.
>>
>> It is caused by a deadlock in the main kernel workqueue. Here are details:
>>
>> 1. Exynos DRM fails to initialize due to missing regulators and gets moved
>> to deferred probe device list
>>
>> 2. Deferred probe is triggered and kernel "events" workqueue calls
>> deferred_probe_work_func()
>>
>> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing
>> Exynos Mixer device
>>
>> 4. error handling path is executed in exynos_drm_bind(), which calls
>> drm_mode_config_cleanup()
>>
>> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
>> deadlock.
>>
>> Do You have idea how to fix this issue properly?
>>
>> Taking a look at git blame, this indeed shows that the issue has been
>> introduced by the commit a703c55004e1 ("drm: safely free connectors from
>> connector_ite"), which added a call to flush_scheduled_work() in
>> drm_mode_config_cleanup().
>>
>> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
>> called from the workqueue, but I don't have idea how to check that. The
>> other way of fixing it would be to resurrect separate workqueue for DRM
>> related events.
> We need to flush the work there, or things will go wrong on unload. I
> think the real fix is to make sure that the connector cleanup work
> isn't run on the same work stuff as any driver bind stuff, which yes
> means new separate workqueue just for this.
>
> I guess the simple fix is to do that in the drm, but tbh I'm surprised
> that driver bind/deferred probe hasn't run into this problem anywhere
> else yet.

Well, this means that no-one tested the error paths in deferred probe
case. It's not that surprising, if we assume that typically platform
devices are deferred only once. Second probe() call (which is done from
workqueue) is successful in that case.

I've managed to fix this deadlock without additional workqueue:
https://patchwork.freedesktop.org/patch/193069/

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging
@ 2017-12-13  9:02             ` Marek Szyprowski
  0 siblings, 0 replies; 35+ messages in thread
From: Marek Szyprowski @ 2017-12-13  9:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Daniel,

On 2017-12-12 19:14, Daniel Vetter wrote:
> On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> Hi All,
>>
>>
>> On 2017-12-11 23:28, Javier Martinez Canillas wrote:
>>> [adding Marek and Shuah to cc list]
>>>
>>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch>
>>> wrote:
>>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker
>>>> <guillaume.tucker@collabora.com> wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> Please see below, I've had several bisection results pointing at
>>>>> that commit over the week-end on mainline but also on linux-next
>>>>> and net-next.  While the peach-pi is a bit flaky at the moment
>>>>> and is likely to have more than one issue, it does seem like this
>>>>> commit is causing some well reproducible kernel hang.
>>>>>
>>>>> Here's a re-run with v4.15-rc3 showing the issue:
>>>>>
>>>>>     https://lava.collabora.co.uk/scheduler/job/1018478
>>>>>
>>>>> and here's another one with the change mentioned below reverted:
>>>>>
>>>>>     https://lava.collabora.co.uk/scheduler/job/1018479
>>>>>
>>>>> They both show a warning about "unbalanced disables for lcd_vdd",
>>>>> I don't know if this is related as I haven't investigated any
>>>>> further.  It does appear to reliably hang with v4.15-rc3 and
>>>>> boot most of the time with the commit reverted though.
>>>>>
>>>>> The automated kernelci.org bisection is still an experimental
>>>>> tool and it may well be a false positive, so please take this
>>>>> result with a pinch of salt...
>>>> The patch just very minimal moves the connector cleanup around (so
>>>> timing change), but except when you unload a driver (or maybe that
>>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have
>>>> more info than "seems to hang a bit more" I have no idea what's wrong.
>>>> The patch itself should work, at least it survived quite some serious
>>>> testing we do on everything.
>>>> -Daniel
>>>>
>>> Marek was pointing to a different culprit [0] in this [1] thread. I
>>> see that both commits made it to v4.15-rc3, which is the first version
>>> where boot fails. So maybe is a combination of both? Or rather
>>> reverting one patch masks the error in the other.
>>>
>>> I've access to the machine but unfortunately not a lot of time to dig
>>> on this, I could try to do it in the weekend though.
>>
>> After a recent discussion on the Javier's patch:
>> https://patchwork.kernel.org/patch/10106417/
>> I've managed to reproduce this issue also on Exynos5250 based Samsung
>> Snow Chromebook and investigate a bit.
>>
>> It is caused by a deadlock in the main kernel workqueue. Here are details:
>>
>> 1. Exynos DRM fails to initialize due to missing regulators and gets moved
>> to deferred probe device list
>>
>> 2. Deferred probe is triggered and kernel "events" workqueue calls
>> deferred_probe_work_func()
>>
>> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing
>> Exynos Mixer device
>>
>> 4. error handling path is executed in exynos_drm_bind(), which calls
>> drm_mode_config_cleanup()
>>
>> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes
>> deadlock.
>>
>> Do You have idea how to fix this issue properly?
>>
>> Taking a look at git blame, this indeed shows that the issue has been
>> introduced by the commit a703c55004e1 ("drm: safely free connectors from
>> connector_ite"), which added a call to flush_scheduled_work() in
>> drm_mode_config_cleanup().
>>
>> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if
>> called from the workqueue, but I don't have idea how to check that. The
>> other way of fixing it would be to resurrect separate workqueue for DRM
>> related events.
> We need to flush the work there, or things will go wrong on unload. I
> think the real fix is to make sure that the connector cleanup work
> isn't run on the same work stuff as any driver bind stuff, which yes
> means new separate workqueue just for this.
>
> I guess the simple fix is to do that in the drm, but tbh I'm surprised
> that driver bind/deferred probe hasn't run into this problem anywhere
> else yet.

Well, this means that no-one tested the error paths in deferred probe
case. It's not that surprising, if we assume that typically platform
devices are deferred only once. Second probe() call (which is done from
workqueue) is successful in that case.

I've managed to fix this deadlock without additional workqueue:
https://patchwork.freedesktop.org/patch/193069/

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2017-12-13  9:02 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <680389204.71.1512980755810.JavaMail.jenkins@ip-172-30-0-246>
2017-12-11 10:30 ` Fwd: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging Guillaume Tucker
2017-12-11 17:05   ` Daniel Vetter
2017-12-11 17:05     ` Daniel Vetter
2017-12-11 22:28     ` Javier Martinez Canillas
2017-12-11 22:28       ` Javier Martinez Canillas
2017-12-11 22:54       ` Javier Martinez Canillas
2017-12-11 22:54         ` Javier Martinez Canillas
2017-12-11 22:58         ` Russell King - ARM Linux
2017-12-11 22:58           ` Russell King - ARM Linux
2017-12-11 23:02           ` Russell King - ARM Linux
2017-12-11 23:02             ` Russell King - ARM Linux
2017-12-11 23:25             ` Shuah Khan
2017-12-11 23:25               ` Shuah Khan
2017-12-12  7:54               ` Marek Szyprowski
2017-12-12  7:54                 ` Marek Szyprowski
2017-12-12  8:00                 ` Javier Martinez Canillas
2017-12-12  8:00                   ` Javier Martinez Canillas
2017-12-12  8:47                   ` Marek Szyprowski
2017-12-12  8:47                     ` Marek Szyprowski
2017-12-12 10:10                     ` Krzysztof Kozlowski
2017-12-12 10:10                       ` Krzysztof Kozlowski
2017-12-12 14:47                     ` Shuah Khan
2017-12-12 14:47                       ` Shuah Khan
2017-12-12 18:26                       ` Shuah Khan
2017-12-12 18:26                         ` Shuah Khan
2017-12-12 18:58                         ` Javier Martinez Canillas
2017-12-12 18:58                           ` Javier Martinez Canillas
2017-12-12 11:38       ` Marek Szyprowski
2017-12-12 11:38         ` Marek Szyprowski
2017-12-12 14:39         ` Shuah Khan
2017-12-12 14:39           ` Shuah Khan
2017-12-12 18:14         ` Daniel Vetter
2017-12-12 18:14           ` Daniel Vetter
2017-12-13  9:02           ` Marek Szyprowski
2017-12-13  9:02             ` Marek Szyprowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.