dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 0/1] drm/i915: Fix a deadlock that only affects 5.4
       [not found] <20200407062622.6443-1-sultan@kerneltoast.com>
@ 2020-04-07  6:52 ` Greg KH
       [not found]   ` <20200407071809.3148-1-sultan@kerneltoast.com>
       [not found] ` <20200407062622.6443-2-sultan@kerneltoast.com>
  1 sibling, 1 reply; 16+ messages in thread
From: Greg KH @ 2020-04-07  6:52 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: dri-devel, David Airlie, intel-gfx, Chris Wilson, stable, Rodrigo Vivi

On Mon, Apr 06, 2020 at 11:26:21PM -0700, Sultan Alsawaf wrote:
> From: Sultan Alsawaf <sultan@kerneltoast.com>
> 
> Hi,
> 
> There's a mutex lock deadlock in i915 that only affects 5.4, but was fixed in
> 5.5. Normally, I would send a backport of the fix from 5.5, but the patch set
> that fixes the deadlock involves massive changes that are neither feasible nor
> desirable for backporting [1][2][3]. Therefore, I've made a small patch that
> only addresses the deadlock specifically for 5.4.

This paragraph needs to go into the patch itself, otherwise just looking
at that doesn't make any sense.

And you do not need a cover letter for a single patch.

Please fix up and resend.

thanks,

greg k-h
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
       [not found]   ` <20200407071809.3148-1-sultan@kerneltoast.com>
@ 2020-04-10  9:08     ` Greg KH
  2020-04-10 14:15       ` Sultan Alsawaf
  2020-04-10 14:17       ` Sultan Alsawaf
  2020-04-10 11:46     ` Patch "drm/i915: Fix ref->mutex deadlock in i915_active_wait()" has been added to the 5.4-stable tree gregkh
  1 sibling, 2 replies; 16+ messages in thread
From: Greg KH @ 2020-04-10  9:08 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: dri-devel, David Airlie, intel-gfx, Chris Wilson, stable, Rodrigo Vivi

On Tue, Apr 07, 2020 at 12:18:09AM -0700, Sultan Alsawaf wrote:
> From: Sultan Alsawaf <sultan@kerneltoast.com>
> 
> The following deadlock exists in i915_active_wait() due to a double lock
> on ref->mutex (call chain listed in order from top to bottom):
>  i915_active_wait();
>  mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
>  i915_active_request_retire();
>  node_retire();
>  active_retire();
>  mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> 
> Fix the deadlock by skipping the second ref->mutex lock when
> active_retire() is called through i915_active_request_retire().
> 
> Note that this bug only affects 5.4 and has since been fixed in 5.5.
> Normally, a backport of the fix from 5.5 would be in order, but the
> patch set that fixes this deadlock involves massive changes that are
> neither feasible nor desirable for backporting [1][2][3]. Therefore,
> this small patch was made to address the deadlock specifically for 5.4.
> 
> [1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
> [2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree")
> [3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")
> 
> Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> Cc: <stable@vger.kernel.org> # 5.4.x
> Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> ---
>  drivers/gpu/drm/i915/i915_active.c | 27 +++++++++++++++++++++++----
>  drivers/gpu/drm/i915/i915_active.h |  4 ++--
>  2 files changed, 25 insertions(+), 6 deletions(-)

Now queued up, thanks.

greg k-h
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Patch "drm/i915: Fix ref->mutex deadlock in i915_active_wait()" has been added to the 5.4-stable tree
       [not found]   ` <20200407071809.3148-1-sultan@kerneltoast.com>
  2020-04-10  9:08     ` [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait() Greg KH
@ 2020-04-10 11:46     ` gregkh
  1 sibling, 0 replies; 16+ messages in thread
From: gregkh @ 2020-04-10 11:46 UTC (permalink / raw)
  To: airlied, chris, daniel, dri-devel, greg, gregkh, intel-gfx,
	jani.nikula, joonas.lahtinen, rodrigo.vivi, sultan
  Cc: stable-commits


This is a note to let you know that I've just added the patch titled

    drm/i915: Fix ref->mutex deadlock in i915_active_wait()

to the 5.4-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     drm-i915-fix-ref-mutex-deadlock-in-i915_active_wait.patch
and it can be found in the queue-5.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.


From sultan@kerneltoast.com  Fri Apr 10 11:07:34 2020
From: Sultan Alsawaf <sultan@kerneltoast.com>
Date: Tue,  7 Apr 2020 00:18:09 -0700
Subject: drm/i915: Fix ref->mutex deadlock in i915_active_wait()
To: Greg KH <greg@kroah.com>
Cc: stable@vger.kernel.org, Jani Nikula <jani.nikula@linux.intel.com>, Joonas Lahtinen <joonas.lahtinen@linux.intel.com>, Rodrigo Vivi <rodrigo.vivi@intel.com>, David Airlie <airlied@linux.ie>, Daniel Vetter <daniel@ffwll.ch>, Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Sultan Alsawaf <sultan@kerneltoast.com>
Message-ID: <20200407071809.3148-1-sultan@kerneltoast.com>

From: Sultan Alsawaf <sultan@kerneltoast.com>

The following deadlock exists in i915_active_wait() due to a double lock
on ref->mutex (call chain listed in order from top to bottom):
 i915_active_wait();
 mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
 i915_active_request_retire();
 node_retire();
 active_retire();
 mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK

Fix the deadlock by skipping the second ref->mutex lock when
active_retire() is called through i915_active_request_retire().

Note that this bug only affects 5.4 and has since been fixed in 5.5.
Normally, a backport of the fix from 5.5 would be in order, but the
patch set that fixes this deadlock involves massive changes that are
neither feasible nor desirable for backporting [1][2][3]. Therefore,
this small patch was made to address the deadlock specifically for 5.4.

[1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
[2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree")
[3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")

Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
Cc: <stable@vger.kernel.org> # 5.4.x
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/gpu/drm/i915/i915_active.c |   27 +++++++++++++++++++++++----
 drivers/gpu/drm/i915/i915_active.h |    4 ++--
 2 files changed, 25 insertions(+), 6 deletions(-)

--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -120,13 +120,17 @@ static inline void debug_active_assert(s
 
 #endif
 
+#define I915_ACTIVE_RETIRE_NOLOCK BIT(0)
+
 static void
 __active_retire(struct i915_active *ref)
 {
 	struct active_node *it, *n;
 	struct rb_root root;
 	bool retire = false;
+	unsigned long bits;
 
+	ref = ptr_unpack_bits(ref, &bits, 2);
 	lockdep_assert_held(&ref->mutex);
 
 	/* return the unused nodes to our slabcache -- flushing the allocator */
@@ -138,7 +142,8 @@ __active_retire(struct i915_active *ref)
 		retire = true;
 	}
 
-	mutex_unlock(&ref->mutex);
+	if (!(bits & I915_ACTIVE_RETIRE_NOLOCK))
+		mutex_unlock(&ref->mutex);
 	if (!retire)
 		return;
 
@@ -155,13 +160,18 @@ __active_retire(struct i915_active *ref)
 static void
 active_retire(struct i915_active *ref)
 {
+	struct i915_active *ref_packed = ref;
+	unsigned long bits;
+
+	ref = ptr_unpack_bits(ref, &bits, 2);
 	GEM_BUG_ON(!atomic_read(&ref->count));
 	if (atomic_add_unless(&ref->count, -1, 1))
 		return;
 
 	/* One active may be flushed from inside the acquire of another */
-	mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING);
-	__active_retire(ref);
+	if (!(bits & I915_ACTIVE_RETIRE_NOLOCK))
+		mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING);
+	__active_retire(ref_packed);
 }
 
 static void
@@ -170,6 +180,14 @@ node_retire(struct i915_active_request *
 	active_retire(node_from_active(base)->ref);
 }
 
+static void
+node_retire_nolock(struct i915_active_request *base, struct i915_request *rq)
+{
+	struct i915_active *ref = node_from_active(base)->ref;
+
+	active_retire(ptr_pack_bits(ref, I915_ACTIVE_RETIRE_NOLOCK, 2));
+}
+
 static struct i915_active_request *
 active_instance(struct i915_active *ref, struct intel_timeline *tl)
 {
@@ -421,7 +439,8 @@ int i915_active_wait(struct i915_active
 			break;
 		}
 
-		err = i915_active_request_retire(&it->base, BKL(ref));
+		err = i915_active_request_retire(&it->base, BKL(ref),
+						 node_retire_nolock);
 		if (err)
 			break;
 	}
--- a/drivers/gpu/drm/i915/i915_active.h
+++ b/drivers/gpu/drm/i915/i915_active.h
@@ -309,7 +309,7 @@ i915_active_request_isset(const struct i
  */
 static inline int __must_check
 i915_active_request_retire(struct i915_active_request *active,
-			   struct mutex *mutex)
+			   struct mutex *mutex, i915_active_retire_fn retire)
 {
 	struct i915_request *request;
 	long ret;
@@ -327,7 +327,7 @@ i915_active_request_retire(struct i915_a
 	list_del_init(&active->link);
 	RCU_INIT_POINTER(active->request, NULL);
 
-	active->retire(active, request);
+	retire(active, request);
 
 	return 0;
 }


Patches currently in stable-queue which might be from sultan@kerneltoast.com are

queue-5.4/drm-i915-fix-ref-mutex-deadlock-in-i915_active_wait.patch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-10  9:08     ` [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait() Greg KH
@ 2020-04-10 14:15       ` Sultan Alsawaf
  2020-04-10 14:17       ` Sultan Alsawaf
  1 sibling, 0 replies; 16+ messages in thread
From: Sultan Alsawaf @ 2020-04-10 14:15 UTC (permalink / raw)
  To: Greg KH
  Cc: dri-devel, David Airlie, intel-gfx, Chris Wilson, stable, Rodrigo Vivi

On Fri, Apr 10, 2020 at 11:08:38AM +0200, Greg KH wrote:
> On Tue, Apr 07, 2020 at 12:18:09AM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf <sultan@kerneltoast.com>
> > 
> > The following deadlock exists in i915_active_wait() due to a double lock
> > on ref->mutex (call chain listed in order from top to bottom):
> >  i915_active_wait();
> >  mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
> >  i915_active_request_retire();
> >  node_retire();
> >  active_retire();
> >  mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > 
> > Fix the deadlock by skipping the second ref->mutex lock when
> > active_retire() is called through i915_active_request_retire().
> > 
> > Note that this bug only affects 5.4 and has since been fixed in 5.5.
> > Normally, a backport of the fix from 5.5 would be in order, but the
> > patch set that fixes this deadlock involves massive changes that are
> > neither feasible nor desirable for backporting [1][2][3]. Therefore,
> > this small patch was made to address the deadlock specifically for 5.4.
> > 
> > [1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
> > [2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree")
> > [3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")
> > 
> > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > Cc: <stable@vger.kernel.org> # 5.4.x
> > Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> > ---
> >  drivers/gpu/drm/i915/i915_active.c | 27 +++++++++++++++++++++++----
> >  drivers/gpu/drm/i915/i915_active.h |  4 ++--
> >  2 files changed, 25 insertions(+), 6 deletions(-)
> 
> Now queued up, thanks.
> 
> greg k-h

Hi Greg,

Thanks for queuing this! However, you queued the v1 version of the patch instead
of v2. Please pick the v2 version instead.

Thanks,
Sultan
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-10  9:08     ` [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait() Greg KH
  2020-04-10 14:15       ` Sultan Alsawaf
@ 2020-04-10 14:17       ` Sultan Alsawaf
  2020-04-11 11:39         ` Greg KH
  1 sibling, 1 reply; 16+ messages in thread
From: Sultan Alsawaf @ 2020-04-10 14:17 UTC (permalink / raw)
  To: Greg KH
  Cc: dri-devel, David Airlie, intel-gfx, Chris Wilson, stable, Rodrigo Vivi

On Fri, Apr 10, 2020 at 11:08:38AM +0200, Greg KH wrote:
> On Tue, Apr 07, 2020 at 12:18:09AM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf <sultan@kerneltoast.com>
> > 
> > The following deadlock exists in i915_active_wait() due to a double lock
> > on ref->mutex (call chain listed in order from top to bottom):
> >  i915_active_wait();
> >  mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
> >  i915_active_request_retire();
> >  node_retire();
> >  active_retire();
> >  mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > 
> > Fix the deadlock by skipping the second ref->mutex lock when
> > active_retire() is called through i915_active_request_retire().
> > 
> > Note that this bug only affects 5.4 and has since been fixed in 5.5.
> > Normally, a backport of the fix from 5.5 would be in order, but the
> > patch set that fixes this deadlock involves massive changes that are
> > neither feasible nor desirable for backporting [1][2][3]. Therefore,
> > this small patch was made to address the deadlock specifically for 5.4.
> > 
> > [1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
> > [2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree")
> > [3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")
> > 
> > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > Cc: <stable@vger.kernel.org> # 5.4.x
> > Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> > ---
> >  drivers/gpu/drm/i915/i915_active.c | 27 +++++++++++++++++++++++----
> >  drivers/gpu/drm/i915/i915_active.h |  4 ++--
> >  2 files changed, 25 insertions(+), 6 deletions(-)
> 
> Now queued up, thanks.
> 
> greg k-h

I'm sorry, I meant the v3 [1]. Apologies for the confusion. The v3 was picked
into Ubuntu so that's what we're rolling with.

Thanks,
Sultan

[1] https://lore.kernel.org/stable/20200407203222.2493-1-sultan@kerneltoast.com
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-10 14:17       ` Sultan Alsawaf
@ 2020-04-11 11:39         ` Greg KH
  2020-04-14  8:15           ` Chris Wilson
  0 siblings, 1 reply; 16+ messages in thread
From: Greg KH @ 2020-04-11 11:39 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: dri-devel, David Airlie, intel-gfx, Chris Wilson, stable, Rodrigo Vivi

On Fri, Apr 10, 2020 at 07:17:38AM -0700, Sultan Alsawaf wrote:
> On Fri, Apr 10, 2020 at 11:08:38AM +0200, Greg KH wrote:
> > On Tue, Apr 07, 2020 at 12:18:09AM -0700, Sultan Alsawaf wrote:
> > > From: Sultan Alsawaf <sultan@kerneltoast.com>
> > > 
> > > The following deadlock exists in i915_active_wait() due to a double lock
> > > on ref->mutex (call chain listed in order from top to bottom):
> > >  i915_active_wait();
> > >  mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
> > >  i915_active_request_retire();
> > >  node_retire();
> > >  active_retire();
> > >  mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > > 
> > > Fix the deadlock by skipping the second ref->mutex lock when
> > > active_retire() is called through i915_active_request_retire().
> > > 
> > > Note that this bug only affects 5.4 and has since been fixed in 5.5.
> > > Normally, a backport of the fix from 5.5 would be in order, but the
> > > patch set that fixes this deadlock involves massive changes that are
> > > neither feasible nor desirable for backporting [1][2][3]. Therefore,
> > > this small patch was made to address the deadlock specifically for 5.4.
> > > 
> > > [1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
> > > [2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree")
> > > [3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")
> > > 
> > > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > > Cc: <stable@vger.kernel.org> # 5.4.x
> > > Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_active.c | 27 +++++++++++++++++++++++----
> > >  drivers/gpu/drm/i915/i915_active.h |  4 ++--
> > >  2 files changed, 25 insertions(+), 6 deletions(-)
> > 
> > Now queued up, thanks.
> > 
> > greg k-h
> 
> I'm sorry, I meant the v3 [1]. Apologies for the confusion. The v3 was picked
> into Ubuntu so that's what we're rolling with.

Ok, thanks, hopefully now I picked upthe right one...

greg k-h
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
       [not found] ` <20200407062622.6443-2-sultan@kerneltoast.com>
@ 2020-04-14  8:13   ` Chris Wilson
  2020-04-14 14:52     ` Sultan Alsawaf
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2020-04-14  8:13 UTC (permalink / raw)
  To: Sultan Alsawaf, stable
  Cc: David Airlie, intel-gfx, dri-devel, Rodrigo Vivi, Sultan Alsawaf

Quoting Sultan Alsawaf (2020-04-07 07:26:22)
> From: Sultan Alsawaf <sultan@kerneltoast.com>
> 
> The following deadlock exists in i915_active_wait() due to a double lock
> on ref->mutex (call chain listed in order from top to bottom):
>  i915_active_wait();
>  mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
>  i915_active_request_retire();
>  node_retire();
>  active_retire();
>  mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> 
> Fix the deadlock by skipping the second ref->mutex lock when
> active_retire() is called through i915_active_request_retire().
> 
> Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> Cc: <stable@vger.kernel.org> # 5.4.x
> Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

Incorrect. 

You missed that it cannot retire from inside the wait due to the active
reference held on the i915_active for the wait.

The only point it can enter retire from inside i915_active_wait() is via
the terminal __active_retire() which releases the mutex in doing so.
-Chris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-11 11:39         ` Greg KH
@ 2020-04-14  8:15           ` Chris Wilson
  2020-04-14  8:23             ` Greg KH
  2020-04-14 14:35             ` Sultan Alsawaf
  0 siblings, 2 replies; 16+ messages in thread
From: Chris Wilson @ 2020-04-14  8:15 UTC (permalink / raw)
  To: Greg KH, Sultan Alsawaf
  Cc: dri-devel, David Airlie, intel-gfx, stable, Rodrigo Vivi

Quoting Greg KH (2020-04-11 12:39:57)
> On Fri, Apr 10, 2020 at 07:17:38AM -0700, Sultan Alsawaf wrote:
> > On Fri, Apr 10, 2020 at 11:08:38AM +0200, Greg KH wrote:
> > > On Tue, Apr 07, 2020 at 12:18:09AM -0700, Sultan Alsawaf wrote:
> > > > From: Sultan Alsawaf <sultan@kerneltoast.com>
> > > > 
> > > > The following deadlock exists in i915_active_wait() due to a double lock
> > > > on ref->mutex (call chain listed in order from top to bottom):
> > > >  i915_active_wait();
> > > >  mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
> > > >  i915_active_request_retire();
> > > >  node_retire();
> > > >  active_retire();
> > > >  mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > > > 
> > > > Fix the deadlock by skipping the second ref->mutex lock when
> > > > active_retire() is called through i915_active_request_retire().
> > > > 
> > > > Note that this bug only affects 5.4 and has since been fixed in 5.5.
> > > > Normally, a backport of the fix from 5.5 would be in order, but the
> > > > patch set that fixes this deadlock involves massive changes that are
> > > > neither feasible nor desirable for backporting [1][2][3]. Therefore,
> > > > this small patch was made to address the deadlock specifically for 5.4.
> > > > 
> > > > [1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
> > > > [2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree")
> > > > [3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")
> > > > 
> > > > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > > > Cc: <stable@vger.kernel.org> # 5.4.x
> > > > Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_active.c | 27 +++++++++++++++++++++++----
> > > >  drivers/gpu/drm/i915/i915_active.h |  4 ++--
> > > >  2 files changed, 25 insertions(+), 6 deletions(-)
> > > 
> > > Now queued up, thanks.
> > > 
> > > greg k-h
> > 
> > I'm sorry, I meant the v3 [1]. Apologies for the confusion. The v3 was picked
> > into Ubuntu so that's what we're rolling with.
> 
> Ok, thanks, hopefully now I picked upthe right one...

The patch does not fix a deadlock. Greg, this patch is not a backport of
a bugfix, why is it in stable?
-Chris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-14  8:15           ` Chris Wilson
@ 2020-04-14  8:23             ` Greg KH
  2020-04-20  9:02               ` Joonas Lahtinen
  2020-04-14 14:35             ` Sultan Alsawaf
  1 sibling, 1 reply; 16+ messages in thread
From: Greg KH @ 2020-04-14  8:23 UTC (permalink / raw)
  To: Chris Wilson
  Cc: dri-devel, David Airlie, intel-gfx, stable, Rodrigo Vivi, Sultan Alsawaf

On Tue, Apr 14, 2020 at 09:15:07AM +0100, Chris Wilson wrote:
> Quoting Greg KH (2020-04-11 12:39:57)
> > On Fri, Apr 10, 2020 at 07:17:38AM -0700, Sultan Alsawaf wrote:
> > > On Fri, Apr 10, 2020 at 11:08:38AM +0200, Greg KH wrote:
> > > > On Tue, Apr 07, 2020 at 12:18:09AM -0700, Sultan Alsawaf wrote:
> > > > > From: Sultan Alsawaf <sultan@kerneltoast.com>
> > > > > 
> > > > > The following deadlock exists in i915_active_wait() due to a double lock
> > > > > on ref->mutex (call chain listed in order from top to bottom):
> > > > >  i915_active_wait();
> > > > >  mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
> > > > >  i915_active_request_retire();
> > > > >  node_retire();
> > > > >  active_retire();
> > > > >  mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > > > > 
> > > > > Fix the deadlock by skipping the second ref->mutex lock when
> > > > > active_retire() is called through i915_active_request_retire().
> > > > > 
> > > > > Note that this bug only affects 5.4 and has since been fixed in 5.5.
> > > > > Normally, a backport of the fix from 5.5 would be in order, but the
> > > > > patch set that fixes this deadlock involves massive changes that are
> > > > > neither feasible nor desirable for backporting [1][2][3]. Therefore,
> > > > > this small patch was made to address the deadlock specifically for 5.4.
> > > > > 
> > > > > [1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
> > > > > [2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree")
> > > > > [3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")
> > > > > 
> > > > > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > > > > Cc: <stable@vger.kernel.org> # 5.4.x
> > > > > Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/i915_active.c | 27 +++++++++++++++++++++++----
> > > > >  drivers/gpu/drm/i915/i915_active.h |  4 ++--
> > > > >  2 files changed, 25 insertions(+), 6 deletions(-)
> > > > 
> > > > Now queued up, thanks.
> > > > 
> > > > greg k-h
> > > 
> > > I'm sorry, I meant the v3 [1]. Apologies for the confusion. The v3 was picked
> > > into Ubuntu so that's what we're rolling with.
> > 
> > Ok, thanks, hopefully now I picked upthe right one...
> 
> The patch does not fix a deadlock. Greg, this patch is not a backport of
> a bugfix, why is it in stable?

Because it says it can't be a backport as the 3 above mentioned patches
do the same thing instead?

I will be glad to drop this, but I need some kind of direction here, and
given that at least one distro is already shipping this, it felt like
the correct thing to do.

So, what do you want me to do?

thanks,

greg k-h
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-14  8:15           ` Chris Wilson
  2020-04-14  8:23             ` Greg KH
@ 2020-04-14 14:35             ` Sultan Alsawaf
  1 sibling, 0 replies; 16+ messages in thread
From: Sultan Alsawaf @ 2020-04-14 14:35 UTC (permalink / raw)
  To: Chris Wilson
  Cc: dri-devel, David Airlie, Greg KH, intel-gfx, stable, Rodrigo Vivi

On Tue, Apr 14, 2020 at 09:15:07AM +0100, Chris Wilson wrote:
> The patch does not fix a deadlock. Greg, this patch is not a backport of
> a bugfix, why is it in stable?
> -Chris

Here's the deadlock this supposedly doesn't fix:
INFO: task kswapd0:178 blocked for more than 122 seconds.
      Tainted: G     U            5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0         D    0   178      2 0x80004000
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 ? i915_request_wait+0x25b/0x370
 active_retire+0x26/0x30
 i915_active_wait+0xa3/0x1a0
 i915_vma_unbind+0xe2/0x1c0
 i915_gem_object_unbind+0x111/0x140
 i915_gem_shrink+0x21b/0x530
 i915_gem_shrinker_scan+0xfd/0x120
 do_shrink_slab+0x154/0x2c0
 shrink_slab+0xd0/0x2f0
 shrink_node+0xdf/0x420
 balance_pgdat+0x2e3/0x540
 kswapd+0x200/0x3c0
 ? __wake_up_common_lock+0xc0/0xc0
 kthread+0xfb/0x130
 ? balance_pgdat+0x540/0x540
 ? __kthread_parkme+0x60/0x60
 ret_from_fork+0x1f/0x40
INFO: task kworker/u32:5:222 blocked for more than 122 seconds.
      Tainted: G     U            5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u32:5   D    0   222      2 0x80004000
Workqueue: i915 idle_work_handler
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 idle_work_handler+0x34/0x120
 process_one_work+0x1ea/0x3a0
 worker_thread+0x4d/0x3f0
 kthread+0xfb/0x130
 ? process_one_work+0x3a0/0x3a0
 ? __kthread_parkme+0x60/0x60
 ret_from_fork+0x1f/0x40
INFO: task mpv:1535 blocked for more than 122 seconds.
      Tainted: G     U            5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mpv             D    0  1535      1 0x00000000
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 __i915_gem_free_objects+0x68/0x190
 i915_gem_create_ioctl+0x18/0x30
 ? i915_gem_dumb_create+0xa0/0xa0
 drm_ioctl_kernel+0xb2/0x100
 drm_ioctl+0x209/0x360
 ? i915_gem_dumb_create+0xa0/0xa0
 do_vfs_ioctl+0x43f/0x6c0
 ksys_ioctl+0x5e/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x4e/0x140
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb49f1b32eb
Code: Bad RIP value.
RSP: 002b:00007ffef9eb0948 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffef9eb09c0 RCX: 00007fb49f1b32eb
RDX: 00007ffef9eb09c0 RSI: 00000000c010645b RDI: 0000000000000008
RBP: 00000000c010645b R08: 000055fdb80c1370 R09: 000055fdb80c14e0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb4781e56b0
R13: 0000000000000008 R14: 00007fb4781e5560 R15: 00007fb4781e56b0
INFO: task kswapd0:178 blocked for more than 245 seconds.
      Tainted: G     U            5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0         D    0   178      2 0x80004000
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 ? i915_request_wait+0x25b/0x370
 active_retire+0x26/0x30
 i915_active_wait+0xa3/0x1a0
 i915_vma_unbind+0xe2/0x1c0
 i915_gem_object_unbind+0x111/0x140
 i915_gem_shrink+0x21b/0x530
 i915_gem_shrinker_scan+0xfd/0x120
 do_shrink_slab+0x154/0x2c0
 shrink_slab+0xd0/0x2f0
 shrink_node+0xdf/0x420
 balance_pgdat+0x2e3/0x540
 kswapd+0x200/0x3c0
 ? __wake_up_common_lock+0xc0/0xc0
 kthread+0xfb/0x130
 ? balance_pgdat+0x540/0x540
 ? __kthread_parkme+0x60/0x60
 ret_from_fork+0x1f/0x40
INFO: task kworker/u32:5:222 blocked for more than 245 seconds.
      Tainted: G     U            5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u32:5   D    0   222      2 0x80004000
Workqueue: i915 idle_work_handler
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 idle_work_handler+0x34/0x120
 process_one_work+0x1ea/0x3a0
 worker_thread+0x4d/0x3f0
 kthread+0xfb/0x130
 ? process_one_work+0x3a0/0x3a0
 ? __kthread_parkme+0x60/0x60
 ret_from_fork+0x1f/0x40
INFO: task mpv:1535 blocked for more than 245 seconds.
      Tainted: G     U            5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mpv             D    0  1535      1 0x00000000
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 __i915_gem_free_objects+0x68/0x190
 i915_gem_create_ioctl+0x18/0x30
 ? i915_gem_dumb_create+0xa0/0xa0
 drm_ioctl_kernel+0xb2/0x100
 drm_ioctl+0x209/0x360
 ? i915_gem_dumb_create+0xa0/0xa0
 do_vfs_ioctl+0x43f/0x6c0
 ksys_ioctl+0x5e/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x4e/0x140
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb49f1b32eb
Code: Bad RIP value.
RSP: 002b:00007ffef9eb0948 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffef9eb09c0 RCX: 00007fb49f1b32eb
RDX: 00007ffef9eb09c0 RSI: 00000000c010645b RDI: 0000000000000008
RBP: 00000000c010645b R08: 000055fdb80c1370 R09: 000055fdb80c14e0
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb4781e56b0
R13: 0000000000000008 R14: 00007fb4781e5560 R15: 00007fb4781e56b0

Dead inside the shrinker, and very easy to reproduce.

Sultan
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-14  8:13   ` [PATCH 1/1] drm/i915: Fix ref->mutex deadlock in i915_active_wait() Chris Wilson
@ 2020-04-14 14:52     ` Sultan Alsawaf
  0 siblings, 0 replies; 16+ messages in thread
From: Sultan Alsawaf @ 2020-04-14 14:52 UTC (permalink / raw)
  To: Chris Wilson; +Cc: dri-devel, David Airlie, intel-gfx, stable, Rodrigo Vivi

On Tue, Apr 14, 2020 at 09:13:28AM +0100, Chris Wilson wrote:
> Quoting Sultan Alsawaf (2020-04-07 07:26:22)
> > From: Sultan Alsawaf <sultan@kerneltoast.com>
> > 
> > The following deadlock exists in i915_active_wait() due to a double lock
> > on ref->mutex (call chain listed in order from top to bottom):
> >  i915_active_wait();
> >  mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
> >  i915_active_request_retire();
> >  node_retire();
> >  active_retire();
> >  mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > 
> > Fix the deadlock by skipping the second ref->mutex lock when
> > active_retire() is called through i915_active_request_retire().
> > 
> > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > Cc: <stable@vger.kernel.org> # 5.4.x
> > Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> 
> Incorrect. 
> 
> You missed that it cannot retire from inside the wait due to the active
> reference held on the i915_active for the wait.
> 
> The only point it can enter retire from inside i915_active_wait() is via
> the terminal __active_retire() which releases the mutex in doing so.
> -Chris

The terminal __active_retire() and rbtree_postorder_for_each_entry_safe() loop
retire different objects, so this isn't true.

Sultan
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-14  8:23             ` Greg KH
@ 2020-04-20  9:02               ` Joonas Lahtinen
  2020-04-20 15:42                 ` Sultan Alsawaf
  0 siblings, 1 reply; 16+ messages in thread
From: Joonas Lahtinen @ 2020-04-20  9:02 UTC (permalink / raw)
  To: Chris Wilson, Greg KH
  Cc: dri-devel, David Airlie, intel-gfx, stable, Rodrigo Vivi, Sultan Alsawaf

Quoting Greg KH (2020-04-14 11:23:44)
> On Tue, Apr 14, 2020 at 09:15:07AM +0100, Chris Wilson wrote:
> > Quoting Greg KH (2020-04-11 12:39:57)
> > > On Fri, Apr 10, 2020 at 07:17:38AM -0700, Sultan Alsawaf wrote:
> > > > On Fri, Apr 10, 2020 at 11:08:38AM +0200, Greg KH wrote:
> > > > > On Tue, Apr 07, 2020 at 12:18:09AM -0700, Sultan Alsawaf wrote:
> > > > > > From: Sultan Alsawaf <sultan@kerneltoast.com>
> > > > > > 
> > > > > > The following deadlock exists in i915_active_wait() due to a double lock
> > > > > > on ref->mutex (call chain listed in order from top to bottom):
> > > > > >  i915_active_wait();
> > > > > >  mutex_lock_interruptible(&ref->mutex); <-- ref->mutex first acquired
> > > > > >  i915_active_request_retire();
> > > > > >  node_retire();
> > > > > >  active_retire();
> > > > > >  mutex_lock_nested(&ref->mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > > > > > 
> > > > > > Fix the deadlock by skipping the second ref->mutex lock when
> > > > > > active_retire() is called through i915_active_request_retire().
> > > > > > 
> > > > > > Note that this bug only affects 5.4 and has since been fixed in 5.5.
> > > > > > Normally, a backport of the fix from 5.5 would be in order, but the
> > > > > > patch set that fixes this deadlock involves massive changes that are
> > > > > > neither feasible nor desirable for backporting [1][2][3]. Therefore,
> > > > > > this small patch was made to address the deadlock specifically for 5.4.
> > > > > > 
> > > > > > [1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
> > > > > > [2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree")
> > > > > > [3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")
> > > > > > 
> > > > > > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > > > > > Cc: <stable@vger.kernel.org> # 5.4.x
> > > > > > Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/i915_active.c | 27 +++++++++++++++++++++++----
> > > > > >  drivers/gpu/drm/i915/i915_active.h |  4 ++--
> > > > > >  2 files changed, 25 insertions(+), 6 deletions(-)
> > > > > 
> > > > > Now queued up, thanks.
> > > > > 
> > > > > greg k-h
> > > > 
> > > > I'm sorry, I meant the v3 [1]. Apologies for the confusion. The v3 was picked
> > > > into Ubuntu so that's what we're rolling with.
> > > 
> > > Ok, thanks, hopefully now I picked upthe right one...
> > 
> > The patch does not fix a deadlock. Greg, this patch is not a backport of
> > a bugfix, why is it in stable?
> 
> Because it says it can't be a backport as the 3 above mentioned patches
> do the same thing instead?

Apologies for jumping late to the thread.

> I will be glad to drop this, but I need some kind of direction here, and
> given that at least one distro is already shipping this, it felt like
> the correct thing to do.
>
> So, what do you want me to do?

I think the the patch should be dropped for now before the issue is
properly addressed. Either by backporting the mainline fixes or if
those are too big and there indeed is a smaller alternative patch
that is properly reviewed. But the above patch is not, at least yet.

There is an another similar thread where there's jumping into
conclusions and doing ad-hoc patches for already fixed issues:

https://lore.kernel.org/dri-devel/20200414144309.GB2082@sultan-box.localdomain/

I appreciate enthusiasm to provide fixes to i915 but we should
continue do the regular due diligence to make sure we're properly
fixing bugs in upstream kernels. And when fixing them, to make
sure we're not simply papering over them for a single use case.

It would be preferred to file a bug for the seen issues,
describing how to reproduce them with vanilla upstream kernels:

https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs

That way we have enough information to assess the severity of the
bug and to prioritize the work accordingly.

Best Regards, Joonas
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-20  9:02               ` Joonas Lahtinen
@ 2020-04-20 15:42                 ` Sultan Alsawaf
  2020-04-21  8:04                   ` Joonas Lahtinen
  0 siblings, 1 reply; 16+ messages in thread
From: Sultan Alsawaf @ 2020-04-20 15:42 UTC (permalink / raw)
  To: Joonas Lahtinen
  Cc: dri-devel, David Airlie, Greg KH, intel-gfx, Chris Wilson,
	stable, Rodrigo Vivi

On Mon, Apr 20, 2020 at 12:02:39PM +0300, Joonas Lahtinen wrote:
> I think the the patch should be dropped for now before the issue is
> properly addressed. Either by backporting the mainline fixes or if
> those are too big and there indeed is a smaller alternative patch
> that is properly reviewed. But the above patch is not, at least yet.

Why should a fix for a bona-fide issue be dropped due to political reasons? This
doesn't make sense to me. This just hurts miserable i915 users even more. If my
patch is going to be dropped, it should be replaced by a different fix at the
same time.

Also, the mainline fixes just *happen* to fix this deadlock by removing the
mutex lock from the path in question and creating multiple other bugs in the
process that had to be addressed with "Fixes:" commits. The regression potential
was too high to include those patches for a "stable" kernel, so I made this
patch which fixes the issue in the simplest way possible. We put this patch into
Ubuntu now as well, because praying for a response from i915 maintainers while
the 20.04 release was on the horizon was not an option.

> There is an another similar thread where there's jumping into
> conclusions and doing ad-hoc patches for already fixed issues:
> 
> https://lore.kernel.org/dri-devel/20200414144309.GB2082@sultan-box.localdomain/

Maybe this wouldn't have happened if I had received a proper response for that
issue on gitlab from the get-go... Instead I got the run-around from Chris
claiming that it wasn't an i915 bug:

https://gitlab.freedesktop.org/drm/intel/issues/1599

> I appreciate enthusiasm to provide fixes to i915 but we should
> continue do the regular due diligence to make sure we're properly
> fixing bugs in upstream kernels. And when fixing them, to make
> sure we're not simply papering over them for a single use case.
> 
> It would be preferred to file a bug for the seen issues,
> describing how to reproduce them with vanilla upstream kernels:
> 
> https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs

gitlab.freedesktop.org/drm/intel is where bugs go to be neglected, as noted
above. I really see no reason to send anything there anymore, when the vast
majority of community-sourced bug reports go ignored. 

Sultan
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-20 15:42                 ` Sultan Alsawaf
@ 2020-04-21  8:04                   ` Joonas Lahtinen
  2020-04-21 16:38                     ` Sultan Alsawaf
  0 siblings, 1 reply; 16+ messages in thread
From: Joonas Lahtinen @ 2020-04-21  8:04 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: dri-devel, David Airlie, Greg KH, intel-gfx, Chris Wilson,
	stable, Rodrigo Vivi

Quoting Sultan Alsawaf (2020-04-20 18:42:16)
> On Mon, Apr 20, 2020 at 12:02:39PM +0300, Joonas Lahtinen wrote:
> > I think the the patch should be dropped for now before the issue is
> > properly addressed. Either by backporting the mainline fixes or if
> > those are too big and there indeed is a smaller alternative patch
> > that is properly reviewed. But the above patch is not, at least yet.
> 
> Why should a fix for a bona-fide issue be dropped due to political reasons? This
> doesn't make sense to me. This just hurts miserable i915 users even more. If my
> patch is going to be dropped, it should be replaced by a different fix at the
> same time.

There's no politics involved. It's all about doing the due diligence
that we're fixing upstream bugs, and we're fixing them in a way that
does not cause regressions to other users.

Without being able to reproduce a bug against vanilla kernel, there's
too high of a risk that the patch that was developed will only work
on the downstream kernel it was developed for. That happens for the
best of the developers, and that is exactly why the process is in
place, to avoid human error. So no politics, just due diligence.

If you could provide bug reproduction instructions by filing a bug,
we can make forward progress in solving this issue. After assessing
the severity of the bug and the amount of users involved, it will
be prioritized accordingly. That is the most efficient way to get
attention to a bug.

> Also, the mainline fixes just *happen* to fix this deadlock by removing the
> mutex lock from the path in question and creating multiple other bugs in the
> process that had to be addressed with "Fixes:" commits. The regression potential
> was too high to include those patches for a "stable" kernel, so I made this
> patch which fixes the issue in the simplest way possible.

The thing is that it may be that the patch fixes the exact issue you
have at hand in the downstream kernel you are testing against. But
in doing so it may as well break other usecases for other users of
vanilla kernel. That is what we're trying to avoid.

With the reproduction instructions, it'll be possible to check which
kernel versions are affected, and after applying a fix to make sure
that the bug is gone from those version. And if the reproduction can
be trivialized to a test, we can introduce a regression check to CI.

A patch that claims to fix a deadlock in upstream kernel should
include that splat from upstream kernel, not a speculated chain.
Again, this is just the regular due diligence, because we have
made errors in the past. It is for those self-made errors we
know not to merge fixes too quickly before we are able to
reproduce the error and make sure it is gone.

It's not about where the patch came from, it's about avoiding
errors.

> We put this patch into
> Ubuntu now as well, because praying for a response from i915 maintainers while
> the 20.04 release was on the horizon was not an option.
> 
> > There is an another similar thread where there's jumping into
> > conclusions and doing ad-hoc patches for already fixed issues:
> > 
> > https://lore.kernel.org/dri-devel/20200414144309.GB2082@sultan-box.localdomain/
> 
> Maybe this wouldn't have happened if I had received a proper response for that
> issue on gitlab from the get-go... Instead I got the run-around from Chris
> claiming that it wasn't an i915 bug:
> 
> https://gitlab.freedesktop.org/drm/intel/issues/1599
> 
> > I appreciate enthusiasm to provide fixes to i915 but we should
> > continue do the regular due diligence to make sure we're properly
> > fixing bugs in upstream kernels. And when fixing them, to make
> > sure we're not simply papering over them for a single use case.
> > 
> > It would be preferred to file a bug for the seen issues,
> > describing how to reproduce them with vanilla upstream kernels:
> > 
> > https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs
> 
> gitlab.freedesktop.org/drm/intel is where bugs go to be neglected, as noted
> above. I really see no reason to send anything there anymore, when the vast
> majority of community-sourced bug reports go ignored.

In the above bug, you claim to be booting vanilla kernel but the splat
clearly says "5.4.28-00007-g64bb42e80256-dirty", so the developer correctly
requested to bisect the error between 5.4.27 and 5.4.28 vanilla kernels, which
you seem to have ignored and simply jumped to provide a patch.

Apologies if it feels like the bugs do not get enough attention, but we
do our best to act on the reported bugs. You can best guarantee that
your bug is getting the attention by providing all the details requested
in the above link.

Without that information, it'll be hard to assess the severity of the
bug. Above bug is missing critical pieces of information which help us
in assessing the severity: 1. Is the bug reproducible on drm-tip?
2. How to reproduce? 3. How often does it reproduce? 4. Which hardware?

If that information is missing, it means that that some of our
developers needs to find out all those bits of information before
we can even assess the severity of the bug. And as we also have
bugs where the information is present, those are often acted on
first.

Again, no politics involved and no praying needed. We just have a
process to follow to make sure we don't repeat our past mistakes
as it's only humans who work on the bugs. At times it may feel
rigid and not suited for the specific case where you feel there
is a shorter route to produce a fix, but following the bug process
helps us understand the problem and avoid trivial mistakes.

Regards, Joonas
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-21  8:04                   ` Joonas Lahtinen
@ 2020-04-21 16:38                     ` Sultan Alsawaf
  2020-04-21 20:55                       ` Jason A. Donenfeld
  0 siblings, 1 reply; 16+ messages in thread
From: Sultan Alsawaf @ 2020-04-21 16:38 UTC (permalink / raw)
  To: Joonas Lahtinen
  Cc: dri-devel, David Airlie, Greg KH, intel-gfx, Chris Wilson,
	stable, Rodrigo Vivi

On Tue, Apr 21, 2020 at 11:04:13AM +0300, Joonas Lahtinen wrote:
> Quoting Sultan Alsawaf (2020-04-20 18:42:16)
> > On Mon, Apr 20, 2020 at 12:02:39PM +0300, Joonas Lahtinen wrote:
> > > I think the the patch should be dropped for now before the issue is
> > > properly addressed. Either by backporting the mainline fixes or if
> > > those are too big and there indeed is a smaller alternative patch
> > > that is properly reviewed. But the above patch is not, at least yet.
> > 
> > Why should a fix for a bona-fide issue be dropped due to political reasons? This
> > doesn't make sense to me. This just hurts miserable i915 users even more. If my
> > patch is going to be dropped, it should be replaced by a different fix at the
> > same time.
> 
> There's no politics involved. It's all about doing the due diligence
> that we're fixing upstream bugs, and we're fixing them in a way that
> does not cause regressions to other users.

Due diligence is the furthest thing from what I'd associate with i915. My
cherryview laptop doesn't work anymore in 5.6 (it gets many GPU hangs; a RedHat
engineer is working on fixing it according to the gitlab issue), my colleagues'
laptops with 4k screens still suffer from GPU hangs (with fully documented
gitlab reports that have gone completely unanswered), and I still get occasional
FIFO underrun messages in dmesg accompanying graphical glitches on my own
laptops. On top of that, PSR doesn't work correctly on any of my laptops, and
it's enabled by default in i915. But when I reported the issue on gitlab, I got
claims that my laptops' panels were bad, and that my reports were "not enough
reason to disable PSR by default." [1]

[1] https://gitlab.freedesktop.org/drm/intel/-/issues/425#note_434182

> Without being able to reproduce a bug against vanilla kernel, there's
> too high of a risk that the patch that was developed will only work
> on the downstream kernel it was developed for. That happens for the
> best of the developers, and that is exactly why the process is in
> place, to avoid human error. So no politics, just due diligence.

You could have reviewed the patch in the same amount of time it took you to
write this email. It's very simple and obvious. That's why this sounds more like
politics to me than "due diligence."

> > Also, the mainline fixes just *happen* to fix this deadlock by removing the
> > mutex lock from the path in question and creating multiple other bugs in the
> > process that had to be addressed with "Fixes:" commits. The regression potential
> > was too high to include those patches for a "stable" kernel, so I made this
> > patch which fixes the issue in the simplest way possible.
> 
> The thing is that it may be that the patch fixes the exact issue you
> have at hand in the downstream kernel you are testing against. But
> in doing so it may as well break other usecases for other users of
> vanilla kernel. That is what we're trying to avoid.

I don't know of any usecase that needs a double mutex lock (except for drm code
that uses mutex_trylock_recursive of course), but alright.

> A patch that claims to fix a deadlock in upstream kernel should
> include that splat from upstream kernel, not a speculated chain.
> Again, this is just the regular due diligence, because we have
> made errors in the past. It is for those self-made errors we
> know not to merge fixes too quickly before we are able to
> reproduce the error and make sure it is gone.

I sent the splat in a previous email. And you lot are all making many errors in
the present, far more so than in the past because i915 used to work at some
point. The sheer number of Chris' commits with subsequent "Fixes:" for his
changes is mind numbing. For Ubuntu, we've had to backport a massive amount of
i915 fixes to our 5.4 kernel thus far, and the major bugs still aren't all
fixed (probably since those bugs still exist in Linus' tree). And we can't
backport all of the relevant fixes we find because they're either contained in
massive refactor commits or they only apply to i915 code that's been massively
refactored since the last kernel release.

> > We put this patch into
> > Ubuntu now as well, because praying for a response from i915 maintainers while
> > the 20.04 release was on the horizon was not an option.
> > 
> > > There is an another similar thread where there's jumping into
> > > conclusions and doing ad-hoc patches for already fixed issues:
> > > 
> > > https://lore.kernel.org/dri-devel/20200414144309.GB2082@sultan-box.localdomain/
> > 
> > Maybe this wouldn't have happened if I had received a proper response for that
> > issue on gitlab from the get-go... Instead I got the run-around from Chris
> > claiming that it wasn't an i915 bug:
> > 
> > https://gitlab.freedesktop.org/drm/intel/issues/1599
> > 
> > > I appreciate enthusiasm to provide fixes to i915 but we should
> > > continue do the regular due diligence to make sure we're properly
> > > fixing bugs in upstream kernels. And when fixing them, to make
> > > sure we're not simply papering over them for a single use case.
> > > 
> > > It would be preferred to file a bug for the seen issues,
> > > describing how to reproduce them with vanilla upstream kernels:
> > > 
> > > https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs
> > 
> > gitlab.freedesktop.org/drm/intel is where bugs go to be neglected, as noted
> > above. I really see no reason to send anything there anymore, when the vast
> > majority of community-sourced bug reports go ignored.
> 
> In the above bug, you claim to be booting vanilla kernel but the splat
> clearly says "5.4.28-00007-g64bb42e80256-dirty", so the developer correctly
> requested to bisect the error between 5.4.27 and 5.4.28 vanilla kernels, which
> you seem to have ignored and simply jumped to provide a patch.

Telling me to bisect between 5.4.27 and 5.4.28 after claiming it's not an i915
bug is a total waste of time. There weren't even any i915 changes in 5.4.28, and
the assumption that it's "not i915" is a load of crap too because the bug in
question had a very sane crash: `ring->vaddr` was clearly zero, like I said in
my description. I "jumped to provide a patch" because we have customers to
answer to when i915 doesn't work, and I wasn't going to waste time running in
circles to appease the maintainer when I could fix the problem myself.

> Apologies if it feels like the bugs do not get enough attention, but we
> do our best to act on the reported bugs. You can best guarantee that
> your bug is getting the attention by providing all the details requested
> in the above link.

Why hasn't this bug got any attention:
https://gitlab.freedesktop.org/drm/intel/issues/1412

That seems like a showstopper.

What about this bug?
https://gitlab.freedesktop.org/drm/intel/issues/1605

That's the bug that makes my cherryview machine unusable. A RedHat engineer is
working on fixing that instead of an i915 maintainer... that doesn't seem right.

Someone else reported that bug here:
https://gitlab.freedesktop.org/drm/intel/issues/1585

And Chris gave a very helpful, "Despite all that it is still probably userspace
fouling the memory" in response. Even though the problem doesn't occur in older
kernels.

> Without that information, it'll be hard to assess the severity of the
> bug. Above bug is missing critical pieces of information which help us
> in assessing the severity: 1. Is the bug reproducible on drm-tip?
> 2. How to reproduce? 3. How often does it reproduce? 4. Which hardware?
> 
> If that information is missing, it means that that some of our
> developers needs to find out all those bits of information before
> we can even assess the severity of the bug. And as we also have
> bugs where the information is present, those are often acted on
> first.

Why not ask for that? I was never asked for that information. A splat should've
been enough to identify the problem anyway.

> Again, no politics involved and no praying needed.

I don't think I'll stop praying for i915 to work based on the current state of
affairs, thanks.

Sultan
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()
  2020-04-21 16:38                     ` Sultan Alsawaf
@ 2020-04-21 20:55                       ` Jason A. Donenfeld
  0 siblings, 0 replies; 16+ messages in thread
From: Jason A. Donenfeld @ 2020-04-21 20:55 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: dri-devel, David Airlie, Greg KH, intel-gfx, Chris Wilson,
	stable, Rodrigo Vivi

On Tue, Apr 21, 2020 at 09:38:09AM -0700, Sultan Alsawaf wrote:
> Why hasn't this bug got any attention:
> https://gitlab.freedesktop.org/drm/intel/issues/1412
> 
> That seems like a showstopper.

Indeed, pretty shocking. It's worth mentioning that the reporter of said
bug, after significant time had elapsed, removed i915 from the kernel
entirely and now daily drives the NVIDIA binary blob. Having to fall
back to closed source blobs indicates a pretty awful state of affairs.
It might otherwise be hard to imagine how that could be preferable, but
this situation indicates how.

Joonas - More generally, it seems what's happening now is that i915
driver stability and quality is reaching a real low point. Back when
i915 was the only stable driver in town, the ivory tower attitude of the
maintainers seemed quasi-justifiable. The drivers kept being awesome, so
when they kicked the can back at users or gave them the cold shoulder on
reports, there was never much of an issue, since things always got fixed
in fairly short order. This is no longer the case. Bugs are piling up.
Users are unhappy. So it's only natural that some users will just wind
up fixing it themselves, especially with responses from Intel like "not
guilty!" in response to i915 bug reports. And these users who try to fix
these bugs will do so without access to your proprietary debuggers,
top-secret data sheets, and NDA'd hardware engineers down the hall, and
thus you'll always be able to accuse them of something or another about
"due-diligence", since nobody is better suited than you to find and fix
these issues in the best way possible. But it hasn't been happening in a
satisfactory way from your end, and users need their laptops to actually
work, and so things will gradually get fixed (hopefully, anyhow) through
other means. Even on the stable front, if fixes to bugs are intermingled
into massive refactoring patches, and your team isn't taking charge to
separate them out and send them to stable@, then backporting to stable
kernels is going to result in a lot of custom work from other people. In
other words, you can shun your users' bug reports and fellow developers
and get away with that while your driver quality remains tip-top, but if
you let things fall, as they have as of late, then expect your ivory
tower to be shaken a bit.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-04-22  6:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200407062622.6443-1-sultan@kerneltoast.com>
2020-04-07  6:52 ` [PATCH 0/1] drm/i915: Fix a deadlock that only affects 5.4 Greg KH
     [not found]   ` <20200407071809.3148-1-sultan@kerneltoast.com>
2020-04-10  9:08     ` [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait() Greg KH
2020-04-10 14:15       ` Sultan Alsawaf
2020-04-10 14:17       ` Sultan Alsawaf
2020-04-11 11:39         ` Greg KH
2020-04-14  8:15           ` Chris Wilson
2020-04-14  8:23             ` Greg KH
2020-04-20  9:02               ` Joonas Lahtinen
2020-04-20 15:42                 ` Sultan Alsawaf
2020-04-21  8:04                   ` Joonas Lahtinen
2020-04-21 16:38                     ` Sultan Alsawaf
2020-04-21 20:55                       ` Jason A. Donenfeld
2020-04-14 14:35             ` Sultan Alsawaf
2020-04-10 11:46     ` Patch "drm/i915: Fix ref->mutex deadlock in i915_active_wait()" has been added to the 5.4-stable tree gregkh
     [not found] ` <20200407062622.6443-2-sultan@kerneltoast.com>
2020-04-14  8:13   ` [PATCH 1/1] drm/i915: Fix ref->mutex deadlock in i915_active_wait() Chris Wilson
2020-04-14 14:52     ` Sultan Alsawaf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).