All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
@ 2013-05-24 10:45 Chris Wilson
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Wilson @ 2013-05-24 10:45 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, stable

If reset fails, the GPU is declared wedged. This ideally should never
happen, but very rarely it does. After the GPU is declared wedged, we
must allow userspace to continue to use its mapping of bo in order to
recover its data (and in some cases in order for memory management to
continue unabated). Obviously after the GPU is wedged, no bo are
currently accessed by the GPU and so we can complete any waits or domain
transitions away from the GPU. Currently, we fail this essential task
and instead report EIO and send a SIGBUS to the affected process -
causing major loss of data (by killing X or compiz).

Fixes regression from
commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c [v3.9]
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Nov 15 17:17:22 2012 +0100

    drm/i915: clear up wedged transitions

v2: Add comments.

References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Damien Lespiau <damien.lespiau@intel.com>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/i915_gem.c |   33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 44da25e..ac05845 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -95,9 +95,17 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
 	if (EXIT_COND)
 		return 0;
 
-	/* GPU is already declared terminally dead, give up. */
+	/* GPU is already declared terminally dead, nothing to wait for.
+	 * Return and let the ioctl continue. If we bail out here, then
+	 * we report EIO back to userspace (or worse SIGBUS through a
+	 * pagefault) when the caller is not necessarily interacting with
+	 * the device but is instead performing memory management. If the
+	 * application does instead want (or requires) to submit a GPU
+	 * command, then we will report the hung GPU (EIO) when we try
+	 * to acquire space on the ring.
+	 */
 	if (i915_terminally_wedged(error))
-		return -EIO;
+		return 0;
 
 	/*
 	 * Only wait 10 seconds for the gpu reset to complete to avoid hanging
@@ -109,13 +117,17 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
 					       10*HZ);
 	if (ret == 0) {
 		DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
-		return -EIO;
-	} else if (ret < 0) {
-		return ret;
-	}
+		/* The impossible happened, mark the device as terminally
+		 * wedged so that we fail quicker next time. If the reset
+		 * does eventually complete, the terminally wedged status
+		 * will be confirmed, or the counter reset.
+		 */
+		atomic_set(&error->reset_counter, I915_WEDGED);
+	} else if (ret > 0)
+		ret = 0;
 #undef EXIT_COND
 
-	return 0;
+	return ret;
 }
 
 int i915_mutex_lock_interruptible(struct drm_device *dev)
@@ -1211,10 +1223,13 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
 
 	/* Try to flush the object off the GPU without holding the lock.
 	 * We will repeat the flush holding the lock in the normal manner
-	 * to catch cases where we are gazumped.
+	 * to catch cases where we are gazumped. Also because it is unlocked,
+	 * it is possible for a spurious GPU hang to occur whilst we wait.
+	 * In that event, just continue on and see if it confirmed by the
+	 * locked wait.
 	 */
 	ret = i915_gem_object_wait_rendering__nonblocking(obj, !write_domain);
-	if (ret)
+	if (ret && ret != -EIO)
 		goto unref;
 
 	if (read_domains & I915_GEM_DOMAIN_GTT) {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
  2013-05-24  9:03           ` Daniel Vetter
@ 2013-05-24  9:36             ` Chris Wilson
  0 siblings, 0 replies; 9+ messages in thread
From: Chris Wilson @ 2013-05-24  9:36 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Fri, May 24, 2013 at 11:03:06AM +0200, Daniel Vetter wrote:
> On Wed, May 1, 2013 at 6:23 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> > The atomic_set(WEDGED) is imo very dangerous, since it'll wreak havoc
> > with our accounting. But in general I really prefer an angry user with
> > a stuck-process backtrace than trying to paper over our own
> > shortcomings with hacks. And at least for the gpu hang vs.
> > pageflip/set_base we should now be fairly well-covered with tests. And
> > I haven't yet seen a report indicating that there's another issue
> > looming ...
> >
> > I guess an obnoxious WARN with a return 0; would also work for the
> > time out case. But I prefer to just ditch the timeout.
> 
> Poke ... I still think setting wedged here is a bit risky.

The alternative is that every ioctl then takes 10s. But if we do set
wedged and it recovers, it is reset to 0. If the reset fails, it is also
set to wedged. It's a fugly thing to do, but at that level of paranoia,
I think it is the right thing to do.

> And maybe
> add a comment why wait_for_error eats the -EIO (and why
> intel_ring_begin does not eat the -EIO)?

Sure. The patch today is slightly different anyway...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
  2013-05-01 16:23         ` Daniel Vetter
@ 2013-05-24  9:03           ` Daniel Vetter
  2013-05-24  9:36             ` Chris Wilson
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Vetter @ 2013-05-24  9:03 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, intel-gfx

On Wed, May 1, 2013 at 6:23 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, May 1, 2013 at 3:01 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote:
>>> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>>> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
>>> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>>> >> > If reset fails, the GPU is declared wedged. This ideally should never
>>> >> > happen, but very rarely it does. After the GPU is declared wedged, we
>>> >> > must allow userspace to continue to use its mapping of bo in order to
>>> >> > recover its data (and in some cases in order for memory management to
>>> >> > continue unabated). Obviously after the GPU is wedged, no bo are
>>> >> > currently accessed by the GPU and so we can complete any waits or domain
>>> >> > transitions away from the GPU. Currently, we fail this essential task
>>> >> > and instead report EIO and send a SIGBUS to the affected process -
>>> >> > causing major loss of data (by killing X or compiz).
>>> >> >
>>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
>>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
>>> >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> >>
>>> >> So I've read again through the reset code and I still don't see how
>>> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
>>> >> -EIO eating after wait_rendering looks really suspicious to me.
>>> >
>>> > That's always been a layer of defense against the driver. Alternatively,
>>> > throw we can throw a warn into the wait as we shouldn't enter there with
>>> > a seqno and a wedged GPU.
>>>
>>> Yeah, that sounds like a good idea. We really shouldn't ever have an
>>> oustanding request while the gpu is wedged (ignoring dangerous changes
>>> to the wedged state through debugfs).
>>
>> Except it is only true for a locked wait. :(
>
> Hm, indeed ... we need to convert any -EIO into a -ERESTARTSYS in (at
> least nonblocking) waits. We'd get that by simply dropping all the
> check_wedge calls from the wait functions. That would leave us with a
> check_wedge in throttle and intel_ring_begin, which I think are the
> two places we really want that.
>
>>> >> Now the other thing is i915_gem_object_wait_
>>> >> rendering, that thing loves to throw an -EIO at us. And on a quick
>>> >> check your patch misses the one in set_domain_ioctl. We probably need
>>> >> to do the same with sw_finish_ioctl. So what about a
>>> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
>>> >> the few places we don't want to hear about a dead gpu?
>>> >
>>> > In fact, it turns out that we hardly ever want
>>> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
>>> > that EIO was only ever returned when we tried to issue a command on the
>>> > GPU whilst it was terminally wedged - in order to preserve data
>>> > integrity.
>>>
>>> Don't we need to re-add a check in execbuf then? Otherwise I guess
>>> userspace has no idea ever that something is amiss ...
>>
>> It will catch the EIO as soon as it tries to emit the command, but
>> adding the explicit check will save a ton of work in reserving buffers.
>
> Right, I've forgotten about the check in intel_ring_begin.
>
>>> Otherwise I think this approach would also make sense, and feels more
>>> future-proof. Applications that want real robustness simply need to
>>> query the kernel through the new interface whether anything ugly has
>>> happened to them. And if we want more synchronous feedback we can
>>> always improve wait/set_domain_ioctl to check whether the given bo
>>> suffered through a little calamity.
>>
>> The only explicit point we have to remember is to preserve the
>> throttle() semantics of reporting EIO when wedged.
>
> I think plugging the leak in execbuf is still good to avoid queueing
> up tons of crap (since we really don't want to do that). But I think
> with the -EIO check in intel_ring_begin we are covered.
>
>>> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> > index 2bd8d7a..f243e32 100644
>>> > --- a/drivers/gpu/drm/i915/i915_gem.c
>>> > +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>>> >
>>> >         /* GPU is already declared terminally dead, give up. */
>>> >         if (i915_terminally_wedged(error))
>>> > -               return -EIO;
>>> > +               return 0;
>>> >
>>> >         /*
>>> >          * Only wait 10 seconds for the gpu reset to complete to avoid hanging
>>> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>>> >                                                10*HZ);
>>> >         if (ret == 0) {
>>> >                 DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
>>> > -               return -EIO;
>>> > -       } else if (ret < 0) {
>>> > -               return ret;
>>> > -       }
>>> > +               atomic_set(&error->reset_counter, I915_WEDGED);
>>> > +       } else if (ret > 0)
>>> > +               ret = 0;
>>>
>>> Less convinced about this hunk here. I think the right approach would
>>> be to simply kill the timeout. Unlucky users can always get out of
>>> here with a signal, and it's imo more robust if we have all relevant
>>> timeouts for the reset process in the reset work. Separate patch
>>> though. Also we now have quite good coverage of all the reset handling
>>> in igt, so I don't think we need to go overboard in defending against
>>> our own logic screw-ups ... Reset really should work nowadays.
>>
>> Apart from #63291 said otherwise (iirc)... I think several layers of
>> paranoia over handling GPU and driver hangs is justified.
>
> The atomic_set(WEDGED) is imo very dangerous, since it'll wreak havoc
> with our accounting. But in general I really prefer an angry user with
> a stuck-process backtrace than trying to paper over our own
> shortcomings with hacks. And at least for the gpu hang vs.
> pageflip/set_base we should now be fairly well-covered with tests. And
> I haven't yet seen a report indicating that there's another issue
> looming ...
>
> I guess an obnoxious WARN with a return 0; would also work for the
> time out case. But I prefer to just ditch the timeout.

Poke ... I still think setting wedged here is a bit risky. And maybe
add a comment why wait_for_error eats the -EIO (and why
intel_ring_begin does not eat the -EIO)?
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
  2013-05-01 13:01       ` Chris Wilson
@ 2013-05-01 16:23         ` Daniel Vetter
  2013-05-24  9:03           ` Daniel Vetter
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Vetter @ 2013-05-01 16:23 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, intel-gfx

On Wed, May 1, 2013 at 3:01 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote:
>> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
>> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> >> > If reset fails, the GPU is declared wedged. This ideally should never
>> >> > happen, but very rarely it does. After the GPU is declared wedged, we
>> >> > must allow userspace to continue to use its mapping of bo in order to
>> >> > recover its data (and in some cases in order for memory management to
>> >> > continue unabated). Obviously after the GPU is wedged, no bo are
>> >> > currently accessed by the GPU and so we can complete any waits or domain
>> >> > transitions away from the GPU. Currently, we fail this essential task
>> >> > and instead report EIO and send a SIGBUS to the affected process -
>> >> > causing major loss of data (by killing X or compiz).
>> >> >
>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
>> >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> >>
>> >> So I've read again through the reset code and I still don't see how
>> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
>> >> -EIO eating after wait_rendering looks really suspicious to me.
>> >
>> > That's always been a layer of defense against the driver. Alternatively,
>> > throw we can throw a warn into the wait as we shouldn't enter there with
>> > a seqno and a wedged GPU.
>>
>> Yeah, that sounds like a good idea. We really shouldn't ever have an
>> oustanding request while the gpu is wedged (ignoring dangerous changes
>> to the wedged state through debugfs).
>
> Except it is only true for a locked wait. :(

Hm, indeed ... we need to convert any -EIO into a -ERESTARTSYS in (at
least nonblocking) waits. We'd get that by simply dropping all the
check_wedge calls from the wait functions. That would leave us with a
check_wedge in throttle and intel_ring_begin, which I think are the
two places we really want that.

>> >> Now the other thing is i915_gem_object_wait_
>> >> rendering, that thing loves to throw an -EIO at us. And on a quick
>> >> check your patch misses the one in set_domain_ioctl. We probably need
>> >> to do the same with sw_finish_ioctl. So what about a
>> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
>> >> the few places we don't want to hear about a dead gpu?
>> >
>> > In fact, it turns out that we hardly ever want
>> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
>> > that EIO was only ever returned when we tried to issue a command on the
>> > GPU whilst it was terminally wedged - in order to preserve data
>> > integrity.
>>
>> Don't we need to re-add a check in execbuf then? Otherwise I guess
>> userspace has no idea ever that something is amiss ...
>
> It will catch the EIO as soon as it tries to emit the command, but
> adding the explicit check will save a ton of work in reserving buffers.

Right, I've forgotten about the check in intel_ring_begin.

>> Otherwise I think this approach would also make sense, and feels more
>> future-proof. Applications that want real robustness simply need to
>> query the kernel through the new interface whether anything ugly has
>> happened to them. And if we want more synchronous feedback we can
>> always improve wait/set_domain_ioctl to check whether the given bo
>> suffered through a little calamity.
>
> The only explicit point we have to remember is to preserve the
> throttle() semantics of reporting EIO when wedged.

I think plugging the leak in execbuf is still good to avoid queueing
up tons of crap (since we really don't want to do that). But I think
with the -EIO check in intel_ring_begin we are covered.

>> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> > index 2bd8d7a..f243e32 100644
>> > --- a/drivers/gpu/drm/i915/i915_gem.c
>> > +++ b/drivers/gpu/drm/i915/i915_gem.c
>> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>> >
>> >         /* GPU is already declared terminally dead, give up. */
>> >         if (i915_terminally_wedged(error))
>> > -               return -EIO;
>> > +               return 0;
>> >
>> >         /*
>> >          * Only wait 10 seconds for the gpu reset to complete to avoid hanging
>> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>> >                                                10*HZ);
>> >         if (ret == 0) {
>> >                 DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
>> > -               return -EIO;
>> > -       } else if (ret < 0) {
>> > -               return ret;
>> > -       }
>> > +               atomic_set(&error->reset_counter, I915_WEDGED);
>> > +       } else if (ret > 0)
>> > +               ret = 0;
>>
>> Less convinced about this hunk here. I think the right approach would
>> be to simply kill the timeout. Unlucky users can always get out of
>> here with a signal, and it's imo more robust if we have all relevant
>> timeouts for the reset process in the reset work. Separate patch
>> though. Also we now have quite good coverage of all the reset handling
>> in igt, so I don't think we need to go overboard in defending against
>> our own logic screw-ups ... Reset really should work nowadays.
>
> Apart from #63291 said otherwise (iirc)... I think several layers of
> paranoia over handling GPU and driver hangs is justified.

The atomic_set(WEDGED) is imo very dangerous, since it'll wreak havoc
with our accounting. But in general I really prefer an angry user with
a stuck-process backtrace than trying to paper over our own
shortcomings with hacks. And at least for the gpu hang vs.
pageflip/set_base we should now be fairly well-covered with tests. And
I haven't yet seen a report indicating that there's another issue
looming ...

I guess an obnoxious WARN with a return 0; would also work for the
time out case. But I prefer to just ditch the timeout.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
  2013-05-01 12:40     ` Daniel Vetter
@ 2013-05-01 13:01       ` Chris Wilson
  2013-05-01 16:23         ` Daniel Vetter
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2013-05-01 13:01 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote:
> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >> > If reset fails, the GPU is declared wedged. This ideally should never
> >> > happen, but very rarely it does. After the GPU is declared wedged, we
> >> > must allow userspace to continue to use its mapping of bo in order to
> >> > recover its data (and in some cases in order for memory management to
> >> > continue unabated). Obviously after the GPU is wedged, no bo are
> >> > currently accessed by the GPU and so we can complete any waits or domain
> >> > transitions away from the GPU. Currently, we fail this essential task
> >> > and instead report EIO and send a SIGBUS to the affected process -
> >> > causing major loss of data (by killing X or compiz).
> >> >
> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>
> >> So I've read again through the reset code and I still don't see how
> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
> >> -EIO eating after wait_rendering looks really suspicious to me.
> >
> > That's always been a layer of defense against the driver. Alternatively,
> > throw we can throw a warn into the wait as we shouldn't enter there with
> > a seqno and a wedged GPU.
> 
> Yeah, that sounds like a good idea. We really shouldn't ever have an
> oustanding request while the gpu is wedged (ignoring dangerous changes
> to the wedged state through debugfs).

Except it is only true for a locked wait. :(

> >> Now the other thing is i915_gem_object_wait_
> >> rendering, that thing loves to throw an -EIO at us. And on a quick
> >> check your patch misses the one in set_domain_ioctl. We probably need
> >> to do the same with sw_finish_ioctl. So what about a
> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
> >> the few places we don't want to hear about a dead gpu?
> >
> > In fact, it turns out that we hardly ever want
> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
> > that EIO was only ever returned when we tried to issue a command on the
> > GPU whilst it was terminally wedged - in order to preserve data
> > integrity.
> 
> Don't we need to re-add a check in execbuf then? Otherwise I guess
> userspace has no idea ever that something is amiss ...

It will catch the EIO as soon as it tries to emit the command, but
adding the explicit check will save a ton of work in reserving buffers.
 
> Otherwise I think this approach would also make sense, and feels more
> future-proof. Applications that want real robustness simply need to
> query the kernel through the new interface whether anything ugly has
> happened to them. And if we want more synchronous feedback we can
> always improve wait/set_domain_ioctl to check whether the given bo
> suffered through a little calamity.

The only explicit point we have to remember is to preserve the
throttle() semantics of reporting EIO when wedged.
 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 2bd8d7a..f243e32 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> >
> >         /* GPU is already declared terminally dead, give up. */
> >         if (i915_terminally_wedged(error))
> > -               return -EIO;
> > +               return 0;
> >
> >         /*
> >          * Only wait 10 seconds for the gpu reset to complete to avoid hanging
> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> >                                                10*HZ);
> >         if (ret == 0) {
> >                 DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
> > -               return -EIO;
> > -       } else if (ret < 0) {
> > -               return ret;
> > -       }
> > +               atomic_set(&error->reset_counter, I915_WEDGED);
> > +       } else if (ret > 0)
> > +               ret = 0;
> 
> Less convinced about this hunk here. I think the right approach would
> be to simply kill the timeout. Unlucky users can always get out of
> here with a signal, and it's imo more robust if we have all relevant
> timeouts for the reset process in the reset work. Separate patch
> though. Also we now have quite good coverage of all the reset handling
> in igt, so I don't think we need to go overboard in defending against
> our own logic screw-ups ... Reset really should work nowadays.

Apart from #63291 said otherwise (iirc)... I think several layers of
paranoia over handling GPU and driver hangs is justified.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
  2013-05-01 11:06   ` Chris Wilson
@ 2013-05-01 12:40     ` Daniel Vetter
  2013-05-01 13:01       ` Chris Wilson
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Vetter @ 2013-05-01 12:40 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, intel-gfx

On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
>> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> > If reset fails, the GPU is declared wedged. This ideally should never
>> > happen, but very rarely it does. After the GPU is declared wedged, we
>> > must allow userspace to continue to use its mapping of bo in order to
>> > recover its data (and in some cases in order for memory management to
>> > continue unabated). Obviously after the GPU is wedged, no bo are
>> > currently accessed by the GPU and so we can complete any waits or domain
>> > transitions away from the GPU. Currently, we fail this essential task
>> > and instead report EIO and send a SIGBUS to the affected process -
>> > causing major loss of data (by killing X or compiz).
>> >
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>
>> So I've read again through the reset code and I still don't see how
>> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
>> -EIO eating after wait_rendering looks really suspicious to me.
>
> That's always been a layer of defense against the driver. Alternatively,
> throw we can throw a warn into the wait as we shouldn't enter there with
> a seqno and a wedged GPU.

Yeah, that sounds like a good idea. We really shouldn't ever have an
oustanding request while the gpu is wedged (ignoring dangerous changes
to the wedged state through debugfs).

>> Now the other thing is i915_gem_object_wait_
>> rendering, that thing loves to throw an -EIO at us. And on a quick
>> check your patch misses the one in set_domain_ioctl. We probably need
>> to do the same with sw_finish_ioctl. So what about a
>> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
>> the few places we don't want to hear about a dead gpu?
>
> In fact, it turns out that we hardly ever want
> i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
> that EIO was only ever returned when we tried to issue a command on the
> GPU whilst it was terminally wedged - in order to preserve data
> integrity.

Don't we need to re-add a check in execbuf then? Otherwise I guess
userspace has no idea ever that something is amiss ...

Otherwise I think this approach would also make sense, and feels more
future-proof. Applications that want real robustness simply need to
query the kernel through the new interface whether anything ugly has
happened to them. And if we want more synchronous feedback we can
always improve wait/set_domain_ioctl to check whether the given bo
suffered through a little calamity.

> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 2bd8d7a..f243e32 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>
>         /* GPU is already declared terminally dead, give up. */
>         if (i915_terminally_wedged(error))
> -               return -EIO;
> +               return 0;
>
>         /*
>          * Only wait 10 seconds for the gpu reset to complete to avoid hanging
> @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>                                                10*HZ);
>         if (ret == 0) {
>                 DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
> -               return -EIO;
> -       } else if (ret < 0) {
> -               return ret;
> -       }
> +               atomic_set(&error->reset_counter, I915_WEDGED);
> +       } else if (ret > 0)
> +               ret = 0;

Less convinced about this hunk here. I think the right approach would
be to simply kill the timeout. Unlucky users can always get out of
here with a signal, and it's imo more robust if we have all relevant
timeouts for the reset process in the reset work. Separate patch
though. Also we now have quite good coverage of all the reset handling
in igt, so I don't think we need to go overboard in defending against
our own logic screw-ups ... Reset really should work nowadays.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
  2013-05-01 10:38 ` Daniel Vetter
@ 2013-05-01 11:06   ` Chris Wilson
  2013-05-01 12:40     ` Daniel Vetter
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2013-05-01 11:06 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > If reset fails, the GPU is declared wedged. This ideally should never
> > happen, but very rarely it does. After the GPU is declared wedged, we
> > must allow userspace to continue to use its mapping of bo in order to
> > recover its data (and in some cases in order for memory management to
> > continue unabated). Obviously after the GPU is wedged, no bo are
> > currently accessed by the GPU and so we can complete any waits or domain
> > transitions away from the GPU. Currently, we fail this essential task
> > and instead report EIO and send a SIGBUS to the affected process -
> > causing major loss of data (by killing X or compiz).
> >
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> So I've read again through the reset code and I still don't see how
> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
> -EIO eating after wait_rendering looks really suspicious to me.

That's always been a layer of defense against the driver. Alternatively,
throw we can throw a warn into the wait as we shouldn't enter there with
a seqno and a wedged GPU.
 
> Now the other thing is i915_gem_object_wait_
> rendering, that thing loves to throw an -EIO at us. And on a quick
> check your patch misses the one in set_domain_ioctl. We probably need
> to do the same with sw_finish_ioctl. So what about a
> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
> the few places we don't want to hear about a dead gpu?

In fact, it turns out that we hardly ever want
i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
that EIO was only ever returned when we tried to issue a command on the
GPU whilst it was terminally wedged - in order to preserve data
integrity.

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 2bd8d7a..f243e32 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
 
        /* GPU is already declared terminally dead, give up. */
        if (i915_terminally_wedged(error))
-               return -EIO;
+               return 0;
 
        /*
         * Only wait 10 seconds for the gpu reset to complete to avoid hanging
@@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
                                               10*HZ);
        if (ret == 0) {
                DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
-               return -EIO;
-       } else if (ret < 0) {
-               return ret;
-       }
+               atomic_set(&error->reset_counter, I915_WEDGED);
+       } else if (ret > 0)
+               ret = 0;
 #undef EXIT_COND
 
-       return 0;
+       return ret;
 }
 
 int i915_mutex_lock_interruptible(struct drm_device *dev)

 
> And if the chances of us breaking bo waiting are too high we can
> always add a few crazy igts which manually wedge the gpu to test them
> and ensure they all work.

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
  2013-05-01 10:25 Chris Wilson
@ 2013-05-01 10:38 ` Daniel Vetter
  2013-05-01 11:06   ` Chris Wilson
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Vetter @ 2013-05-01 10:38 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> If reset fails, the GPU is declared wedged. This ideally should never
> happen, but very rarely it does. After the GPU is declared wedged, we
> must allow userspace to continue to use its mapping of bo in order to
> recover its data (and in some cases in order for memory management to
> continue unabated). Obviously after the GPU is wedged, no bo are
> currently accessed by the GPU and so we can complete any waits or domain
> transitions away from the GPU. Currently, we fail this essential task
> and instead report EIO and send a SIGBUS to the affected process -
> causing major loss of data (by killing X or compiz).
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

So I've read again through the reset code and I still don't see how
wait_rendering can ever gives us -EIO once the gpu is dead. So all the
-EIO eating after wait_rendering looks really suspicious to me.

Now the other thing is i915_gem_object_wait_
rendering, that thing loves to throw an -EIO at us. And on a quick
check your patch misses the one in set_domain_ioctl. We probably need
to do the same with sw_finish_ioctl. So what about a
i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
the few places we don't want to hear about a dead gpu?

And if the chances of us breaking bo waiting are too high we can
always add a few crazy igts which manually wedge the gpu to test them
and ensure they all work.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
@ 2013-05-01 10:25 Chris Wilson
  2013-05-01 10:38 ` Daniel Vetter
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Wilson @ 2013-05-01 10:25 UTC (permalink / raw)
  To: intel-gfx

If reset fails, the GPU is declared wedged. This ideally should never
happen, but very rarely it does. After the GPU is declared wedged, we
must allow userspace to continue to use its mapping of bo in order to
recover its data (and in some cases in order for memory management to
continue unabated). Obviously after the GPU is wedged, no bo are
currently accessed by the GPU and so we can complete any waits or domain
transitions away from the GPU. Currently, we fail this essential task
and instead report EIO and send a SIGBUS to the affected process -
causing major loss of data (by killing X or compiz).

References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e0c3ada..2bd8d7a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1214,7 +1214,7 @@ i915_gem_set_domain_ioctl(struct drm_device *dev, void *data,
 	 * to catch cases where we are gazumped.
 	 */
 	ret = i915_gem_object_wait_rendering__nonblocking(obj, !write_domain);
-	if (ret)
+	if (ret && ret != -EIO)
 		goto unref;
 
 	if (read_domains & I915_GEM_DOMAIN_GTT) {
@@ -1334,6 +1334,8 @@ int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	bool write = !!(vmf->flags & FAULT_FLAG_WRITE);
 
 	ret = i915_mutex_lock_interruptible(dev);
+	if (ret == -EIO)
+		ret = mutex_lock_interruptible(dev);
 	if (ret)
 		goto out;
 
@@ -2732,7 +2734,7 @@ i915_gem_object_wait_fence(struct drm_i915_gem_object *obj)
 {
 	if (obj->last_fenced_seqno) {
 		int ret = i915_wait_seqno(obj->ring, obj->last_fenced_seqno);
-		if (ret)
+		if (ret && ret != -EIO)
 			return ret;
 
 		obj->last_fenced_seqno = 0;
@@ -3141,7 +3143,7 @@ i915_gem_object_set_to_gtt_domain(struct drm_i915_gem_object *obj, bool write)
 		return 0;
 
 	ret = i915_gem_object_wait_rendering(obj, !write);
-	if (ret)
+	if (ret && ret != -EIO)
 		return ret;
 
 	i915_gem_object_flush_cpu_write_domain(obj);
@@ -3382,7 +3384,7 @@ i915_gem_object_finish_gpu(struct drm_i915_gem_object *obj)
 		return 0;
 
 	ret = i915_gem_object_wait_rendering(obj, false);
-	if (ret)
+	if (ret && ret != -EIO)
 		return ret;
 
 	/* Ensure that we invalidate the GPU's caches and TLBs. */
@@ -3406,7 +3408,7 @@ i915_gem_object_set_to_cpu_domain(struct drm_i915_gem_object *obj, bool write)
 		return 0;
 
 	ret = i915_gem_object_wait_rendering(obj, !write);
-	if (ret)
+	if (ret && ret != -EIO)
 		return ret;
 
 	i915_gem_object_flush_gtt_write_domain(obj);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-05-24 10:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-24 10:45 [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain Chris Wilson
  -- strict thread matches above, loose matches on Subject: below --
2013-05-01 10:25 Chris Wilson
2013-05-01 10:38 ` Daniel Vetter
2013-05-01 11:06   ` Chris Wilson
2013-05-01 12:40     ` Daniel Vetter
2013-05-01 13:01       ` Chris Wilson
2013-05-01 16:23         ` Daniel Vetter
2013-05-24  9:03           ` Daniel Vetter
2013-05-24  9:36             ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.