All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl
@ 2015-11-26 11:34 Daniel Vetter
  2015-11-26 11:34 ` [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast" Daniel Vetter
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Daniel Vetter @ 2015-11-26 11:34 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter, Daniel Vetter

So there's 3 competing proposals for what wait_ioctl should do wrt
-EIO:

- return -EIO when the gpu is wedged. Not terribly useful for
  userspace since it might race with a hang and then there's no
  guarantee that a subsequent execbuf won't end up in an -EIO.
  Terminally wedge really can only be reliably signalled at execbuf
  time, and userspace needs to cope with that (or decide not to
  bother).

- EIO for any obj that suffered from a reset. This means big internal
  reorginazation in the kernel since currently we track reset stats
  per-ctx and not on the obj. That's also what arb robustness wants.
  We could do this, but this feels like new ABI territory with the
  usual userspace requirements and high hurdles.

- No -EIO at all. Consistent with set_domain_ioctl and simplest to
  implement. Which is what this patch does.

We can always opt to change this later on if there's a real need.

To make the test really exercise this do a full wedged gpu hang, to
make sure -EIO doesn't leak out at all.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 tests/gem_eio.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tests/gem_eio.c b/tests/gem_eio.c
index a24c8f1c53b5..8345d1a7a429 100644
--- a/tests/gem_eio.c
+++ b/tests/gem_eio.c
@@ -161,10 +161,14 @@ static void test_wait(int fd)
 {
 	igt_hang_ring_t hang;
 
+	igt_require(i915_reset_control(false));
+
 	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
-	igt_assert_eq(__gem_wait(fd, hang.handle, -1), -EIO);
+	igt_assert_eq(__gem_wait(fd, hang.handle, -1), 0);
 	igt_post_hang_ring(fd, hang);
 
+	igt_assert(i915_reset_control(true));
+
 	trigger_reset(fd);
 }
 
-- 
2.1.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast"
  2015-11-26 11:34 [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl Daniel Vetter
@ 2015-11-26 11:34 ` Daniel Vetter
  2015-11-26 12:59   ` Chris Wilson
  2015-11-30 10:11 ` [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl Chris Wilson
  2015-12-16 13:48 ` Chris Wilson
  2 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-11-26 11:34 UTC (permalink / raw)
  To: Intel Graphics Development; +Cc: Daniel Vetter, Daniel Vetter

Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
hang we need to be careful to not run into the "hanging too fast
check":

- don't restore the ban period, but instead keep it at 0.
- make sure we idle the gpu fully before hanging it again (wait
  subtest missted that).

With this gem_eio works now reliable even when I don't run the
subtests individually.

Of course it's a bit fishy that the default ctx gets blamed for
essentially doing nothing, but until that's figured out in upstream
it's better to make the test work for now.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 tests/gem_eio.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/tests/gem_eio.c b/tests/gem_eio.c
index 8345d1a7a429..6c07d9175b95 100644
--- a/tests/gem_eio.c
+++ b/tests/gem_eio.c
@@ -84,13 +84,17 @@ static void trigger_reset(int fd)
 
 static void wedge_gpu(int fd)
 {
+	igt_hang_ring_t hang;
+
 	/* First idle the GPU then disable GPU resets before injecting a hang */
 	gem_quiescent_gpu(fd);
 
 	igt_require(i915_reset_control(false));
 
 	igt_debug("Wedging GPU by injecting hang\n");
-	igt_post_hang_ring(fd, igt_hang_ring(fd, I915_EXEC_DEFAULT));
+	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
+	hang.ban = 0;
+	igt_post_hang_ring(fd, hang);
 
 	igt_assert(i915_reset_control(true));
 }
@@ -161,10 +165,14 @@ static void test_wait(int fd)
 {
 	igt_hang_ring_t hang;
 
+	/* First idle the GPU then disable GPU resets before injecting a hang */
+	gem_quiescent_gpu(fd);
+
 	igt_require(i915_reset_control(false));
 
 	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
 	igt_assert_eq(__gem_wait(fd, hang.handle, -1), 0);
+	hang.ban = 0;
 	igt_post_hang_ring(fd, hang);
 
 	igt_assert(i915_reset_control(true));
-- 
2.1.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast"
  2015-11-26 11:34 ` [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast" Daniel Vetter
@ 2015-11-26 12:59   ` Chris Wilson
  2015-11-26 14:46     ` Daniel Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-11-26 12:59 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development

On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> hang we need to be careful to not run into the "hanging too fast
> check":
> 
> - don't restore the ban period, but instead keep it at 0.
> - make sure we idle the gpu fully before hanging it again (wait
>   subtest missted that).
> 
> With this gem_eio works now reliable even when I don't run the
> subtests individually.
> 
> Of course it's a bit fishy that the default ctx gets blamed for
> essentially doing nothing, but until that's figured out in upstream
> it's better to make the test work for now.

This used to be reliable. And just disabling all banning in the kernel
forever more is silly.

During igt_post_hang_ring:
1. we wait upon the hanging batch
 - this returns when hangcheck fires
2. reset the ban period to normal
 - this takes mutex_lock_interruptible and so must wait for the reset
   handler to run before it can make the change,
 - ergo the hanging batch never triggers a ban for itself.
 - (a subsequent nonsimulated GPU hang may trigger the ban though)

Nak.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast"
  2015-11-26 12:59   ` Chris Wilson
@ 2015-11-26 14:46     ` Daniel Vetter
  2015-11-26 15:34       ` Chris Wilson
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-11-26 14:46 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Intel Graphics Development, Daniel Vetter

On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > hang we need to be careful to not run into the "hanging too fast
> > check":
> > 
> > - don't restore the ban period, but instead keep it at 0.
> > - make sure we idle the gpu fully before hanging it again (wait
> >   subtest missted that).
> > 
> > With this gem_eio works now reliable even when I don't run the
> > subtests individually.
> > 
> > Of course it's a bit fishy that the default ctx gets blamed for
> > essentially doing nothing, but until that's figured out in upstream
> > it's better to make the test work for now.
> 
> This used to be reliable. And just disabling all banning in the kernel
> forever more is silly.
> 
> During igt_post_hang_ring:
> 1. we wait upon the hanging batch
>  - this returns when hangcheck fires
> 2. reset the ban period to normal
>  - this takes mutex_lock_interruptible and so must wait for the reset
>    handler to run before it can make the change,
>  - ergo the hanging batch never triggers a ban for itself.
>  - (a subsequent nonsimulated GPU hang may trigger the ban though)

This isn't where it dies. It dies when we do the echo 1 > i915_wedged. I
suspect quiescent_gpu or whatever is getting in the way, but I really only
wanted to get things to run first. And since i915_wedged is a developer
feature, and it does work perfectly if you don't intend to reuse contexts
I didn't see any point in first trying to fix it up.

So I still maintain that this is a good enough approach, at least if
there's no obvious fix in-flight already.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast"
  2015-11-26 14:46     ` Daniel Vetter
@ 2015-11-26 15:34       ` Chris Wilson
  2015-11-26 15:51         ` Daniel Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-11-26 15:34 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development, Daniel Vetter

On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote:
> On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > hang we need to be careful to not run into the "hanging too fast
> > > check":
> > > 
> > > - don't restore the ban period, but instead keep it at 0.
> > > - make sure we idle the gpu fully before hanging it again (wait
> > >   subtest missted that).
> > > 
> > > With this gem_eio works now reliable even when I don't run the
> > > subtests individually.
> > > 
> > > Of course it's a bit fishy that the default ctx gets blamed for
> > > essentially doing nothing, but until that's figured out in upstream
> > > it's better to make the test work for now.
> > 
> > This used to be reliable. And just disabling all banning in the kernel
> > forever more is silly.
> > 
> > During igt_post_hang_ring:
> > 1. we wait upon the hanging batch
> >  - this returns when hangcheck fires
> > 2. reset the ban period to normal
> >  - this takes mutex_lock_interruptible and so must wait for the reset
> >    handler to run before it can make the change,
> >  - ergo the hanging batch never triggers a ban for itself.
> >  - (a subsequent nonsimulated GPU hang may trigger the ban though)
> 
> This isn't where it dies. It dies when we do the echo 1 > i915_wedged.

That is not where it dies.

> I suspect quiescent_gpu or whatever is getting in the way, but I really only
> wanted to get things to run first. And since i915_wedged is a developer
> feature, and it does work perfectly if you don't intend to reuse contexts
> I didn't see any point in first trying to fix it up.
> 
> So I still maintain that this is a good enough approach, at least if
> there's no obvious fix in-flight already.

No way. This is a kernel regression since 4.0, having just tested with
v4.0 on snb/ivb/hsw.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast"
  2015-11-26 15:34       ` Chris Wilson
@ 2015-11-26 15:51         ` Daniel Vetter
  2015-11-26 21:10           ` Chris Wilson
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-11-26 15:51 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Daniel Vetter,
	Intel Graphics Development, Daniel Vetter

On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote:
> > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > > hang we need to be careful to not run into the "hanging too fast
> > > > check":
> > > > 
> > > > - don't restore the ban period, but instead keep it at 0.
> > > > - make sure we idle the gpu fully before hanging it again (wait
> > > >   subtest missted that).
> > > > 
> > > > With this gem_eio works now reliable even when I don't run the
> > > > subtests individually.
> > > > 
> > > > Of course it's a bit fishy that the default ctx gets blamed for
> > > > essentially doing nothing, but until that's figured out in upstream
> > > > it's better to make the test work for now.
> > > 
> > > This used to be reliable. And just disabling all banning in the kernel
> > > forever more is silly.
> > > 
> > > During igt_post_hang_ring:
> > > 1. we wait upon the hanging batch
> > >  - this returns when hangcheck fires
> > > 2. reset the ban period to normal
> > >  - this takes mutex_lock_interruptible and so must wait for the reset
> > >    handler to run before it can make the change,
> > >  - ergo the hanging batch never triggers a ban for itself.
> > >  - (a subsequent nonsimulated GPU hang may trigger the ban though)
> > 
> > This isn't where it dies. It dies when we do the echo 1 > i915_wedged.
> 
> That is not where it dies.

Well at least it happens after we start the hang recover from i915_wedged.

> > I suspect quiescent_gpu or whatever is getting in the way, but I really only
> > wanted to get things to run first. And since i915_wedged is a developer
> > feature, and it does work perfectly if you don't intend to reuse contexts
> > I didn't see any point in first trying to fix it up.
> > 
> > So I still maintain that this is a good enough approach, at least if
> > there's no obvious fix in-flight already.
> 
> No way. This is a kernel regression since 4.0, having just tested with
> v4.0 on snb/ivb/hsw.

Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN
anyway when we are terminally wedged, and that seems to have been the case
ever since we started with reset_counter this has been broken forever. I
guess I missed something.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast"
  2015-11-26 15:51         ` Daniel Vetter
@ 2015-11-26 21:10           ` Chris Wilson
  2015-11-30  8:25             ` Daniel Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-11-26 21:10 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development, Daniel Vetter

On Thu, Nov 26, 2015 at 04:51:13PM +0100, Daniel Vetter wrote:
> On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote:
> > On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote:
> > > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> > > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > > > hang we need to be careful to not run into the "hanging too fast
> > > > > check":
> > > > > 
> > > > > - don't restore the ban period, but instead keep it at 0.
> > > > > - make sure we idle the gpu fully before hanging it again (wait
> > > > >   subtest missted that).
> > > > > 
> > > > > With this gem_eio works now reliable even when I don't run the
> > > > > subtests individually.
> > > > > 
> > > > > Of course it's a bit fishy that the default ctx gets blamed for
> > > > > essentially doing nothing, but until that's figured out in upstream
> > > > > it's better to make the test work for now.
> > > > 
> > > > This used to be reliable. And just disabling all banning in the kernel
> > > > forever more is silly.
> > > > 
> > > > During igt_post_hang_ring:
> > > > 1. we wait upon the hanging batch
> > > >  - this returns when hangcheck fires
> > > > 2. reset the ban period to normal
> > > >  - this takes mutex_lock_interruptible and so must wait for the reset
> > > >    handler to run before it can make the change,
> > > >  - ergo the hanging batch never triggers a ban for itself.
> > > >  - (a subsequent nonsimulated GPU hang may trigger the ban though)
> > > 
> > > This isn't where it dies. It dies when we do the echo 1 > i915_wedged.
> > 
> > That is not where it dies.
> 
> Well at least it happens after we start the hang recover from i915_wedged.
> 
> > > I suspect quiescent_gpu or whatever is getting in the way, but I really only
> > > wanted to get things to run first. And since i915_wedged is a developer
> > > feature, and it does work perfectly if you don't intend to reuse contexts
> > > I didn't see any point in first trying to fix it up.
> > > 
> > > So I still maintain that this is a good enough approach, at least if
> > > there's no obvious fix in-flight already.
> > 
> > No way. This is a kernel regression since 4.0, having just tested with
> > v4.0 on snb/ivb/hsw.
> 
> Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN
> anyway when we are terminally wedged, and that seems to have been the case
> ever since we started with reset_counter this has been broken forever. I
> guess I missed something.

The bug I see is SNB specific, and introduced between v4.0 and v4.1.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast"
  2015-11-26 21:10           ` Chris Wilson
@ 2015-11-30  8:25             ` Daniel Vetter
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel Vetter @ 2015-11-30  8:25 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Daniel Vetter,
	Intel Graphics Development, Daniel Vetter

On Thu, Nov 26, 2015 at 09:10:57PM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 04:51:13PM +0100, Daniel Vetter wrote:
> > On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote:
> > > On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote:
> > > > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote:
> > > > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote:
> > > > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > > > > hang we need to be careful to not run into the "hanging too fast
> > > > > > check":
> > > > > > 
> > > > > > - don't restore the ban period, but instead keep it at 0.
> > > > > > - make sure we idle the gpu fully before hanging it again (wait
> > > > > >   subtest missted that).
> > > > > > 
> > > > > > With this gem_eio works now reliable even when I don't run the
> > > > > > subtests individually.
> > > > > > 
> > > > > > Of course it's a bit fishy that the default ctx gets blamed for
> > > > > > essentially doing nothing, but until that's figured out in upstream
> > > > > > it's better to make the test work for now.
> > > > > 
> > > > > This used to be reliable. And just disabling all banning in the kernel
> > > > > forever more is silly.
> > > > > 
> > > > > During igt_post_hang_ring:
> > > > > 1. we wait upon the hanging batch
> > > > >  - this returns when hangcheck fires
> > > > > 2. reset the ban period to normal
> > > > >  - this takes mutex_lock_interruptible and so must wait for the reset
> > > > >    handler to run before it can make the change,
> > > > >  - ergo the hanging batch never triggers a ban for itself.
> > > > >  - (a subsequent nonsimulated GPU hang may trigger the ban though)
> > > > 
> > > > This isn't where it dies. It dies when we do the echo 1 > i915_wedged.
> > > 
> > > That is not where it dies.
> > 
> > Well at least it happens after we start the hang recover from i915_wedged.
> > 
> > > > I suspect quiescent_gpu or whatever is getting in the way, but I really only
> > > > wanted to get things to run first. And since i915_wedged is a developer
> > > > feature, and it does work perfectly if you don't intend to reuse contexts
> > > > I didn't see any point in first trying to fix it up.
> > > > 
> > > > So I still maintain that this is a good enough approach, at least if
> > > > there's no obvious fix in-flight already.
> > > 
> > > No way. This is a kernel regression since 4.0, having just tested with
> > > v4.0 on snb/ivb/hsw.
> > 
> > Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN
> > anyway when we are terminally wedged, and that seems to have been the case
> > ever since we started with reset_counter this has been broken forever. I
> > guess I missed something.
> 
> The bug I see is SNB specific, and introduced between v4.0 and v4.1.

The irony, I hacked on gem_eio on snb too ;-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl
  2015-11-26 11:34 [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl Daniel Vetter
  2015-11-26 11:34 ` [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast" Daniel Vetter
@ 2015-11-30 10:11 ` Chris Wilson
  2015-12-01  8:28   ` Daniel Vetter
  2015-12-16 13:48 ` Chris Wilson
  2 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-11-30 10:11 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development

On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> So there's 3 competing proposals for what wait_ioctl should do wrt
> -EIO:
> 
> - return -EIO when the gpu is wedged. Not terribly useful for
>   userspace since it might race with a hang and then there's no
>   guarantee that a subsequent execbuf won't end up in an -EIO.
>   Terminally wedge really can only be reliably signalled at execbuf
>   time, and userspace needs to cope with that (or decide not to
>   bother).
> 
> - EIO for any obj that suffered from a reset. This means big internal
>   reorginazation in the kernel since currently we track reset stats
>   per-ctx and not on the obj. That's also what arb robustness wants.
>   We could do this, but this feels like new ABI territory with the
>   usual userspace requirements and high hurdles.
> 
> - No -EIO at all. Consistent with set_domain_ioctl and simplest to
>   implement. Which is what this patch does.

Since no one else is weighing into the ABI discussion, I'm happy with
losing EIO here. I thought it could be useful, but as no one is using or
seems likely to start using it, begone.

> We can always opt to change this later on if there's a real need.
> 
> To make the test really exercise this do a full wedged gpu hang, to
> make sure -EIO doesn't leak out at all.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  tests/gem_eio.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> index a24c8f1c53b5..8345d1a7a429 100644
> --- a/tests/gem_eio.c
> +++ b/tests/gem_eio.c
> @@ -161,10 +161,14 @@ static void test_wait(int fd)
>  {
>  	igt_hang_ring_t hang;
>  
> +	igt_require(i915_reset_control(false));

However, this is not required to test the ABI change above as the wait
itself will still hang, whether or not it wedges the GPU.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl
  2015-11-30 10:11 ` [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl Chris Wilson
@ 2015-12-01  8:28   ` Daniel Vetter
  2015-12-01  9:04     ` Chris Wilson
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-12-01  8:28 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Intel Graphics Development, Daniel Vetter

On Mon, Nov 30, 2015 at 10:11:12AM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> > So there's 3 competing proposals for what wait_ioctl should do wrt
> > -EIO:
> > 
> > - return -EIO when the gpu is wedged. Not terribly useful for
> >   userspace since it might race with a hang and then there's no
> >   guarantee that a subsequent execbuf won't end up in an -EIO.
> >   Terminally wedge really can only be reliably signalled at execbuf
> >   time, and userspace needs to cope with that (or decide not to
> >   bother).
> > 
> > - EIO for any obj that suffered from a reset. This means big internal
> >   reorginazation in the kernel since currently we track reset stats
> >   per-ctx and not on the obj. That's also what arb robustness wants.
> >   We could do this, but this feels like new ABI territory with the
> >   usual userspace requirements and high hurdles.
> > 
> > - No -EIO at all. Consistent with set_domain_ioctl and simplest to
> >   implement. Which is what this patch does.
> 
> Since no one else is weighing into the ABI discussion, I'm happy with
> losing EIO here. I thought it could be useful, but as no one is using or
> seems likely to start using it, begone.
> 
> > We can always opt to change this later on if there's a real need.
> > 
> > To make the test really exercise this do a full wedged gpu hang, to
> > make sure -EIO doesn't leak out at all.
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  tests/gem_eio.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> > index a24c8f1c53b5..8345d1a7a429 100644
> > --- a/tests/gem_eio.c
> > +++ b/tests/gem_eio.c
> > @@ -161,10 +161,14 @@ static void test_wait(int fd)
> >  {
> >  	igt_hang_ring_t hang;
> >  
> > +	igt_require(i915_reset_control(false));
> 
> However, this is not required to test the ABI change above as the wait
> itself will still hang, whether or not it wedges the GPU.

Yes it's not strictly required, but without it the testcase is fairly
boring. If we move the check_wedge out of wait_request then a normail gpu
reset would always return 0 (after retrying a few times perhaps), so I
figured testing the wedged case is the only one that's worth it.

Maybe we should dupe the subtests all and have wedged and non-wedged cases
for all of them? That would imo make more sense.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl
  2015-12-01  8:28   ` Daniel Vetter
@ 2015-12-01  9:04     ` Chris Wilson
  2015-12-01  9:13       ` Daniel Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-12-01  9:04 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development, Daniel Vetter

On Tue, Dec 01, 2015 at 09:28:08AM +0100, Daniel Vetter wrote:
> On Mon, Nov 30, 2015 at 10:11:12AM +0000, Chris Wilson wrote:
> > On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> > > So there's 3 competing proposals for what wait_ioctl should do wrt
> > > -EIO:
> > > 
> > > - return -EIO when the gpu is wedged. Not terribly useful for
> > >   userspace since it might race with a hang and then there's no
> > >   guarantee that a subsequent execbuf won't end up in an -EIO.
> > >   Terminally wedge really can only be reliably signalled at execbuf
> > >   time, and userspace needs to cope with that (or decide not to
> > >   bother).
> > > 
> > > - EIO for any obj that suffered from a reset. This means big internal
> > >   reorginazation in the kernel since currently we track reset stats
> > >   per-ctx and not on the obj. That's also what arb robustness wants.
> > >   We could do this, but this feels like new ABI territory with the
> > >   usual userspace requirements and high hurdles.
> > > 
> > > - No -EIO at all. Consistent with set_domain_ioctl and simplest to
> > >   implement. Which is what this patch does.
> > 
> > Since no one else is weighing into the ABI discussion, I'm happy with
> > losing EIO here. I thought it could be useful, but as no one is using or
> > seems likely to start using it, begone.
> > 
> > > We can always opt to change this later on if there's a real need.
> > > 
> > > To make the test really exercise this do a full wedged gpu hang, to
> > > make sure -EIO doesn't leak out at all.
> > > 
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  tests/gem_eio.c | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> > > index a24c8f1c53b5..8345d1a7a429 100644
> > > --- a/tests/gem_eio.c
> > > +++ b/tests/gem_eio.c
> > > @@ -161,10 +161,14 @@ static void test_wait(int fd)
> > >  {
> > >  	igt_hang_ring_t hang;
> > >  
> > > +	igt_require(i915_reset_control(false));
> > 
> > However, this is not required to test the ABI change above as the wait
> > itself will still hang, whether or not it wedges the GPU.
> 
> Yes it's not strictly required, but without it the testcase is fairly
> boring. If we move the check_wedge out of wait_request then a normail gpu
> reset would always return 0 (after retrying a few times perhaps), so I
> figured testing the wedged case is the only one that's worth it.

But wedging during the hang is also not interesting as we have no
opportunity to see the reset failure in the test case. Putting the GPU
into the wedged state before the wait, should be a trivial test that the
object is idle after the reset.
 
> Maybe we should dupe the subtests all and have wedged and non-wedged cases
> for all of them? That would imo make more sense.

The others, what matters is how we handle the GPU being wedged before we
queue an execbuf or throttling. In terms of testing no error is reported
for the hanging case, we should add tests for set-domain (so that it is
explicit and not reliant on implementation inside lib/), GTT faulting
(that would need both wedged and hanging cases, but we are not likely to
be able to cover all the cases like waiting on fence and other secondary
waits) and throttle.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl
  2015-12-01  9:04     ` Chris Wilson
@ 2015-12-01  9:13       ` Daniel Vetter
  2015-12-01  9:20         ` Chris Wilson
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-12-01  9:13 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Daniel Vetter,
	Intel Graphics Development, Daniel Vetter

On Tue, Dec 01, 2015 at 09:04:23AM +0000, Chris Wilson wrote:
> On Tue, Dec 01, 2015 at 09:28:08AM +0100, Daniel Vetter wrote:
> > On Mon, Nov 30, 2015 at 10:11:12AM +0000, Chris Wilson wrote:
> > > On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> > > > So there's 3 competing proposals for what wait_ioctl should do wrt
> > > > -EIO:
> > > > 
> > > > - return -EIO when the gpu is wedged. Not terribly useful for
> > > >   userspace since it might race with a hang and then there's no
> > > >   guarantee that a subsequent execbuf won't end up in an -EIO.
> > > >   Terminally wedge really can only be reliably signalled at execbuf
> > > >   time, and userspace needs to cope with that (or decide not to
> > > >   bother).
> > > > 
> > > > - EIO for any obj that suffered from a reset. This means big internal
> > > >   reorginazation in the kernel since currently we track reset stats
> > > >   per-ctx and not on the obj. That's also what arb robustness wants.
> > > >   We could do this, but this feels like new ABI territory with the
> > > >   usual userspace requirements and high hurdles.
> > > > 
> > > > - No -EIO at all. Consistent with set_domain_ioctl and simplest to
> > > >   implement. Which is what this patch does.
> > > 
> > > Since no one else is weighing into the ABI discussion, I'm happy with
> > > losing EIO here. I thought it could be useful, but as no one is using or
> > > seems likely to start using it, begone.
> > > 
> > > > We can always opt to change this later on if there's a real need.
> > > > 
> > > > To make the test really exercise this do a full wedged gpu hang, to
> > > > make sure -EIO doesn't leak out at all.
> > > > 
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > ---
> > > >  tests/gem_eio.c | 6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> > > > index a24c8f1c53b5..8345d1a7a429 100644
> > > > --- a/tests/gem_eio.c
> > > > +++ b/tests/gem_eio.c
> > > > @@ -161,10 +161,14 @@ static void test_wait(int fd)
> > > >  {
> > > >  	igt_hang_ring_t hang;
> > > >  
> > > > +	igt_require(i915_reset_control(false));
> > > 
> > > However, this is not required to test the ABI change above as the wait
> > > itself will still hang, whether or not it wedges the GPU.
> > 
> > Yes it's not strictly required, but without it the testcase is fairly
> > boring. If we move the check_wedge out of wait_request then a normail gpu
> > reset would always return 0 (after retrying a few times perhaps), so I
> > figured testing the wedged case is the only one that's worth it.
> 
> But wedging during the hang is also not interesting as we have no
> opportunity to see the reset failure in the test case. Putting the GPU
> into the wedged state before the wait, should be a trivial test that the
> object is idle after the reset.

Right now (with current kernels) we see an -EIO with this testcase instead
of 0 in the wait. Without disabling reset we see 0 both on fixed and
broken kernels. So I don't really see why not testing this case is a good
idea? It's the one we're currently failing at and leak -EIO to userspace.

Assuming ofc we still go with the "let's curb -EIO except for execbuf"
ABI.

> > Maybe we should dupe the subtests all and have wedged and non-wedged cases
> > for all of them? That would imo make more sense.
> 
> The others, what matters is how we handle the GPU being wedged before we
> queue an execbuf or throttling. In terms of testing no error is reported
> for the hanging case, we should add tests for set-domain (so that it is
> explicit and not reliant on implementation inside lib/), GTT faulting
> (that would need both wedged and hanging cases, but we are not likely to
> be able to cover all the cases like waiting on fence and other secondary
> waits) and throttle.

Yeah there's more gaps than just the one above.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl
  2015-12-01  9:13       ` Daniel Vetter
@ 2015-12-01  9:20         ` Chris Wilson
  2015-12-03  8:50           ` Daniel Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2015-12-01  9:20 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development, Daniel Vetter

On Tue, Dec 01, 2015 at 10:13:10AM +0100, Daniel Vetter wrote:
> On Tue, Dec 01, 2015 at 09:04:23AM +0000, Chris Wilson wrote:
> > On Tue, Dec 01, 2015 at 09:28:08AM +0100, Daniel Vetter wrote:
> > > On Mon, Nov 30, 2015 at 10:11:12AM +0000, Chris Wilson wrote:
> > > > On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> > > > > So there's 3 competing proposals for what wait_ioctl should do wrt
> > > > > -EIO:
> > > > > 
> > > > > - return -EIO when the gpu is wedged. Not terribly useful for
> > > > >   userspace since it might race with a hang and then there's no
> > > > >   guarantee that a subsequent execbuf won't end up in an -EIO.
> > > > >   Terminally wedge really can only be reliably signalled at execbuf
> > > > >   time, and userspace needs to cope with that (or decide not to
> > > > >   bother).
> > > > > 
> > > > > - EIO for any obj that suffered from a reset. This means big internal
> > > > >   reorginazation in the kernel since currently we track reset stats
> > > > >   per-ctx and not on the obj. That's also what arb robustness wants.
> > > > >   We could do this, but this feels like new ABI territory with the
> > > > >   usual userspace requirements and high hurdles.
> > > > > 
> > > > > - No -EIO at all. Consistent with set_domain_ioctl and simplest to
> > > > >   implement. Which is what this patch does.
> > > > 
> > > > Since no one else is weighing into the ABI discussion, I'm happy with
> > > > losing EIO here. I thought it could be useful, but as no one is using or
> > > > seems likely to start using it, begone.
> > > > 
> > > > > We can always opt to change this later on if there's a real need.
> > > > > 
> > > > > To make the test really exercise this do a full wedged gpu hang, to
> > > > > make sure -EIO doesn't leak out at all.
> > > > > 
> > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > ---
> > > > >  tests/gem_eio.c | 6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> > > > > index a24c8f1c53b5..8345d1a7a429 100644
> > > > > --- a/tests/gem_eio.c
> > > > > +++ b/tests/gem_eio.c
> > > > > @@ -161,10 +161,14 @@ static void test_wait(int fd)
> > > > >  {
> > > > >  	igt_hang_ring_t hang;
> > > > >  
> > > > > +	igt_require(i915_reset_control(false));
> > > > 
> > > > However, this is not required to test the ABI change above as the wait
> > > > itself will still hang, whether or not it wedges the GPU.
> > > 
> > > Yes it's not strictly required, but without it the testcase is fairly
> > > boring. If we move the check_wedge out of wait_request then a normail gpu
> > > reset would always return 0 (after retrying a few times perhaps), so I
> > > figured testing the wedged case is the only one that's worth it.
> > 
> > But wedging during the hang is also not interesting as we have no
> > opportunity to see the reset failure in the test case. Putting the GPU
> > into the wedged state before the wait, should be a trivial test that the
> > object is idle after the reset.
> 
> Right now (with current kernels) we see an -EIO with this testcase instead
> of 0 in the wait. Without disabling reset we see 0 both on fixed and
> broken kernels. So I don't really see why not testing this case is a good
> idea? It's the one we're currently failing at and leak -EIO to userspace.

Because that wasn't the ABI I was trying to test here and expected to
maintain! I was just aiming for covering the ioctl calls with expected
EIO (to do the opposite would need something like trinity or a guided
fuzzer).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl
  2015-12-01  9:20         ` Chris Wilson
@ 2015-12-03  8:50           ` Daniel Vetter
  2015-12-03  9:00             ` Chris Wilson
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2015-12-03  8:50 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Daniel Vetter,
	Intel Graphics Development, Daniel Vetter

On Tue, Dec 01, 2015 at 09:20:02AM +0000, Chris Wilson wrote:
> On Tue, Dec 01, 2015 at 10:13:10AM +0100, Daniel Vetter wrote:
> > On Tue, Dec 01, 2015 at 09:04:23AM +0000, Chris Wilson wrote:
> > > On Tue, Dec 01, 2015 at 09:28:08AM +0100, Daniel Vetter wrote:
> > > > On Mon, Nov 30, 2015 at 10:11:12AM +0000, Chris Wilson wrote:
> > > > > On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> > > > > > So there's 3 competing proposals for what wait_ioctl should do wrt
> > > > > > -EIO:
> > > > > > 
> > > > > > - return -EIO when the gpu is wedged. Not terribly useful for
> > > > > >   userspace since it might race with a hang and then there's no
> > > > > >   guarantee that a subsequent execbuf won't end up in an -EIO.
> > > > > >   Terminally wedge really can only be reliably signalled at execbuf
> > > > > >   time, and userspace needs to cope with that (or decide not to
> > > > > >   bother).
> > > > > > 
> > > > > > - EIO for any obj that suffered from a reset. This means big internal
> > > > > >   reorginazation in the kernel since currently we track reset stats
> > > > > >   per-ctx and not on the obj. That's also what arb robustness wants.
> > > > > >   We could do this, but this feels like new ABI territory with the
> > > > > >   usual userspace requirements and high hurdles.
> > > > > > 
> > > > > > - No -EIO at all. Consistent with set_domain_ioctl and simplest to
> > > > > >   implement. Which is what this patch does.
> > > > > 
> > > > > Since no one else is weighing into the ABI discussion, I'm happy with
> > > > > losing EIO here. I thought it could be useful, but as no one is using or
> > > > > seems likely to start using it, begone.
> > > > > 
> > > > > > We can always opt to change this later on if there's a real need.
> > > > > > 
> > > > > > To make the test really exercise this do a full wedged gpu hang, to
> > > > > > make sure -EIO doesn't leak out at all.
> > > > > > 
> > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > ---
> > > > > >  tests/gem_eio.c | 6 +++++-
> > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> > > > > > index a24c8f1c53b5..8345d1a7a429 100644
> > > > > > --- a/tests/gem_eio.c
> > > > > > +++ b/tests/gem_eio.c
> > > > > > @@ -161,10 +161,14 @@ static void test_wait(int fd)
> > > > > >  {
> > > > > >  	igt_hang_ring_t hang;
> > > > > >  
> > > > > > +	igt_require(i915_reset_control(false));
> > > > > 
> > > > > However, this is not required to test the ABI change above as the wait
> > > > > itself will still hang, whether or not it wedges the GPU.
> > > > 
> > > > Yes it's not strictly required, but without it the testcase is fairly
> > > > boring. If we move the check_wedge out of wait_request then a normail gpu
> > > > reset would always return 0 (after retrying a few times perhaps), so I
> > > > figured testing the wedged case is the only one that's worth it.
> > > 
> > > But wedging during the hang is also not interesting as we have no
> > > opportunity to see the reset failure in the test case. Putting the GPU
> > > into the wedged state before the wait, should be a trivial test that the
> > > object is idle after the reset.
> > 
> > Right now (with current kernels) we see an -EIO with this testcase instead
> > of 0 in the wait. Without disabling reset we see 0 both on fixed and
> > broken kernels. So I don't really see why not testing this case is a good
> > idea? It's the one we're currently failing at and leak -EIO to userspace.
> 
> Because that wasn't the ABI I was trying to test here and expected to
> maintain! I was just aiming for covering the ioctl calls with expected
> EIO (to do the opposite would need something like trinity or a guided
> fuzzer).

Well yeah, but I thought we decided to change it? So Ack or not?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl
  2015-12-03  8:50           ` Daniel Vetter
@ 2015-12-03  9:00             ` Chris Wilson
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Wilson @ 2015-12-03  9:00 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development, Daniel Vetter

On Thu, Dec 03, 2015 at 09:50:25AM +0100, Daniel Vetter wrote:
> On Tue, Dec 01, 2015 at 09:20:02AM +0000, Chris Wilson wrote:
> > On Tue, Dec 01, 2015 at 10:13:10AM +0100, Daniel Vetter wrote:
> > > On Tue, Dec 01, 2015 at 09:04:23AM +0000, Chris Wilson wrote:
> > > > On Tue, Dec 01, 2015 at 09:28:08AM +0100, Daniel Vetter wrote:
> > > > > On Mon, Nov 30, 2015 at 10:11:12AM +0000, Chris Wilson wrote:
> > > > > > On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> > > > > > > So there's 3 competing proposals for what wait_ioctl should do wrt
> > > > > > > -EIO:
> > > > > > > 
> > > > > > > - return -EIO when the gpu is wedged. Not terribly useful for
> > > > > > >   userspace since it might race with a hang and then there's no
> > > > > > >   guarantee that a subsequent execbuf won't end up in an -EIO.
> > > > > > >   Terminally wedge really can only be reliably signalled at execbuf
> > > > > > >   time, and userspace needs to cope with that (or decide not to
> > > > > > >   bother).
> > > > > > > 
> > > > > > > - EIO for any obj that suffered from a reset. This means big internal
> > > > > > >   reorginazation in the kernel since currently we track reset stats
> > > > > > >   per-ctx and not on the obj. That's also what arb robustness wants.
> > > > > > >   We could do this, but this feels like new ABI territory with the
> > > > > > >   usual userspace requirements and high hurdles.
> > > > > > > 
> > > > > > > - No -EIO at all. Consistent with set_domain_ioctl and simplest to
> > > > > > >   implement. Which is what this patch does.
> > > > > > 
> > > > > > Since no one else is weighing into the ABI discussion, I'm happy with
> > > > > > losing EIO here. I thought it could be useful, but as no one is using or
> > > > > > seems likely to start using it, begone.
> > > > > > 
> > > > > > > We can always opt to change this later on if there's a real need.
> > > > > > > 
> > > > > > > To make the test really exercise this do a full wedged gpu hang, to
> > > > > > > make sure -EIO doesn't leak out at all.
> > > > > > > 
> > > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > > ---
> > > > > > >  tests/gem_eio.c | 6 +++++-
> > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> > > > > > > index a24c8f1c53b5..8345d1a7a429 100644
> > > > > > > --- a/tests/gem_eio.c
> > > > > > > +++ b/tests/gem_eio.c
> > > > > > > @@ -161,10 +161,14 @@ static void test_wait(int fd)
> > > > > > >  {
> > > > > > >  	igt_hang_ring_t hang;
> > > > > > >  
> > > > > > > +	igt_require(i915_reset_control(false));
> > > > > > 
> > > > > > However, this is not required to test the ABI change above as the wait
> > > > > > itself will still hang, whether or not it wedges the GPU.
> > > > > 
> > > > > Yes it's not strictly required, but without it the testcase is fairly
> > > > > boring. If we move the check_wedge out of wait_request then a normail gpu
> > > > > reset would always return 0 (after retrying a few times perhaps), so I
> > > > > figured testing the wedged case is the only one that's worth it.
> > > > 
> > > > But wedging during the hang is also not interesting as we have no
> > > > opportunity to see the reset failure in the test case. Putting the GPU
> > > > into the wedged state before the wait, should be a trivial test that the
> > > > object is idle after the reset.
> > > 
> > > Right now (with current kernels) we see an -EIO with this testcase instead
> > > of 0 in the wait. Without disabling reset we see 0 both on fixed and
> > > broken kernels. So I don't really see why not testing this case is a good
> > > idea? It's the one we're currently failing at and leak -EIO to userspace.
> > 
> > Because that wasn't the ABI I was trying to test here and expected to
> > maintain! I was just aiming for covering the ioctl calls with expected
> > EIO (to do the opposite would need something like trinity or a guided
> > fuzzer).
> 
> Well yeah, but I thought we decided to change it? So Ack or not?

If you change the subtest to do both, the more likely case where there
is a GPU hang whilst we wait for a request, and the second where the GPU
hang leads to a failed reset. There's actually a another pair where the
GPU hangs (and reset succeeds or not) before we call the wait_ioctl. And
then we can use the guts of that function to do the set-domain and
gtt-fault testing.

As far as wait-ioctl reporting 0 if the request is completed due to a
reset, that's an ACK.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl
  2015-11-26 11:34 [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl Daniel Vetter
  2015-11-26 11:34 ` [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast" Daniel Vetter
  2015-11-30 10:11 ` [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl Chris Wilson
@ 2015-12-16 13:48 ` Chris Wilson
  2 siblings, 0 replies; 16+ messages in thread
From: Chris Wilson @ 2015-12-16 13:48 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Intel Graphics Development

On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> So there's 3 competing proposals for what wait_ioctl should do wrt
> -EIO:
> 
> - return -EIO when the gpu is wedged. Not terribly useful for
>   userspace since it might race with a hang and then there's no
>   guarantee that a subsequent execbuf won't end up in an -EIO.
>   Terminally wedge really can only be reliably signalled at execbuf
>   time, and userspace needs to cope with that (or decide not to
>   bother).
> 
> - EIO for any obj that suffered from a reset. This means big internal
>   reorginazation in the kernel since currently we track reset stats
>   per-ctx and not on the obj. That's also what arb robustness wants.
>   We could do this, but this feels like new ABI territory with the
>   usual userspace requirements and high hurdles.
> 
> - No -EIO at all. Consistent with set_domain_ioctl and simplest to
>   implement. Which is what this patch does.
> 
> We can always opt to change this later on if there's a real need.
> 
> To make the test really exercise this do a full wedged gpu hang, to
> make sure -EIO doesn't leak out at all.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

I changed the test to cover both i915.reset=true and i915.reset=false
possibilities and pushed.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-12-16 13:48 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-26 11:34 [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl Daniel Vetter
2015-11-26 11:34 ` [PATCH 2/2] tests/gem_eio: Resilience against "hanging too fast" Daniel Vetter
2015-11-26 12:59   ` Chris Wilson
2015-11-26 14:46     ` Daniel Vetter
2015-11-26 15:34       ` Chris Wilson
2015-11-26 15:51         ` Daniel Vetter
2015-11-26 21:10           ` Chris Wilson
2015-11-30  8:25             ` Daniel Vetter
2015-11-30 10:11 ` [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl Chris Wilson
2015-12-01  8:28   ` Daniel Vetter
2015-12-01  9:04     ` Chris Wilson
2015-12-01  9:13       ` Daniel Vetter
2015-12-01  9:20         ` Chris Wilson
2015-12-03  8:50           ` Daniel Vetter
2015-12-03  9:00             ` Chris Wilson
2015-12-16 13:48 ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.