All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
@ 2017-01-03 11:05 Chris Wilson
  2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Chris Wilson @ 2017-01-03 11:05 UTC (permalink / raw)
  To: dri-devel; +Cc: intel-gfx

As the fence->status is an optional field that may be set before
dma_fence_signal() is called to convey that the fence completed with an
error, we have to ensure that it is always set to zero on initialisation
so that the typical use (i.e. unset) always flags a successful completion.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/dma-buf/dma-fence.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 3444f293ad4a..9130f790ebf3 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
 	fence->context = context;
 	fence->seqno = seqno;
 	fence->flags = 0UL;
+	fence->status = 0;
 
 	trace_dma_fence_init(fence);
 }
-- 
2.11.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
  2017-01-03 11:05 [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() Chris Wilson
@ 2017-01-03 11:05 ` Chris Wilson
  2017-01-03 11:34   ` Tvrtko Ursulin
  2017-01-03 11:53 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init() Patchwork
  2017-01-03 14:04 ` [PATCH 1/2] " Tvrtko Ursulin
  2 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2017-01-03 11:05 UTC (permalink / raw)
  To: dri-devel; +Cc: Mika Kuoppala, intel-gfx, Tvrtko Ursulin

The struct dma_fence carries a status field exposed to userspace by
sync_file. This is inspected after the fence is signaled and can convey
whether or not the request completed successfully, or in our case if we
detected a hang during the request (signaled via -EIO in
SYNC_IOC_FILE_INFO).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 204c4a673bf3..bc99c0e292d8 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
 		ring_hung = false;
 	}
 
-	if (ring_hung)
+	if (ring_hung) {
 		i915_gem_context_mark_guilty(request->ctx);
-	else
+		request->fence.status = -EIO;
+	} else {
 		i915_gem_context_mark_innocent(request->ctx);
+	}
 
 	if (!ring_hung)
 		return;
-- 
2.11.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
  2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson
@ 2017-01-03 11:34   ` Tvrtko Ursulin
  2017-01-03 11:46     ` [Intel-gfx] " Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2017-01-03 11:34 UTC (permalink / raw)
  To: Chris Wilson, dri-devel; +Cc: intel-gfx


On 03/01/2017 11:05, Chris Wilson wrote:
> The struct dma_fence carries a status field exposed to userspace by
> sync_file. This is inspected after the fence is signaled and can convey
> whether or not the request completed successfully, or in our case if we
> detected a hang during the request (signaled via -EIO in
> SYNC_IOC_FILE_INFO).
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 204c4a673bf3..bc99c0e292d8 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>  		ring_hung = false;
>  	}
>
> -	if (ring_hung)
> +	if (ring_hung) {
>  		i915_gem_context_mark_guilty(request->ctx);
> -	else
> +		request->fence.status = -EIO;
> +	} else {
>  		i915_gem_context_mark_innocent(request->ctx);
> +	}
>
>  	if (!ring_hung)
>  		return;
>

Reading what happens later in this function, should we set the status of 
all the other requests we are about to clear?

However one thing I don't understand is how this scheme interacts with 
the current userspace. We will clear/no-nop some of the submitted 
requests since the state is corrupt. But how will userspace notice this 
before it submits more requets?

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
  2017-01-03 11:34   ` Tvrtko Ursulin
@ 2017-01-03 11:46     ` Chris Wilson
  2017-01-03 11:57       ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2017-01-03 11:46 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 11:05, Chris Wilson wrote:
> >The struct dma_fence carries a status field exposed to userspace by
> >sync_file. This is inspected after the fence is signaled and can convey
> >whether or not the request completed successfully, or in our case if we
> >detected a hang during the request (signaled via -EIO in
> >SYNC_IOC_FILE_INFO).
> >
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >---
> > drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index 204c4a673bf3..bc99c0e292d8 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> > 		ring_hung = false;
> > 	}
> >
> >-	if (ring_hung)
> >+	if (ring_hung) {
> > 		i915_gem_context_mark_guilty(request->ctx);
> >-	else
> >+		request->fence.status = -EIO;
> >+	} else {
> > 		i915_gem_context_mark_innocent(request->ctx);
> >+	}
> >
> > 	if (!ring_hung)
> > 		return;
> >
> 
> Reading what happens later in this function, should we set the
> status of all the other requests we are about to clear?
> 
> However one thing I don't understand is how this scheme interacts
> with the current userspace. We will clear/no-nop some of the
> submitted requests since the state is corrupt. But how will
> userspace notice this before it submits more requets?

There is no mechanism currently for user space to be able to detect a
hung request. (It can use the uevent for async notification of the
hang/reset, but that will not tell you who caused the hang.) Userspace
can track the number of hangs it caused, but the delay makes any
roundtripping impractical (i.e. you have to synchronise before all
rendering if you must detect the event immediately). Note also that we
do not want to give out interprocess information (i.e. to allow one
client to spy on another), which makes things harder to get right.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* ✗ Fi.CI.BAT: failure for series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init()
  2017-01-03 11:05 [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() Chris Wilson
  2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson
@ 2017-01-03 11:53 ` Patchwork
  2017-01-03 14:04 ` [PATCH 1/2] " Tvrtko Ursulin
  2 siblings, 0 replies; 19+ messages in thread
From: Patchwork @ 2017-01-03 11:53 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init()
URL   : https://patchwork.freedesktop.org/series/17403/
State : failure

== Summary ==

Series 17403v1 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/17403/revisions/1/mbox/

Test gem_busy:
        Subgroup basic-hang-default:
                fail       -> PASS       (fi-hsw-4770r)
Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-b:
                pass       -> INCOMPLETE (fi-snb-2600)

fi-bdw-5557u     total:246  pass:232  dwarn:0   dfail:0   fail:0   skip:14 
fi-bsw-n3050     total:246  pass:207  dwarn:0   dfail:0   fail:0   skip:39 
fi-bxt-j4205     total:246  pass:224  dwarn:0   dfail:0   fail:0   skip:22 
fi-bxt-t5700     total:82   pass:69   dwarn:0   dfail:0   fail:0   skip:12 
fi-byt-j1900     total:246  pass:219  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-n2820     total:246  pass:215  dwarn:0   dfail:0   fail:0   skip:31 
fi-hsw-4770      total:246  pass:227  dwarn:0   dfail:0   fail:0   skip:19 
fi-hsw-4770r     total:246  pass:227  dwarn:0   dfail:0   fail:0   skip:19 
fi-ivb-3520m     total:246  pass:225  dwarn:0   dfail:0   fail:0   skip:21 
fi-ivb-3770      total:246  pass:225  dwarn:0   dfail:0   fail:0   skip:21 
fi-kbl-7500u     total:246  pass:225  dwarn:0   dfail:0   fail:0   skip:21 
fi-skl-6260u     total:246  pass:233  dwarn:0   dfail:0   fail:0   skip:13 
fi-skl-6700hq    total:246  pass:226  dwarn:0   dfail:0   fail:0   skip:20 
fi-skl-6700k     total:246  pass:222  dwarn:3   dfail:0   fail:0   skip:21 
fi-skl-6770hq    total:246  pass:233  dwarn:0   dfail:0   fail:0   skip:13 
fi-snb-2520m     total:246  pass:215  dwarn:0   dfail:0   fail:0   skip:31 
fi-snb-2600      total:209  pass:181  dwarn:0   dfail:0   fail:0   skip:27 

c882640c3f6f9571df175774c5609a798e859fe2 drm-tip: 2017y-01m-03d-10h-16m-34s UTC integration manifest
a0752fa drm/i915: Set guilty-flag on fence after detecting a hang
0269bc0 dma-fence: Clear fence->status during dma_fence_init()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3421/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
  2017-01-03 11:46     ` [Intel-gfx] " Chris Wilson
@ 2017-01-03 11:57       ` Tvrtko Ursulin
  2017-01-03 12:13         ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2017-01-03 11:57 UTC (permalink / raw)
  To: Chris Wilson, dri-devel, intel-gfx


On 03/01/2017 11:46, Chris Wilson wrote:
> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/01/2017 11:05, Chris Wilson wrote:
>>> The struct dma_fence carries a status field exposed to userspace by
>>> sync_file. This is inspected after the fence is signaled and can convey
>>> whether or not the request completed successfully, or in our case if we
>>> detected a hang during the request (signaled via -EIO in
>>> SYNC_IOC_FILE_INFO).
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 204c4a673bf3..bc99c0e292d8 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>> 		ring_hung = false;
>>> 	}
>>>
>>> -	if (ring_hung)
>>> +	if (ring_hung) {
>>> 		i915_gem_context_mark_guilty(request->ctx);
>>> -	else
>>> +		request->fence.status = -EIO;
>>> +	} else {
>>> 		i915_gem_context_mark_innocent(request->ctx);
>>> +	}
>>>
>>> 	if (!ring_hung)
>>> 		return;
>>>
>>
>> Reading what happens later in this function, should we set the
>> status of all the other requests we are about to clear?
>>
>> However one thing I don't understand is how this scheme interacts
>> with the current userspace. We will clear/no-nop some of the
>> submitted requests since the state is corrupt. But how will
>> userspace notice this before it submits more requets?
>
> There is no mechanism currently for user space to be able to detect a
> hung request. (It can use the uevent for async notification of the
> hang/reset, but that will not tell you who caused the hang.) Userspace
> can track the number of hangs it caused, but the delay makes any
> roundtripping impractical (i.e. you have to synchronise before all
> rendering if you must detect the event immediately). Note also that we
> do not want to give out interprocess information (i.e. to allow one
> client to spy on another), which makes things harder to get right.

So idea is to clear already submitted requests _if_ the userspace is 
synchronising before all rendering and looking at reset stats, to make 
it theoretically possible to detect the corrupt state?

Still with the fences do you agree error status needs to be set on those 
as well?

Regards,

Tvrtko






_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
  2017-01-03 11:57       ` Tvrtko Ursulin
@ 2017-01-03 12:13         ` Chris Wilson
  2017-01-03 12:34           ` [Intel-gfx] " Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2017-01-03 12:13 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 11:46, Chris Wilson wrote:
> >On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 03/01/2017 11:05, Chris Wilson wrote:
> >>>The struct dma_fence carries a status field exposed to userspace by
> >>>sync_file. This is inspected after the fence is signaled and can convey
> >>>whether or not the request completed successfully, or in our case if we
> >>>detected a hang during the request (signaled via -EIO in
> >>>SYNC_IOC_FILE_INFO).
> >>>
> >>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>>---
> >>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
> >>>1 file changed, 4 insertions(+), 2 deletions(-)
> >>>
> >>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>index 204c4a673bf3..bc99c0e292d8 100644
> >>>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> >>>		ring_hung = false;
> >>>	}
> >>>
> >>>-	if (ring_hung)
> >>>+	if (ring_hung) {
> >>>		i915_gem_context_mark_guilty(request->ctx);
> >>>-	else
> >>>+		request->fence.status = -EIO;
> >>>+	} else {
> >>>		i915_gem_context_mark_innocent(request->ctx);
> >>>+	}
> >>>
> >>>	if (!ring_hung)
> >>>		return;
> >>>
> >>
> >>Reading what happens later in this function, should we set the
> >>status of all the other requests we are about to clear?
> >>
> >>However one thing I don't understand is how this scheme interacts
> >>with the current userspace. We will clear/no-nop some of the
> >>submitted requests since the state is corrupt. But how will
> >>userspace notice this before it submits more requets?
> >
> >There is no mechanism currently for user space to be able to detect a
> >hung request. (It can use the uevent for async notification of the
> >hang/reset, but that will not tell you who caused the hang.) Userspace
> >can track the number of hangs it caused, but the delay makes any
> >roundtripping impractical (i.e. you have to synchronise before all
> >rendering if you must detect the event immediately). Note also that we
> >do not want to give out interprocess information (i.e. to allow one
> >client to spy on another), which makes things harder to get right.
> 
> So idea is to clear already submitted requests _if_ the userspace is
> synchronising before all rendering and looking at reset stats, to
> make it theoretically possible to detect the corrupt state?

No, I'm just don't see a way that userspace can detect the hang without
testing after seeing the request signaled (either by waiting on the
batch or by waiting on the fence), i.e. by being completely synchronous
(or at least chosing its synchronous points very carefully, such as
around IPC). It can either poll reset-count or use sync_file (which
requires fence exporting).

The current robustness interfaces is a basic query on whether any reset
occurred within the context, not when.
 
> Still with the fences do you agree error status needs to be set on
> those as well?

Yes.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
  2017-01-03 12:13         ` Chris Wilson
@ 2017-01-03 12:34           ` Tvrtko Ursulin
  2017-01-03 12:38             ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2017-01-03 12:34 UTC (permalink / raw)
  To: Chris Wilson, dri-devel, intel-gfx


On 03/01/2017 12:13, Chris Wilson wrote:
> On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/01/2017 11:46, Chris Wilson wrote:
>>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/01/2017 11:05, Chris Wilson wrote:
>>>>> The struct dma_fence carries a status field exposed to userspace by
>>>>> sync_file. This is inspected after the fence is signaled and can convey
>>>>> whether or not the request completed successfully, or in our case if we
>>>>> detected a hang during the request (signaled via -EIO in
>>>>> SYNC_IOC_FILE_INFO).
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>>> ---
>>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>> index 204c4a673bf3..bc99c0e292d8 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>>>> 		ring_hung = false;
>>>>> 	}
>>>>>
>>>>> -	if (ring_hung)
>>>>> +	if (ring_hung) {
>>>>> 		i915_gem_context_mark_guilty(request->ctx);
>>>>> -	else
>>>>> +		request->fence.status = -EIO;
>>>>> +	} else {
>>>>> 		i915_gem_context_mark_innocent(request->ctx);
>>>>> +	}
>>>>>
>>>>> 	if (!ring_hung)
>>>>> 		return;
>>>>>
>>>>
>>>> Reading what happens later in this function, should we set the
>>>> status of all the other requests we are about to clear?
>>>>
>>>> However one thing I don't understand is how this scheme interacts
>>>> with the current userspace. We will clear/no-nop some of the
>>>> submitted requests since the state is corrupt. But how will
>>>> userspace notice this before it submits more requets?
>>>
>>> There is no mechanism currently for user space to be able to detect a
>>> hung request. (It can use the uevent for async notification of the
>>> hang/reset, but that will not tell you who caused the hang.) Userspace
>>> can track the number of hangs it caused, but the delay makes any
>>> roundtripping impractical (i.e. you have to synchronise before all
>>> rendering if you must detect the event immediately). Note also that we
>>> do not want to give out interprocess information (i.e. to allow one
>>> client to spy on another), which makes things harder to get right.
>>
>> So idea is to clear already submitted requests _if_ the userspace is
>> synchronising before all rendering and looking at reset stats, to
>> make it theoretically possible to detect the corrupt state?
>
> No, I'm just don't see a way that userspace can detect the hang without
> testing after seeing the request signaled (either by waiting on the
> batch or by waiting on the fence), i.e. by being completely synchronous
> (or at least chosing its synchronous points very carefully, such as
> around IPC). It can either poll reset-count or use sync_file (which
> requires fence exporting).
>
> The current robustness interfaces is a basic query on whether any reset
> occurred within the context, not when.

Why do we bother with clearing the submitted requests then?

Regards,

Tvrtko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
  2017-01-03 12:34           ` [Intel-gfx] " Tvrtko Ursulin
@ 2017-01-03 12:38             ` Chris Wilson
  2017-01-03 13:17               ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2017-01-03 12:38 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 12:13, Chris Wilson wrote:
> >On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 03/01/2017 11:46, Chris Wilson wrote:
> >>>On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
> >>>>
> >>>>On 03/01/2017 11:05, Chris Wilson wrote:
> >>>>>The struct dma_fence carries a status field exposed to userspace by
> >>>>>sync_file. This is inspected after the fence is signaled and can convey
> >>>>>whether or not the request completed successfully, or in our case if we
> >>>>>detected a hang during the request (signaled via -EIO in
> >>>>>SYNC_IOC_FILE_INFO).
> >>>>>
> >>>>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>>>>---
> >>>>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
> >>>>>1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>>
> >>>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>index 204c4a673bf3..bc99c0e292d8 100644
> >>>>>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>>>>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> >>>>>		ring_hung = false;
> >>>>>	}
> >>>>>
> >>>>>-	if (ring_hung)
> >>>>>+	if (ring_hung) {
> >>>>>		i915_gem_context_mark_guilty(request->ctx);
> >>>>>-	else
> >>>>>+		request->fence.status = -EIO;
> >>>>>+	} else {
> >>>>>		i915_gem_context_mark_innocent(request->ctx);
> >>>>>+	}
> >>>>>
> >>>>>	if (!ring_hung)
> >>>>>		return;
> >>>>>
> >>>>
> >>>>Reading what happens later in this function, should we set the
> >>>>status of all the other requests we are about to clear?
> >>>>
> >>>>However one thing I don't understand is how this scheme interacts
> >>>>with the current userspace. We will clear/no-nop some of the
> >>>>submitted requests since the state is corrupt. But how will
> >>>>userspace notice this before it submits more requets?
> >>>
> >>>There is no mechanism currently for user space to be able to detect a
> >>>hung request. (It can use the uevent for async notification of the
> >>>hang/reset, but that will not tell you who caused the hang.) Userspace
> >>>can track the number of hangs it caused, but the delay makes any
> >>>roundtripping impractical (i.e. you have to synchronise before all
> >>>rendering if you must detect the event immediately). Note also that we
> >>>do not want to give out interprocess information (i.e. to allow one
> >>>client to spy on another), which makes things harder to get right.
> >>
> >>So idea is to clear already submitted requests _if_ the userspace is
> >>synchronising before all rendering and looking at reset stats, to
> >>make it theoretically possible to detect the corrupt state?
> >
> >No, I'm just don't see a way that userspace can detect the hang without
> >testing after seeing the request signaled (either by waiting on the
> >batch or by waiting on the fence), i.e. by being completely synchronous
> >(or at least chosing its synchronous points very carefully, such as
> >around IPC). It can either poll reset-count or use sync_file (which
> >requires fence exporting).
> >
> >The current robustness interfaces is a basic query on whether any reset
> >occurred within the context, not when.
> 
> Why do we bother with clearing the submitted requests then?

The same reason we ban processes from submitting new requests if they
cause repeated hangs. If before we ban that client, it has already
submitted 1000 hanging requests, it has successfully locked the machine
up for a couple of hours.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
  2017-01-03 12:38             ` Chris Wilson
@ 2017-01-03 13:17               ` Tvrtko Ursulin
  2017-01-03 13:25                 ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2017-01-03 13:17 UTC (permalink / raw)
  To: Chris Wilson, dri-devel, intel-gfx


On 03/01/2017 12:38, Chris Wilson wrote:
> On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/01/2017 12:13, Chris Wilson wrote:
>>> On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/01/2017 11:46, Chris Wilson wrote:
>>>>> On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 03/01/2017 11:05, Chris Wilson wrote:
>>>>>>> The struct dma_fence carries a status field exposed to userspace by
>>>>>>> sync_file. This is inspected after the fence is signaled and can convey
>>>>>>> whether or not the request completed successfully, or in our case if we
>>>>>>> detected a hang during the request (signaled via -EIO in
>>>>>>> SYNC_IOC_FILE_INFO).
>>>>>>>
>>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
>>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> index 204c4a673bf3..bc99c0e292d8 100644
>>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>> @@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>>>>>> 		ring_hung = false;
>>>>>>> 	}
>>>>>>>
>>>>>>> -	if (ring_hung)
>>>>>>> +	if (ring_hung) {
>>>>>>> 		i915_gem_context_mark_guilty(request->ctx);
>>>>>>> -	else
>>>>>>> +		request->fence.status = -EIO;
>>>>>>> +	} else {
>>>>>>> 		i915_gem_context_mark_innocent(request->ctx);
>>>>>>> +	}
>>>>>>>
>>>>>>> 	if (!ring_hung)
>>>>>>> 		return;
>>>>>>>
>>>>>>
>>>>>> Reading what happens later in this function, should we set the
>>>>>> status of all the other requests we are about to clear?
>>>>>>
>>>>>> However one thing I don't understand is how this scheme interacts
>>>>>> with the current userspace. We will clear/no-nop some of the
>>>>>> submitted requests since the state is corrupt. But how will
>>>>>> userspace notice this before it submits more requets?
>>>>>
>>>>> There is no mechanism currently for user space to be able to detect a
>>>>> hung request. (It can use the uevent for async notification of the
>>>>> hang/reset, but that will not tell you who caused the hang.) Userspace
>>>>> can track the number of hangs it caused, but the delay makes any
>>>>> roundtripping impractical (i.e. you have to synchronise before all
>>>>> rendering if you must detect the event immediately). Note also that we
>>>>> do not want to give out interprocess information (i.e. to allow one
>>>>> client to spy on another), which makes things harder to get right.
>>>>
>>>> So idea is to clear already submitted requests _if_ the userspace is
>>>> synchronising before all rendering and looking at reset stats, to
>>>> make it theoretically possible to detect the corrupt state?
>>>
>>> No, I'm just don't see a way that userspace can detect the hang without
>>> testing after seeing the request signaled (either by waiting on the
>>> batch or by waiting on the fence), i.e. by being completely synchronous
>>> (or at least chosing its synchronous points very carefully, such as
>>> around IPC). It can either poll reset-count or use sync_file (which
>>> requires fence exporting).
>>>
>>> The current robustness interfaces is a basic query on whether any reset
>>> occurred within the context, not when.
>>
>> Why do we bother with clearing the submitted requests then?
>
> The same reason we ban processes from submitting new requests if they
> cause repeated hangs. If before we ban that client, it has already
> submitted 1000 hanging requests, it has successfully locked the machine
> up for a couple of hours.

So we would need to gate clearing on the transition to banned state I 
think. Because currently it does in unconditionally.

Regards,

Tvrtko


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang
  2017-01-03 13:17               ` Tvrtko Ursulin
@ 2017-01-03 13:25                 ` Chris Wilson
  0 siblings, 0 replies; 19+ messages in thread
From: Chris Wilson @ 2017-01-03 13:25 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx, dri-devel

On Tue, Jan 03, 2017 at 01:17:19PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 12:38, Chris Wilson wrote:
> >On Tue, Jan 03, 2017 at 12:34:16PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 03/01/2017 12:13, Chris Wilson wrote:
> >>>On Tue, Jan 03, 2017 at 11:57:44AM +0000, Tvrtko Ursulin wrote:
> >>>>
> >>>>On 03/01/2017 11:46, Chris Wilson wrote:
> >>>>>On Tue, Jan 03, 2017 at 11:34:45AM +0000, Tvrtko Ursulin wrote:
> >>>>>>
> >>>>>>On 03/01/2017 11:05, Chris Wilson wrote:
> >>>>>>>The struct dma_fence carries a status field exposed to userspace by
> >>>>>>>sync_file. This is inspected after the fence is signaled and can convey
> >>>>>>>whether or not the request completed successfully, or in our case if we
> >>>>>>>detected a hang during the request (signaled via -EIO in
> >>>>>>>SYNC_IOC_FILE_INFO).
> >>>>>>>
> >>>>>>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>>>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>>>>>>---
> >>>>>>>drivers/gpu/drm/i915/i915_gem.c | 6 ++++--
> >>>>>>>1 file changed, 4 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>>>index 204c4a673bf3..bc99c0e292d8 100644
> >>>>>>>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>>>>>>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>>>@@ -2757,10 +2757,12 @@ static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> >>>>>>>		ring_hung = false;
> >>>>>>>	}
> >>>>>>>
> >>>>>>>-	if (ring_hung)
> >>>>>>>+	if (ring_hung) {
> >>>>>>>		i915_gem_context_mark_guilty(request->ctx);
> >>>>>>>-	else
> >>>>>>>+		request->fence.status = -EIO;
> >>>>>>>+	} else {
> >>>>>>>		i915_gem_context_mark_innocent(request->ctx);
> >>>>>>>+	}
> >>>>>>>
> >>>>>>>	if (!ring_hung)
> >>>>>>>		return;
> >>>>>>>
> >>>>>>
> >>>>>>Reading what happens later in this function, should we set the
> >>>>>>status of all the other requests we are about to clear?
> >>>>>>
> >>>>>>However one thing I don't understand is how this scheme interacts
> >>>>>>with the current userspace. We will clear/no-nop some of the
> >>>>>>submitted requests since the state is corrupt. But how will
> >>>>>>userspace notice this before it submits more requets?
> >>>>>
> >>>>>There is no mechanism currently for user space to be able to detect a
> >>>>>hung request. (It can use the uevent for async notification of the
> >>>>>hang/reset, but that will not tell you who caused the hang.) Userspace
> >>>>>can track the number of hangs it caused, but the delay makes any
> >>>>>roundtripping impractical (i.e. you have to synchronise before all
> >>>>>rendering if you must detect the event immediately). Note also that we
> >>>>>do not want to give out interprocess information (i.e. to allow one
> >>>>>client to spy on another), which makes things harder to get right.
> >>>>
> >>>>So idea is to clear already submitted requests _if_ the userspace is
> >>>>synchronising before all rendering and looking at reset stats, to
> >>>>make it theoretically possible to detect the corrupt state?
> >>>
> >>>No, I'm just don't see a way that userspace can detect the hang without
> >>>testing after seeing the request signaled (either by waiting on the
> >>>batch or by waiting on the fence), i.e. by being completely synchronous
> >>>(or at least chosing its synchronous points very carefully, such as
> >>>around IPC). It can either poll reset-count or use sync_file (which
> >>>requires fence exporting).
> >>>
> >>>The current robustness interfaces is a basic query on whether any reset
> >>>occurred within the context, not when.
> >>
> >>Why do we bother with clearing the submitted requests then?
> >
> >The same reason we ban processes from submitting new requests if they
> >cause repeated hangs. If before we ban that client, it has already
> >submitted 1000 hanging requests, it has successfully locked the machine
> >up for a couple of hours.
> 
> So we would need to gate clearing on the transition to banned state
> I think. Because currently it does in unconditionally.

Yes, see the other patch :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
  2017-01-03 11:05 [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() Chris Wilson
  2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson
  2017-01-03 11:53 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init() Patchwork
@ 2017-01-03 14:04 ` Tvrtko Ursulin
  2017-01-04  9:15   ` [Intel-gfx] " Daniel Vetter
  2 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2017-01-03 14:04 UTC (permalink / raw)
  To: Chris Wilson, dri-devel; +Cc: intel-gfx


On 03/01/2017 11:05, Chris Wilson wrote:
> As the fence->status is an optional field that may be set before
> dma_fence_signal() is called to convey that the fence completed with an
> error, we have to ensure that it is always set to zero on initialisation
> so that the typical use (i.e. unset) always flags a successful completion.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/dma-buf/dma-fence.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 3444f293ad4a..9130f790ebf3 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
>  	fence->context = context;
>  	fence->seqno = seqno;
>  	fence->flags = 0UL;
> +	fence->status = 0;
>
>  	trace_dma_fence_init(fence);
>  }
>

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
  2017-01-03 14:04 ` [PATCH 1/2] " Tvrtko Ursulin
@ 2017-01-04  9:15   ` Daniel Vetter
  2017-01-04  9:24     ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2017-01-04  9:15 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: gustavo.padovan, intel-gfx, dri-devel

On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote:
> 
> On 03/01/2017 11:05, Chris Wilson wrote:
> > As the fence->status is an optional field that may be set before
> > dma_fence_signal() is called to convey that the fence completed with an
> > error, we have to ensure that it is always set to zero on initialisation
> > so that the typical use (i.e. unset) always flags a successful completion.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >  drivers/dma-buf/dma-fence.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 3444f293ad4a..9130f790ebf3 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
> >  	fence->context = context;
> >  	fence->seqno = seqno;
> >  	fence->flags = 0UL;
> > +	fence->status = 0;
> > 
> >  	trace_dma_fence_init(fence);
> >  }
> > 
> 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Looking at existing users there's only the sync_file stuff. And that gets
it wrong, because afaics no one ever sets fence->status to anything
useful. But sync_file blindly assumes it's correct.

Gustavo, Maarten?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
  2017-01-04  9:15   ` [Intel-gfx] " Daniel Vetter
@ 2017-01-04  9:24     ` Chris Wilson
  2017-01-04  9:37       ` [Intel-gfx] " Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2017-01-04  9:24 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: gustavo.padovan, intel-gfx, dri-devel

On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote:
> On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote:
> > 
> > On 03/01/2017 11:05, Chris Wilson wrote:
> > > As the fence->status is an optional field that may be set before
> > > dma_fence_signal() is called to convey that the fence completed with an
> > > error, we have to ensure that it is always set to zero on initialisation
> > > so that the typical use (i.e. unset) always flags a successful completion.
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > ---
> > >  drivers/dma-buf/dma-fence.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > index 3444f293ad4a..9130f790ebf3 100644
> > > --- a/drivers/dma-buf/dma-fence.c
> > > +++ b/drivers/dma-buf/dma-fence.c
> > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
> > >  	fence->context = context;
> > >  	fence->seqno = seqno;
> > >  	fence->flags = 0UL;
> > > +	fence->status = 0;
> > > 
> > >  	trace_dma_fence_init(fence);
> > >  }
> > > 
> > 
> > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Looking at existing users there's only the sync_file stuff. And that gets
> it wrong, because afaics no one ever sets fence->status to anything
> useful. But sync_file blindly assumes it's correct.

In terms of doc, sync_file is using it correctly, and dma-fence isn't
living up to its doc. The documented behaviour (sync_file) seems useful.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
  2017-01-04  9:24     ` Chris Wilson
@ 2017-01-04  9:37       ` Daniel Vetter
  2017-01-04  9:43         ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2017-01-04  9:37 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Tvrtko Ursulin, dri-devel,
	intel-gfx, Maarten Lankhorst, gustavo.padovan

On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote:
> On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote:
> > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 03/01/2017 11:05, Chris Wilson wrote:
> > > > As the fence->status is an optional field that may be set before
> > > > dma_fence_signal() is called to convey that the fence completed with an
> > > > error, we have to ensure that it is always set to zero on initialisation
> > > > so that the typical use (i.e. unset) always flags a successful completion.
> > > > 
> > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > ---
> > > >  drivers/dma-buf/dma-fence.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > > index 3444f293ad4a..9130f790ebf3 100644
> > > > --- a/drivers/dma-buf/dma-fence.c
> > > > +++ b/drivers/dma-buf/dma-fence.c
> > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
> > > >  	fence->context = context;
> > > >  	fence->seqno = seqno;
> > > >  	fence->flags = 0UL;
> > > > +	fence->status = 0;
> > > > 
> > > >  	trace_dma_fence_init(fence);
> > > >  }
> > > > 
> > > 
> > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > 
> > Looking at existing users there's only the sync_file stuff. And that gets
> > it wrong, because afaics no one ever sets fence->status to anything
> > useful. But sync_file blindly assumes it's correct.
> 
> In terms of doc, sync_file is using it correctly, and dma-fence isn't
> living up to its doc. The documented behaviour (sync_file) seems useful.

Yeah, it looks like an oversight in merging sync_file and dma_fence
together, but instead of piecemeal fixing (like this patch does), fixing
it all&properly might be better.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
  2017-01-04  9:37       ` [Intel-gfx] " Daniel Vetter
@ 2017-01-04  9:43         ` Chris Wilson
  2017-01-04 10:18           ` Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2017-01-04  9:43 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: gustavo.padovan, intel-gfx, dri-devel

On Wed, Jan 04, 2017 at 10:37:32AM +0100, Daniel Vetter wrote:
> On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote:
> > On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote:
> > > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote:
> > > > 
> > > > On 03/01/2017 11:05, Chris Wilson wrote:
> > > > > As the fence->status is an optional field that may be set before
> > > > > dma_fence_signal() is called to convey that the fence completed with an
> > > > > error, we have to ensure that it is always set to zero on initialisation
> > > > > so that the typical use (i.e. unset) always flags a successful completion.
> > > > > 
> > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > ---
> > > > >  drivers/dma-buf/dma-fence.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > > > index 3444f293ad4a..9130f790ebf3 100644
> > > > > --- a/drivers/dma-buf/dma-fence.c
> > > > > +++ b/drivers/dma-buf/dma-fence.c
> > > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
> > > > >  	fence->context = context;
> > > > >  	fence->seqno = seqno;
> > > > >  	fence->flags = 0UL;
> > > > > +	fence->status = 0;
> > > > > 
> > > > >  	trace_dma_fence_init(fence);
> > > > >  }
> > > > > 
> > > > 
> > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > 
> > > Looking at existing users there's only the sync_file stuff. And that gets
> > > it wrong, because afaics no one ever sets fence->status to anything
> > > useful. But sync_file blindly assumes it's correct.
> > 
> > In terms of doc, sync_file is using it correctly, and dma-fence isn't
> > living up to its doc. The documented behaviour (sync_file) seems useful.
> 
> Yeah, it looks like an oversight in merging sync_file and dma_fence
> together, but instead of piecemeal fixing (like this patch does), fixing
> it all&properly might be better.

What's missing?

  * @status: Optional, only valid if < 0, must be set before calling
  * dma_fence_signal, indicates that the fence has completed with an
  * error.

dma-fence must then guarantee that is zeroed during init, then the
drivers can supply the optional information prior to calling
dma_fence_signal()

A dma_fence_set_status(fence, status) {
	BUG_ON(test_bit(SIGNALED, &fence->flags);
	BUG_ON(!IS_ERR_VALUE(status));
	BUG_ON(fence->status);
	fence->status = status;
} ?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
  2017-01-04  9:43         ` Chris Wilson
@ 2017-01-04 10:18           ` Daniel Vetter
  2017-01-04 10:26             ` Chris Wilson
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2017-01-04 10:18 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Tvrtko Ursulin, dri-devel,
	intel-gfx, Maarten Lankhorst, gustavo.padovan

On Wed, Jan 04, 2017 at 09:43:51AM +0000, Chris Wilson wrote:
> On Wed, Jan 04, 2017 at 10:37:32AM +0100, Daniel Vetter wrote:
> > On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote:
> > > On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote:
> > > > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 03/01/2017 11:05, Chris Wilson wrote:
> > > > > > As the fence->status is an optional field that may be set before
> > > > > > dma_fence_signal() is called to convey that the fence completed with an
> > > > > > error, we have to ensure that it is always set to zero on initialisation
> > > > > > so that the typical use (i.e. unset) always flags a successful completion.
> > > > > > 
> > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > ---
> > > > > >  drivers/dma-buf/dma-fence.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > > > > index 3444f293ad4a..9130f790ebf3 100644
> > > > > > --- a/drivers/dma-buf/dma-fence.c
> > > > > > +++ b/drivers/dma-buf/dma-fence.c
> > > > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
> > > > > >  	fence->context = context;
> > > > > >  	fence->seqno = seqno;
> > > > > >  	fence->flags = 0UL;
> > > > > > +	fence->status = 0;
> > > > > > 
> > > > > >  	trace_dma_fence_init(fence);
> > > > > >  }
> > > > > > 
> > > > > 
> > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > 
> > > > Looking at existing users there's only the sync_file stuff. And that gets
> > > > it wrong, because afaics no one ever sets fence->status to anything
> > > > useful. But sync_file blindly assumes it's correct.
> > > 
> > > In terms of doc, sync_file is using it correctly, and dma-fence isn't
> > > living up to its doc. The documented behaviour (sync_file) seems useful.
> > 
> > Yeah, it looks like an oversight in merging sync_file and dma_fence
> > together, but instead of piecemeal fixing (like this patch does), fixing
> > it all&properly might be better.
> 
> What's missing?
> 
>   * @status: Optional, only valid if < 0, must be set before calling
>   * dma_fence_signal, indicates that the fence has completed with an
>   * error.
> 
> dma-fence must then guarantee that is zeroed during init, then the
> drivers can supply the optional information prior to calling
> dma_fence_signal()
> 
> A dma_fence_set_status(fence, status) {
> 	BUG_ON(test_bit(SIGNALED, &fence->flags);
> 	BUG_ON(!IS_ERR_VALUE(status));
> 	BUG_ON(fence->status);
> 	fence->status = status;
> } ?

The trouble I'm seeing is that sync_file.c and sync_debug.c _only_ look at
fence->status, and assume that statue > 0 means the fence is singalled.
That code doesn't check fence->flags at all ...

So maybe we need to rename status to error_status to make it clear it's
for failure modes only, and fix the sync_file code to look at flags, too.

Please maybe even some userspace tests to sweeten the deal? Perhaps in our
hangstats tests.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
  2017-01-04 10:18           ` Daniel Vetter
@ 2017-01-04 10:26             ` Chris Wilson
  2017-01-04 11:31               ` [Intel-gfx] " Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Wilson @ 2017-01-04 10:26 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: gustavo.padovan, intel-gfx, dri-devel

On Wed, Jan 04, 2017 at 11:18:58AM +0100, Daniel Vetter wrote:
> On Wed, Jan 04, 2017 at 09:43:51AM +0000, Chris Wilson wrote:
> > On Wed, Jan 04, 2017 at 10:37:32AM +0100, Daniel Vetter wrote:
> > > On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote:
> > > > On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote:
> > > > > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote:
> > > > > > 
> > > > > > On 03/01/2017 11:05, Chris Wilson wrote:
> > > > > > > As the fence->status is an optional field that may be set before
> > > > > > > dma_fence_signal() is called to convey that the fence completed with an
> > > > > > > error, we have to ensure that it is always set to zero on initialisation
> > > > > > > so that the typical use (i.e. unset) always flags a successful completion.
> > > > > > > 
> > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > > ---
> > > > > > >  drivers/dma-buf/dma-fence.c | 1 +
> > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > > > > > index 3444f293ad4a..9130f790ebf3 100644
> > > > > > > --- a/drivers/dma-buf/dma-fence.c
> > > > > > > +++ b/drivers/dma-buf/dma-fence.c
> > > > > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
> > > > > > >  	fence->context = context;
> > > > > > >  	fence->seqno = seqno;
> > > > > > >  	fence->flags = 0UL;
> > > > > > > +	fence->status = 0;
> > > > > > > 
> > > > > > >  	trace_dma_fence_init(fence);
> > > > > > >  }
> > > > > > > 
> > > > > > 
> > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > 
> > > > > Looking at existing users there's only the sync_file stuff. And that gets
> > > > > it wrong, because afaics no one ever sets fence->status to anything
> > > > > useful. But sync_file blindly assumes it's correct.
> > > > 
> > > > In terms of doc, sync_file is using it correctly, and dma-fence isn't
> > > > living up to its doc. The documented behaviour (sync_file) seems useful.
> > > 
> > > Yeah, it looks like an oversight in merging sync_file and dma_fence
> > > together, but instead of piecemeal fixing (like this patch does), fixing
> > > it all&properly might be better.
> > 
> > What's missing?
> > 
> >   * @status: Optional, only valid if < 0, must be set before calling
> >   * dma_fence_signal, indicates that the fence has completed with an
> >   * error.
> > 
> > dma-fence must then guarantee that is zeroed during init, then the
> > drivers can supply the optional information prior to calling
> > dma_fence_signal()
> > 
> > A dma_fence_set_status(fence, status) {
> > 	BUG_ON(test_bit(SIGNALED, &fence->flags);
> > 	BUG_ON(!IS_ERR_VALUE(status));
> > 	BUG_ON(fence->status);
> > 	fence->status = status;
> > } ?
> 
> The trouble I'm seeing is that sync_file.c and sync_debug.c _only_ look at
> fence->status, and assume that statue > 0 means the fence is singalled.
> That code doesn't check fence->flags at all ...

sync_file:
sync_fill_fence_info() {
	if (dma_fence_is_signaled(fence))
		info->status = fence->status >= 0 ? 1 : fence->status;
	else
		info->status = 0;
}

The no_fences: path is confusing since it sets info.status, but that's a
different struct entirely.

sync_debug:
sync_print_fence() {
	status = 1;
	if (dma_fence_is_is_signaled_locked(fence))
		status = fence->status;

That's backwards...

> So maybe we need to rename status to error_status to make it clear it's
> for failure modes only, and fix the sync_file code to look at flags, too.

It is already ^^.
 
> Please maybe even some userspace tests to sweeten the deal? Perhaps in our
> hangstats tests.

I am already querying the error code in igt.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Intel-gfx] [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init()
  2017-01-04 10:26             ` Chris Wilson
@ 2017-01-04 11:31               ` Daniel Vetter
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Vetter @ 2017-01-04 11:31 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Tvrtko Ursulin, dri-devel,
	intel-gfx, Maarten Lankhorst, gustavo.padovan

On Wed, Jan 04, 2017 at 10:26:49AM +0000, Chris Wilson wrote:
> On Wed, Jan 04, 2017 at 11:18:58AM +0100, Daniel Vetter wrote:
> > On Wed, Jan 04, 2017 at 09:43:51AM +0000, Chris Wilson wrote:
> > > On Wed, Jan 04, 2017 at 10:37:32AM +0100, Daniel Vetter wrote:
> > > > On Wed, Jan 04, 2017 at 09:24:27AM +0000, Chris Wilson wrote:
> > > > > On Wed, Jan 04, 2017 at 10:15:01AM +0100, Daniel Vetter wrote:
> > > > > > On Tue, Jan 03, 2017 at 02:04:44PM +0000, Tvrtko Ursulin wrote:
> > > > > > > 
> > > > > > > On 03/01/2017 11:05, Chris Wilson wrote:
> > > > > > > > As the fence->status is an optional field that may be set before
> > > > > > > > dma_fence_signal() is called to convey that the fence completed with an
> > > > > > > > error, we have to ensure that it is always set to zero on initialisation
> > > > > > > > so that the typical use (i.e. unset) always flags a successful completion.
> > > > > > > > 
> > > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > > > ---
> > > > > > > >  drivers/dma-buf/dma-fence.c | 1 +
> > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > > > > > > index 3444f293ad4a..9130f790ebf3 100644
> > > > > > > > --- a/drivers/dma-buf/dma-fence.c
> > > > > > > > +++ b/drivers/dma-buf/dma-fence.c
> > > > > > > > @@ -534,6 +534,7 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
> > > > > > > >  	fence->context = context;
> > > > > > > >  	fence->seqno = seqno;
> > > > > > > >  	fence->flags = 0UL;
> > > > > > > > +	fence->status = 0;
> > > > > > > > 
> > > > > > > >  	trace_dma_fence_init(fence);
> > > > > > > >  }
> > > > > > > > 
> > > > > > > 
> > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > > 
> > > > > > Looking at existing users there's only the sync_file stuff. And that gets
> > > > > > it wrong, because afaics no one ever sets fence->status to anything
> > > > > > useful. But sync_file blindly assumes it's correct.
> > > > > 
> > > > > In terms of doc, sync_file is using it correctly, and dma-fence isn't
> > > > > living up to its doc. The documented behaviour (sync_file) seems useful.
> > > > 
> > > > Yeah, it looks like an oversight in merging sync_file and dma_fence
> > > > together, but instead of piecemeal fixing (like this patch does), fixing
> > > > it all&properly might be better.
> > > 
> > > What's missing?
> > > 
> > >   * @status: Optional, only valid if < 0, must be set before calling
> > >   * dma_fence_signal, indicates that the fence has completed with an
> > >   * error.
> > > 
> > > dma-fence must then guarantee that is zeroed during init, then the
> > > drivers can supply the optional information prior to calling
> > > dma_fence_signal()
> > > 
> > > A dma_fence_set_status(fence, status) {
> > > 	BUG_ON(test_bit(SIGNALED, &fence->flags);
> > > 	BUG_ON(!IS_ERR_VALUE(status));
> > > 	BUG_ON(fence->status);
> > > 	fence->status = status;
> > > } ?
> > 
> > The trouble I'm seeing is that sync_file.c and sync_debug.c _only_ look at
> > fence->status, and assume that statue > 0 means the fence is singalled.
> > That code doesn't check fence->flags at all ...
> 
> sync_file:
> sync_fill_fence_info() {
> 	if (dma_fence_is_signaled(fence))
> 		info->status = fence->status >= 0 ? 1 : fence->status;
> 	else
> 		info->status = 0;
> }
> 
> The no_fences: path is confusing since it sets info.status, but that's a
> different struct entirely.

Ah right, coffee didn't properly kick in this morning ;-)

> sync_debug:
> sync_print_fence() {
> 	status = 1;
> 	if (dma_fence_is_is_signaled_locked(fence))
> 		status = fence->status;
> 
> That's backwards...

Yeah, that's the one I've meant. But it's for debug only

> > So maybe we need to rename status to error_status to make it clear it's
> > for failure modes only, and fix the sync_file code to look at flags, too.
> 
> It is already ^^.
>  
> > Please maybe even some userspace tests to sweeten the deal? Perhaps in our
> > hangstats tests.
> 
> I am already querying the error code in igt.

Excellent, and since sync_file works that explains why igt works ;-)

Can you pls spin some patch to fix up sync_debug, so I can pull it all in?
And I still think s/status/error_status/ would help clarity a lot. And
then maybe even the dma_fence_set_error_status helper from above.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-01-04 11:31 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-03 11:05 [PATCH 1/2] dma-fence: Clear fence->status during dma_fence_init() Chris Wilson
2017-01-03 11:05 ` [PATCH 2/2] drm/i915: Set guilty-flag on fence after detecting a hang Chris Wilson
2017-01-03 11:34   ` Tvrtko Ursulin
2017-01-03 11:46     ` [Intel-gfx] " Chris Wilson
2017-01-03 11:57       ` Tvrtko Ursulin
2017-01-03 12:13         ` Chris Wilson
2017-01-03 12:34           ` [Intel-gfx] " Tvrtko Ursulin
2017-01-03 12:38             ` Chris Wilson
2017-01-03 13:17               ` Tvrtko Ursulin
2017-01-03 13:25                 ` Chris Wilson
2017-01-03 11:53 ` ✗ Fi.CI.BAT: failure for series starting with [1/2] dma-fence: Clear fence->status during dma_fence_init() Patchwork
2017-01-03 14:04 ` [PATCH 1/2] " Tvrtko Ursulin
2017-01-04  9:15   ` [Intel-gfx] " Daniel Vetter
2017-01-04  9:24     ` Chris Wilson
2017-01-04  9:37       ` [Intel-gfx] " Daniel Vetter
2017-01-04  9:43         ` Chris Wilson
2017-01-04 10:18           ` Daniel Vetter
2017-01-04 10:26             ` Chris Wilson
2017-01-04 11:31               ` [Intel-gfx] " Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.