* [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
@ 2017-07-21 16:20 Christian König
2017-07-24 8:33 ` Daniel Vetter
2017-07-24 8:34 ` zhoucm1
0 siblings, 2 replies; 13+ messages in thread
From: Christian König @ 2017-07-21 16:20 UTC (permalink / raw)
To: linux-media, dri-devel, linaro-mm-sig; +Cc: sumit.semwal
From: Christian König <christian.koenig@amd.com>
With hardware resets in mind it is possible that all shared fences are
signaled, but the exlusive isn't. Fix waiting for everything in this situation.
Signed-off-by: Christian König <christian.koenig@amd.com>
---
drivers/dma-buf/reservation.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
index e2eff86..ce3f9c1 100644
--- a/drivers/dma-buf/reservation.c
+++ b/drivers/dma-buf/reservation.c
@@ -461,7 +461,7 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj,
}
}
- if (!shared_count) {
+ if (!fence) {
struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
if (fence_excl &&
--
2.7.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-21 16:20 [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly Christian König
@ 2017-07-24 8:33 ` Daniel Vetter
2017-07-24 9:51 ` Christian König
2017-07-24 8:34 ` zhoucm1
1 sibling, 1 reply; 13+ messages in thread
From: Daniel Vetter @ 2017-07-24 8:33 UTC (permalink / raw)
To: Christian König; +Cc: linux-media, dri-devel, linaro-mm-sig
On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
> From: Christian König <christian.koenig@amd.com>
>
> With hardware resets in mind it is possible that all shared fences are
> signaled, but the exlusive isn't. Fix waiting for everything in this situation.
How did you end up with both shared and exclusive fences on the same
reservation object? At least I thought the point of exclusive was that
it's exclusive (and has an implicit barrier on all previous shared
fences). Same for shared fences, they need to wait for the exclusive one
(and replace it).
Is this fallout from the amdgpu trickery where by default you do all
shared fences? I thought we've aligned semantics a while back ...
-Daniel
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
> drivers/dma-buf/reservation.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> index e2eff86..ce3f9c1 100644
> --- a/drivers/dma-buf/reservation.c
> +++ b/drivers/dma-buf/reservation.c
> @@ -461,7 +461,7 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj,
> }
> }
>
> - if (!shared_count) {
> + if (!fence) {
> struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
>
> if (fence_excl &&
> --
> 2.7.4
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-21 16:20 [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly Christian König
2017-07-24 8:33 ` Daniel Vetter
@ 2017-07-24 8:34 ` zhoucm1
2017-07-24 9:58 ` Christian König
1 sibling, 1 reply; 13+ messages in thread
From: zhoucm1 @ 2017-07-24 8:34 UTC (permalink / raw)
To: Christian König, linux-media, dri-devel, linaro-mm-sig
On 2017年07月22日 00:20, Christian König wrote:
> From: Christian König <christian.koenig@amd.com>
>
> With hardware resets in mind it is possible that all shared fences are
> signaled, but the exlusive isn't. Fix waiting for everything in this situation.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
> drivers/dma-buf/reservation.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
> index e2eff86..ce3f9c1 100644
> --- a/drivers/dma-buf/reservation.c
> +++ b/drivers/dma-buf/reservation.c
> @@ -461,7 +461,7 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj,
> }
> }
>
> - if (!shared_count) {
> + if (!fence) {
previous code seems be a bug, the exclusive fence isn't be waited at all
if shared_count != 0.
With your fix, there still is a case the exclusive fence could be
skipped, that when fobj->shared[shared_count-1] isn't signalled.
Regards,
David Zhou
> struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
>
> if (fence_excl &&
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-24 8:33 ` Daniel Vetter
@ 2017-07-24 9:51 ` Christian König
2017-07-24 11:57 ` Daniel Vetter
0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2017-07-24 9:51 UTC (permalink / raw)
To: Daniel Vetter; +Cc: linux-media, dri-devel, linaro-mm-sig
Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
> On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
>> From: Christian König <christian.koenig@amd.com>
>>
>> With hardware resets in mind it is possible that all shared fences are
>> signaled, but the exlusive isn't. Fix waiting for everything in this situation.
> How did you end up with both shared and exclusive fences on the same
> reservation object? At least I thought the point of exclusive was that
> it's exclusive (and has an implicit barrier on all previous shared
> fences). Same for shared fences, they need to wait for the exclusive one
> (and replace it).
>
> Is this fallout from the amdgpu trickery where by default you do all
> shared fences? I thought we've aligned semantics a while back ...
No, that is perfectly normal even for other drivers. Take a look at the
reservation code.
The exclusive fence replaces all shared fences, but adding a shared
fence doesn't replace the exclusive fence. That actually makes sense,
cause when you want to add move shared fences those need to wait for the
last exclusive fence as well.
Now normally I would agree that when you have shared fences it is
sufficient to wait for all of them cause those operations can't start
before the exclusive one finishes. But with GPU reset and/or the ability
to abort already submitted operations it is perfectly possible that you
end up with an exclusive fence which isn't signaled and a shared fence
which is signaled in the same reservation object.
Christian.
> -Daniel
>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>> drivers/dma-buf/reservation.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
>> index e2eff86..ce3f9c1 100644
>> --- a/drivers/dma-buf/reservation.c
>> +++ b/drivers/dma-buf/reservation.c
>> @@ -461,7 +461,7 @@ long reservation_object_wait_timeout_rcu(struct reservation_object *obj,
>> }
>> }
>>
>> - if (!shared_count) {
>> + if (!fence) {
>> struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
>>
>> if (fence_excl &&
>> --
>> 2.7.4
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-24 8:34 ` zhoucm1
@ 2017-07-24 9:58 ` Christian König
0 siblings, 0 replies; 13+ messages in thread
From: Christian König @ 2017-07-24 9:58 UTC (permalink / raw)
To: zhoucm1, linux-media, dri-devel, linaro-mm-sig
Am 24.07.2017 um 10:34 schrieb zhoucm1:
>
>
> On 2017年07月22日 00:20, Christian König wrote:
>> From: Christian König <christian.koenig@amd.com>
>>
>> With hardware resets in mind it is possible that all shared fences are
>> signaled, but the exlusive isn't. Fix waiting for everything in this
>> situation.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>> drivers/dma-buf/reservation.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/reservation.c
>> b/drivers/dma-buf/reservation.c
>> index e2eff86..ce3f9c1 100644
>> --- a/drivers/dma-buf/reservation.c
>> +++ b/drivers/dma-buf/reservation.c
>> @@ -461,7 +461,7 @@ long reservation_object_wait_timeout_rcu(struct
>> reservation_object *obj,
>> }
>> }
>> - if (!shared_count) {
>> + if (!fence) {
> previous code seems be a bug, the exclusive fence isn't be waited at
> all if shared_count != 0.
>
> With your fix, there still is a case the exclusive fence could be
> skipped, that when fobj->shared[shared_count-1] isn't signalled.
Yeah, indeed that looks like it needs to be fixed as well.
I'm still completely jet lagged and need to work through tons of stuff
from last week. Do you have time to take care of fixing up this patch
and send a v2?
Thanks in advance,
Christian.
>
> Regards,
> David Zhou
>> struct dma_fence *fence_excl =
>> rcu_dereference(obj->fence_excl);
>> if (fence_excl &&
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-24 9:51 ` Christian König
@ 2017-07-24 11:57 ` Daniel Vetter
2017-07-25 6:16 ` zhoucm1
2017-07-25 9:11 ` Christian König
0 siblings, 2 replies; 13+ messages in thread
From: Daniel Vetter @ 2017-07-24 11:57 UTC (permalink / raw)
To: Christian König; +Cc: linux-media, dri-devel, linaro-mm-sig
On Mon, Jul 24, 2017 at 11:51 AM, Christian König
<deathsimple@vodafone.de> wrote:
> Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
>>
>> On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
>>>
>>> From: Christian König <christian.koenig@amd.com>
>>>
>>> With hardware resets in mind it is possible that all shared fences are
>>> signaled, but the exlusive isn't. Fix waiting for everything in this
>>> situation.
>>
>> How did you end up with both shared and exclusive fences on the same
>> reservation object? At least I thought the point of exclusive was that
>> it's exclusive (and has an implicit barrier on all previous shared
>> fences). Same for shared fences, they need to wait for the exclusive one
>> (and replace it).
>>
>> Is this fallout from the amdgpu trickery where by default you do all
>> shared fences? I thought we've aligned semantics a while back ...
>
>
> No, that is perfectly normal even for other drivers. Take a look at the
> reservation code.
>
> The exclusive fence replaces all shared fences, but adding a shared fence
> doesn't replace the exclusive fence. That actually makes sense, cause when
> you want to add move shared fences those need to wait for the last exclusive
> fence as well.
Hm right.
> Now normally I would agree that when you have shared fences it is sufficient
> to wait for all of them cause those operations can't start before the
> exclusive one finishes. But with GPU reset and/or the ability to abort
> already submitted operations it is perfectly possible that you end up with
> an exclusive fence which isn't signaled and a shared fence which is signaled
> in the same reservation object.
How does that work? The batch(es) with the shared fence are all
supposed to wait for the exclusive fence before they start, which
means even if you gpu reset and restart/cancel certain things, they
shouldn't be able to complete out of order.
If you outright cancel a fence then you're supposed to first call
dma_fence_set_error(-EIO) and then complete it. Note that atm that
part might be slightly overengineered and I'm not sure about how we
expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or
soon, has been) used by i915 for it's internal book-keeping, which
might not be the best to leak to other consumers. But completing
fences (at least exported ones, where userspace or other drivers can
get at them) shouldn't be possible.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-24 11:57 ` Daniel Vetter
@ 2017-07-25 6:16 ` zhoucm1
2017-07-25 6:50 ` Daniel Vetter
2017-07-25 9:11 ` Christian König
1 sibling, 1 reply; 13+ messages in thread
From: zhoucm1 @ 2017-07-25 6:16 UTC (permalink / raw)
To: Daniel Vetter, Christian König; +Cc: linaro-mm-sig, dri-devel, linux-media
On 2017年07月24日 19:57, Daniel Vetter wrote:
> On Mon, Jul 24, 2017 at 11:51 AM, Christian König
> <deathsimple@vodafone.de> wrote:
>> Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
>>> On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
>>>> From: Christian König <christian.koenig@amd.com>
>>>>
>>>> With hardware resets in mind it is possible that all shared fences are
>>>> signaled, but the exlusive isn't. Fix waiting for everything in this
>>>> situation.
>>> How did you end up with both shared and exclusive fences on the same
>>> reservation object? At least I thought the point of exclusive was that
>>> it's exclusive (and has an implicit barrier on all previous shared
>>> fences). Same for shared fences, they need to wait for the exclusive one
>>> (and replace it).
>>>
>>> Is this fallout from the amdgpu trickery where by default you do all
>>> shared fences? I thought we've aligned semantics a while back ...
>>
>> No, that is perfectly normal even for other drivers. Take a look at the
>> reservation code.
>>
>> The exclusive fence replaces all shared fences, but adding a shared fence
>> doesn't replace the exclusive fence. That actually makes sense, cause when
>> you want to add move shared fences those need to wait for the last exclusive
>> fence as well.
> Hm right.
>
>> Now normally I would agree that when you have shared fences it is sufficient
>> to wait for all of them cause those operations can't start before the
>> exclusive one finishes. But with GPU reset and/or the ability to abort
>> already submitted operations it is perfectly possible that you end up with
>> an exclusive fence which isn't signaled and a shared fence which is signaled
>> in the same reservation object.
> How does that work? The batch(es) with the shared fence are all
> supposed to wait for the exclusive fence before they start, which
> means even if you gpu reset and restart/cancel certain things, they
> shouldn't be able to complete out of order.
Hi Daniel,
Do you mean exclusive fence must be signalled before any shared fence?
Where could I find this restriction?
Thanks,
David Zhou
>
> If you outright cancel a fence then you're supposed to first call
> dma_fence_set_error(-EIO) and then complete it. Note that atm that
> part might be slightly overengineered and I'm not sure about how we
> expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or
> soon, has been) used by i915 for it's internal book-keeping, which
> might not be the best to leak to other consumers. But completing
> fences (at least exported ones, where userspace or other drivers can
> get at them) shouldn't be possible.
> -Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-25 6:16 ` zhoucm1
@ 2017-07-25 6:50 ` Daniel Vetter
2017-07-25 6:55 ` zhoucm1
0 siblings, 1 reply; 13+ messages in thread
From: Daniel Vetter @ 2017-07-25 6:50 UTC (permalink / raw)
To: zhoucm1
Cc: Daniel Vetter, Christian König, linaro-mm-sig, dri-devel,
linux-media
On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
>
>
> On 2017年07月24日 19:57, Daniel Vetter wrote:
> > On Mon, Jul 24, 2017 at 11:51 AM, Christian König
> > <deathsimple@vodafone.de> wrote:
> > > Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
> > > > On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
> > > > > From: Christian König <christian.koenig@amd.com>
> > > > >
> > > > > With hardware resets in mind it is possible that all shared fences are
> > > > > signaled, but the exlusive isn't. Fix waiting for everything in this
> > > > > situation.
> > > > How did you end up with both shared and exclusive fences on the same
> > > > reservation object? At least I thought the point of exclusive was that
> > > > it's exclusive (and has an implicit barrier on all previous shared
> > > > fences). Same for shared fences, they need to wait for the exclusive one
> > > > (and replace it).
> > > >
> > > > Is this fallout from the amdgpu trickery where by default you do all
> > > > shared fences? I thought we've aligned semantics a while back ...
> > >
> > > No, that is perfectly normal even for other drivers. Take a look at the
> > > reservation code.
> > >
> > > The exclusive fence replaces all shared fences, but adding a shared fence
> > > doesn't replace the exclusive fence. That actually makes sense, cause when
> > > you want to add move shared fences those need to wait for the last exclusive
> > > fence as well.
> > Hm right.
> >
> > > Now normally I would agree that when you have shared fences it is sufficient
> > > to wait for all of them cause those operations can't start before the
> > > exclusive one finishes. But with GPU reset and/or the ability to abort
> > > already submitted operations it is perfectly possible that you end up with
> > > an exclusive fence which isn't signaled and a shared fence which is signaled
> > > in the same reservation object.
> > How does that work? The batch(es) with the shared fence are all
> > supposed to wait for the exclusive fence before they start, which
> > means even if you gpu reset and restart/cancel certain things, they
> > shouldn't be able to complete out of order.
> Hi Daniel,
>
> Do you mean exclusive fence must be signalled before any shared fence? Where
> could I find this restriction?
Yes, Christian also described it above. Could be that we should have
better kerneldoc to document this ...
-Daniel
>
> Thanks,
> David Zhou
> >
> > If you outright cancel a fence then you're supposed to first call
> > dma_fence_set_error(-EIO) and then complete it. Note that atm that
> > part might be slightly overengineered and I'm not sure about how we
> > expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or
> > soon, has been) used by i915 for it's internal book-keeping, which
> > might not be the best to leak to other consumers. But completing
> > fences (at least exported ones, where userspace or other drivers can
> > get at them) shouldn't be possible.
> > -Daniel
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-25 6:50 ` Daniel Vetter
@ 2017-07-25 6:55 ` zhoucm1
2017-07-25 7:02 ` Daniel Vetter
0 siblings, 1 reply; 13+ messages in thread
From: zhoucm1 @ 2017-07-25 6:55 UTC (permalink / raw)
To: Daniel Vetter; +Cc: Christian König, linaro-mm-sig, dri-devel, linux-media
On 2017年07月25日 14:50, Daniel Vetter wrote:
> On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
>>
>> On 2017年07月24日 19:57, Daniel Vetter wrote:
>>> On Mon, Jul 24, 2017 at 11:51 AM, Christian König
>>> <deathsimple@vodafone.de> wrote:
>>>> Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
>>>>> On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>> With hardware resets in mind it is possible that all shared fences are
>>>>>> signaled, but the exlusive isn't. Fix waiting for everything in this
>>>>>> situation.
>>>>> How did you end up with both shared and exclusive fences on the same
>>>>> reservation object? At least I thought the point of exclusive was that
>>>>> it's exclusive (and has an implicit barrier on all previous shared
>>>>> fences). Same for shared fences, they need to wait for the exclusive one
>>>>> (and replace it).
>>>>>
>>>>> Is this fallout from the amdgpu trickery where by default you do all
>>>>> shared fences? I thought we've aligned semantics a while back ...
>>>> No, that is perfectly normal even for other drivers. Take a look at the
>>>> reservation code.
>>>>
>>>> The exclusive fence replaces all shared fences, but adding a shared fence
>>>> doesn't replace the exclusive fence. That actually makes sense, cause when
>>>> you want to add move shared fences those need to wait for the last exclusive
>>>> fence as well.
>>> Hm right.
>>>
>>>> Now normally I would agree that when you have shared fences it is sufficient
>>>> to wait for all of them cause those operations can't start before the
>>>> exclusive one finishes. But with GPU reset and/or the ability to abort
>>>> already submitted operations it is perfectly possible that you end up with
>>>> an exclusive fence which isn't signaled and a shared fence which is signaled
>>>> in the same reservation object.
>>> How does that work? The batch(es) with the shared fence are all
>>> supposed to wait for the exclusive fence before they start, which
>>> means even if you gpu reset and restart/cancel certain things, they
>>> shouldn't be able to complete out of order.
>> Hi Daniel,
>>
>> Do you mean exclusive fence must be signalled before any shared fence? Where
>> could I find this restriction?
> Yes, Christian also described it above. Could be that we should have
> better kerneldoc to document this ...
Is that a known assumption? if that way, it doesn't matter even that we
always wait exclusive fence, right? Just one more line checking.
Thanks,
David Zhou
> -Daniel
>
>> Thanks,
>> David Zhou
>>> If you outright cancel a fence then you're supposed to first call
>>> dma_fence_set_error(-EIO) and then complete it. Note that atm that
>>> part might be slightly overengineered and I'm not sure about how we
>>> expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or
>>> soon, has been) used by i915 for it's internal book-keeping, which
>>> might not be the best to leak to other consumers. But completing
>>> fences (at least exported ones, where userspace or other drivers can
>>> get at them) shouldn't be possible.
>>> -Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-25 6:55 ` zhoucm1
@ 2017-07-25 7:02 ` Daniel Vetter
2017-07-25 7:15 ` zhoucm1
0 siblings, 1 reply; 13+ messages in thread
From: Daniel Vetter @ 2017-07-25 7:02 UTC (permalink / raw)
To: zhoucm1
Cc: Daniel Vetter, Christian König, linaro-mm-sig, dri-devel,
linux-media
On Tue, Jul 25, 2017 at 02:55:14PM +0800, zhoucm1 wrote:
>
>
> On 2017年07月25日 14:50, Daniel Vetter wrote:
> > On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
> > >
> > > On 2017年07月24日 19:57, Daniel Vetter wrote:
> > > > On Mon, Jul 24, 2017 at 11:51 AM, Christian König
> > > > <deathsimple@vodafone.de> wrote:
> > > > > Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
> > > > > > On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
> > > > > > > From: Christian König <christian.koenig@amd.com>
> > > > > > >
> > > > > > > With hardware resets in mind it is possible that all shared fences are
> > > > > > > signaled, but the exlusive isn't. Fix waiting for everything in this
> > > > > > > situation.
> > > > > > How did you end up with both shared and exclusive fences on the same
> > > > > > reservation object? At least I thought the point of exclusive was that
> > > > > > it's exclusive (and has an implicit barrier on all previous shared
> > > > > > fences). Same for shared fences, they need to wait for the exclusive one
> > > > > > (and replace it).
> > > > > >
> > > > > > Is this fallout from the amdgpu trickery where by default you do all
> > > > > > shared fences? I thought we've aligned semantics a while back ...
> > > > > No, that is perfectly normal even for other drivers. Take a look at the
> > > > > reservation code.
> > > > >
> > > > > The exclusive fence replaces all shared fences, but adding a shared fence
> > > > > doesn't replace the exclusive fence. That actually makes sense, cause when
> > > > > you want to add move shared fences those need to wait for the last exclusive
> > > > > fence as well.
> > > > Hm right.
> > > >
> > > > > Now normally I would agree that when you have shared fences it is sufficient
> > > > > to wait for all of them cause those operations can't start before the
> > > > > exclusive one finishes. But with GPU reset and/or the ability to abort
> > > > > already submitted operations it is perfectly possible that you end up with
> > > > > an exclusive fence which isn't signaled and a shared fence which is signaled
> > > > > in the same reservation object.
> > > > How does that work? The batch(es) with the shared fence are all
> > > > supposed to wait for the exclusive fence before they start, which
> > > > means even if you gpu reset and restart/cancel certain things, they
> > > > shouldn't be able to complete out of order.
> > > Hi Daniel,
> > >
> > > Do you mean exclusive fence must be signalled before any shared fence? Where
> > > could I find this restriction?
> > Yes, Christian also described it above. Could be that we should have
> > better kerneldoc to document this ...
> Is that a known assumption? if that way, it doesn't matter even that we
> always wait exclusive fence, right? Just one more line checking.
The problem is that amdgpu breaks that assumption over gpu reset, and that
might have implications _everywhere_, not just in this code here. Are you
sure this case won't pull the i915 driver over the table when sharing
dma-bufs with it? Did you audit the code (plus all the other drivers too
ofc).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-25 7:02 ` Daniel Vetter
@ 2017-07-25 7:15 ` zhoucm1
0 siblings, 0 replies; 13+ messages in thread
From: zhoucm1 @ 2017-07-25 7:15 UTC (permalink / raw)
To: Daniel Vetter; +Cc: Christian König, linaro-mm-sig, dri-devel, linux-media
On 2017年07月25日 15:02, Daniel Vetter wrote:
> On Tue, Jul 25, 2017 at 02:55:14PM +0800, zhoucm1 wrote:
>>
>> On 2017年07月25日 14:50, Daniel Vetter wrote:
>>> On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
>>>> On 2017年07月24日 19:57, Daniel Vetter wrote:
>>>>> On Mon, Jul 24, 2017 at 11:51 AM, Christian König
>>>>> <deathsimple@vodafone.de> wrote:
>>>>>> Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
>>>>>>> On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
>>>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>> With hardware resets in mind it is possible that all shared fences are
>>>>>>>> signaled, but the exlusive isn't. Fix waiting for everything in this
>>>>>>>> situation.
>>>>>>> How did you end up with both shared and exclusive fences on the same
>>>>>>> reservation object? At least I thought the point of exclusive was that
>>>>>>> it's exclusive (and has an implicit barrier on all previous shared
>>>>>>> fences). Same for shared fences, they need to wait for the exclusive one
>>>>>>> (and replace it).
>>>>>>>
>>>>>>> Is this fallout from the amdgpu trickery where by default you do all
>>>>>>> shared fences? I thought we've aligned semantics a while back ...
>>>>>> No, that is perfectly normal even for other drivers. Take a look at the
>>>>>> reservation code.
>>>>>>
>>>>>> The exclusive fence replaces all shared fences, but adding a shared fence
>>>>>> doesn't replace the exclusive fence. That actually makes sense, cause when
>>>>>> you want to add move shared fences those need to wait for the last exclusive
>>>>>> fence as well.
>>>>> Hm right.
>>>>>
>>>>>> Now normally I would agree that when you have shared fences it is sufficient
>>>>>> to wait for all of them cause those operations can't start before the
>>>>>> exclusive one finishes. But with GPU reset and/or the ability to abort
>>>>>> already submitted operations it is perfectly possible that you end up with
>>>>>> an exclusive fence which isn't signaled and a shared fence which is signaled
>>>>>> in the same reservation object.
>>>>> How does that work? The batch(es) with the shared fence are all
>>>>> supposed to wait for the exclusive fence before they start, which
>>>>> means even if you gpu reset and restart/cancel certain things, they
>>>>> shouldn't be able to complete out of order.
>>>> Hi Daniel,
>>>>
>>>> Do you mean exclusive fence must be signalled before any shared fence? Where
>>>> could I find this restriction?
>>> Yes, Christian also described it above. Could be that we should have
>>> better kerneldoc to document this ...
>> Is that a known assumption? if that way, it doesn't matter even that we
>> always wait exclusive fence, right? Just one more line checking.
> The problem is that amdgpu breaks that assumption over gpu reset, and that
> might have implications _everywhere_, not just in this code here. Are you
> sure this case won't pull the i915 driver over the table when sharing
> dma-bufs with it?
Ah, I finally got your mean. So as you mentioned before, we at least
should have better describe for this assumption, the best place is
comments in reservation.c, every newer would know it when reading code
first time.
Thanks,
David Zhou
> Did you audit the code (plus all the other drivers too
> ofc).
> -Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-24 11:57 ` Daniel Vetter
2017-07-25 6:16 ` zhoucm1
@ 2017-07-25 9:11 ` Christian König
2017-07-25 12:14 ` Daniel Vetter
1 sibling, 1 reply; 13+ messages in thread
From: Christian König @ 2017-07-25 9:11 UTC (permalink / raw)
To: Daniel Vetter; +Cc: linux-media, dri-devel, linaro-mm-sig
Am 24.07.2017 um 13:57 schrieb Daniel Vetter:
> On Mon, Jul 24, 2017 at 11:51 AM, Christian König
> <deathsimple@vodafone.de> wrote:
>> Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
>>> On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
>>>> From: Christian König <christian.koenig@amd.com>
>>>>
>>>> With hardware resets in mind it is possible that all shared fences are
>>>> signaled, but the exlusive isn't. Fix waiting for everything in this
>>>> situation.
>>> How did you end up with both shared and exclusive fences on the same
>>> reservation object? At least I thought the point of exclusive was that
>>> it's exclusive (and has an implicit barrier on all previous shared
>>> fences). Same for shared fences, they need to wait for the exclusive one
>>> (and replace it).
>>>
>>> Is this fallout from the amdgpu trickery where by default you do all
>>> shared fences? I thought we've aligned semantics a while back ...
>>
>> No, that is perfectly normal even for other drivers. Take a look at the
>> reservation code.
>>
>> The exclusive fence replaces all shared fences, but adding a shared fence
>> doesn't replace the exclusive fence. That actually makes sense, cause when
>> you want to add move shared fences those need to wait for the last exclusive
>> fence as well.
> Hm right.
>
>> Now normally I would agree that when you have shared fences it is sufficient
>> to wait for all of them cause those operations can't start before the
>> exclusive one finishes. But with GPU reset and/or the ability to abort
>> already submitted operations it is perfectly possible that you end up with
>> an exclusive fence which isn't signaled and a shared fence which is signaled
>> in the same reservation object.
> How does that work? The batch(es) with the shared fence are all
> supposed to wait for the exclusive fence before they start, which
> means even if you gpu reset and restart/cancel certain things, they
> shouldn't be able to complete out of order.
Assume the following:
1. The exclusive fence is some move operation by the kernel which
executes on a DMA engine.
2. The shared fence is a 3D operation submitted by userspace which
executes on the 3D engine.
Now we found the 3D engine to be hung and needs a reset, all currently
submitted jobs are aborted, marked with an error code and their fences
put into the signaled state.
Since we only reset the 3D engine, the move operation (fortunately)
isn't affected by this.
I think this applies to all drivers and isn't something amdgpu specific.
Regards,
Christian.
>
> If you outright cancel a fence then you're supposed to first call
> dma_fence_set_error(-EIO) and then complete it. Note that atm that
> part might be slightly overengineered and I'm not sure about how we
> expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or
> soon, has been) used by i915 for it's internal book-keeping, which
> might not be the best to leak to other consumers. But completing
> fences (at least exported ones, where userspace or other drivers can
> get at them) shouldn't be possible.
> -Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly
2017-07-25 9:11 ` Christian König
@ 2017-07-25 12:14 ` Daniel Vetter
0 siblings, 0 replies; 13+ messages in thread
From: Daniel Vetter @ 2017-07-25 12:14 UTC (permalink / raw)
To: Christian König; +Cc: Daniel Vetter, linux-media, dri-devel, linaro-mm-sig
On Tue, Jul 25, 2017 at 11:11:35AM +0200, Christian König wrote:
> Am 24.07.2017 um 13:57 schrieb Daniel Vetter:
> > On Mon, Jul 24, 2017 at 11:51 AM, Christian König
> > <deathsimple@vodafone.de> wrote:
> > > Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
> > > > On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
> > > > > From: Christian König <christian.koenig@amd.com>
> > > > >
> > > > > With hardware resets in mind it is possible that all shared fences are
> > > > > signaled, but the exlusive isn't. Fix waiting for everything in this
> > > > > situation.
> > > > How did you end up with both shared and exclusive fences on the same
> > > > reservation object? At least I thought the point of exclusive was that
> > > > it's exclusive (and has an implicit barrier on all previous shared
> > > > fences). Same for shared fences, they need to wait for the exclusive one
> > > > (and replace it).
> > > >
> > > > Is this fallout from the amdgpu trickery where by default you do all
> > > > shared fences? I thought we've aligned semantics a while back ...
> > >
> > > No, that is perfectly normal even for other drivers. Take a look at the
> > > reservation code.
> > >
> > > The exclusive fence replaces all shared fences, but adding a shared fence
> > > doesn't replace the exclusive fence. That actually makes sense, cause when
> > > you want to add move shared fences those need to wait for the last exclusive
> > > fence as well.
> > Hm right.
> >
> > > Now normally I would agree that when you have shared fences it is sufficient
> > > to wait for all of them cause those operations can't start before the
> > > exclusive one finishes. But with GPU reset and/or the ability to abort
> > > already submitted operations it is perfectly possible that you end up with
> > > an exclusive fence which isn't signaled and a shared fence which is signaled
> > > in the same reservation object.
> > How does that work? The batch(es) with the shared fence are all
> > supposed to wait for the exclusive fence before they start, which
> > means even if you gpu reset and restart/cancel certain things, they
> > shouldn't be able to complete out of order.
>
> Assume the following:
> 1. The exclusive fence is some move operation by the kernel which executes
> on a DMA engine.
> 2. The shared fence is a 3D operation submitted by userspace which executes
> on the 3D engine.
>
> Now we found the 3D engine to be hung and needs a reset, all currently
> submitted jobs are aborted, marked with an error code and their fences put
> into the signaled state.
>
> Since we only reset the 3D engine, the move operation (fortunately) isn't
> affected by this.
>
> I think this applies to all drivers and isn't something amdgpu specific.
Not i915 because:
- At first we only had system wide gpu reset that killed everything, which
means all requests will be completed, not just on a single engine.
- Now we have per-engine reset, but combined with replaying them (the
offending one gets a no-op batch to avoid re-hanging), to make sure the
depency tree doesn't fall apart.
Now I see that doing this isn't all that simple, and either way we still
have the case where one driver resets but not the other (in multi-gpu),
but I'm not exactly sure how to best handle this.
What exactly is the downside of not dropping this assumption, i.e. why do
you want this patch? What blows up?
-Daniel
>
> Regards,
> Christian.
>
> >
> > If you outright cancel a fence then you're supposed to first call
> > dma_fence_set_error(-EIO) and then complete it. Note that atm that
> > part might be slightly overengineered and I'm not sure about how we
> > expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or
> > soon, has been) used by i915 for it's internal book-keeping, which
> > might not be the best to leak to other consumers. But completing
> > fences (at least exported ones, where userspace or other drivers can
> > get at them) shouldn't be possible.
> > -Daniel
>
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-07-25 12:14 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-21 16:20 [PATCH] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly Christian König
2017-07-24 8:33 ` Daniel Vetter
2017-07-24 9:51 ` Christian König
2017-07-24 11:57 ` Daniel Vetter
2017-07-25 6:16 ` zhoucm1
2017-07-25 6:50 ` Daniel Vetter
2017-07-25 6:55 ` zhoucm1
2017-07-25 7:02 ` Daniel Vetter
2017-07-25 7:15 ` zhoucm1
2017-07-25 9:11 ` Christian König
2017-07-25 12:14 ` Daniel Vetter
2017-07-24 8:34 ` zhoucm1
2017-07-24 9:58 ` Christian König
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).