All of lore.kernel.org
 help / color / mirror / Atom feed
* Question on UAPI for fences
@ 2014-09-12 13:23 Christian König
  2014-09-12 14:09 ` Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2014-09-12 13:23 UTC (permalink / raw)
  To: dri-devel, Maarten Lankhorst, Jerome Glisse; +Cc: gpudriverdevsupport

Hello everyone,

to allow concurrent buffer access by different engines beyond the 
multiple readers/single writer model that we currently use in radeon and 
other drivers we need some kind of synchonization object exposed to 
userspace.

My initial patch set for this used (or rather abused) zero sized GEM 
buffers as fence handles. This is obviously isn't the best way of doing 
this (to much overhead, rather ugly etc...), Jerome commented on this 
accordingly.

So what should a driver expose instead? Android sync points? Something else?

Please discuss and/or advise,
Christian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 13:23 Question on UAPI for fences Christian König
@ 2014-09-12 14:09 ` Daniel Vetter
  2014-09-12 14:43   ` Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2014-09-12 14:09 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport

On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
> Hello everyone,
>
> to allow concurrent buffer access by different engines beyond the multiple
> readers/single writer model that we currently use in radeon and other
> drivers we need some kind of synchonization object exposed to userspace.
>
> My initial patch set for this used (or rather abused) zero sized GEM buffers
> as fence handles. This is obviously isn't the best way of doing this (to
> much overhead, rather ugly etc...), Jerome commented on this accordingly.
>
> So what should a driver expose instead? Android sync points? Something else?

I think actually exposing the struct fence objects as a fd, using android
syncpts (or at least something compatible to it) is the way to go. Problem
is that it's super-hard to get the android guys out of hiding for this :(

Adding a bunch of people in the hopes that something sticks.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 14:09 ` Daniel Vetter
@ 2014-09-12 14:43   ` Daniel Vetter
  2014-09-12 14:50     ` Jerome Glisse
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Vetter @ 2014-09-12 14:43 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	John Harrison, gpudriverdevsupport

On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>> Hello everyone,
>>
>> to allow concurrent buffer access by different engines beyond the multiple
>> readers/single writer model that we currently use in radeon and other
>> drivers we need some kind of synchonization object exposed to userspace.
>>
>> My initial patch set for this used (or rather abused) zero sized GEM buffers
>> as fence handles. This is obviously isn't the best way of doing this (to
>> much overhead, rather ugly etc...), Jerome commented on this accordingly.
>>
>> So what should a driver expose instead? Android sync points? Something else?
>
> I think actually exposing the struct fence objects as a fd, using android
> syncpts (or at least something compatible to it) is the way to go. Problem
> is that it's super-hard to get the android guys out of hiding for this :(
>
> Adding a bunch of people in the hopes that something sticks.

More people.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 14:43   ` Daniel Vetter
@ 2014-09-12 14:50     ` Jerome Glisse
  2014-09-12 15:13       ` Daniel Vetter
  2014-09-12 15:25       ` Alex Deucher
  0 siblings, 2 replies; 19+ messages in thread
From: Jerome Glisse @ 2014-09-12 14:50 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	John Harrison, Christian König, gpudriverdevsupport

On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
> >> Hello everyone,
> >>
> >> to allow concurrent buffer access by different engines beyond the multiple
> >> readers/single writer model that we currently use in radeon and other
> >> drivers we need some kind of synchonization object exposed to userspace.
> >>
> >> My initial patch set for this used (or rather abused) zero sized GEM buffers
> >> as fence handles. This is obviously isn't the best way of doing this (to
> >> much overhead, rather ugly etc...), Jerome commented on this accordingly.
> >>
> >> So what should a driver expose instead? Android sync points? Something else?
> >
> > I think actually exposing the struct fence objects as a fd, using android
> > syncpts (or at least something compatible to it) is the way to go. Problem
> > is that it's super-hard to get the android guys out of hiding for this :(
> >
> > Adding a bunch of people in the hopes that something sticks.
> 
> More people.

Just to re-iterate, exposing such thing while still using command stream
ioctl that use implicit synchronization is a waste and you can only get
the lowest common denominator which is implicit synchronization. So i do
not see the point of such api if you are not also adding a new cs ioctl
with explicit contract that it does not do any kind of synchronization
(it could be almost the exact same code modulo the do not wait for
previous cmd to complete).

Also one thing that the Android sync point does not have, AFAICT, is a
way to schedule synchronization as part of a cs ioctl so cpu never have
to be involve for cmd stream that deal only one gpu (assuming the driver
and hw can do such trick).

Cheers,
Jérôme

> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 14:50     ` Jerome Glisse
@ 2014-09-12 15:13       ` Daniel Vetter
  2014-09-12 15:25       ` Alex Deucher
  1 sibling, 0 replies; 19+ messages in thread
From: Daniel Vetter @ 2014-09-12 15:13 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	John Harrison, Christian König, gpudriverdevsupport

On Fri, Sep 12, 2014 at 10:50:49AM -0400, Jerome Glisse wrote:
> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
> > On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
> > >> Hello everyone,
> > >>
> > >> to allow concurrent buffer access by different engines beyond the multiple
> > >> readers/single writer model that we currently use in radeon and other
> > >> drivers we need some kind of synchonization object exposed to userspace.
> > >>
> > >> My initial patch set for this used (or rather abused) zero sized GEM buffers
> > >> as fence handles. This is obviously isn't the best way of doing this (to
> > >> much overhead, rather ugly etc...), Jerome commented on this accordingly.
> > >>
> > >> So what should a driver expose instead? Android sync points? Something else?
> > >
> > > I think actually exposing the struct fence objects as a fd, using android
> > > syncpts (or at least something compatible to it) is the way to go. Problem
> > > is that it's super-hard to get the android guys out of hiding for this :(
> > >
> > > Adding a bunch of people in the hopes that something sticks.
> > 
> > More people.
> 
> Just to re-iterate, exposing such thing while still using command stream
> ioctl that use implicit synchronization is a waste and you can only get
> the lowest common denominator which is implicit synchronization. So i do
> not see the point of such api if you are not also adding a new cs ioctl
> with explicit contract that it does not do any kind of synchronization
> (it could be almost the exact same code modulo the do not wait for
> previous cmd to complete).

I don't think we should cathegorically exclude this, since without some
partial implicit/explicit world we'll never convert over to fences. Of
course adding fences without any way to at least partially forgoe the
implicit syncing is pointless. But that might be some other user (e.g.
camera capture device) which needs explicit fences.

> Also one thing that the Android sync point does not have, AFAICT, is a
> way to schedule synchronization as part of a cs ioctl so cpu never have
> to be involve for cmd stream that deal only one gpu (assuming the driver
> and hw can do such trick).

You need to integrate the android stuff with your (new) cs ioctl, with a
input parameter for the fence fd to wait on before executing the cs and
one that gets created to signal when it's all done.

Same goes for all the other places android wants sync objects, e.g. for
synchronization before atomic flips and for signalling completion of the
same.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 14:50     ` Jerome Glisse
  2014-09-12 15:13       ` Daniel Vetter
@ 2014-09-12 15:25       ` Alex Deucher
  2014-09-12 15:33         ` Jerome Glisse
  1 sibling, 1 reply; 19+ messages in thread
From: Alex Deucher @ 2014-09-12 15:25 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, Christian König, John Harrison

On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>> > On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>> >> Hello everyone,
>> >>
>> >> to allow concurrent buffer access by different engines beyond the multiple
>> >> readers/single writer model that we currently use in radeon and other
>> >> drivers we need some kind of synchonization object exposed to userspace.
>> >>
>> >> My initial patch set for this used (or rather abused) zero sized GEM buffers
>> >> as fence handles. This is obviously isn't the best way of doing this (to
>> >> much overhead, rather ugly etc...), Jerome commented on this accordingly.
>> >>
>> >> So what should a driver expose instead? Android sync points? Something else?
>> >
>> > I think actually exposing the struct fence objects as a fd, using android
>> > syncpts (or at least something compatible to it) is the way to go. Problem
>> > is that it's super-hard to get the android guys out of hiding for this :(
>> >
>> > Adding a bunch of people in the hopes that something sticks.
>>
>> More people.
>
> Just to re-iterate, exposing such thing while still using command stream
> ioctl that use implicit synchronization is a waste and you can only get
> the lowest common denominator which is implicit synchronization. So i do
> not see the point of such api if you are not also adding a new cs ioctl
> with explicit contract that it does not do any kind of synchronization
> (it could be almost the exact same code modulo the do not wait for
> previous cmd to complete).

Our thinking was to allow explicit sync from a single process, but
implicitly sync between processes.

Alex

>
> Also one thing that the Android sync point does not have, AFAICT, is a
> way to schedule synchronization as part of a cs ioctl so cpu never have
> to be involve for cmd stream that deal only one gpu (assuming the driver
> and hw can do such trick).
>
> Cheers,
> Jérôme
>
>> -Daniel
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 15:25       ` Alex Deucher
@ 2014-09-12 15:33         ` Jerome Glisse
  2014-09-12 15:38           ` Alex Deucher
  2014-09-12 15:42           ` Christian König
  0 siblings, 2 replies; 19+ messages in thread
From: Jerome Glisse @ 2014-09-12 15:33 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, Christian König, John Harrison

On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
> >> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >> > On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
> >> >> Hello everyone,
> >> >>
> >> >> to allow concurrent buffer access by different engines beyond the multiple
> >> >> readers/single writer model that we currently use in radeon and other
> >> >> drivers we need some kind of synchonization object exposed to userspace.
> >> >>
> >> >> My initial patch set for this used (or rather abused) zero sized GEM buffers
> >> >> as fence handles. This is obviously isn't the best way of doing this (to
> >> >> much overhead, rather ugly etc...), Jerome commented on this accordingly.
> >> >>
> >> >> So what should a driver expose instead? Android sync points? Something else?
> >> >
> >> > I think actually exposing the struct fence objects as a fd, using android
> >> > syncpts (or at least something compatible to it) is the way to go. Problem
> >> > is that it's super-hard to get the android guys out of hiding for this :(
> >> >
> >> > Adding a bunch of people in the hopes that something sticks.
> >>
> >> More people.
> >
> > Just to re-iterate, exposing such thing while still using command stream
> > ioctl that use implicit synchronization is a waste and you can only get
> > the lowest common denominator which is implicit synchronization. So i do
> > not see the point of such api if you are not also adding a new cs ioctl
> > with explicit contract that it does not do any kind of synchronization
> > (it could be almost the exact same code modulo the do not wait for
> > previous cmd to complete).
> 
> Our thinking was to allow explicit sync from a single process, but
> implicitly sync between processes.

This is a BIG NAK if you are using the same ioctl as it would mean you are
changing userspace API, well at least userspace expectation. Adding a new
cs flag might do the trick but it should not be about inter-process, or any
thing special, it's just implicit sync or no synchronization. Converting
userspace is not that much of a big deal either, it can be broken into
several step. Like mesa use explicit synchronization all time but ddx use
implicit.

Cheers,
Jérôme

> 
> Alex
> 
> >
> > Also one thing that the Android sync point does not have, AFAICT, is a
> > way to schedule synchronization as part of a cs ioctl so cpu never have
> > to be involve for cmd stream that deal only one gpu (assuming the driver
> > and hw can do such trick).
> >
> > Cheers,
> > Jérôme
> >
> >> -Daniel
> >> --
> >> Daniel Vetter
> >> Software Engineer, Intel Corporation
> >> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 15:33         ` Jerome Glisse
@ 2014-09-12 15:38           ` Alex Deucher
  2014-09-12 15:42           ` Christian König
  1 sibling, 0 replies; 19+ messages in thread
From: Alex Deucher @ 2014-09-12 15:38 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, Christian König, John Harrison

On Fri, Sep 12, 2014 at 11:33 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
>> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>> > On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>> >> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>> >> > On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>> >> >> Hello everyone,
>> >> >>
>> >> >> to allow concurrent buffer access by different engines beyond the multiple
>> >> >> readers/single writer model that we currently use in radeon and other
>> >> >> drivers we need some kind of synchonization object exposed to userspace.
>> >> >>
>> >> >> My initial patch set for this used (or rather abused) zero sized GEM buffers
>> >> >> as fence handles. This is obviously isn't the best way of doing this (to
>> >> >> much overhead, rather ugly etc...), Jerome commented on this accordingly.
>> >> >>
>> >> >> So what should a driver expose instead? Android sync points? Something else?
>> >> >
>> >> > I think actually exposing the struct fence objects as a fd, using android
>> >> > syncpts (or at least something compatible to it) is the way to go. Problem
>> >> > is that it's super-hard to get the android guys out of hiding for this :(
>> >> >
>> >> > Adding a bunch of people in the hopes that something sticks.
>> >>
>> >> More people.
>> >
>> > Just to re-iterate, exposing such thing while still using command stream
>> > ioctl that use implicit synchronization is a waste and you can only get
>> > the lowest common denominator which is implicit synchronization. So i do
>> > not see the point of such api if you are not also adding a new cs ioctl
>> > with explicit contract that it does not do any kind of synchronization
>> > (it could be almost the exact same code modulo the do not wait for
>> > previous cmd to complete).
>>
>> Our thinking was to allow explicit sync from a single process, but
>> implicitly sync between processes.
>
> This is a BIG NAK if you are using the same ioctl as it would mean you are
> changing userspace API, well at least userspace expectation. Adding a new
> cs flag might do the trick but it should not be about inter-process, or any
> thing special, it's just implicit sync or no synchronization. Converting
> userspace is not that much of a big deal either, it can be broken into
> several step. Like mesa use explicit synchronization all time but ddx use
> implicit.

Right, you'd have to explicitly ask for it to avoid breaking old
userspace.  My point was just that within a single process, it's quite
easy to know exactly what you are doing and handle the synchronization
yourself, while for inter-process there is an assumed implicit sync.

Alex


>
> Cheers,
> Jérôme
>
>>
>> Alex
>>
>> >
>> > Also one thing that the Android sync point does not have, AFAICT, is a
>> > way to schedule synchronization as part of a cs ioctl so cpu never have
>> > to be involve for cmd stream that deal only one gpu (assuming the driver
>> > and hw can do such trick).
>> >
>> > Cheers,
>> > Jérôme
>> >
>> >> -Daniel
>> >> --
>> >> Daniel Vetter
>> >> Software Engineer, Intel Corporation
>> >> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>> > _______________________________________________
>> > dri-devel mailing list
>> > dri-devel@lists.freedesktop.org
>> > http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 15:33         ` Jerome Glisse
  2014-09-12 15:38           ` Alex Deucher
@ 2014-09-12 15:42           ` Christian König
  2014-09-12 15:48             ` Jerome Glisse
  1 sibling, 1 reply; 19+ messages in thread
From: Christian König @ 2014-09-12 15:42 UTC (permalink / raw)
  To: Jerome Glisse, Alex Deucher
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, John Harrison

Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
> On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
>> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>>>> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>> On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>>>>>> Hello everyone,
>>>>>>
>>>>>> to allow concurrent buffer access by different engines beyond the multiple
>>>>>> readers/single writer model that we currently use in radeon and other
>>>>>> drivers we need some kind of synchonization object exposed to userspace.
>>>>>>
>>>>>> My initial patch set for this used (or rather abused) zero sized GEM buffers
>>>>>> as fence handles. This is obviously isn't the best way of doing this (to
>>>>>> much overhead, rather ugly etc...), Jerome commented on this accordingly.
>>>>>>
>>>>>> So what should a driver expose instead? Android sync points? Something else?
>>>>> I think actually exposing the struct fence objects as a fd, using android
>>>>> syncpts (or at least something compatible to it) is the way to go. Problem
>>>>> is that it's super-hard to get the android guys out of hiding for this :(
>>>>>
>>>>> Adding a bunch of people in the hopes that something sticks.
>>>> More people.
>>> Just to re-iterate, exposing such thing while still using command stream
>>> ioctl that use implicit synchronization is a waste and you can only get
>>> the lowest common denominator which is implicit synchronization. So i do
>>> not see the point of such api if you are not also adding a new cs ioctl
>>> with explicit contract that it does not do any kind of synchronization
>>> (it could be almost the exact same code modulo the do not wait for
>>> previous cmd to complete).
>> Our thinking was to allow explicit sync from a single process, but
>> implicitly sync between processes.
> This is a BIG NAK if you are using the same ioctl as it would mean you are
> changing userspace API, well at least userspace expectation. Adding a new
> cs flag might do the trick but it should not be about inter-process, or any
> thing special, it's just implicit sync or no synchronization. Converting
> userspace is not that much of a big deal either, it can be broken into
> several step. Like mesa use explicit synchronization all time but ddx use
> implicit.

The thinking here is that we need to be backward compatible for DRI2/3 
and support all kind of different use cases like old DDX and new Mesa, 
or old Mesa and new DDX etc...

So for my prototype if the kernel sees any access of a BO from two 
different clients it falls back to the old behavior of implicit 
synchronization of access to the same buffer object. That might not be 
the fastest approach, but is as far as I can see conservative and so 
should work under all conditions.

Apart from that the planning so far was that we just hide this feature 
behind a couple of command submission flags and new chunks.

Regards,
Christian.

>
> Cheers,
> Jérôme
>
>> Alex
>>
>>> Also one thing that the Android sync point does not have, AFAICT, is a
>>> way to schedule synchronization as part of a cs ioctl so cpu never have
>>> to be involve for cmd stream that deal only one gpu (assuming the driver
>>> and hw can do such trick).
>>>
>>> Cheers,
>>> Jérôme
>>>
>>>> -Daniel
>>>> --
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 15:42           ` Christian König
@ 2014-09-12 15:48             ` Jerome Glisse
  2014-09-12 15:58               ` Christian König
  0 siblings, 1 reply; 19+ messages in thread
From: Jerome Glisse @ 2014-09-12 15:48 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, John Harrison

On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
> Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
> >On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
> >>On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >>>On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
> >>>>On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>>On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
> >>>>>>Hello everyone,
> >>>>>>
> >>>>>>to allow concurrent buffer access by different engines beyond the multiple
> >>>>>>readers/single writer model that we currently use in radeon and other
> >>>>>>drivers we need some kind of synchonization object exposed to userspace.
> >>>>>>
> >>>>>>My initial patch set for this used (or rather abused) zero sized GEM buffers
> >>>>>>as fence handles. This is obviously isn't the best way of doing this (to
> >>>>>>much overhead, rather ugly etc...), Jerome commented on this accordingly.
> >>>>>>
> >>>>>>So what should a driver expose instead? Android sync points? Something else?
> >>>>>I think actually exposing the struct fence objects as a fd, using android
> >>>>>syncpts (or at least something compatible to it) is the way to go. Problem
> >>>>>is that it's super-hard to get the android guys out of hiding for this :(
> >>>>>
> >>>>>Adding a bunch of people in the hopes that something sticks.
> >>>>More people.
> >>>Just to re-iterate, exposing such thing while still using command stream
> >>>ioctl that use implicit synchronization is a waste and you can only get
> >>>the lowest common denominator which is implicit synchronization. So i do
> >>>not see the point of such api if you are not also adding a new cs ioctl
> >>>with explicit contract that it does not do any kind of synchronization
> >>>(it could be almost the exact same code modulo the do not wait for
> >>>previous cmd to complete).
> >>Our thinking was to allow explicit sync from a single process, but
> >>implicitly sync between processes.
> >This is a BIG NAK if you are using the same ioctl as it would mean you are
> >changing userspace API, well at least userspace expectation. Adding a new
> >cs flag might do the trick but it should not be about inter-process, or any
> >thing special, it's just implicit sync or no synchronization. Converting
> >userspace is not that much of a big deal either, it can be broken into
> >several step. Like mesa use explicit synchronization all time but ddx use
> >implicit.
> 
> The thinking here is that we need to be backward compatible for DRI2/3 and
> support all kind of different use cases like old DDX and new Mesa, or old
> Mesa and new DDX etc...
> 
> So for my prototype if the kernel sees any access of a BO from two different
> clients it falls back to the old behavior of implicit synchronization of
> access to the same buffer object. That might not be the fastest approach,
> but is as far as I can see conservative and so should work under all
> conditions.
> 
> Apart from that the planning so far was that we just hide this feature
> behind a couple of command submission flags and new chunks.

Just to reproduce IRC discussion, i think it's a lot simpler and not that
complex. For explicit cs ioctl you do not wait for any previous fence of
any of the buffer referenced in the cs ioctl, but you still associate a
new fence with all the buffer object referenced in the cs ioctl. So if the
next ioctl is an implicit sync ioctl it will wait properly and synchronize
properly with previous explicit cs ioctl. Hence you can easily have a mix
in userspace thing is you only get benefit once enough of your userspace
is using explicit.

Note that you still need a way to have explicit cs ioctl to wait on a
previos "explicit" fence so you need some api to expose fence per cs
submission.

Cheers,
Jérôme

> 
> Regards,
> Christian.
> 
> >
> >Cheers,
> >Jérôme
> >
> >>Alex
> >>
> >>>Also one thing that the Android sync point does not have, AFAICT, is a
> >>>way to schedule synchronization as part of a cs ioctl so cpu never have
> >>>to be involve for cmd stream that deal only one gpu (assuming the driver
> >>>and hw can do such trick).
> >>>
> >>>Cheers,
> >>>Jérôme
> >>>
> >>>>-Daniel
> >>>>--
> >>>>Daniel Vetter
> >>>>Software Engineer, Intel Corporation
> >>>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >>>_______________________________________________
> >>>dri-devel mailing list
> >>>dri-devel@lists.freedesktop.org
> >>>http://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 15:48             ` Jerome Glisse
@ 2014-09-12 15:58               ` Christian König
  2014-09-12 16:03                 ` Jerome Glisse
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2014-09-12 15:58 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, John Harrison

Am 12.09.2014 um 17:48 schrieb Jerome Glisse:
> On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
>> Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
>>> On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
>>>> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>>>> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>>>>>> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>> On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>>>>>>>> Hello everyone,
>>>>>>>>
>>>>>>>> to allow concurrent buffer access by different engines beyond the multiple
>>>>>>>> readers/single writer model that we currently use in radeon and other
>>>>>>>> drivers we need some kind of synchonization object exposed to userspace.
>>>>>>>>
>>>>>>>> My initial patch set for this used (or rather abused) zero sized GEM buffers
>>>>>>>> as fence handles. This is obviously isn't the best way of doing this (to
>>>>>>>> much overhead, rather ugly etc...), Jerome commented on this accordingly.
>>>>>>>>
>>>>>>>> So what should a driver expose instead? Android sync points? Something else?
>>>>>>> I think actually exposing the struct fence objects as a fd, using android
>>>>>>> syncpts (or at least something compatible to it) is the way to go. Problem
>>>>>>> is that it's super-hard to get the android guys out of hiding for this :(
>>>>>>>
>>>>>>> Adding a bunch of people in the hopes that something sticks.
>>>>>> More people.
>>>>> Just to re-iterate, exposing such thing while still using command stream
>>>>> ioctl that use implicit synchronization is a waste and you can only get
>>>>> the lowest common denominator which is implicit synchronization. So i do
>>>>> not see the point of such api if you are not also adding a new cs ioctl
>>>>> with explicit contract that it does not do any kind of synchronization
>>>>> (it could be almost the exact same code modulo the do not wait for
>>>>> previous cmd to complete).
>>>> Our thinking was to allow explicit sync from a single process, but
>>>> implicitly sync between processes.
>>> This is a BIG NAK if you are using the same ioctl as it would mean you are
>>> changing userspace API, well at least userspace expectation. Adding a new
>>> cs flag might do the trick but it should not be about inter-process, or any
>>> thing special, it's just implicit sync or no synchronization. Converting
>>> userspace is not that much of a big deal either, it can be broken into
>>> several step. Like mesa use explicit synchronization all time but ddx use
>>> implicit.
>> The thinking here is that we need to be backward compatible for DRI2/3 and
>> support all kind of different use cases like old DDX and new Mesa, or old
>> Mesa and new DDX etc...
>>
>> So for my prototype if the kernel sees any access of a BO from two different
>> clients it falls back to the old behavior of implicit synchronization of
>> access to the same buffer object. That might not be the fastest approach,
>> but is as far as I can see conservative and so should work under all
>> conditions.
>>
>> Apart from that the planning so far was that we just hide this feature
>> behind a couple of command submission flags and new chunks.
> Just to reproduce IRC discussion, i think it's a lot simpler and not that
> complex. For explicit cs ioctl you do not wait for any previous fence of
> any of the buffer referenced in the cs ioctl, but you still associate a
> new fence with all the buffer object referenced in the cs ioctl. So if the
> next ioctl is an implicit sync ioctl it will wait properly and synchronize
> properly with previous explicit cs ioctl. Hence you can easily have a mix
> in userspace thing is you only get benefit once enough of your userspace
> is using explicit.

Yes, that's exactly what my patches currently implement.

The only difference is that by current planning I implemented it as a 
per BO flag for the command submission, but that was just for testing. 
Having a single flag to switch between implicit and explicit 
synchronization for whole CS IOCTL would do equally well.

> Note that you still need a way to have explicit cs ioctl to wait on a
> previos "explicit" fence so you need some api to expose fence per cs
> submission.

Exactly, that's what this mail thread is all about.

As Daniel correctly noted you need something like a functionality to get 
a fence as the result of a command submission as well as pass in a list 
of fences to wait for before beginning a command submission.

At least it looks like we are all on the same general line here, its 
just nobody has a good idea how the details should look like.

Regards,
Christian.

>
> Cheers,
> Jérôme
>
>> Regards,
>> Christian.
>>
>>> Cheers,
>>> Jérôme
>>>
>>>> Alex
>>>>
>>>>> Also one thing that the Android sync point does not have, AFAICT, is a
>>>>> way to schedule synchronization as part of a cs ioctl so cpu never have
>>>>> to be involve for cmd stream that deal only one gpu (assuming the driver
>>>>> and hw can do such trick).
>>>>>
>>>>> Cheers,
>>>>> Jérôme
>>>>>
>>>>>> -Daniel
>>>>>> --
>>>>>> Daniel Vetter
>>>>>> Software Engineer, Intel Corporation
>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel@lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 15:58               ` Christian König
@ 2014-09-12 16:03                 ` Jerome Glisse
  2014-09-12 16:08                   ` Christian König
  0 siblings, 1 reply; 19+ messages in thread
From: Jerome Glisse @ 2014-09-12 16:03 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, John Harrison

On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
> Am 12.09.2014 um 17:48 schrieb Jerome Glisse:
> >On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
> >>Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
> >>>On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
> >>>>On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >>>>>On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
> >>>>>>On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>>>>On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
> >>>>>>>>Hello everyone,
> >>>>>>>>
> >>>>>>>>to allow concurrent buffer access by different engines beyond the multiple
> >>>>>>>>readers/single writer model that we currently use in radeon and other
> >>>>>>>>drivers we need some kind of synchonization object exposed to userspace.
> >>>>>>>>
> >>>>>>>>My initial patch set for this used (or rather abused) zero sized GEM buffers
> >>>>>>>>as fence handles. This is obviously isn't the best way of doing this (to
> >>>>>>>>much overhead, rather ugly etc...), Jerome commented on this accordingly.
> >>>>>>>>
> >>>>>>>>So what should a driver expose instead? Android sync points? Something else?
> >>>>>>>I think actually exposing the struct fence objects as a fd, using android
> >>>>>>>syncpts (or at least something compatible to it) is the way to go. Problem
> >>>>>>>is that it's super-hard to get the android guys out of hiding for this :(
> >>>>>>>
> >>>>>>>Adding a bunch of people in the hopes that something sticks.
> >>>>>>More people.
> >>>>>Just to re-iterate, exposing such thing while still using command stream
> >>>>>ioctl that use implicit synchronization is a waste and you can only get
> >>>>>the lowest common denominator which is implicit synchronization. So i do
> >>>>>not see the point of such api if you are not also adding a new cs ioctl
> >>>>>with explicit contract that it does not do any kind of synchronization
> >>>>>(it could be almost the exact same code modulo the do not wait for
> >>>>>previous cmd to complete).
> >>>>Our thinking was to allow explicit sync from a single process, but
> >>>>implicitly sync between processes.
> >>>This is a BIG NAK if you are using the same ioctl as it would mean you are
> >>>changing userspace API, well at least userspace expectation. Adding a new
> >>>cs flag might do the trick but it should not be about inter-process, or any
> >>>thing special, it's just implicit sync or no synchronization. Converting
> >>>userspace is not that much of a big deal either, it can be broken into
> >>>several step. Like mesa use explicit synchronization all time but ddx use
> >>>implicit.
> >>The thinking here is that we need to be backward compatible for DRI2/3 and
> >>support all kind of different use cases like old DDX and new Mesa, or old
> >>Mesa and new DDX etc...
> >>
> >>So for my prototype if the kernel sees any access of a BO from two different
> >>clients it falls back to the old behavior of implicit synchronization of
> >>access to the same buffer object. That might not be the fastest approach,
> >>but is as far as I can see conservative and so should work under all
> >>conditions.
> >>
> >>Apart from that the planning so far was that we just hide this feature
> >>behind a couple of command submission flags and new chunks.
> >Just to reproduce IRC discussion, i think it's a lot simpler and not that
> >complex. For explicit cs ioctl you do not wait for any previous fence of
> >any of the buffer referenced in the cs ioctl, but you still associate a
> >new fence with all the buffer object referenced in the cs ioctl. So if the
> >next ioctl is an implicit sync ioctl it will wait properly and synchronize
> >properly with previous explicit cs ioctl. Hence you can easily have a mix
> >in userspace thing is you only get benefit once enough of your userspace
> >is using explicit.
> 
> Yes, that's exactly what my patches currently implement.
> 
> The only difference is that by current planning I implemented it as a per BO
> flag for the command submission, but that was just for testing. Having a
> single flag to switch between implicit and explicit synchronization for
> whole CS IOCTL would do equally well.

Doing it per BO sounds bogus to me. But otherwise yes we are in agreement.
As Daniel said using fd is most likely the way we want to do it but this
remains vague.

> 
> >Note that you still need a way to have explicit cs ioctl to wait on a
> >previos "explicit" fence so you need some api to expose fence per cs
> >submission.
> 
> Exactly, that's what this mail thread is all about.
> 
> As Daniel correctly noted you need something like a functionality to get a
> fence as the result of a command submission as well as pass in a list of
> fences to wait for before beginning a command submission.
> 
> At least it looks like we are all on the same general line here, its just
> nobody has a good idea how the details should look like.
> 
> Regards,
> Christian.
> 
> >
> >Cheers,
> >Jérôme
> >
> >>Regards,
> >>Christian.
> >>
> >>>Cheers,
> >>>Jérôme
> >>>
> >>>>Alex
> >>>>
> >>>>>Also one thing that the Android sync point does not have, AFAICT, is a
> >>>>>way to schedule synchronization as part of a cs ioctl so cpu never have
> >>>>>to be involve for cmd stream that deal only one gpu (assuming the driver
> >>>>>and hw can do such trick).
> >>>>>
> >>>>>Cheers,
> >>>>>Jérôme
> >>>>>
> >>>>>>-Daniel
> >>>>>>--
> >>>>>>Daniel Vetter
> >>>>>>Software Engineer, Intel Corporation
> >>>>>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
> >>>>>_______________________________________________
> >>>>>dri-devel mailing list
> >>>>>dri-devel@lists.freedesktop.org
> >>>>>http://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 16:03                 ` Jerome Glisse
@ 2014-09-12 16:08                   ` Christian König
  2014-09-12 16:38                     ` John Harrison
  2014-09-12 16:45                     ` Jesse Barnes
  0 siblings, 2 replies; 19+ messages in thread
From: Christian König @ 2014-09-12 16:08 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, John Harrison

> As Daniel said using fd is most likely the way we want to do it but this
> remains vague.
Separating the discussion if it should be an fd or not. Using an fd 
sounds fine to me in general, but I have some concerns as well.

For example what was the maximum number of opened FDs per process again? 
Could that become a problem? etc...

Please comment,
Christian.

Am 12.09.2014 um 18:03 schrieb Jerome Glisse:
> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
>> Am 12.09.2014 um 17:48 schrieb Jerome Glisse:
>>> On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
>>>> Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
>>>>> On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
>>>>>> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>>>>>> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>>>>>>>> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>>>> On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>>>>>>>>>> Hello everyone,
>>>>>>>>>>
>>>>>>>>>> to allow concurrent buffer access by different engines beyond the multiple
>>>>>>>>>> readers/single writer model that we currently use in radeon and other
>>>>>>>>>> drivers we need some kind of synchonization object exposed to userspace.
>>>>>>>>>>
>>>>>>>>>> My initial patch set for this used (or rather abused) zero sized GEM buffers
>>>>>>>>>> as fence handles. This is obviously isn't the best way of doing this (to
>>>>>>>>>> much overhead, rather ugly etc...), Jerome commented on this accordingly.
>>>>>>>>>>
>>>>>>>>>> So what should a driver expose instead? Android sync points? Something else?
>>>>>>>>> I think actually exposing the struct fence objects as a fd, using android
>>>>>>>>> syncpts (or at least something compatible to it) is the way to go. Problem
>>>>>>>>> is that it's super-hard to get the android guys out of hiding for this :(
>>>>>>>>>
>>>>>>>>> Adding a bunch of people in the hopes that something sticks.
>>>>>>>> More people.
>>>>>>> Just to re-iterate, exposing such thing while still using command stream
>>>>>>> ioctl that use implicit synchronization is a waste and you can only get
>>>>>>> the lowest common denominator which is implicit synchronization. So i do
>>>>>>> not see the point of such api if you are not also adding a new cs ioctl
>>>>>>> with explicit contract that it does not do any kind of synchronization
>>>>>>> (it could be almost the exact same code modulo the do not wait for
>>>>>>> previous cmd to complete).
>>>>>> Our thinking was to allow explicit sync from a single process, but
>>>>>> implicitly sync between processes.
>>>>> This is a BIG NAK if you are using the same ioctl as it would mean you are
>>>>> changing userspace API, well at least userspace expectation. Adding a new
>>>>> cs flag might do the trick but it should not be about inter-process, or any
>>>>> thing special, it's just implicit sync or no synchronization. Converting
>>>>> userspace is not that much of a big deal either, it can be broken into
>>>>> several step. Like mesa use explicit synchronization all time but ddx use
>>>>> implicit.
>>>> The thinking here is that we need to be backward compatible for DRI2/3 and
>>>> support all kind of different use cases like old DDX and new Mesa, or old
>>>> Mesa and new DDX etc...
>>>>
>>>> So for my prototype if the kernel sees any access of a BO from two different
>>>> clients it falls back to the old behavior of implicit synchronization of
>>>> access to the same buffer object. That might not be the fastest approach,
>>>> but is as far as I can see conservative and so should work under all
>>>> conditions.
>>>>
>>>> Apart from that the planning so far was that we just hide this feature
>>>> behind a couple of command submission flags and new chunks.
>>> Just to reproduce IRC discussion, i think it's a lot simpler and not that
>>> complex. For explicit cs ioctl you do not wait for any previous fence of
>>> any of the buffer referenced in the cs ioctl, but you still associate a
>>> new fence with all the buffer object referenced in the cs ioctl. So if the
>>> next ioctl is an implicit sync ioctl it will wait properly and synchronize
>>> properly with previous explicit cs ioctl. Hence you can easily have a mix
>>> in userspace thing is you only get benefit once enough of your userspace
>>> is using explicit.
>> Yes, that's exactly what my patches currently implement.
>>
>> The only difference is that by current planning I implemented it as a per BO
>> flag for the command submission, but that was just for testing. Having a
>> single flag to switch between implicit and explicit synchronization for
>> whole CS IOCTL would do equally well.
> Doing it per BO sounds bogus to me. But otherwise yes we are in agreement.
> As Daniel said using fd is most likely the way we want to do it but this
> remains vague.
>
>>> Note that you still need a way to have explicit cs ioctl to wait on a
>>> previos "explicit" fence so you need some api to expose fence per cs
>>> submission.
>> Exactly, that's what this mail thread is all about.
>>
>> As Daniel correctly noted you need something like a functionality to get a
>> fence as the result of a command submission as well as pass in a list of
>> fences to wait for before beginning a command submission.
>>
>> At least it looks like we are all on the same general line here, its just
>> nobody has a good idea how the details should look like.
>>
>> Regards,
>> Christian.
>>
>>> Cheers,
>>> Jérôme
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Cheers,
>>>>> Jérôme
>>>>>
>>>>>> Alex
>>>>>>
>>>>>>> Also one thing that the Android sync point does not have, AFAICT, is a
>>>>>>> way to schedule synchronization as part of a cs ioctl so cpu never have
>>>>>>> to be involve for cmd stream that deal only one gpu (assuming the driver
>>>>>>> and hw can do such trick).
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jérôme
>>>>>>>
>>>>>>>> -Daniel
>>>>>>>> --
>>>>>>>> Daniel Vetter
>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>>>> _______________________________________________
>>>>>>> dri-devel mailing list
>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 16:08                   ` Christian König
@ 2014-09-12 16:38                     ` John Harrison
  2014-09-13 12:25                       ` Christian König
  2014-09-12 16:45                     ` Jesse Barnes
  1 sibling, 1 reply; 19+ messages in thread
From: John Harrison @ 2014-09-12 16:38 UTC (permalink / raw)
  To: Christian König, Jerome Glisse
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport

On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
 > pass in a list of fences to wait for before beginning a command 
submission.

The Android implementation has a mechanism for combining multiple sync 
points into a brand new single sync pt. Thus APIs only ever need to take 
in a single fd not a list of them. If the user wants an operation to 
wait for multiple events to occur then it is up to them to request the 
combined version first. They can then happily close the individual fds 
that have been combined and only keep the one big one around. Indeed, 
even that fd can be closed once it has been passed on to some other API.

Doing such combining and cleaning up fds as soon as they have been 
passed on should keep each application's fd usage fairly small.


On 12/09/2014 17:08, Christian König wrote:
>> As Daniel said using fd is most likely the way we want to do it but this
>> remains vague.
> Separating the discussion if it should be an fd or not. Using an fd 
> sounds fine to me in general, but I have some concerns as well.
>
> For example what was the maximum number of opened FDs per process 
> again? Could that become a problem? etc...
>
> Please comment,
> Christian.
>
> Am 12.09.2014 um 18:03 schrieb Jerome Glisse:
>> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
>>> Am 12.09.2014 um 17:48 schrieb Jerome Glisse:
>>>> On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
>>>>> Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
>>>>>> On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
>>>>>>> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse 
>>>>>>> <j.glisse@gmail.com> wrote:
>>>>>>>> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>>>>>>>>> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter 
>>>>>>>>> <daniel@ffwll.ch> wrote:
>>>>>>>>>> On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>
>>>>>>>>>>> to allow concurrent buffer access by different engines 
>>>>>>>>>>> beyond the multiple
>>>>>>>>>>> readers/single writer model that we currently use in radeon 
>>>>>>>>>>> and other
>>>>>>>>>>> drivers we need some kind of synchonization object exposed 
>>>>>>>>>>> to userspace.
>>>>>>>>>>>
>>>>>>>>>>> My initial patch set for this used (or rather abused) zero 
>>>>>>>>>>> sized GEM buffers
>>>>>>>>>>> as fence handles. This is obviously isn't the best way of 
>>>>>>>>>>> doing this (to
>>>>>>>>>>> much overhead, rather ugly etc...), Jerome commented on this 
>>>>>>>>>>> accordingly.
>>>>>>>>>>>
>>>>>>>>>>> So what should a driver expose instead? Android sync points? 
>>>>>>>>>>> Something else?
>>>>>>>>>> I think actually exposing the struct fence objects as a fd, 
>>>>>>>>>> using android
>>>>>>>>>> syncpts (or at least something compatible to it) is the way 
>>>>>>>>>> to go. Problem
>>>>>>>>>> is that it's super-hard to get the android guys out of hiding 
>>>>>>>>>> for this :(
>>>>>>>>>>
>>>>>>>>>> Adding a bunch of people in the hopes that something sticks.
>>>>>>>>> More people.
>>>>>>>> Just to re-iterate, exposing such thing while still using 
>>>>>>>> command stream
>>>>>>>> ioctl that use implicit synchronization is a waste and you can 
>>>>>>>> only get
>>>>>>>> the lowest common denominator which is implicit 
>>>>>>>> synchronization. So i do
>>>>>>>> not see the point of such api if you are not also adding a new 
>>>>>>>> cs ioctl
>>>>>>>> with explicit contract that it does not do any kind of 
>>>>>>>> synchronization
>>>>>>>> (it could be almost the exact same code modulo the do not wait for
>>>>>>>> previous cmd to complete).
>>>>>>> Our thinking was to allow explicit sync from a single process, but
>>>>>>> implicitly sync between processes.
>>>>>> This is a BIG NAK if you are using the same ioctl as it would 
>>>>>> mean you are
>>>>>> changing userspace API, well at least userspace expectation. 
>>>>>> Adding a new
>>>>>> cs flag might do the trick but it should not be about 
>>>>>> inter-process, or any
>>>>>> thing special, it's just implicit sync or no synchronization. 
>>>>>> Converting
>>>>>> userspace is not that much of a big deal either, it can be broken 
>>>>>> into
>>>>>> several step. Like mesa use explicit synchronization all time but 
>>>>>> ddx use
>>>>>> implicit.
>>>>> The thinking here is that we need to be backward compatible for 
>>>>> DRI2/3 and
>>>>> support all kind of different use cases like old DDX and new Mesa, 
>>>>> or old
>>>>> Mesa and new DDX etc...
>>>>>
>>>>> So for my prototype if the kernel sees any access of a BO from two 
>>>>> different
>>>>> clients it falls back to the old behavior of implicit 
>>>>> synchronization of
>>>>> access to the same buffer object. That might not be the fastest 
>>>>> approach,
>>>>> but is as far as I can see conservative and so should work under all
>>>>> conditions.
>>>>>
>>>>> Apart from that the planning so far was that we just hide this 
>>>>> feature
>>>>> behind a couple of command submission flags and new chunks.
>>>> Just to reproduce IRC discussion, i think it's a lot simpler and 
>>>> not that
>>>> complex. For explicit cs ioctl you do not wait for any previous 
>>>> fence of
>>>> any of the buffer referenced in the cs ioctl, but you still 
>>>> associate a
>>>> new fence with all the buffer object referenced in the cs ioctl. So 
>>>> if the
>>>> next ioctl is an implicit sync ioctl it will wait properly and 
>>>> synchronize
>>>> properly with previous explicit cs ioctl. Hence you can easily have 
>>>> a mix
>>>> in userspace thing is you only get benefit once enough of your 
>>>> userspace
>>>> is using explicit.
>>> Yes, that's exactly what my patches currently implement.
>>>
>>> The only difference is that by current planning I implemented it as 
>>> a per BO
>>> flag for the command submission, but that was just for testing. 
>>> Having a
>>> single flag to switch between implicit and explicit synchronization for
>>> whole CS IOCTL would do equally well.
>> Doing it per BO sounds bogus to me. But otherwise yes we are in 
>> agreement.
>> As Daniel said using fd is most likely the way we want to do it but this
>> remains vague.
>>
>>>> Note that you still need a way to have explicit cs ioctl to wait on a
>>>> previos "explicit" fence so you need some api to expose fence per cs
>>>> submission.
>>> Exactly, that's what this mail thread is all about.
>>>
>>> As Daniel correctly noted you need something like a functionality to 
>>> get a
>>> fence as the result of a command submission as well as pass in a 
>>> list of
>>> fences to wait for before beginning a command submission.
>>>
>>> At least it looks like we are all on the same general line here, its 
>>> just
>>> nobody has a good idea how the details should look like.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Cheers,
>>>> Jérôme
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Cheers,
>>>>>> Jérôme
>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>> Also one thing that the Android sync point does not have, 
>>>>>>>> AFAICT, is a
>>>>>>>> way to schedule synchronization as part of a cs ioctl so cpu 
>>>>>>>> never have
>>>>>>>> to be involve for cmd stream that deal only one gpu (assuming 
>>>>>>>> the driver
>>>>>>>> and hw can do such trick).
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Jérôme
>>>>>>>>
>>>>>>>>> -Daniel
>>>>>>>>> -- 
>>>>>>>>> Daniel Vetter
>>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>>>>> _______________________________________________
>>>>>>>> dri-devel mailing list
>>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 16:08                   ` Christian König
  2014-09-12 16:38                     ` John Harrison
@ 2014-09-12 16:45                     ` Jesse Barnes
  1 sibling, 0 replies; 19+ messages in thread
From: Jesse Barnes @ 2014-09-12 16:45 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	John Harrison, gpudriverdevsupport

On Fri, 12 Sep 2014 18:08:23 +0200
Christian König <christian.koenig@amd.com> wrote:

> > As Daniel said using fd is most likely the way we want to do it but this
> > remains vague.
> Separating the discussion if it should be an fd or not. Using an fd 
> sounds fine to me in general, but I have some concerns as well.
> 
> For example what was the maximum number of opened FDs per process again? 
> Could that become a problem? etc...

You can check out the i915 patches I posted if you want to see
examples.  Max fds may be an issue if userspace doesn't clean up its
fences.  The implementation is pretty easy with the stuff Maarten has
done recently.

The changes I still need to make to mine:
  - sit on top of Chris's request/seqno changes (driver internals
    really)
  - switch over to execbuf as the main API on the render side (like
    you're doing)
  - add support for display and other timelines

As far as compat goes, I don't think it should be too hard.  Even with
GPU scheduling, a given context's buffers should all be in-order with
respect to one another, so we ought to be able to mix & match clients
using explicit fencing and implicit fencing.  Though in Mesa I still
haven't looked at how to handle server vs client side arb_sync with the
scheduler and explicit fencing in place; might need some extra work
there...

-- 
Jesse Barnes, Intel Open Source Technology Center
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-12 16:38                     ` John Harrison
@ 2014-09-13 12:25                       ` Christian König
  2014-09-14  0:32                         ` Marek Olšák
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2014-09-13 12:25 UTC (permalink / raw)
  To: John Harrison, Jerome Glisse
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport

> Doing such combining and cleaning up fds as soon as they have been 
> passed on should keep each application's fd usage fairly small.
Yeah, but this is exactly what we wanted to avoid internally because of 
the IOCTL overhead.

And thinking more about it for our driver internal use we will 
definitely hit some limitations with the number of FDs in use and the 
overhead for creating and closing them. With the execution model we 
target for the long term we will need something like 10k fences per 
second or more.

How about this: We use an identifier per client for the fence internally 
and when we need to somehow expose it to somebody else export it as sync 
point fd. Very similar to how we currently have GEM handles internally 
and when we need to expose them export a DMA_buf fd.

Regards,
Christian.

Am 12.09.2014 um 18:38 schrieb John Harrison:
> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
> > pass in a list of fences to wait for before beginning a command 
> submission.
>
> The Android implementation has a mechanism for combining multiple sync 
> points into a brand new single sync pt. Thus APIs only ever need to 
> take in a single fd not a list of them. If the user wants an operation 
> to wait for multiple events to occur then it is up to them to request 
> the combined version first. They can then happily close the individual 
> fds that have been combined and only keep the one big one around. 
> Indeed, even that fd can be closed once it has been passed on to some 
> other API.
>
> Doing such combining and cleaning up fds as soon as they have been 
> passed on should keep each application's fd usage fairly small.
>
>
> On 12/09/2014 17:08, Christian König wrote:
>>> As Daniel said using fd is most likely the way we want to do it but 
>>> this
>>> remains vague.
>> Separating the discussion if it should be an fd or not. Using an fd 
>> sounds fine to me in general, but I have some concerns as well.
>>
>> For example what was the maximum number of opened FDs per process 
>> again? Could that become a problem? etc...
>>
>> Please comment,
>> Christian.
>>
>> Am 12.09.2014 um 18:03 schrieb Jerome Glisse:
>>> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
>>>> Am 12.09.2014 um 17:48 schrieb Jerome Glisse:
>>>>> On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
>>>>>> Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
>>>>>>> On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
>>>>>>>> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse 
>>>>>>>> <j.glisse@gmail.com> wrote:
>>>>>>>>> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>>>>>>>>>> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter 
>>>>>>>>>> <daniel@ffwll.ch> wrote:
>>>>>>>>>>> On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> to allow concurrent buffer access by different engines 
>>>>>>>>>>>> beyond the multiple
>>>>>>>>>>>> readers/single writer model that we currently use in radeon 
>>>>>>>>>>>> and other
>>>>>>>>>>>> drivers we need some kind of synchonization object exposed 
>>>>>>>>>>>> to userspace.
>>>>>>>>>>>>
>>>>>>>>>>>> My initial patch set for this used (or rather abused) zero 
>>>>>>>>>>>> sized GEM buffers
>>>>>>>>>>>> as fence handles. This is obviously isn't the best way of 
>>>>>>>>>>>> doing this (to
>>>>>>>>>>>> much overhead, rather ugly etc...), Jerome commented on 
>>>>>>>>>>>> this accordingly.
>>>>>>>>>>>>
>>>>>>>>>>>> So what should a driver expose instead? Android sync 
>>>>>>>>>>>> points? Something else?
>>>>>>>>>>> I think actually exposing the struct fence objects as a fd, 
>>>>>>>>>>> using android
>>>>>>>>>>> syncpts (or at least something compatible to it) is the way 
>>>>>>>>>>> to go. Problem
>>>>>>>>>>> is that it's super-hard to get the android guys out of 
>>>>>>>>>>> hiding for this :(
>>>>>>>>>>>
>>>>>>>>>>> Adding a bunch of people in the hopes that something sticks.
>>>>>>>>>> More people.
>>>>>>>>> Just to re-iterate, exposing such thing while still using 
>>>>>>>>> command stream
>>>>>>>>> ioctl that use implicit synchronization is a waste and you can 
>>>>>>>>> only get
>>>>>>>>> the lowest common denominator which is implicit 
>>>>>>>>> synchronization. So i do
>>>>>>>>> not see the point of such api if you are not also adding a new 
>>>>>>>>> cs ioctl
>>>>>>>>> with explicit contract that it does not do any kind of 
>>>>>>>>> synchronization
>>>>>>>>> (it could be almost the exact same code modulo the do not wait 
>>>>>>>>> for
>>>>>>>>> previous cmd to complete).
>>>>>>>> Our thinking was to allow explicit sync from a single process, but
>>>>>>>> implicitly sync between processes.
>>>>>>> This is a BIG NAK if you are using the same ioctl as it would 
>>>>>>> mean you are
>>>>>>> changing userspace API, well at least userspace expectation. 
>>>>>>> Adding a new
>>>>>>> cs flag might do the trick but it should not be about 
>>>>>>> inter-process, or any
>>>>>>> thing special, it's just implicit sync or no synchronization. 
>>>>>>> Converting
>>>>>>> userspace is not that much of a big deal either, it can be 
>>>>>>> broken into
>>>>>>> several step. Like mesa use explicit synchronization all time 
>>>>>>> but ddx use
>>>>>>> implicit.
>>>>>> The thinking here is that we need to be backward compatible for 
>>>>>> DRI2/3 and
>>>>>> support all kind of different use cases like old DDX and new 
>>>>>> Mesa, or old
>>>>>> Mesa and new DDX etc...
>>>>>>
>>>>>> So for my prototype if the kernel sees any access of a BO from 
>>>>>> two different
>>>>>> clients it falls back to the old behavior of implicit 
>>>>>> synchronization of
>>>>>> access to the same buffer object. That might not be the fastest 
>>>>>> approach,
>>>>>> but is as far as I can see conservative and so should work under all
>>>>>> conditions.
>>>>>>
>>>>>> Apart from that the planning so far was that we just hide this 
>>>>>> feature
>>>>>> behind a couple of command submission flags and new chunks.
>>>>> Just to reproduce IRC discussion, i think it's a lot simpler and 
>>>>> not that
>>>>> complex. For explicit cs ioctl you do not wait for any previous 
>>>>> fence of
>>>>> any of the buffer referenced in the cs ioctl, but you still 
>>>>> associate a
>>>>> new fence with all the buffer object referenced in the cs ioctl. 
>>>>> So if the
>>>>> next ioctl is an implicit sync ioctl it will wait properly and 
>>>>> synchronize
>>>>> properly with previous explicit cs ioctl. Hence you can easily 
>>>>> have a mix
>>>>> in userspace thing is you only get benefit once enough of your 
>>>>> userspace
>>>>> is using explicit.
>>>> Yes, that's exactly what my patches currently implement.
>>>>
>>>> The only difference is that by current planning I implemented it as 
>>>> a per BO
>>>> flag for the command submission, but that was just for testing. 
>>>> Having a
>>>> single flag to switch between implicit and explicit synchronization 
>>>> for
>>>> whole CS IOCTL would do equally well.
>>> Doing it per BO sounds bogus to me. But otherwise yes we are in 
>>> agreement.
>>> As Daniel said using fd is most likely the way we want to do it but 
>>> this
>>> remains vague.
>>>
>>>>> Note that you still need a way to have explicit cs ioctl to wait on a
>>>>> previos "explicit" fence so you need some api to expose fence per cs
>>>>> submission.
>>>> Exactly, that's what this mail thread is all about.
>>>>
>>>> As Daniel correctly noted you need something like a functionality 
>>>> to get a
>>>> fence as the result of a command submission as well as pass in a 
>>>> list of
>>>> fences to wait for before beginning a command submission.
>>>>
>>>> At least it looks like we are all on the same general line here, 
>>>> its just
>>>> nobody has a good idea how the details should look like.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Cheers,
>>>>> Jérôme
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Cheers,
>>>>>>> Jérôme
>>>>>>>
>>>>>>>> Alex
>>>>>>>>
>>>>>>>>> Also one thing that the Android sync point does not have, 
>>>>>>>>> AFAICT, is a
>>>>>>>>> way to schedule synchronization as part of a cs ioctl so cpu 
>>>>>>>>> never have
>>>>>>>>> to be involve for cmd stream that deal only one gpu (assuming 
>>>>>>>>> the driver
>>>>>>>>> and hw can do such trick).
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Jérôme
>>>>>>>>>
>>>>>>>>>> -Daniel
>>>>>>>>>> -- 
>>>>>>>>>> Daniel Vetter
>>>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>>>>>> _______________________________________________
>>>>>>>>> dri-devel mailing list
>>>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-13 12:25                       ` Christian König
@ 2014-09-14  0:32                         ` Marek Olšák
  2014-09-14 10:36                           ` Christian König
  0 siblings, 1 reply; 19+ messages in thread
From: Marek Olšák @ 2014-09-14  0:32 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, John Harrison

BTW, we can recycle fences in userspace just like we recycle buffers.
That should make the create/close overhead non-existent.

Marek

On Sat, Sep 13, 2014 at 2:25 PM, Christian König
<christian.koenig@amd.com> wrote:
>> Doing such combining and cleaning up fds as soon as they have been passed
>> on should keep each application's fd usage fairly small.
>
> Yeah, but this is exactly what we wanted to avoid internally because of the
> IOCTL overhead.
>
> And thinking more about it for our driver internal use we will definitely
> hit some limitations with the number of FDs in use and the overhead for
> creating and closing them. With the execution model we target for the long
> term we will need something like 10k fences per second or more.
>
> How about this: We use an identifier per client for the fence internally and
> when we need to somehow expose it to somebody else export it as sync point
> fd. Very similar to how we currently have GEM handles internally and when we
> need to expose them export a DMA_buf fd.
>
> Regards,
> Christian.
>
> Am 12.09.2014 um 18:38 schrieb John Harrison:
>
>> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
>> > pass in a list of fences to wait for before beginning a command
>> > submission.
>>
>> The Android implementation has a mechanism for combining multiple sync
>> points into a brand new single sync pt. Thus APIs only ever need to take in
>> a single fd not a list of them. If the user wants an operation to wait for
>> multiple events to occur then it is up to them to request the combined
>> version first. They can then happily close the individual fds that have been
>> combined and only keep the one big one around. Indeed, even that fd can be
>> closed once it has been passed on to some other API.
>>
>> Doing such combining and cleaning up fds as soon as they have been passed
>> on should keep each application's fd usage fairly small.
>>
>>
>> On 12/09/2014 17:08, Christian König wrote:
>>>>
>>>> As Daniel said using fd is most likely the way we want to do it but this
>>>> remains vague.
>>>
>>> Separating the discussion if it should be an fd or not. Using an fd
>>> sounds fine to me in general, but I have some concerns as well.
>>>
>>> For example what was the maximum number of opened FDs per process again?
>>> Could that become a problem? etc...
>>>
>>> Please comment,
>>> Christian.
>>>
>>> Am 12.09.2014 um 18:03 schrieb Jerome Glisse:
>>>>
>>>> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
>>>>>
>>>>> Am 12.09.2014 um 17:48 schrieb Jerome Glisse:
>>>>>>
>>>>>> On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
>>>>>>>
>>>>>>> Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
>>>>>>>>
>>>>>>>> On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse
>>>>>>>>> <j.glisse@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> to allow concurrent buffer access by different engines beyond
>>>>>>>>>>>>> the multiple
>>>>>>>>>>>>> readers/single writer model that we currently use in radeon and
>>>>>>>>>>>>> other
>>>>>>>>>>>>> drivers we need some kind of synchonization object exposed to
>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My initial patch set for this used (or rather abused) zero
>>>>>>>>>>>>> sized GEM buffers
>>>>>>>>>>>>> as fence handles. This is obviously isn't the best way of doing
>>>>>>>>>>>>> this (to
>>>>>>>>>>>>> much overhead, rather ugly etc...), Jerome commented on this
>>>>>>>>>>>>> accordingly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So what should a driver expose instead? Android sync points?
>>>>>>>>>>>>> Something else?
>>>>>>>>>>>>
>>>>>>>>>>>> I think actually exposing the struct fence objects as a fd,
>>>>>>>>>>>> using android
>>>>>>>>>>>> syncpts (or at least something compatible to it) is the way to
>>>>>>>>>>>> go. Problem
>>>>>>>>>>>> is that it's super-hard to get the android guys out of hiding
>>>>>>>>>>>> for this :(
>>>>>>>>>>>>
>>>>>>>>>>>> Adding a bunch of people in the hopes that something sticks.
>>>>>>>>>>>
>>>>>>>>>>> More people.
>>>>>>>>>>
>>>>>>>>>> Just to re-iterate, exposing such thing while still using command
>>>>>>>>>> stream
>>>>>>>>>> ioctl that use implicit synchronization is a waste and you can
>>>>>>>>>> only get
>>>>>>>>>> the lowest common denominator which is implicit synchronization.
>>>>>>>>>> So i do
>>>>>>>>>> not see the point of such api if you are not also adding a new cs
>>>>>>>>>> ioctl
>>>>>>>>>> with explicit contract that it does not do any kind of
>>>>>>>>>> synchronization
>>>>>>>>>> (it could be almost the exact same code modulo the do not wait for
>>>>>>>>>> previous cmd to complete).
>>>>>>>>>
>>>>>>>>> Our thinking was to allow explicit sync from a single process, but
>>>>>>>>> implicitly sync between processes.
>>>>>>>>
>>>>>>>> This is a BIG NAK if you are using the same ioctl as it would mean
>>>>>>>> you are
>>>>>>>> changing userspace API, well at least userspace expectation. Adding
>>>>>>>> a new
>>>>>>>> cs flag might do the trick but it should not be about inter-process,
>>>>>>>> or any
>>>>>>>> thing special, it's just implicit sync or no synchronization.
>>>>>>>> Converting
>>>>>>>> userspace is not that much of a big deal either, it can be broken
>>>>>>>> into
>>>>>>>> several step. Like mesa use explicit synchronization all time but
>>>>>>>> ddx use
>>>>>>>> implicit.
>>>>>>>
>>>>>>> The thinking here is that we need to be backward compatible for
>>>>>>> DRI2/3 and
>>>>>>> support all kind of different use cases like old DDX and new Mesa, or
>>>>>>> old
>>>>>>> Mesa and new DDX etc...
>>>>>>>
>>>>>>> So for my prototype if the kernel sees any access of a BO from two
>>>>>>> different
>>>>>>> clients it falls back to the old behavior of implicit synchronization
>>>>>>> of
>>>>>>> access to the same buffer object. That might not be the fastest
>>>>>>> approach,
>>>>>>> but is as far as I can see conservative and so should work under all
>>>>>>> conditions.
>>>>>>>
>>>>>>> Apart from that the planning so far was that we just hide this
>>>>>>> feature
>>>>>>> behind a couple of command submission flags and new chunks.
>>>>>>
>>>>>> Just to reproduce IRC discussion, i think it's a lot simpler and not
>>>>>> that
>>>>>> complex. For explicit cs ioctl you do not wait for any previous fence
>>>>>> of
>>>>>> any of the buffer referenced in the cs ioctl, but you still associate
>>>>>> a
>>>>>> new fence with all the buffer object referenced in the cs ioctl. So if
>>>>>> the
>>>>>> next ioctl is an implicit sync ioctl it will wait properly and
>>>>>> synchronize
>>>>>> properly with previous explicit cs ioctl. Hence you can easily have a
>>>>>> mix
>>>>>> in userspace thing is you only get benefit once enough of your
>>>>>> userspace
>>>>>> is using explicit.
>>>>>
>>>>> Yes, that's exactly what my patches currently implement.
>>>>>
>>>>> The only difference is that by current planning I implemented it as a
>>>>> per BO
>>>>> flag for the command submission, but that was just for testing. Having
>>>>> a
>>>>> single flag to switch between implicit and explicit synchronization for
>>>>> whole CS IOCTL would do equally well.
>>>>
>>>> Doing it per BO sounds bogus to me. But otherwise yes we are in
>>>> agreement.
>>>> As Daniel said using fd is most likely the way we want to do it but this
>>>> remains vague.
>>>>
>>>>>> Note that you still need a way to have explicit cs ioctl to wait on a
>>>>>> previos "explicit" fence so you need some api to expose fence per cs
>>>>>> submission.
>>>>>
>>>>> Exactly, that's what this mail thread is all about.
>>>>>
>>>>> As Daniel correctly noted you need something like a functionality to
>>>>> get a
>>>>> fence as the result of a command submission as well as pass in a list
>>>>> of
>>>>> fences to wait for before beginning a command submission.
>>>>>
>>>>> At least it looks like we are all on the same general line here, its
>>>>> just
>>>>> nobody has a good idea how the details should look like.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Cheers,
>>>>>> Jérôme
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Jérôme
>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>> Also one thing that the Android sync point does not have, AFAICT,
>>>>>>>>>> is a
>>>>>>>>>> way to schedule synchronization as part of a cs ioctl so cpu never
>>>>>>>>>> have
>>>>>>>>>> to be involve for cmd stream that deal only one gpu (assuming the
>>>>>>>>>> driver
>>>>>>>>>> and hw can do such trick).
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Jérôme
>>>>>>>>>>
>>>>>>>>>>> -Daniel
>>>>>>>>>>> --
>>>>>>>>>>> Daniel Vetter
>>>>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> dri-devel mailing list
>>>>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-14  0:32                         ` Marek Olšák
@ 2014-09-14 10:36                           ` Christian König
  2014-09-15  8:46                             ` Daniel Vetter
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2014-09-14 10:36 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	gpudriverdevsupport, John Harrison

Yeah, right. Providing the fd to reassign to a fence would indeed reduce 
the create/close overhead.

But it would still be more overhead than for example a simple on demand 
growing ring buffer which then uses 64bit sequence numbers in userspace 
to refer to a fence in the kernel.

Apart from that I'm pretty sure that when we do the syncing completely 
in userspace we need more fences open at the same time than fds are 
available by default.

As long as our internal handle or sequence based fence are easily 
convertible to a fence fd I actually don't really see a problem with 
that. Going to hack that approach into my prototype and then we can see 
how bad the code looks after all.

Christian.

Am 14.09.2014 um 02:32 schrieb Marek Olšák:
> BTW, we can recycle fences in userspace just like we recycle buffers.
> That should make the create/close overhead non-existent.
>
> Marek
>
> On Sat, Sep 13, 2014 at 2:25 PM, Christian König
> <christian.koenig@amd.com> wrote:
>>> Doing such combining and cleaning up fds as soon as they have been passed
>>> on should keep each application's fd usage fairly small.
>> Yeah, but this is exactly what we wanted to avoid internally because of the
>> IOCTL overhead.
>>
>> And thinking more about it for our driver internal use we will definitely
>> hit some limitations with the number of FDs in use and the overhead for
>> creating and closing them. With the execution model we target for the long
>> term we will need something like 10k fences per second or more.
>>
>> How about this: We use an identifier per client for the fence internally and
>> when we need to somehow expose it to somebody else export it as sync point
>> fd. Very similar to how we currently have GEM handles internally and when we
>> need to expose them export a DMA_buf fd.
>>
>> Regards,
>> Christian.
>>
>> Am 12.09.2014 um 18:38 schrieb John Harrison:
>>
>>> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
>>>> pass in a list of fences to wait for before beginning a command
>>>> submission.
>>> The Android implementation has a mechanism for combining multiple sync
>>> points into a brand new single sync pt. Thus APIs only ever need to take in
>>> a single fd not a list of them. If the user wants an operation to wait for
>>> multiple events to occur then it is up to them to request the combined
>>> version first. They can then happily close the individual fds that have been
>>> combined and only keep the one big one around. Indeed, even that fd can be
>>> closed once it has been passed on to some other API.
>>>
>>> Doing such combining and cleaning up fds as soon as they have been passed
>>> on should keep each application's fd usage fairly small.
>>>
>>>
>>> On 12/09/2014 17:08, Christian König wrote:
>>>>> As Daniel said using fd is most likely the way we want to do it but this
>>>>> remains vague.
>>>> Separating the discussion if it should be an fd or not. Using an fd
>>>> sounds fine to me in general, but I have some concerns as well.
>>>>
>>>> For example what was the maximum number of opened FDs per process again?
>>>> Could that become a problem? etc...
>>>>
>>>> Please comment,
>>>> Christian.
>>>>
>>>> Am 12.09.2014 um 18:03 schrieb Jerome Glisse:
>>>>> On Fri, Sep 12, 2014 at 05:58:09PM +0200, Christian König wrote:
>>>>>> Am 12.09.2014 um 17:48 schrieb Jerome Glisse:
>>>>>>> On Fri, Sep 12, 2014 at 05:42:57PM +0200, Christian König wrote:
>>>>>>>> Am 12.09.2014 um 17:33 schrieb Jerome Glisse:
>>>>>>>>> On Fri, Sep 12, 2014 at 11:25:12AM -0400, Alex Deucher wrote:
>>>>>>>>>> On Fri, Sep 12, 2014 at 10:50 AM, Jerome Glisse
>>>>>>>>>> <j.glisse@gmail.com> wrote:
>>>>>>>>>>> On Fri, Sep 12, 2014 at 04:43:44PM +0200, Daniel Vetter wrote:
>>>>>>>>>>>> On Fri, Sep 12, 2014 at 4:09 PM, Daniel Vetter <daniel@ffwll.ch>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> On Fri, Sep 12, 2014 at 03:23:22PM +0200, Christian König wrote:
>>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> to allow concurrent buffer access by different engines beyond
>>>>>>>>>>>>>> the multiple
>>>>>>>>>>>>>> readers/single writer model that we currently use in radeon and
>>>>>>>>>>>>>> other
>>>>>>>>>>>>>> drivers we need some kind of synchonization object exposed to
>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My initial patch set for this used (or rather abused) zero
>>>>>>>>>>>>>> sized GEM buffers
>>>>>>>>>>>>>> as fence handles. This is obviously isn't the best way of doing
>>>>>>>>>>>>>> this (to
>>>>>>>>>>>>>> much overhead, rather ugly etc...), Jerome commented on this
>>>>>>>>>>>>>> accordingly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So what should a driver expose instead? Android sync points?
>>>>>>>>>>>>>> Something else?
>>>>>>>>>>>>> I think actually exposing the struct fence objects as a fd,
>>>>>>>>>>>>> using android
>>>>>>>>>>>>> syncpts (or at least something compatible to it) is the way to
>>>>>>>>>>>>> go. Problem
>>>>>>>>>>>>> is that it's super-hard to get the android guys out of hiding
>>>>>>>>>>>>> for this :(
>>>>>>>>>>>>>
>>>>>>>>>>>>> Adding a bunch of people in the hopes that something sticks.
>>>>>>>>>>>> More people.
>>>>>>>>>>> Just to re-iterate, exposing such thing while still using command
>>>>>>>>>>> stream
>>>>>>>>>>> ioctl that use implicit synchronization is a waste and you can
>>>>>>>>>>> only get
>>>>>>>>>>> the lowest common denominator which is implicit synchronization.
>>>>>>>>>>> So i do
>>>>>>>>>>> not see the point of such api if you are not also adding a new cs
>>>>>>>>>>> ioctl
>>>>>>>>>>> with explicit contract that it does not do any kind of
>>>>>>>>>>> synchronization
>>>>>>>>>>> (it could be almost the exact same code modulo the do not wait for
>>>>>>>>>>> previous cmd to complete).
>>>>>>>>>> Our thinking was to allow explicit sync from a single process, but
>>>>>>>>>> implicitly sync between processes.
>>>>>>>>> This is a BIG NAK if you are using the same ioctl as it would mean
>>>>>>>>> you are
>>>>>>>>> changing userspace API, well at least userspace expectation. Adding
>>>>>>>>> a new
>>>>>>>>> cs flag might do the trick but it should not be about inter-process,
>>>>>>>>> or any
>>>>>>>>> thing special, it's just implicit sync or no synchronization.
>>>>>>>>> Converting
>>>>>>>>> userspace is not that much of a big deal either, it can be broken
>>>>>>>>> into
>>>>>>>>> several step. Like mesa use explicit synchronization all time but
>>>>>>>>> ddx use
>>>>>>>>> implicit.
>>>>>>>> The thinking here is that we need to be backward compatible for
>>>>>>>> DRI2/3 and
>>>>>>>> support all kind of different use cases like old DDX and new Mesa, or
>>>>>>>> old
>>>>>>>> Mesa and new DDX etc...
>>>>>>>>
>>>>>>>> So for my prototype if the kernel sees any access of a BO from two
>>>>>>>> different
>>>>>>>> clients it falls back to the old behavior of implicit synchronization
>>>>>>>> of
>>>>>>>> access to the same buffer object. That might not be the fastest
>>>>>>>> approach,
>>>>>>>> but is as far as I can see conservative and so should work under all
>>>>>>>> conditions.
>>>>>>>>
>>>>>>>> Apart from that the planning so far was that we just hide this
>>>>>>>> feature
>>>>>>>> behind a couple of command submission flags and new chunks.
>>>>>>> Just to reproduce IRC discussion, i think it's a lot simpler and not
>>>>>>> that
>>>>>>> complex. For explicit cs ioctl you do not wait for any previous fence
>>>>>>> of
>>>>>>> any of the buffer referenced in the cs ioctl, but you still associate
>>>>>>> a
>>>>>>> new fence with all the buffer object referenced in the cs ioctl. So if
>>>>>>> the
>>>>>>> next ioctl is an implicit sync ioctl it will wait properly and
>>>>>>> synchronize
>>>>>>> properly with previous explicit cs ioctl. Hence you can easily have a
>>>>>>> mix
>>>>>>> in userspace thing is you only get benefit once enough of your
>>>>>>> userspace
>>>>>>> is using explicit.
>>>>>> Yes, that's exactly what my patches currently implement.
>>>>>>
>>>>>> The only difference is that by current planning I implemented it as a
>>>>>> per BO
>>>>>> flag for the command submission, but that was just for testing. Having
>>>>>> a
>>>>>> single flag to switch between implicit and explicit synchronization for
>>>>>> whole CS IOCTL would do equally well.
>>>>> Doing it per BO sounds bogus to me. But otherwise yes we are in
>>>>> agreement.
>>>>> As Daniel said using fd is most likely the way we want to do it but this
>>>>> remains vague.
>>>>>
>>>>>>> Note that you still need a way to have explicit cs ioctl to wait on a
>>>>>>> previos "explicit" fence so you need some api to expose fence per cs
>>>>>>> submission.
>>>>>> Exactly, that's what this mail thread is all about.
>>>>>>
>>>>>> As Daniel correctly noted you need something like a functionality to
>>>>>> get a
>>>>>> fence as the result of a command submission as well as pass in a list
>>>>>> of
>>>>>> fences to wait for before beginning a command submission.
>>>>>>
>>>>>> At least it looks like we are all on the same general line here, its
>>>>>> just
>>>>>> nobody has a good idea how the details should look like.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Cheers,
>>>>>>> Jérôme
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Jérôme
>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>>> Also one thing that the Android sync point does not have, AFAICT,
>>>>>>>>>>> is a
>>>>>>>>>>> way to schedule synchronization as part of a cs ioctl so cpu never
>>>>>>>>>>> have
>>>>>>>>>>> to be involve for cmd stream that deal only one gpu (assuming the
>>>>>>>>>>> driver
>>>>>>>>>>> and hw can do such trick).
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Jérôme
>>>>>>>>>>>
>>>>>>>>>>>> -Daniel
>>>>>>>>>>>> --
>>>>>>>>>>>> Daniel Vetter
>>>>>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>>>>>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> dri-devel mailing list
>>>>>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Question on UAPI for fences
  2014-09-14 10:36                           ` Christian König
@ 2014-09-15  8:46                             ` Daniel Vetter
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Vetter @ 2014-09-15  8:46 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Zach Pfeffer, dri-devel, linaro-mm-sig,
	John Harrison, gpudriverdevsupport

On Sun, Sep 14, 2014 at 12:36:43PM +0200, Christian König wrote:
> Yeah, right. Providing the fd to reassign to a fence would indeed reduce the
> create/close overhead.
> 
> But it would still be more overhead than for example a simple on demand
> growing ring buffer which then uses 64bit sequence numbers in userspace to
> refer to a fence in the kernel.
> 
> Apart from that I'm pretty sure that when we do the syncing completely in
> userspace we need more fences open at the same time than fds are available
> by default.

If you do the syncing completely in userspace you don't need kernel fences
at all. Kernel fences are only required if you sync with a different
process (where the pure userspace syncing might not work out) or with
different devices.

tbh I don't see any use-case at all where you'd need 10k such fences. That
means your driver gets to deal with 2 kinds of fences, but so be it. Since
not using fds for cross-device or cross-process syncing imo just doesn't
make sense, so that one pretty much will have to stick.

> As long as our internal handle or sequence based fence are easily
> convertible to a fence fd I actually don't really see a problem with that.
> Going to hack that approach into my prototype and then we can see how bad
> the code looks after all.

My plan for i915 is to start out with fd fences only, and once we have
some clarity on the exact requirements probably add some pure
userspace-controlled fences for tightly coupled stuff. Those might be
fully internal to the opencl userspace driver though and never get out of
there, ever.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-09-15  8:52 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-12 13:23 Question on UAPI for fences Christian König
2014-09-12 14:09 ` Daniel Vetter
2014-09-12 14:43   ` Daniel Vetter
2014-09-12 14:50     ` Jerome Glisse
2014-09-12 15:13       ` Daniel Vetter
2014-09-12 15:25       ` Alex Deucher
2014-09-12 15:33         ` Jerome Glisse
2014-09-12 15:38           ` Alex Deucher
2014-09-12 15:42           ` Christian König
2014-09-12 15:48             ` Jerome Glisse
2014-09-12 15:58               ` Christian König
2014-09-12 16:03                 ` Jerome Glisse
2014-09-12 16:08                   ` Christian König
2014-09-12 16:38                     ` John Harrison
2014-09-13 12:25                       ` Christian König
2014-09-14  0:32                         ` Marek Olšák
2014-09-14 10:36                           ` Christian König
2014-09-15  8:46                             ` Daniel Vetter
2014-09-12 16:45                     ` Jesse Barnes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.