All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: about mmap dma-buf and sync
       [not found] <55D50442.5080102@intel.com>
@ 2015-08-20  6:48 ` Thomas Hellstrom
  2015-08-20 14:33   ` Rob Clark
                     ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-20  6:48 UTC (permalink / raw)
  To: Tiago Vignatti; +Cc: Thomas Hellstrom, dri-devel

Hi, Tiago!

On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>
> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html

Hmm, for some reason it doesn't show up in my mail app, but I found it
in the archives. An attempt to explain the situation from the vmwgfx
perspective.

The fact that the interface is generic means that people will start
using it for the zero-copy case. There has been a couple of more or less
hackish attempts to do this before, and if it's a _driver_ interface we
don't need to be that careful but if it is a _generic_ interface we need
to be very careful to make it fit *all* the hardware out there and that
we make all potential users use the interface in a way that conforms
with the interface specification.

What will happen otherwise is that apps written for coherent fast
hardware might, for example, ignore calling the SYNC api, just because
the app writer only cared about his own hardware on which the app works
fine. That would fail miserably if the same app was run on incoherent
hardware, or the incoherent hardware driver maintainers would be forced
to base an implementation on page-faults which would be very slow.

So assume the following use case: An app updates a 10x10 area using the
CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
texturing. On some hardware the dma-buf might be tiled in a very
specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
accessible using DMA. On vmwgfx the SYNC operation must carry out a
10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
write and a DMA back again after the write, before GPU usage. On the
tiled architecture the SYNC operation must untile before CPU access and
probably tile again before GPU access.

If we now have a one-dimensional SYNC api, in this particular case we'd
either need to sync a far too large area (1600x10) or call SYNC 10 times
before writing, and then again after writing. If the app forgot to call
SYNC we must error.

So to summarize, the fact that the interface is generic IMO means:

1) Any user must be able to make valid assumptions about the internal
format of the dma-buf. (untiled, color format, stride etc.)
2) Any user *must* call SYNC before and after CPU access. On coherent
architectures, the SYNC is a NULL operation anyway, and that should be
documented somewhere so that maintainers of drivers of uncoherent
architectures have somewhere to point their fingers.
3) Driver-specific implementations must be allowed to error (segfault)
if SYNC has not been used.
4) The use-case stated above clearly shows the benefit of a
2-dimensional sync interface (we want to sync the 10x10 region), but
what if someone in the future wants to use this interface for a 3D
texture? Will a 2D sync suffice? Can we make the SYNC interface
extendable in a way that an enum sync_type member defines the layout of
the argument, and initially we implement only 1d, 2d sync, leaving 3d
for the future?

Also, I agree there is probably no good way to generically implement an
error if SYNC has not been called. That needs to be left as an option to
drivers.

Thanks,
Thomas


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-20  6:48 ` about mmap dma-buf and sync Thomas Hellstrom
@ 2015-08-20 14:33   ` Rob Clark
  2015-08-20 19:27     ` Thomas Hellstrom
  2015-08-20 14:53   ` Jerome Glisse
  2015-08-21 23:06   ` Tiago Vignatti
  2 siblings, 1 reply; 42+ messages in thread
From: Rob Clark @ 2015-08-20 14:33 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Thu, Aug 20, 2015 at 2:48 AM, Thomas Hellstrom <thellstrom@vmware.com> wrote:
> Hi, Tiago!
>
> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>
>> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
>
> Hmm, for some reason it doesn't show up in my mail app, but I found it
> in the archives. An attempt to explain the situation from the vmwgfx
> perspective.
>
> The fact that the interface is generic means that people will start
> using it for the zero-copy case. There has been a couple of more or less
> hackish attempts to do this before, and if it's a _driver_ interface we
> don't need to be that careful but if it is a _generic_ interface we need
> to be very careful to make it fit *all* the hardware out there and that
> we make all potential users use the interface in a way that conforms
> with the interface specification.
>
> What will happen otherwise is that apps written for coherent fast
> hardware might, for example, ignore calling the SYNC api, just because
> the app writer only cared about his own hardware on which the app works
> fine. That would fail miserably if the same app was run on incoherent
> hardware, or the incoherent hardware driver maintainers would be forced
> to base an implementation on page-faults which would be very slow.
>
> So assume the following use case: An app updates a 10x10 area using the
> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
> texturing. On some hardware the dma-buf might be tiled in a very
> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
> accessible using DMA. On vmwgfx the SYNC operation must carry out a
> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
> write and a DMA back again after the write, before GPU usage. On the
> tiled architecture the SYNC operation must untile before CPU access and
> probably tile again before GPU access.
>
> If we now have a one-dimensional SYNC api, in this particular case we'd
> either need to sync a far too large area (1600x10) or call SYNC 10 times
> before writing, and then again after writing. If the app forgot to call
> SYNC we must error.

just curious, but couldn't you batch up the 10 10x1 sync's?

BR,
-R

>
> So to summarize, the fact that the interface is generic IMO means:
>
> 1) Any user must be able to make valid assumptions about the internal
> format of the dma-buf. (untiled, color format, stride etc.)
> 2) Any user *must* call SYNC before and after CPU access. On coherent
> architectures, the SYNC is a NULL operation anyway, and that should be
> documented somewhere so that maintainers of drivers of uncoherent
> architectures have somewhere to point their fingers.
> 3) Driver-specific implementations must be allowed to error (segfault)
> if SYNC has not been used.
> 4) The use-case stated above clearly shows the benefit of a
> 2-dimensional sync interface (we want to sync the 10x10 region), but
> what if someone in the future wants to use this interface for a 3D
> texture? Will a 2D sync suffice? Can we make the SYNC interface
> extendable in a way that an enum sync_type member defines the layout of
> the argument, and initially we implement only 1d, 2d sync, leaving 3d
> for the future?
>
> Also, I agree there is probably no good way to generically implement an
> error if SYNC has not been called. That needs to be left as an option to
> drivers.
>
> Thanks,
> Thomas
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-20  6:48 ` about mmap dma-buf and sync Thomas Hellstrom
  2015-08-20 14:33   ` Rob Clark
@ 2015-08-20 14:53   ` Jerome Glisse
  2015-08-20 19:39     ` Thomas Hellstrom
  2015-08-21 23:06   ` Tiago Vignatti
  2 siblings, 1 reply; 42+ messages in thread
From: Jerome Glisse @ 2015-08-20 14:53 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
> Hi, Tiago!
> 
> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
> > Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
> >
> > http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
> 
> Hmm, for some reason it doesn't show up in my mail app, but I found it
> in the archives. An attempt to explain the situation from the vmwgfx
> perspective.
> 
> The fact that the interface is generic means that people will start
> using it for the zero-copy case. There has been a couple of more or less
> hackish attempts to do this before, and if it's a _driver_ interface we
> don't need to be that careful but if it is a _generic_ interface we need
> to be very careful to make it fit *all* the hardware out there and that
> we make all potential users use the interface in a way that conforms
> with the interface specification.
> 
> What will happen otherwise is that apps written for coherent fast
> hardware might, for example, ignore calling the SYNC api, just because
> the app writer only cared about his own hardware on which the app works
> fine. That would fail miserably if the same app was run on incoherent
> hardware, or the incoherent hardware driver maintainers would be forced
> to base an implementation on page-faults which would be very slow.
> 
> So assume the following use case: An app updates a 10x10 area using the
> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
> texturing. On some hardware the dma-buf might be tiled in a very
> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
> accessible using DMA. On vmwgfx the SYNC operation must carry out a
> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
> write and a DMA back again after the write, before GPU usage. On the
> tiled architecture the SYNC operation must untile before CPU access and
> probably tile again before GPU access.
> 
> If we now have a one-dimensional SYNC api, in this particular case we'd
> either need to sync a far too large area (1600x10) or call SYNC 10 times
> before writing, and then again after writing. If the app forgot to call
> SYNC we must error.
> 
> So to summarize, the fact that the interface is generic IMO means:
> 
> 1) Any user must be able to make valid assumptions about the internal
> format of the dma-buf. (untiled, color format, stride etc.)
> 2) Any user *must* call SYNC before and after CPU access. On coherent
> architectures, the SYNC is a NULL operation anyway, and that should be
> documented somewhere so that maintainers of drivers of uncoherent
> architectures have somewhere to point their fingers.
> 3) Driver-specific implementations must be allowed to error (segfault)
> if SYNC has not been used.

I think here you are too lax, the common code must segfault or
error badly if SYNC has not been use in all cases even on cache
coherent arch. The device driver sync callback can still be a
no operation. But i think that we need to insist strongly on a
proper sync call being made (and we should forbid empty area
sync call). This would be the only way to make sure userspace
behave properly as otherwise we endup in the situation you were
describing above, where the app is design on a cache coherent
arch and works fine there but broke in subtle way on non cache
coherent arch and app developer is clueless of why.

I just do not trust userspace.

> 4) The use-case stated above clearly shows the benefit of a
> 2-dimensional sync interface (we want to sync the 10x10 region), but
> what if someone in the future wants to use this interface for a 3D
> texture? Will a 2D sync suffice? Can we make the SYNC interface
> extendable in a way that an enum sync_type member defines the layout of
> the argument, and initially we implement only 1d, 2d sync, leaving 3d
> for the future?
> 
> Also, I agree there is probably no good way to generically implement an
> error if SYNC has not been called. That needs to be left as an option to
> drivers.

I think there is, just forbid any further use of the dma buffer, mark
it as invalid and printk a big error. Userspace app developer will
quickly see that something is wrong and looking at kernel log should
explain why.

Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-20 14:33   ` Rob Clark
@ 2015-08-20 19:27     ` Thomas Hellstrom
  2015-08-20 19:32       ` Thomas Hellstrom
  2015-08-21 16:42       ` Rob Clark
  0 siblings, 2 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-20 19:27 UTC (permalink / raw)
  To: dri-devel

On 08/20/2015 04:33 PM, Rob Clark wrote:
> On Thu, Aug 20, 2015 at 2:48 AM, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>> Hi, Tiago!
>>
>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>
>>> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>> in the archives. An attempt to explain the situation from the vmwgfx
>> perspective.
>>
>> The fact that the interface is generic means that people will start
>> using it for the zero-copy case. There has been a couple of more or less
>> hackish attempts to do this before, and if it's a _driver_ interface we
>> don't need to be that careful but if it is a _generic_ interface we need
>> to be very careful to make it fit *all* the hardware out there and that
>> we make all potential users use the interface in a way that conforms
>> with the interface specification.
>>
>> What will happen otherwise is that apps written for coherent fast
>> hardware might, for example, ignore calling the SYNC api, just because
>> the app writer only cared about his own hardware on which the app works
>> fine. That would fail miserably if the same app was run on incoherent
>> hardware, or the incoherent hardware driver maintainers would be forced
>> to base an implementation on page-faults which would be very slow.
>>
>> So assume the following use case: An app updates a 10x10 area using the
>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>> texturing. On some hardware the dma-buf might be tiled in a very
>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>> write and a DMA back again after the write, before GPU usage. On the
>> tiled architecture the SYNC operation must untile before CPU access and
>> probably tile again before GPU access.
>>
>> If we now have a one-dimensional SYNC api, in this particular case we'd
>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>> before writing, and then again after writing. If the app forgot to call
>> SYNC we must error.
> just curious, but couldn't you batch up the 10 10x1 sync's?

Yes that would work up to the first CPU access. Subsequent syncs would
need to be carried out immediately or all ptes would need to be unmapped
to detect the next CPU access. Write only syncs could probably be
batched unconditionally.

/Thomas


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-20 19:27     ` Thomas Hellstrom
@ 2015-08-20 19:32       ` Thomas Hellstrom
  2015-08-21 16:42       ` Rob Clark
  1 sibling, 0 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-20 19:32 UTC (permalink / raw)
  To: dri-devel

On 08/20/2015 09:27 PM, Thomas Hellstrom wrote:
> On 08/20/2015 04:33 PM, Rob Clark wrote:
>> On Thu, Aug 20, 2015 at 2:48 AM, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>> Hi, Tiago!
>>>
>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>>
>>>> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>>> in the archives. An attempt to explain the situation from the vmwgfx
>>> perspective.
>>>
>>> The fact that the interface is generic means that people will start
>>> using it for the zero-copy case. There has been a couple of more or less
>>> hackish attempts to do this before, and if it's a _driver_ interface we
>>> don't need to be that careful but if it is a _generic_ interface we need
>>> to be very careful to make it fit *all* the hardware out there and that
>>> we make all potential users use the interface in a way that conforms
>>> with the interface specification.
>>>
>>> What will happen otherwise is that apps written for coherent fast
>>> hardware might, for example, ignore calling the SYNC api, just because
>>> the app writer only cared about his own hardware on which the app works
>>> fine. That would fail miserably if the same app was run on incoherent
>>> hardware, or the incoherent hardware driver maintainers would be forced
>>> to base an implementation on page-faults which would be very slow.
>>>
>>> So assume the following use case: An app updates a 10x10 area using the
>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>>> texturing. On some hardware the dma-buf might be tiled in a very
>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>>> write and a DMA back again after the write, before GPU usage. On the
>>> tiled architecture the SYNC operation must untile before CPU access and
>>> probably tile again before GPU access.
>>>
>>> If we now have a one-dimensional SYNC api, in this particular case we'd
>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>>> before writing, and then again after writing. If the app forgot to call
>>> SYNC we must error.
>> just curious, but couldn't you batch up the 10 10x1 sync's?
> Yes that would work up to the first CPU access. Subsequent syncs would
> need to be carried out immediately or all ptes would need to be unmapped
> to detect the next CPU access. Write only syncs could probably be
> batched unconditionally.
>
> /Thomas

But aside from the problem of subsequent syncs after first CPU access,
does user-space really want to call sync for each line? Probably not,
but that's a problem that can be postponed (2D sync getting a separate
IOCTL) until someone gets tired of calling 1D syncs. My feeling is,
however that that will happen rather quickly and at least 2D syncs will
be a common usecase.

/Thomas





>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-20 14:53   ` Jerome Glisse
@ 2015-08-20 19:39     ` Thomas Hellstrom
  2015-08-20 20:34       ` Jerome Glisse
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-20 19:39 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Thomas Hellstrom, dri-devel

On 08/20/2015 04:53 PM, Jerome Glisse wrote:
> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
>> Hi, Tiago!
>>
>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>
>>> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>> in the archives. An attempt to explain the situation from the vmwgfx
>> perspective.
>>
>> The fact that the interface is generic means that people will start
>> using it for the zero-copy case. There has been a couple of more or less
>> hackish attempts to do this before, and if it's a _driver_ interface we
>> don't need to be that careful but if it is a _generic_ interface we need
>> to be very careful to make it fit *all* the hardware out there and that
>> we make all potential users use the interface in a way that conforms
>> with the interface specification.
>>
>> What will happen otherwise is that apps written for coherent fast
>> hardware might, for example, ignore calling the SYNC api, just because
>> the app writer only cared about his own hardware on which the app works
>> fine. That would fail miserably if the same app was run on incoherent
>> hardware, or the incoherent hardware driver maintainers would be forced
>> to base an implementation on page-faults which would be very slow.
>>
>> So assume the following use case: An app updates a 10x10 area using the
>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>> texturing. On some hardware the dma-buf might be tiled in a very
>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>> write and a DMA back again after the write, before GPU usage. On the
>> tiled architecture the SYNC operation must untile before CPU access and
>> probably tile again before GPU access.
>>
>> If we now have a one-dimensional SYNC api, in this particular case we'd
>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>> before writing, and then again after writing. If the app forgot to call
>> SYNC we must error.
>>
>> So to summarize, the fact that the interface is generic IMO means:
>>
>> 1) Any user must be able to make valid assumptions about the internal
>> format of the dma-buf. (untiled, color format, stride etc.)
>> 2) Any user *must* call SYNC before and after CPU access. On coherent
>> architectures, the SYNC is a NULL operation anyway, and that should be
>> documented somewhere so that maintainers of drivers of uncoherent
>> architectures have somewhere to point their fingers.
>> 3) Driver-specific implementations must be allowed to error (segfault)
>> if SYNC has not been used.
> I think here you are too lax, the common code must segfault or
> error badly if SYNC has not been use in all cases even on cache
> coherent arch. The device driver sync callback can still be a
> no operation. But i think that we need to insist strongly on a
> proper sync call being made (and we should forbid empty area
> sync call). This would be the only way to make sure userspace
> behave properly as otherwise we endup in the situation you were
> describing above, where the app is design on a cache coherent
> arch and works fine there but broke in subtle way on non cache
> coherent arch and app developer is clueless of why.
>
> I just do not trust userspace.
I agree and ideally i'd want it this way as well. The question is, is it
possible? Can this be done without a fault() handler in the generic
kernel code?

>
>> 4) The use-case stated above clearly shows the benefit of a
>> 2-dimensional sync interface (we want to sync the 10x10 region), but
>> what if someone in the future wants to use this interface for a 3D
>> texture? Will a 2D sync suffice? Can we make the SYNC interface
>> extendable in a way that an enum sync_type member defines the layout of
>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
>> for the future?
>>
>> Also, I agree there is probably no good way to generically implement an
>> error if SYNC has not been called. That needs to be left as an option to
>> drivers.
> I think there is, just forbid any further use of the dma buffer, mark
> it as invalid and printk a big error. Userspace app developer will
> quickly see that something is wrong and looking at kernel log should
> explain why.

The problem is if someone calls mmap() and then decides to not access
the buffer before munmap() or GPU usage. How do we recognize that case
and separate it from a CPU access occured outside a sync? We could, as
mentioned above have a fault() handler in the generic kernel code and
make sure drivers don't populate PTEs until the first fault(). Another
option would be to scan all PTEs for an "accessed" flag, but I'm not
even sure all CPU architectures have such a flag?

But there might be a simpler solution that I have overlooked?

/Thomas

>
> Cheers,
> Jérôme
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-20 19:39     ` Thomas Hellstrom
@ 2015-08-20 20:34       ` Jerome Glisse
  2015-08-21  5:25         ` Thomas Hellstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Jerome Glisse @ 2015-08-20 20:34 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Thomas Hellstrom, dri-devel

On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
> > On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
> >> Hi, Tiago!
> >>
> >> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
> >>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
> >>>
> >>> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
> >> Hmm, for some reason it doesn't show up in my mail app, but I found it
> >> in the archives. An attempt to explain the situation from the vmwgfx
> >> perspective.
> >>
> >> The fact that the interface is generic means that people will start
> >> using it for the zero-copy case. There has been a couple of more or less
> >> hackish attempts to do this before, and if it's a _driver_ interface we
> >> don't need to be that careful but if it is a _generic_ interface we need
> >> to be very careful to make it fit *all* the hardware out there and that
> >> we make all potential users use the interface in a way that conforms
> >> with the interface specification.
> >>
> >> What will happen otherwise is that apps written for coherent fast
> >> hardware might, for example, ignore calling the SYNC api, just because
> >> the app writer only cared about his own hardware on which the app works
> >> fine. That would fail miserably if the same app was run on incoherent
> >> hardware, or the incoherent hardware driver maintainers would be forced
> >> to base an implementation on page-faults which would be very slow.
> >>
> >> So assume the following use case: An app updates a 10x10 area using the
> >> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
> >> texturing. On some hardware the dma-buf might be tiled in a very
> >> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
> >> accessible using DMA. On vmwgfx the SYNC operation must carry out a
> >> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
> >> write and a DMA back again after the write, before GPU usage. On the
> >> tiled architecture the SYNC operation must untile before CPU access and
> >> probably tile again before GPU access.
> >>
> >> If we now have a one-dimensional SYNC api, in this particular case we'd
> >> either need to sync a far too large area (1600x10) or call SYNC 10 times
> >> before writing, and then again after writing. If the app forgot to call
> >> SYNC we must error.
> >>
> >> So to summarize, the fact that the interface is generic IMO means:
> >>
> >> 1) Any user must be able to make valid assumptions about the internal
> >> format of the dma-buf. (untiled, color format, stride etc.)
> >> 2) Any user *must* call SYNC before and after CPU access. On coherent
> >> architectures, the SYNC is a NULL operation anyway, and that should be
> >> documented somewhere so that maintainers of drivers of uncoherent
> >> architectures have somewhere to point their fingers.
> >> 3) Driver-specific implementations must be allowed to error (segfault)
> >> if SYNC has not been used.
> > I think here you are too lax, the common code must segfault or
> > error badly if SYNC has not been use in all cases even on cache
> > coherent arch. The device driver sync callback can still be a
> > no operation. But i think that we need to insist strongly on a
> > proper sync call being made (and we should forbid empty area
> > sync call). This would be the only way to make sure userspace
> > behave properly as otherwise we endup in the situation you were
> > describing above, where the app is design on a cache coherent
> > arch and works fine there but broke in subtle way on non cache
> > coherent arch and app developer is clueless of why.
> >
> > I just do not trust userspace.
> I agree and ideally i'd want it this way as well. The question is, is it
> possible? Can this be done without a fault() handler in the generic
> kernel code?
> 
> >
> >> 4) The use-case stated above clearly shows the benefit of a
> >> 2-dimensional sync interface (we want to sync the 10x10 region), but
> >> what if someone in the future wants to use this interface for a 3D
> >> texture? Will a 2D sync suffice? Can we make the SYNC interface
> >> extendable in a way that an enum sync_type member defines the layout of
> >> the argument, and initially we implement only 1d, 2d sync, leaving 3d
> >> for the future?
> >>
> >> Also, I agree there is probably no good way to generically implement an
> >> error if SYNC has not been called. That needs to be left as an option to
> >> drivers.
> > I think there is, just forbid any further use of the dma buffer, mark
> > it as invalid and printk a big error. Userspace app developer will
> > quickly see that something is wrong and looking at kernel log should
> > explain why.
> 
> The problem is if someone calls mmap() and then decides to not access
> the buffer before munmap() or GPU usage. How do we recognize that case
> and separate it from a CPU access occured outside a sync? We could, as
> mentioned above have a fault() handler in the generic kernel code and
> make sure drivers don't populate PTEs until the first fault(). Another
> option would be to scan all PTEs for an "accessed" flag, but I'm not
> even sure all CPU architectures have such a flag?
> 
> But there might be a simpler solution that I have overlooked?

All arch have a dirty flag, so you could check that, as all we
really care about is CPU write access. So all you need is clearing
the dirty bit after each successfull GPU command stream ioctl
and checking that no dirty bit is set in a region not cover by
a flush(). Note that the clear and check can happen in the same
function as part of buffer validation for the GPU command stream.

Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-20 20:34       ` Jerome Glisse
@ 2015-08-21  5:25         ` Thomas Hellstrom
  2015-08-21 10:28           ` Lucas Stach
  2015-08-21 13:32           ` Jerome Glisse
  0 siblings, 2 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-21  5:25 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Thomas Hellstrom, dri-devel

On 08/20/2015 10:34 PM, Jerome Glisse wrote:
> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
>>>> Hi, Tiago!
>>>>
>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>>>
>>>>> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>>>> in the archives. An attempt to explain the situation from the vmwgfx
>>>> perspective.
>>>>
>>>> The fact that the interface is generic means that people will start
>>>> using it for the zero-copy case. There has been a couple of more or less
>>>> hackish attempts to do this before, and if it's a _driver_ interface we
>>>> don't need to be that careful but if it is a _generic_ interface we need
>>>> to be very careful to make it fit *all* the hardware out there and that
>>>> we make all potential users use the interface in a way that conforms
>>>> with the interface specification.
>>>>
>>>> What will happen otherwise is that apps written for coherent fast
>>>> hardware might, for example, ignore calling the SYNC api, just because
>>>> the app writer only cared about his own hardware on which the app works
>>>> fine. That would fail miserably if the same app was run on incoherent
>>>> hardware, or the incoherent hardware driver maintainers would be forced
>>>> to base an implementation on page-faults which would be very slow.
>>>>
>>>> So assume the following use case: An app updates a 10x10 area using the
>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>>>> texturing. On some hardware the dma-buf might be tiled in a very
>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>>>> write and a DMA back again after the write, before GPU usage. On the
>>>> tiled architecture the SYNC operation must untile before CPU access and
>>>> probably tile again before GPU access.
>>>>
>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>>>> before writing, and then again after writing. If the app forgot to call
>>>> SYNC we must error.
>>>>
>>>> So to summarize, the fact that the interface is generic IMO means:
>>>>
>>>> 1) Any user must be able to make valid assumptions about the internal
>>>> format of the dma-buf. (untiled, color format, stride etc.)
>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
>>>> architectures, the SYNC is a NULL operation anyway, and that should be
>>>> documented somewhere so that maintainers of drivers of uncoherent
>>>> architectures have somewhere to point their fingers.
>>>> 3) Driver-specific implementations must be allowed to error (segfault)
>>>> if SYNC has not been used.
>>> I think here you are too lax, the common code must segfault or
>>> error badly if SYNC has not been use in all cases even on cache
>>> coherent arch. The device driver sync callback can still be a
>>> no operation. But i think that we need to insist strongly on a
>>> proper sync call being made (and we should forbid empty area
>>> sync call). This would be the only way to make sure userspace
>>> behave properly as otherwise we endup in the situation you were
>>> describing above, where the app is design on a cache coherent
>>> arch and works fine there but broke in subtle way on non cache
>>> coherent arch and app developer is clueless of why.
>>>
>>> I just do not trust userspace.
>> I agree and ideally i'd want it this way as well. The question is, is it
>> possible? Can this be done without a fault() handler in the generic
>> kernel code?
>>
>>>> 4) The use-case stated above clearly shows the benefit of a
>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
>>>> what if someone in the future wants to use this interface for a 3D
>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
>>>> extendable in a way that an enum sync_type member defines the layout of
>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
>>>> for the future?
>>>>
>>>> Also, I agree there is probably no good way to generically implement an
>>>> error if SYNC has not been called. That needs to be left as an option to
>>>> drivers.
>>> I think there is, just forbid any further use of the dma buffer, mark
>>> it as invalid and printk a big error. Userspace app developer will
>>> quickly see that something is wrong and looking at kernel log should
>>> explain why.
>> The problem is if someone calls mmap() and then decides to not access
>> the buffer before munmap() or GPU usage. How do we recognize that case
>> and separate it from a CPU access occured outside a sync? We could, as
>> mentioned above have a fault() handler in the generic kernel code and
>> make sure drivers don't populate PTEs until the first fault(). Another
>> option would be to scan all PTEs for an "accessed" flag, but I'm not
>> even sure all CPU architectures have such a flag?
>>
>> But there might be a simpler solution that I have overlooked?
> All arch have a dirty flag, so you could check that, as all we
> really care about is CPU write access. So all you need is clearing
> the dirty bit after each successfull GPU command stream ioctl
> and checking that no dirty bit is set in a region not cover by
> a flush(). Note that the clear and check can happen in the same
> function as part of buffer validation for the GPU command stream.

Actually, I do think we care about reading as well, since reading
without flushing anything written by the GPU will
yield garbage on a non-coherent architecture. Instead we might check for
the PAGE_ACCESSED bit.

So the full version of this would keep track of all synced regions and
at buffer validation time, or unmap time error if there
is a page completely outside the union of all synced regions in any VMA
belonging to the dma-buf address space, then unmark all accessed PTEs
and invalidate TLBs accordingly, typically causing a global TLB flush.
Yeah, we could probably put together a helper function that does this,
but it will be expensive to run and the whole point of this API is to be
able to improve performance.

IMO we should probably do this and after relevant performance testing
decide whether to either expose a way to turn it off, either using a
compile-time option or run-time, like a flag in the SYNC ioctl.

/Thomas

>
> Cheers,
> Jérôme

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-21  5:25         ` Thomas Hellstrom
@ 2015-08-21 10:28           ` Lucas Stach
  2015-08-21 10:51             ` Thomas Hellstrom
  2015-08-21 13:32           ` Jerome Glisse
  1 sibling, 1 reply; 42+ messages in thread
From: Lucas Stach @ 2015-08-21 10:28 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

Am Freitag, den 21.08.2015, 07:25 +0200 schrieb Thomas Hellstrom:
> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
> > On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
> >> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
> >>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
> >>>> Hi, Tiago!
> >>>>
> >>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
> >>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
> >>>>>
> >>>>> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
> >>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
> >>>> in the archives. An attempt to explain the situation from the vmwgfx
> >>>> perspective.
> >>>>
> >>>> The fact that the interface is generic means that people will start
> >>>> using it for the zero-copy case. There has been a couple of more or less
> >>>> hackish attempts to do this before, and if it's a _driver_ interface we
> >>>> don't need to be that careful but if it is a _generic_ interface we need
> >>>> to be very careful to make it fit *all* the hardware out there and that
> >>>> we make all potential users use the interface in a way that conforms
> >>>> with the interface specification.
> >>>>
> >>>> What will happen otherwise is that apps written for coherent fast
> >>>> hardware might, for example, ignore calling the SYNC api, just because
> >>>> the app writer only cared about his own hardware on which the app works
> >>>> fine. That would fail miserably if the same app was run on incoherent
> >>>> hardware, or the incoherent hardware driver maintainers would be forced
> >>>> to base an implementation on page-faults which would be very slow.
> >>>>
> >>>> So assume the following use case: An app updates a 10x10 area using the
> >>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
> >>>> texturing. On some hardware the dma-buf might be tiled in a very
> >>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
> >>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
> >>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
> >>>> write and a DMA back again after the write, before GPU usage. On the
> >>>> tiled architecture the SYNC operation must untile before CPU access and
> >>>> probably tile again before GPU access.
> >>>>
> >>>> If we now have a one-dimensional SYNC api, in this particular case we'd
> >>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
> >>>> before writing, and then again after writing. If the app forgot to call
> >>>> SYNC we must error.
> >>>>
> >>>> So to summarize, the fact that the interface is generic IMO means:
> >>>>
> >>>> 1) Any user must be able to make valid assumptions about the internal
> >>>> format of the dma-buf. (untiled, color format, stride etc.)
> >>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
> >>>> architectures, the SYNC is a NULL operation anyway, and that should be
> >>>> documented somewhere so that maintainers of drivers of uncoherent
> >>>> architectures have somewhere to point their fingers.
> >>>> 3) Driver-specific implementations must be allowed to error (segfault)
> >>>> if SYNC has not been used.
> >>> I think here you are too lax, the common code must segfault or
> >>> error badly if SYNC has not been use in all cases even on cache
> >>> coherent arch. The device driver sync callback can still be a
> >>> no operation. But i think that we need to insist strongly on a
> >>> proper sync call being made (and we should forbid empty area
> >>> sync call). This would be the only way to make sure userspace
> >>> behave properly as otherwise we endup in the situation you were
> >>> describing above, where the app is design on a cache coherent
> >>> arch and works fine there but broke in subtle way on non cache
> >>> coherent arch and app developer is clueless of why.
> >>>
> >>> I just do not trust userspace.
> >> I agree and ideally i'd want it this way as well. The question is, is it
> >> possible? Can this be done without a fault() handler in the generic
> >> kernel code?
> >>
> >>>> 4) The use-case stated above clearly shows the benefit of a
> >>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
> >>>> what if someone in the future wants to use this interface for a 3D
> >>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
> >>>> extendable in a way that an enum sync_type member defines the layout of
> >>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
> >>>> for the future?
> >>>>
> >>>> Also, I agree there is probably no good way to generically implement an
> >>>> error if SYNC has not been called. That needs to be left as an option to
> >>>> drivers.
> >>> I think there is, just forbid any further use of the dma buffer, mark
> >>> it as invalid and printk a big error. Userspace app developer will
> >>> quickly see that something is wrong and looking at kernel log should
> >>> explain why.
> >> The problem is if someone calls mmap() and then decides to not access
> >> the buffer before munmap() or GPU usage. How do we recognize that case
> >> and separate it from a CPU access occured outside a sync? We could, as
> >> mentioned above have a fault() handler in the generic kernel code and
> >> make sure drivers don't populate PTEs until the first fault(). Another
> >> option would be to scan all PTEs for an "accessed" flag, but I'm not
> >> even sure all CPU architectures have such a flag?
> >>
> >> But there might be a simpler solution that I have overlooked?
> > All arch have a dirty flag, so you could check that, as all we
> > really care about is CPU write access. So all you need is clearing
> > the dirty bit after each successfull GPU command stream ioctl
> > and checking that no dirty bit is set in a region not cover by
> > a flush(). Note that the clear and check can happen in the same
> > function as part of buffer validation for the GPU command stream.
> 
> Actually, I do think we care about reading as well, since reading
> without flushing anything written by the GPU will
> yield garbage on a non-coherent architecture. Instead we might check for
> the PAGE_ACCESSED bit.
> 
> So the full version of this would keep track of all synced regions and
> at buffer validation time, or unmap time error if there
> is a page completely outside the union of all synced regions in any VMA
> belonging to the dma-buf address space, then unmark all accessed PTEs
> and invalidate TLBs accordingly, typically causing a global TLB flush.
> Yeah, we could probably put together a helper function that does this,
> but it will be expensive to run and the whole point of this API is to be
> able to improve performance.
> 
> IMO we should probably do this and after relevant performance testing
> decide whether to either expose a way to turn it off, either using a
> compile-time option or run-time, like a flag in the SYNC ioctl.
> 
Can't we just specify that any access without a SYNC may yield undefined
results? A segfault is one of the nicer reactions in that case and
totally valid.

I don't think we want to slow down the common path by PTE fiddling just
to validate userspace input. But it may be possible to have a debug
option where we just expose an explicit non-coherent copy of the dma-buf
through mmap and only copy data to the real buffer on SYNC. Then specify
that an application must work with this option enabled, or it isn't
conforming to the spec.

Regards,
Lucas

-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-21 10:28           ` Lucas Stach
@ 2015-08-21 10:51             ` Thomas Hellstrom
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-21 10:51 UTC (permalink / raw)
  To: Lucas Stach; +Cc: dri-devel

On 08/21/2015 12:28 PM, Lucas Stach wrote:
> Am Freitag, den 21.08.2015, 07:25 +0200 schrieb Thomas Hellstrom:
>> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
>>> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
>>>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
>>>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
>>>>>> Hi, Tiago!
>>>>>>
>>>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>>>>>
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DAugust_088376.html&d=BQICaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=vpukPkBtpoNQp2IUKuFviOmPNYWVKmen3Jeeu55zmEA&m=QUjAmFoYOUwBS_uCQcp4RhWNd-La4C8nkFO1xufPDi0&s=DIaO_PWHLpt7a_NY0TL9XLQjeWXTW-iPIWO-vlN6BYI&e= 
>>>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>>>>>> in the archives. An attempt to explain the situation from the vmwgfx
>>>>>> perspective.
>>>>>>
>>>>>> The fact that the interface is generic means that people will start
>>>>>> using it for the zero-copy case. There has been a couple of more or less
>>>>>> hackish attempts to do this before, and if it's a _driver_ interface we
>>>>>> don't need to be that careful but if it is a _generic_ interface we need
>>>>>> to be very careful to make it fit *all* the hardware out there and that
>>>>>> we make all potential users use the interface in a way that conforms
>>>>>> with the interface specification.
>>>>>>
>>>>>> What will happen otherwise is that apps written for coherent fast
>>>>>> hardware might, for example, ignore calling the SYNC api, just because
>>>>>> the app writer only cared about his own hardware on which the app works
>>>>>> fine. That would fail miserably if the same app was run on incoherent
>>>>>> hardware, or the incoherent hardware driver maintainers would be forced
>>>>>> to base an implementation on page-faults which would be very slow.
>>>>>>
>>>>>> So assume the following use case: An app updates a 10x10 area using the
>>>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>>>>>> texturing. On some hardware the dma-buf might be tiled in a very
>>>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>>>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>>>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>>>>>> write and a DMA back again after the write, before GPU usage. On the
>>>>>> tiled architecture the SYNC operation must untile before CPU access and
>>>>>> probably tile again before GPU access.
>>>>>>
>>>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
>>>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>>>>>> before writing, and then again after writing. If the app forgot to call
>>>>>> SYNC we must error.
>>>>>>
>>>>>> So to summarize, the fact that the interface is generic IMO means:
>>>>>>
>>>>>> 1) Any user must be able to make valid assumptions about the internal
>>>>>> format of the dma-buf. (untiled, color format, stride etc.)
>>>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
>>>>>> architectures, the SYNC is a NULL operation anyway, and that should be
>>>>>> documented somewhere so that maintainers of drivers of uncoherent
>>>>>> architectures have somewhere to point their fingers.
>>>>>> 3) Driver-specific implementations must be allowed to error (segfault)
>>>>>> if SYNC has not been used.
>>>>> I think here you are too lax, the common code must segfault or
>>>>> error badly if SYNC has not been use in all cases even on cache
>>>>> coherent arch. The device driver sync callback can still be a
>>>>> no operation. But i think that we need to insist strongly on a
>>>>> proper sync call being made (and we should forbid empty area
>>>>> sync call). This would be the only way to make sure userspace
>>>>> behave properly as otherwise we endup in the situation you were
>>>>> describing above, where the app is design on a cache coherent
>>>>> arch and works fine there but broke in subtle way on non cache
>>>>> coherent arch and app developer is clueless of why.
>>>>>
>>>>> I just do not trust userspace.
>>>> I agree and ideally i'd want it this way as well. The question is, is it
>>>> possible? Can this be done without a fault() handler in the generic
>>>> kernel code?
>>>>
>>>>>> 4) The use-case stated above clearly shows the benefit of a
>>>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
>>>>>> what if someone in the future wants to use this interface for a 3D
>>>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
>>>>>> extendable in a way that an enum sync_type member defines the layout of
>>>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
>>>>>> for the future?
>>>>>>
>>>>>> Also, I agree there is probably no good way to generically implement an
>>>>>> error if SYNC has not been called. That needs to be left as an option to
>>>>>> drivers.
>>>>> I think there is, just forbid any further use of the dma buffer, mark
>>>>> it as invalid and printk a big error. Userspace app developer will
>>>>> quickly see that something is wrong and looking at kernel log should
>>>>> explain why.
>>>> The problem is if someone calls mmap() and then decides to not access
>>>> the buffer before munmap() or GPU usage. How do we recognize that case
>>>> and separate it from a CPU access occured outside a sync? We could, as
>>>> mentioned above have a fault() handler in the generic kernel code and
>>>> make sure drivers don't populate PTEs until the first fault(). Another
>>>> option would be to scan all PTEs for an "accessed" flag, but I'm not
>>>> even sure all CPU architectures have such a flag?
>>>>
>>>> But there might be a simpler solution that I have overlooked?
>>> All arch have a dirty flag, so you could check that, as all we
>>> really care about is CPU write access. So all you need is clearing
>>> the dirty bit after each successfull GPU command stream ioctl
>>> and checking that no dirty bit is set in a region not cover by
>>> a flush(). Note that the clear and check can happen in the same
>>> function as part of buffer validation for the GPU command stream.
>> Actually, I do think we care about reading as well, since reading
>> without flushing anything written by the GPU will
>> yield garbage on a non-coherent architecture. Instead we might check for
>> the PAGE_ACCESSED bit.
>>
>> So the full version of this would keep track of all synced regions and
>> at buffer validation time, or unmap time error if there
>> is a page completely outside the union of all synced regions in any VMA
>> belonging to the dma-buf address space, then unmark all accessed PTEs
>> and invalidate TLBs accordingly, typically causing a global TLB flush.
>> Yeah, we could probably put together a helper function that does this,
>> but it will be expensive to run and the whole point of this API is to be
>> able to improve performance.
>>
>> IMO we should probably do this and after relevant performance testing
>> decide whether to either expose a way to turn it off, either using a
>> compile-time option or run-time, like a flag in the SYNC ioctl.
>>
> Can't we just specify that any access without a SYNC may yield undefined
> results? A segfault is one of the nicer reactions in that case and
> totally valid.

But that will, as Jerome points out, probably leave valid results on
common coherent hardware,
and the lazy app writer would be clueless to or wouldn't care about why
it doesn't work on
other hardware.

>
> I don't think we want to slow down the common path by PTE fiddling just
> to validate userspace input. But it may be possible to have a debug
> option where we just expose an explicit non-coherent copy of the dma-buf
> through mmap and only copy data to the real buffer on SYNC. Then specify
> that an application must work with this option enabled, or it isn't
> conforming to the spec.

Another option that wouldn't degrade performance much would be to
require drivers not to populate
PTEs until first access. Then in that first access check for sync. (We
wrap the fault())

>
> Regards,
> Lucas
>
Thomas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-21  5:25         ` Thomas Hellstrom
  2015-08-21 10:28           ` Lucas Stach
@ 2015-08-21 13:32           ` Jerome Glisse
  2015-08-21 14:15             ` Thomas Hellstrom
  1 sibling, 1 reply; 42+ messages in thread
From: Jerome Glisse @ 2015-08-21 13:32 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Fri, Aug 21, 2015 at 07:25:07AM +0200, Thomas Hellstrom wrote:
> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
> > On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
> >> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
> >>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
> >>>> Hi, Tiago!
> >>>>
> >>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
> >>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
> >>>>>
> >>>>> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
> >>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
> >>>> in the archives. An attempt to explain the situation from the vmwgfx
> >>>> perspective.
> >>>>
> >>>> The fact that the interface is generic means that people will start
> >>>> using it for the zero-copy case. There has been a couple of more or less
> >>>> hackish attempts to do this before, and if it's a _driver_ interface we
> >>>> don't need to be that careful but if it is a _generic_ interface we need
> >>>> to be very careful to make it fit *all* the hardware out there and that
> >>>> we make all potential users use the interface in a way that conforms
> >>>> with the interface specification.
> >>>>
> >>>> What will happen otherwise is that apps written for coherent fast
> >>>> hardware might, for example, ignore calling the SYNC api, just because
> >>>> the app writer only cared about his own hardware on which the app works
> >>>> fine. That would fail miserably if the same app was run on incoherent
> >>>> hardware, or the incoherent hardware driver maintainers would be forced
> >>>> to base an implementation on page-faults which would be very slow.
> >>>>
> >>>> So assume the following use case: An app updates a 10x10 area using the
> >>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
> >>>> texturing. On some hardware the dma-buf might be tiled in a very
> >>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
> >>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
> >>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
> >>>> write and a DMA back again after the write, before GPU usage. On the
> >>>> tiled architecture the SYNC operation must untile before CPU access and
> >>>> probably tile again before GPU access.
> >>>>
> >>>> If we now have a one-dimensional SYNC api, in this particular case we'd
> >>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
> >>>> before writing, and then again after writing. If the app forgot to call
> >>>> SYNC we must error.
> >>>>
> >>>> So to summarize, the fact that the interface is generic IMO means:
> >>>>
> >>>> 1) Any user must be able to make valid assumptions about the internal
> >>>> format of the dma-buf. (untiled, color format, stride etc.)
> >>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
> >>>> architectures, the SYNC is a NULL operation anyway, and that should be
> >>>> documented somewhere so that maintainers of drivers of uncoherent
> >>>> architectures have somewhere to point their fingers.
> >>>> 3) Driver-specific implementations must be allowed to error (segfault)
> >>>> if SYNC has not been used.
> >>> I think here you are too lax, the common code must segfault or
> >>> error badly if SYNC has not been use in all cases even on cache
> >>> coherent arch. The device driver sync callback can still be a
> >>> no operation. But i think that we need to insist strongly on a
> >>> proper sync call being made (and we should forbid empty area
> >>> sync call). This would be the only way to make sure userspace
> >>> behave properly as otherwise we endup in the situation you were
> >>> describing above, where the app is design on a cache coherent
> >>> arch and works fine there but broke in subtle way on non cache
> >>> coherent arch and app developer is clueless of why.
> >>>
> >>> I just do not trust userspace.
> >> I agree and ideally i'd want it this way as well. The question is, is it
> >> possible? Can this be done without a fault() handler in the generic
> >> kernel code?
> >>
> >>>> 4) The use-case stated above clearly shows the benefit of a
> >>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
> >>>> what if someone in the future wants to use this interface for a 3D
> >>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
> >>>> extendable in a way that an enum sync_type member defines the layout of
> >>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
> >>>> for the future?
> >>>>
> >>>> Also, I agree there is probably no good way to generically implement an
> >>>> error if SYNC has not been called. That needs to be left as an option to
> >>>> drivers.
> >>> I think there is, just forbid any further use of the dma buffer, mark
> >>> it as invalid and printk a big error. Userspace app developer will
> >>> quickly see that something is wrong and looking at kernel log should
> >>> explain why.
> >> The problem is if someone calls mmap() and then decides to not access
> >> the buffer before munmap() or GPU usage. How do we recognize that case
> >> and separate it from a CPU access occured outside a sync? We could, as
> >> mentioned above have a fault() handler in the generic kernel code and
> >> make sure drivers don't populate PTEs until the first fault(). Another
> >> option would be to scan all PTEs for an "accessed" flag, but I'm not
> >> even sure all CPU architectures have such a flag?
> >>
> >> But there might be a simpler solution that I have overlooked?
> > All arch have a dirty flag, so you could check that, as all we
> > really care about is CPU write access. So all you need is clearing
> > the dirty bit after each successfull GPU command stream ioctl
> > and checking that no dirty bit is set in a region not cover by
> > a flush(). Note that the clear and check can happen in the same
> > function as part of buffer validation for the GPU command stream.
> 
> Actually, I do think we care about reading as well, since reading
> without flushing anything written by the GPU will
> yield garbage on a non-coherent architecture. Instead we might check for
> the PAGE_ACCESSED bit.

I do not think the read access after GPU matter. If people think it does
then it is simple, at buffer validation in buffer mapping we invalidate
CPU page table mapping and the prepare access ioctl populate them after
checking that GPU is flush.

> 
> So the full version of this would keep track of all synced regions and
> at buffer validation time, or unmap time error if there
> is a page completely outside the union of all synced regions in any VMA
> belonging to the dma-buf address space, then unmark all accessed PTEs
> and invalidate TLBs accordingly, typically causing a global TLB flush.
> Yeah, we could probably put together a helper function that does this,
> but it will be expensive to run and the whole point of this API is to be
> able to improve performance.

No need for such complexity, the sync ioctl would go over the page table
and clear dirty bit for all page in range that is synchronized. So once
you get to the command ioctl you know that all dirty bit must be clear.
Note this is only a page table walk in all case. I do not consider this
to be that expensive.

> 
> IMO we should probably do this and after relevant performance testing
> decide whether to either expose a way to turn it off, either using a
> compile-time option or run-time, like a flag in the SYNC ioctl.

Flag in the sync ioctl is bad IMHO as i am sure userspace will abuse it.
Given original description where the buffer object is allocated by a
privilege userspace space process, i would rather add a device specific
ioctl and thus privilege userspace that knows about the hw can call that
ioctl to disable sync allowing platform that are cache coherent to not
pay the price. But this would leave open the door for application to rely
on a specific platform behavior.

I would rather first see benchmark to compare cost of doing sync check
and not doing it. I would like to think that the page table check would
be burry in the noise.

Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-21 13:32           ` Jerome Glisse
@ 2015-08-21 14:15             ` Thomas Hellstrom
  2015-08-21 16:00               ` Jerome Glisse
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-21 14:15 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel

On 08/21/2015 03:32 PM, Jerome Glisse wrote:
> On Fri, Aug 21, 2015 at 07:25:07AM +0200, Thomas Hellstrom wrote:
>> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
>>> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
>>>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
>>>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
>>>>>> Hi, Tiago!
>>>>>>
>>>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>>>>>
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DAugust_088376.html&d=BQIDAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=vpukPkBtpoNQp2IUKuFviOmPNYWVKmen3Jeeu55zmEA&m=EX3w8PM79N5qLeXyogeSPN9J5pIvBV6IKrhBjzwJDEM&s=S3NqF0kPNtBKQ3-WiffL2T9NFK96XBp56vkb3ujf-io&e= 
>>>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>>>>>> in the archives. An attempt to explain the situation from the vmwgfx
>>>>>> perspective.
>>>>>>
>>>>>> The fact that the interface is generic means that people will start
>>>>>> using it for the zero-copy case. There has been a couple of more or less
>>>>>> hackish attempts to do this before, and if it's a _driver_ interface we
>>>>>> don't need to be that careful but if it is a _generic_ interface we need
>>>>>> to be very careful to make it fit *all* the hardware out there and that
>>>>>> we make all potential users use the interface in a way that conforms
>>>>>> with the interface specification.
>>>>>>
>>>>>> What will happen otherwise is that apps written for coherent fast
>>>>>> hardware might, for example, ignore calling the SYNC api, just because
>>>>>> the app writer only cared about his own hardware on which the app works
>>>>>> fine. That would fail miserably if the same app was run on incoherent
>>>>>> hardware, or the incoherent hardware driver maintainers would be forced
>>>>>> to base an implementation on page-faults which would be very slow.
>>>>>>
>>>>>> So assume the following use case: An app updates a 10x10 area using the
>>>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>>>>>> texturing. On some hardware the dma-buf might be tiled in a very
>>>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>>>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>>>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>>>>>> write and a DMA back again after the write, before GPU usage. On the
>>>>>> tiled architecture the SYNC operation must untile before CPU access and
>>>>>> probably tile again before GPU access.
>>>>>>
>>>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
>>>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>>>>>> before writing, and then again after writing. If the app forgot to call
>>>>>> SYNC we must error.
>>>>>>
>>>>>> So to summarize, the fact that the interface is generic IMO means:
>>>>>>
>>>>>> 1) Any user must be able to make valid assumptions about the internal
>>>>>> format of the dma-buf. (untiled, color format, stride etc.)
>>>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
>>>>>> architectures, the SYNC is a NULL operation anyway, and that should be
>>>>>> documented somewhere so that maintainers of drivers of uncoherent
>>>>>> architectures have somewhere to point their fingers.
>>>>>> 3) Driver-specific implementations must be allowed to error (segfault)
>>>>>> if SYNC has not been used.
>>>>> I think here you are too lax, the common code must segfault or
>>>>> error badly if SYNC has not been use in all cases even on cache
>>>>> coherent arch. The device driver sync callback can still be a
>>>>> no operation. But i think that we need to insist strongly on a
>>>>> proper sync call being made (and we should forbid empty area
>>>>> sync call). This would be the only way to make sure userspace
>>>>> behave properly as otherwise we endup in the situation you were
>>>>> describing above, where the app is design on a cache coherent
>>>>> arch and works fine there but broke in subtle way on non cache
>>>>> coherent arch and app developer is clueless of why.
>>>>>
>>>>> I just do not trust userspace.
>>>> I agree and ideally i'd want it this way as well. The question is, is it
>>>> possible? Can this be done without a fault() handler in the generic
>>>> kernel code?
>>>>
>>>>>> 4) The use-case stated above clearly shows the benefit of a
>>>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
>>>>>> what if someone in the future wants to use this interface for a 3D
>>>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
>>>>>> extendable in a way that an enum sync_type member defines the layout of
>>>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
>>>>>> for the future?
>>>>>>
>>>>>> Also, I agree there is probably no good way to generically implement an
>>>>>> error if SYNC has not been called. That needs to be left as an option to
>>>>>> drivers.
>>>>> I think there is, just forbid any further use of the dma buffer, mark
>>>>> it as invalid and printk a big error. Userspace app developer will
>>>>> quickly see that something is wrong and looking at kernel log should
>>>>> explain why.
>>>> The problem is if someone calls mmap() and then decides to not access
>>>> the buffer before munmap() or GPU usage. How do we recognize that case
>>>> and separate it from a CPU access occured outside a sync? We could, as
>>>> mentioned above have a fault() handler in the generic kernel code and
>>>> make sure drivers don't populate PTEs until the first fault(). Another
>>>> option would be to scan all PTEs for an "accessed" flag, but I'm not
>>>> even sure all CPU architectures have such a flag?
>>>>
>>>> But there might be a simpler solution that I have overlooked?
>>> All arch have a dirty flag, so you could check that, as all we
>>> really care about is CPU write access. So all you need is clearing
>>> the dirty bit after each successfull GPU command stream ioctl
>>> and checking that no dirty bit is set in a region not cover by
>>> a flush(). Note that the clear and check can happen in the same
>>> function as part of buffer validation for the GPU command stream.
>> Actually, I do think we care about reading as well, since reading
>> without flushing anything written by the GPU will
>> yield garbage on a non-coherent architecture. Instead we might check for
>> the PAGE_ACCESSED bit.
> I do not think the read access after GPU matter. If people think it does
> then it is simple, at buffer validation in buffer mapping we invalidate
> CPU page table mapping and the prepare access ioctl populate them after
> checking that GPU is flush.
>
>> So the full version of this would keep track of all synced regions and
>> at buffer validation time, or unmap time error if there
>> is a page completely outside the union of all synced regions in any VMA
>> belonging to the dma-buf address space, then unmark all accessed PTEs
>> and invalidate TLBs accordingly, typically causing a global TLB flush.
>> Yeah, we could probably put together a helper function that does this,
>> but it will be expensive to run and the whole point of this API is to be
>> able to improve performance.
> No need for such complexity, the sync ioctl would go over the page table
> and clear dirty bit for all page in range that is synchronized. So once
> you get to the command ioctl you know that all dirty bit must be clear.
> Note this is only a page table walk in all case. I do not consider this
> to be that expensive.
>
>
The expensive part is the TLB flushing after the PTE update, which
typically needs to be performed on all cores. Although without a
benchmark I'm not sure how expensive. It might be that it doesn't
matter. After all, the vmwgfx- and I believe the qxl driver both use a
similar approach for fbdev user-space access, but that has never been
cosidered performance-critical...

/Thomas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-21 14:15             ` Thomas Hellstrom
@ 2015-08-21 16:00               ` Jerome Glisse
  2015-08-21 22:00                 ` Thomas Hellstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Jerome Glisse @ 2015-08-21 16:00 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Fri, Aug 21, 2015 at 04:15:53PM +0200, Thomas Hellstrom wrote:
> On 08/21/2015 03:32 PM, Jerome Glisse wrote:
> > On Fri, Aug 21, 2015 at 07:25:07AM +0200, Thomas Hellstrom wrote:
> >> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
> >>> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
> >>>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
> >>>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
> >>>>>> Hi, Tiago!
> >>>>>>
> >>>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
> >>>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
> >>>>>>>
> >>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DAugust_088376.html&d=BQIDAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=vpukPkBtpoNQp2IUKuFviOmPNYWVKmen3Jeeu55zmEA&m=EX3w8PM79N5qLeXyogeSPN9J5pIvBV6IKrhBjzwJDEM&s=S3NqF0kPNtBKQ3-WiffL2T9NFK96XBp56vkb3ujf-io&e= 
> >>>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
> >>>>>> in the archives. An attempt to explain the situation from the vmwgfx
> >>>>>> perspective.
> >>>>>>
> >>>>>> The fact that the interface is generic means that people will start
> >>>>>> using it for the zero-copy case. There has been a couple of more or less
> >>>>>> hackish attempts to do this before, and if it's a _driver_ interface we
> >>>>>> don't need to be that careful but if it is a _generic_ interface we need
> >>>>>> to be very careful to make it fit *all* the hardware out there and that
> >>>>>> we make all potential users use the interface in a way that conforms
> >>>>>> with the interface specification.
> >>>>>>
> >>>>>> What will happen otherwise is that apps written for coherent fast
> >>>>>> hardware might, for example, ignore calling the SYNC api, just because
> >>>>>> the app writer only cared about his own hardware on which the app works
> >>>>>> fine. That would fail miserably if the same app was run on incoherent
> >>>>>> hardware, or the incoherent hardware driver maintainers would be forced
> >>>>>> to base an implementation on page-faults which would be very slow.
> >>>>>>
> >>>>>> So assume the following use case: An app updates a 10x10 area using the
> >>>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
> >>>>>> texturing. On some hardware the dma-buf might be tiled in a very
> >>>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
> >>>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
> >>>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
> >>>>>> write and a DMA back again after the write, before GPU usage. On the
> >>>>>> tiled architecture the SYNC operation must untile before CPU access and
> >>>>>> probably tile again before GPU access.
> >>>>>>
> >>>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
> >>>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
> >>>>>> before writing, and then again after writing. If the app forgot to call
> >>>>>> SYNC we must error.
> >>>>>>
> >>>>>> So to summarize, the fact that the interface is generic IMO means:
> >>>>>>
> >>>>>> 1) Any user must be able to make valid assumptions about the internal
> >>>>>> format of the dma-buf. (untiled, color format, stride etc.)
> >>>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
> >>>>>> architectures, the SYNC is a NULL operation anyway, and that should be
> >>>>>> documented somewhere so that maintainers of drivers of uncoherent
> >>>>>> architectures have somewhere to point their fingers.
> >>>>>> 3) Driver-specific implementations must be allowed to error (segfault)
> >>>>>> if SYNC has not been used.
> >>>>> I think here you are too lax, the common code must segfault or
> >>>>> error badly if SYNC has not been use in all cases even on cache
> >>>>> coherent arch. The device driver sync callback can still be a
> >>>>> no operation. But i think that we need to insist strongly on a
> >>>>> proper sync call being made (and we should forbid empty area
> >>>>> sync call). This would be the only way to make sure userspace
> >>>>> behave properly as otherwise we endup in the situation you were
> >>>>> describing above, where the app is design on a cache coherent
> >>>>> arch and works fine there but broke in subtle way on non cache
> >>>>> coherent arch and app developer is clueless of why.
> >>>>>
> >>>>> I just do not trust userspace.
> >>>> I agree and ideally i'd want it this way as well. The question is, is it
> >>>> possible? Can this be done without a fault() handler in the generic
> >>>> kernel code?
> >>>>
> >>>>>> 4) The use-case stated above clearly shows the benefit of a
> >>>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
> >>>>>> what if someone in the future wants to use this interface for a 3D
> >>>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
> >>>>>> extendable in a way that an enum sync_type member defines the layout of
> >>>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
> >>>>>> for the future?
> >>>>>>
> >>>>>> Also, I agree there is probably no good way to generically implement an
> >>>>>> error if SYNC has not been called. That needs to be left as an option to
> >>>>>> drivers.
> >>>>> I think there is, just forbid any further use of the dma buffer, mark
> >>>>> it as invalid and printk a big error. Userspace app developer will
> >>>>> quickly see that something is wrong and looking at kernel log should
> >>>>> explain why.
> >>>> The problem is if someone calls mmap() and then decides to not access
> >>>> the buffer before munmap() or GPU usage. How do we recognize that case
> >>>> and separate it from a CPU access occured outside a sync? We could, as
> >>>> mentioned above have a fault() handler in the generic kernel code and
> >>>> make sure drivers don't populate PTEs until the first fault(). Another
> >>>> option would be to scan all PTEs for an "accessed" flag, but I'm not
> >>>> even sure all CPU architectures have such a flag?
> >>>>
> >>>> But there might be a simpler solution that I have overlooked?
> >>> All arch have a dirty flag, so you could check that, as all we
> >>> really care about is CPU write access. So all you need is clearing
> >>> the dirty bit after each successfull GPU command stream ioctl
> >>> and checking that no dirty bit is set in a region not cover by
> >>> a flush(). Note that the clear and check can happen in the same
> >>> function as part of buffer validation for the GPU command stream.
> >> Actually, I do think we care about reading as well, since reading
> >> without flushing anything written by the GPU will
> >> yield garbage on a non-coherent architecture. Instead we might check for
> >> the PAGE_ACCESSED bit.
> > I do not think the read access after GPU matter. If people think it does
> > then it is simple, at buffer validation in buffer mapping we invalidate
> > CPU page table mapping and the prepare access ioctl populate them after
> > checking that GPU is flush.
> >
> >> So the full version of this would keep track of all synced regions and
> >> at buffer validation time, or unmap time error if there
> >> is a page completely outside the union of all synced regions in any VMA
> >> belonging to the dma-buf address space, then unmark all accessed PTEs
> >> and invalidate TLBs accordingly, typically causing a global TLB flush.
> >> Yeah, we could probably put together a helper function that does this,
> >> but it will be expensive to run and the whole point of this API is to be
> >> able to improve performance.
> > No need for such complexity, the sync ioctl would go over the page table
> > and clear dirty bit for all page in range that is synchronized. So once
> > you get to the command ioctl you know that all dirty bit must be clear.
> > Note this is only a page table walk in all case. I do not consider this
> > to be that expensive.
> >
> >
> The expensive part is the TLB flushing after the PTE update, which
> typically needs to be performed on all cores. Although without a
> benchmark I'm not sure how expensive. It might be that it doesn't
> matter. After all, the vmwgfx- and I believe the qxl driver both use a
> similar approach for fbdev user-space access, but that has never been
> cosidered performance-critical...

There is no need to flush the TLB ! At least not for the write case.
Clearing the dirty bit is cache coherent. TLB are about caching page
table directory tree. To convince yourself look at page_mkclean_one()
in mm/rmap.c which, by the way, is almost what we want (remove the mmu
notifier part and the write protection part and force real CPU cache
flush if needed base on device flag).

Thought this have been the topic of numerous discussion on how hardware
exactly implement the dirty bit. So far for x86 impression is that it
is like an atomic bit operation. So atomicaly testing and clearing the
bit would be good enough. I am not sure about ARM but would assume they
do the same thing.

Note, no matter what there is a race window btw the time the GPU ioctl
valid a buffer (checking and clearing dirty bit) and the time it is
effectively use by the GPU. During that window userspace program can
keep writting to the buffer and bad thing will happen. But bad thing
will happen on all arch because even on cache coherent arch you might
see GPU use thing that are in a middle of a write.

Now i agree that if we care about the read case then we need a TLB
flush. But this is optimized by the kernel in some way. For instance
if the buffer is only CPU map inside the unprivilege process and that
the unprivilege process is not running at time of GPU ioctl then the
tlb flush is a noop.

Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-20 19:27     ` Thomas Hellstrom
  2015-08-20 19:32       ` Thomas Hellstrom
@ 2015-08-21 16:42       ` Rob Clark
  1 sibling, 0 replies; 42+ messages in thread
From: Rob Clark @ 2015-08-21 16:42 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Thu, Aug 20, 2015 at 3:27 PM, Thomas Hellstrom <thomas@shipmail.org> wrote:
> On 08/20/2015 04:33 PM, Rob Clark wrote:
>> On Thu, Aug 20, 2015 at 2:48 AM, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>> Hi, Tiago!
>>>
>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>>
>>>> http://lists.freedesktop.org/archives/dri-devel/2015-August/088376.html
>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>>> in the archives. An attempt to explain the situation from the vmwgfx
>>> perspective.
>>>
>>> The fact that the interface is generic means that people will start
>>> using it for the zero-copy case. There has been a couple of more or less
>>> hackish attempts to do this before, and if it's a _driver_ interface we
>>> don't need to be that careful but if it is a _generic_ interface we need
>>> to be very careful to make it fit *all* the hardware out there and that
>>> we make all potential users use the interface in a way that conforms
>>> with the interface specification.
>>>
>>> What will happen otherwise is that apps written for coherent fast
>>> hardware might, for example, ignore calling the SYNC api, just because
>>> the app writer only cared about his own hardware on which the app works
>>> fine. That would fail miserably if the same app was run on incoherent
>>> hardware, or the incoherent hardware driver maintainers would be forced
>>> to base an implementation on page-faults which would be very slow.
>>>
>>> So assume the following use case: An app updates a 10x10 area using the
>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>>> texturing. On some hardware the dma-buf might be tiled in a very
>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>>> write and a DMA back again after the write, before GPU usage. On the
>>> tiled architecture the SYNC operation must untile before CPU access and
>>> probably tile again before GPU access.
>>>
>>> If we now have a one-dimensional SYNC api, in this particular case we'd
>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>>> before writing, and then again after writing. If the app forgot to call
>>> SYNC we must error.
>> just curious, but couldn't you batch up the 10 10x1 sync's?
>
> Yes that would work up to the first CPU access. Subsequent syncs would
> need to be carried out immediately or all ptes would need to be unmapped
> to detect the next CPU access. Write only syncs could probably be
> batched unconditionally.

hmm, maybe another cpu barrier ioctl?

I mean if we had a 2d sync API, then needed to update layers in a 3d
or cubemap texture, then you need to do multiple 2d updates..

but what about instead having something like:

 ioctl(SYNC)
 ioctl(SYNC)
 ioctl(SYNC)
 ioctl(PREP)

  ... cpu access

 ioctl(FINI)

(or something roughly like that)

BR,
-R

> /Thomas
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-21 16:00               ` Jerome Glisse
@ 2015-08-21 22:00                 ` Thomas Hellstrom
  2015-08-24  1:41                   ` Jerome Glisse
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-21 22:00 UTC (permalink / raw)
  To: dri-devel

On 08/21/2015 06:00 PM, Jerome Glisse wrote:
> On Fri, Aug 21, 2015 at 04:15:53PM +0200, Thomas Hellstrom wrote:
>> On 08/21/2015 03:32 PM, Jerome Glisse wrote:
>>> On Fri, Aug 21, 2015 at 07:25:07AM +0200, Thomas Hellstrom wrote:
>>>> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
>>>>> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
>>>>>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
>>>>>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
>>>>>>>> Hi, Tiago!
>>>>>>>>
>>>>>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>>>>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>>>>>>>
>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DAugust_088376.html&d=BQIDAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=vpukPkBtpoNQp2IUKuFviOmPNYWVKmen3Jeeu55zmEA&m=EX3w8PM79N5qLeXyogeSPN9J5pIvBV6IKrhBjzwJDEM&s=S3NqF0kPNtBKQ3-WiffL2T9NFK96XBp56vkb3ujf-io&e= 
>>>>>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>>>>>>>> in the archives. An attempt to explain the situation from the vmwgfx
>>>>>>>> perspective.
>>>>>>>>
>>>>>>>> The fact that the interface is generic means that people will start
>>>>>>>> using it for the zero-copy case. There has been a couple of more or less
>>>>>>>> hackish attempts to do this before, and if it's a _driver_ interface we
>>>>>>>> don't need to be that careful but if it is a _generic_ interface we need
>>>>>>>> to be very careful to make it fit *all* the hardware out there and that
>>>>>>>> we make all potential users use the interface in a way that conforms
>>>>>>>> with the interface specification.
>>>>>>>>
>>>>>>>> What will happen otherwise is that apps written for coherent fast
>>>>>>>> hardware might, for example, ignore calling the SYNC api, just because
>>>>>>>> the app writer only cared about his own hardware on which the app works
>>>>>>>> fine. That would fail miserably if the same app was run on incoherent
>>>>>>>> hardware, or the incoherent hardware driver maintainers would be forced
>>>>>>>> to base an implementation on page-faults which would be very slow.
>>>>>>>>
>>>>>>>> So assume the following use case: An app updates a 10x10 area using the
>>>>>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>>>>>>>> texturing. On some hardware the dma-buf might be tiled in a very
>>>>>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>>>>>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>>>>>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>>>>>>>> write and a DMA back again after the write, before GPU usage. On the
>>>>>>>> tiled architecture the SYNC operation must untile before CPU access and
>>>>>>>> probably tile again before GPU access.
>>>>>>>>
>>>>>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
>>>>>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>>>>>>>> before writing, and then again after writing. If the app forgot to call
>>>>>>>> SYNC we must error.
>>>>>>>>
>>>>>>>> So to summarize, the fact that the interface is generic IMO means:
>>>>>>>>
>>>>>>>> 1) Any user must be able to make valid assumptions about the internal
>>>>>>>> format of the dma-buf. (untiled, color format, stride etc.)
>>>>>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
>>>>>>>> architectures, the SYNC is a NULL operation anyway, and that should be
>>>>>>>> documented somewhere so that maintainers of drivers of uncoherent
>>>>>>>> architectures have somewhere to point their fingers.
>>>>>>>> 3) Driver-specific implementations must be allowed to error (segfault)
>>>>>>>> if SYNC has not been used.
>>>>>>> I think here you are too lax, the common code must segfault or
>>>>>>> error badly if SYNC has not been use in all cases even on cache
>>>>>>> coherent arch. The device driver sync callback can still be a
>>>>>>> no operation. But i think that we need to insist strongly on a
>>>>>>> proper sync call being made (and we should forbid empty area
>>>>>>> sync call). This would be the only way to make sure userspace
>>>>>>> behave properly as otherwise we endup in the situation you were
>>>>>>> describing above, where the app is design on a cache coherent
>>>>>>> arch and works fine there but broke in subtle way on non cache
>>>>>>> coherent arch and app developer is clueless of why.
>>>>>>>
>>>>>>> I just do not trust userspace.
>>>>>> I agree and ideally i'd want it this way as well. The question is, is it
>>>>>> possible? Can this be done without a fault() handler in the generic
>>>>>> kernel code?
>>>>>>
>>>>>>>> 4) The use-case stated above clearly shows the benefit of a
>>>>>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
>>>>>>>> what if someone in the future wants to use this interface for a 3D
>>>>>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
>>>>>>>> extendable in a way that an enum sync_type member defines the layout of
>>>>>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
>>>>>>>> for the future?
>>>>>>>>
>>>>>>>> Also, I agree there is probably no good way to generically implement an
>>>>>>>> error if SYNC has not been called. That needs to be left as an option to
>>>>>>>> drivers.
>>>>>>> I think there is, just forbid any further use of the dma buffer, mark
>>>>>>> it as invalid and printk a big error. Userspace app developer will
>>>>>>> quickly see that something is wrong and looking at kernel log should
>>>>>>> explain why.
>>>>>> The problem is if someone calls mmap() and then decides to not access
>>>>>> the buffer before munmap() or GPU usage. How do we recognize that case
>>>>>> and separate it from a CPU access occured outside a sync? We could, as
>>>>>> mentioned above have a fault() handler in the generic kernel code and
>>>>>> make sure drivers don't populate PTEs until the first fault(). Another
>>>>>> option would be to scan all PTEs for an "accessed" flag, but I'm not
>>>>>> even sure all CPU architectures have such a flag?
>>>>>>
>>>>>> But there might be a simpler solution that I have overlooked?
>>>>> All arch have a dirty flag, so you could check that, as all we
>>>>> really care about is CPU write access. So all you need is clearing
>>>>> the dirty bit after each successfull GPU command stream ioctl
>>>>> and checking that no dirty bit is set in a region not cover by
>>>>> a flush(). Note that the clear and check can happen in the same
>>>>> function as part of buffer validation for the GPU command stream.
>>>> Actually, I do think we care about reading as well, since reading
>>>> without flushing anything written by the GPU will
>>>> yield garbage on a non-coherent architecture. Instead we might check for
>>>> the PAGE_ACCESSED bit.
>>> I do not think the read access after GPU matter. If people think it does
>>> then it is simple, at buffer validation in buffer mapping we invalidate
>>> CPU page table mapping and the prepare access ioctl populate them after
>>> checking that GPU is flush.
>>>
>>>> So the full version of this would keep track of all synced regions and
>>>> at buffer validation time, or unmap time error if there
>>>> is a page completely outside the union of all synced regions in any VMA
>>>> belonging to the dma-buf address space, then unmark all accessed PTEs
>>>> and invalidate TLBs accordingly, typically causing a global TLB flush.
>>>> Yeah, we could probably put together a helper function that does this,
>>>> but it will be expensive to run and the whole point of this API is to be
>>>> able to improve performance.
>>> No need for such complexity, the sync ioctl would go over the page table
>>> and clear dirty bit for all page in range that is synchronized. So once
>>> you get to the command ioctl you know that all dirty bit must be clear.
>>> Note this is only a page table walk in all case. I do not consider this
>>> to be that expensive.
>>>
>>>
>> The expensive part is the TLB flushing after the PTE update, which
>> typically needs to be performed on all cores. Although without a
>> benchmark I'm not sure how expensive. It might be that it doesn't
>> matter. After all, the vmwgfx- and I believe the qxl driver both use a
>> similar approach for fbdev user-space access, but that has never been
>> cosidered performance-critical...
> There is no need to flush the TLB ! At least not for the write case.
> Clearing the dirty bit is cache coherent. TLB are about caching page
> table directory tree. To convince yourself look at page_mkclean_one()
> in mm/rmap.c which, by the way, is almost what we want (remove the mmu
> notifier part and the write protection part and force real CPU cache
> flush if needed base on device flag).

Hmm? Last time I visited this code, page_mkclean_one() was eventually
calling flush_tlb_page() through ptep_clear_flush_notify(). I was under
the impression that you always have to flush the TLB when you clear a
PTE status bit, but typically not when you set it. Are you saying that
that's not the case?
The question as I see it is rather whether you want to flush TLB on a
per-PTE basis or do it as a global flush after a batched PTE status bit
clearing.

/Thomas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-20  6:48 ` about mmap dma-buf and sync Thomas Hellstrom
  2015-08-20 14:33   ` Rob Clark
  2015-08-20 14:53   ` Jerome Glisse
@ 2015-08-21 23:06   ` Tiago Vignatti
  2015-08-24  9:50     ` Thomas Hellstrom
  2 siblings, 1 reply; 42+ messages in thread
From: Tiago Vignatti @ 2015-08-21 23:06 UTC (permalink / raw)
  To: Thomas Hellstrom, Daniel Vetter; +Cc: dri-devel

Hi back!

On 08/20/2015 03:48 AM, Thomas Hellstrom wrote:
> Hi, Tiago!

Something that the Chrome OS folks told me today is whether we could 
change the sync API to use a syscall for that instead. Reason for that 
is to eventually fit this in nicely in their sandbox architecture 
requirements, so yet another ioctl wouldn't need to be white-listed for 
the unpriviledged process.

How does that change the game, for example regarding the details we been 
talking here about make the sync mandatory? Can we reuse, say msync(2) 
for that for example?

Best Regards,

Tiago
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-21 22:00                 ` Thomas Hellstrom
@ 2015-08-24  1:41                   ` Jerome Glisse
  2015-08-24  9:59                     ` Thomas Hellstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Jerome Glisse @ 2015-08-24  1:41 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel

On Sat, Aug 22, 2015 at 12:00:21AM +0200, Thomas Hellstrom wrote:
> On 08/21/2015 06:00 PM, Jerome Glisse wrote:
> > On Fri, Aug 21, 2015 at 04:15:53PM +0200, Thomas Hellstrom wrote:
> >> On 08/21/2015 03:32 PM, Jerome Glisse wrote:
> >>> On Fri, Aug 21, 2015 at 07:25:07AM +0200, Thomas Hellstrom wrote:
> >>>> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
> >>>>> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
> >>>>>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
> >>>>>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
> >>>>>>>> Hi, Tiago!
> >>>>>>>>
> >>>>>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
> >>>>>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
> >>>>>>>>>
> >>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DAugust_088376.html&d=BQIDAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=vpukPkBtpoNQp2IUKuFviOmPNYWVKmen3Jeeu55zmEA&m=EX3w8PM79N5qLeXyogeSPN9J5pIvBV6IKrhBjzwJDEM&s=S3NqF0kPNtBKQ3-WiffL2T9NFK96XBp56vkb3ujf-io&e= 
> >>>>>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
> >>>>>>>> in the archives. An attempt to explain the situation from the vmwgfx
> >>>>>>>> perspective.
> >>>>>>>>
> >>>>>>>> The fact that the interface is generic means that people will start
> >>>>>>>> using it for the zero-copy case. There has been a couple of more or less
> >>>>>>>> hackish attempts to do this before, and if it's a _driver_ interface we
> >>>>>>>> don't need to be that careful but if it is a _generic_ interface we need
> >>>>>>>> to be very careful to make it fit *all* the hardware out there and that
> >>>>>>>> we make all potential users use the interface in a way that conforms
> >>>>>>>> with the interface specification.
> >>>>>>>>
> >>>>>>>> What will happen otherwise is that apps written for coherent fast
> >>>>>>>> hardware might, for example, ignore calling the SYNC api, just because
> >>>>>>>> the app writer only cared about his own hardware on which the app works
> >>>>>>>> fine. That would fail miserably if the same app was run on incoherent
> >>>>>>>> hardware, or the incoherent hardware driver maintainers would be forced
> >>>>>>>> to base an implementation on page-faults which would be very slow.
> >>>>>>>>
> >>>>>>>> So assume the following use case: An app updates a 10x10 area using the
> >>>>>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
> >>>>>>>> texturing. On some hardware the dma-buf might be tiled in a very
> >>>>>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
> >>>>>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
> >>>>>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
> >>>>>>>> write and a DMA back again after the write, before GPU usage. On the
> >>>>>>>> tiled architecture the SYNC operation must untile before CPU access and
> >>>>>>>> probably tile again before GPU access.
> >>>>>>>>
> >>>>>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
> >>>>>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
> >>>>>>>> before writing, and then again after writing. If the app forgot to call
> >>>>>>>> SYNC we must error.
> >>>>>>>>
> >>>>>>>> So to summarize, the fact that the interface is generic IMO means:
> >>>>>>>>
> >>>>>>>> 1) Any user must be able to make valid assumptions about the internal
> >>>>>>>> format of the dma-buf. (untiled, color format, stride etc.)
> >>>>>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
> >>>>>>>> architectures, the SYNC is a NULL operation anyway, and that should be
> >>>>>>>> documented somewhere so that maintainers of drivers of uncoherent
> >>>>>>>> architectures have somewhere to point their fingers.
> >>>>>>>> 3) Driver-specific implementations must be allowed to error (segfault)
> >>>>>>>> if SYNC has not been used.
> >>>>>>> I think here you are too lax, the common code must segfault or
> >>>>>>> error badly if SYNC has not been use in all cases even on cache
> >>>>>>> coherent arch. The device driver sync callback can still be a
> >>>>>>> no operation. But i think that we need to insist strongly on a
> >>>>>>> proper sync call being made (and we should forbid empty area
> >>>>>>> sync call). This would be the only way to make sure userspace
> >>>>>>> behave properly as otherwise we endup in the situation you were
> >>>>>>> describing above, where the app is design on a cache coherent
> >>>>>>> arch and works fine there but broke in subtle way on non cache
> >>>>>>> coherent arch and app developer is clueless of why.
> >>>>>>>
> >>>>>>> I just do not trust userspace.
> >>>>>> I agree and ideally i'd want it this way as well. The question is, is it
> >>>>>> possible? Can this be done without a fault() handler in the generic
> >>>>>> kernel code?
> >>>>>>
> >>>>>>>> 4) The use-case stated above clearly shows the benefit of a
> >>>>>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
> >>>>>>>> what if someone in the future wants to use this interface for a 3D
> >>>>>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
> >>>>>>>> extendable in a way that an enum sync_type member defines the layout of
> >>>>>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
> >>>>>>>> for the future?
> >>>>>>>>
> >>>>>>>> Also, I agree there is probably no good way to generically implement an
> >>>>>>>> error if SYNC has not been called. That needs to be left as an option to
> >>>>>>>> drivers.
> >>>>>>> I think there is, just forbid any further use of the dma buffer, mark
> >>>>>>> it as invalid and printk a big error. Userspace app developer will
> >>>>>>> quickly see that something is wrong and looking at kernel log should
> >>>>>>> explain why.
> >>>>>> The problem is if someone calls mmap() and then decides to not access
> >>>>>> the buffer before munmap() or GPU usage. How do we recognize that case
> >>>>>> and separate it from a CPU access occured outside a sync? We could, as
> >>>>>> mentioned above have a fault() handler in the generic kernel code and
> >>>>>> make sure drivers don't populate PTEs until the first fault(). Another
> >>>>>> option would be to scan all PTEs for an "accessed" flag, but I'm not
> >>>>>> even sure all CPU architectures have such a flag?
> >>>>>>
> >>>>>> But there might be a simpler solution that I have overlooked?
> >>>>> All arch have a dirty flag, so you could check that, as all we
> >>>>> really care about is CPU write access. So all you need is clearing
> >>>>> the dirty bit after each successfull GPU command stream ioctl
> >>>>> and checking that no dirty bit is set in a region not cover by
> >>>>> a flush(). Note that the clear and check can happen in the same
> >>>>> function as part of buffer validation for the GPU command stream.
> >>>> Actually, I do think we care about reading as well, since reading
> >>>> without flushing anything written by the GPU will
> >>>> yield garbage on a non-coherent architecture. Instead we might check for
> >>>> the PAGE_ACCESSED bit.
> >>> I do not think the read access after GPU matter. If people think it does
> >>> then it is simple, at buffer validation in buffer mapping we invalidate
> >>> CPU page table mapping and the prepare access ioctl populate them after
> >>> checking that GPU is flush.
> >>>
> >>>> So the full version of this would keep track of all synced regions and
> >>>> at buffer validation time, or unmap time error if there
> >>>> is a page completely outside the union of all synced regions in any VMA
> >>>> belonging to the dma-buf address space, then unmark all accessed PTEs
> >>>> and invalidate TLBs accordingly, typically causing a global TLB flush.
> >>>> Yeah, we could probably put together a helper function that does this,
> >>>> but it will be expensive to run and the whole point of this API is to be
> >>>> able to improve performance.
> >>> No need for such complexity, the sync ioctl would go over the page table
> >>> and clear dirty bit for all page in range that is synchronized. So once
> >>> you get to the command ioctl you know that all dirty bit must be clear.
> >>> Note this is only a page table walk in all case. I do not consider this
> >>> to be that expensive.
> >>>
> >>>
> >> The expensive part is the TLB flushing after the PTE update, which
> >> typically needs to be performed on all cores. Although without a
> >> benchmark I'm not sure how expensive. It might be that it doesn't
> >> matter. After all, the vmwgfx- and I believe the qxl driver both use a
> >> similar approach for fbdev user-space access, but that has never been
> >> cosidered performance-critical...
> > There is no need to flush the TLB ! At least not for the write case.
> > Clearing the dirty bit is cache coherent. TLB are about caching page
> > table directory tree. To convince yourself look at page_mkclean_one()
> > in mm/rmap.c which, by the way, is almost what we want (remove the mmu
> > notifier part and the write protection part and force real CPU cache
> > flush if needed base on device flag).
> 
> Hmm? Last time I visited this code, page_mkclean_one() was eventually
> calling flush_tlb_page() through ptep_clear_flush_notify(). I was under
> the impression that you always have to flush the TLB when you clear a
> PTE status bit, but typically not when you set it. Are you saying that
> that's not the case?

Well this is very down to the implementation details of hw, consideration
and my memory is that the AMD/Intel doc says that write is a locked read
modify write operation ie it is atomic. Now there is a bit of unknown on
how exactly this happens (does the hardware retraverse the full directory
hierarchy ? does it keeps a pointer to the page table entry ? ...).

So basicly this means that at time of validation if any CPU did write to
any page inside the range we would see the dirty set, we do not need to
flush anything to see that happening (unless things are mapped write
combine).

Well there is a race window if CPU is writing to the buffer while we are
doing the validation on another CPU. But there is no way around that race
except to unmap and flush TLB. Which is not what we want.

Now, there is something i am not sure, does CPU keep TLB entry around
with dirty bit set and do not redo the RMW locked sequence, or does it
do the RMW locked sequence everytime it write to a cacheline. I would
guess the former if only for perf reasons. In this case we would need
TLB flush but linux kernel already offer helpers to optimize this away.


All this kind of make me think that there is no sane way around this.
I mean we are trying to work around potentialy broken userspace mostly
because most of the GPU driver is in userspace and we can't trust it.
I think this is just a lost battle and it would be better to just
provide the API and says that you must sync and userspace that do not
will get to bear the consequence sooner or latter.

Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-21 23:06   ` Tiago Vignatti
@ 2015-08-24  9:50     ` Thomas Hellstrom
  2015-08-24 15:52       ` Daniel Stone
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-24  9:50 UTC (permalink / raw)
  To: Tiago Vignatti; +Cc: Daniel Vetter, Thomas Hellstrom, dri-devel

Hi!

On 08/22/2015 01:06 AM, Tiago Vignatti wrote:
> Hi back!
>
> On 08/20/2015 03:48 AM, Thomas Hellstrom wrote:
>> Hi, Tiago!
>
> Something that the Chrome OS folks told me today is whether we could
> change the sync API to use a syscall for that instead. Reason for that
> is to eventually fit this in nicely in their sandbox architecture
> requirements, so yet another ioctl wouldn't need to be white-listed
> for the unpriviledged process.
>
> How does that change the game, for example regarding the details we
> been talking here about make the sync mandatory? Can we reuse, say
> msync(2) for that for example?

To me, the mechanism itself is the most important, but I think it should
be much easier to have an IOCTL accepted since the functionality is
limited to dma-bufs.

In any case, Ideally I'd want the struct dma_buf_sync look something like

enum dma_buf_sync_flags {
	DMA_BUF_SYNC_READ = (1 << 0),
	DMA_BUF_SYNC_WRITE = (2 << 0),
	DMA_BUF_SYNC_RW = (3 << 0),
	DMA_BUF_SYNC_START = (0 << 2),
	DMA_BUF_SYNC_END = (1 << 2),

	DMA_BUF_SYNC_VALID_FLAGS_MASK = DMA_BUF_SYNC_RW |
		DMA_BUF_SYNC_END
};


/* begin/end dma-buf functions used for userspace mmap. */
struct dma_buf_sync {
	enum dma_buf_sync_flags flags;
	__u64 x;
	__u64 width;

	/* 2D Sync parameters */
	__u64 stride;
	__u64 y;
	__u64 height;

}


So again, why use a 2D sync in the first place when it's probably possible to batch 1D syncs in the driver?
It all boils down to the number of syscalls required, and also the fact that it's probably not trivial to batch sync calls in the read-write case. One can, again, argue that there might be 3D cases or other obscure sync cases,
but in that case, let's implement them in the future if the overhead becomes unbearable. If the sync mechanism gets established as
mandatory, the problem will solve itself. And at least now we cover the most common case and we don't need to add a new multiplexing
mechanism in the sync IOCTL since we already have one in the IOCTL mechanism. We can simply add a new sync IOCTL for new obscure sync cases if desired.

Also, in fact I believe this also gives us a hint whether to use sycalls or IOCTLs: If we want to maintain the possibility to extend the synchronization mechanism, I strongly believe we need to use an IOCTL.

/Thomas


>
> Best Regards,
>
> Tiago
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24  1:41                   ` Jerome Glisse
@ 2015-08-24  9:59                     ` Thomas Hellstrom
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-24  9:59 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel

On 08/24/2015 03:41 AM, Jerome Glisse wrote:
> On Sat, Aug 22, 2015 at 12:00:21AM +0200, Thomas Hellstrom wrote:
>> On 08/21/2015 06:00 PM, Jerome Glisse wrote:
>>> On Fri, Aug 21, 2015 at 04:15:53PM +0200, Thomas Hellstrom wrote:
>>>> On 08/21/2015 03:32 PM, Jerome Glisse wrote:
>>>>> On Fri, Aug 21, 2015 at 07:25:07AM +0200, Thomas Hellstrom wrote:
>>>>>> On 08/20/2015 10:34 PM, Jerome Glisse wrote:
>>>>>>> On Thu, Aug 20, 2015 at 09:39:12PM +0200, Thomas Hellstrom wrote:
>>>>>>>> On 08/20/2015 04:53 PM, Jerome Glisse wrote:
>>>>>>>>> On Thu, Aug 20, 2015 at 08:48:23AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>> Hi, Tiago!
>>>>>>>>>>
>>>>>>>>>> On 08/20/2015 12:33 AM, Tiago Vignatti wrote:
>>>>>>>>>>> Hey Thomas, you haven't answered my email about making SYNC_* mandatory:
>>>>>>>>>>>
>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_archives_dri-2Ddevel_2015-2DAugust_088376.html&d=BQIDAw&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=vpukPkBtpoNQp2IUKuFviOmPNYWVKmen3Jeeu55zmEA&m=EX3w8PM79N5qLeXyogeSPN9J5pIvBV6IKrhBjzwJDEM&s=S3NqF0kPNtBKQ3-WiffL2T9NFK96XBp56vkb3ujf-io&e= 
>>>>>>>>>> Hmm, for some reason it doesn't show up in my mail app, but I found it
>>>>>>>>>> in the archives. An attempt to explain the situation from the vmwgfx
>>>>>>>>>> perspective.
>>>>>>>>>>
>>>>>>>>>> The fact that the interface is generic means that people will start
>>>>>>>>>> using it for the zero-copy case. There has been a couple of more or less
>>>>>>>>>> hackish attempts to do this before, and if it's a _driver_ interface we
>>>>>>>>>> don't need to be that careful but if it is a _generic_ interface we need
>>>>>>>>>> to be very careful to make it fit *all* the hardware out there and that
>>>>>>>>>> we make all potential users use the interface in a way that conforms
>>>>>>>>>> with the interface specification.
>>>>>>>>>>
>>>>>>>>>> What will happen otherwise is that apps written for coherent fast
>>>>>>>>>> hardware might, for example, ignore calling the SYNC api, just because
>>>>>>>>>> the app writer only cared about his own hardware on which the app works
>>>>>>>>>> fine. That would fail miserably if the same app was run on incoherent
>>>>>>>>>> hardware, or the incoherent hardware driver maintainers would be forced
>>>>>>>>>> to base an implementation on page-faults which would be very slow.
>>>>>>>>>>
>>>>>>>>>> So assume the following use case: An app updates a 10x10 area using the
>>>>>>>>>> CPU on a 1600x1200 dma-buf, and it will then use the dma-buf for
>>>>>>>>>> texturing. On some hardware the dma-buf might be tiled in a very
>>>>>>>>>> specific way, on vmwgfx the dma-buf is a GPU buffer on the host, only
>>>>>>>>>> accessible using DMA. On vmwgfx the SYNC operation must carry out a
>>>>>>>>>> 10x10 DMA from the host GPU buffer to a guest CPU buffer before the CPU
>>>>>>>>>> write and a DMA back again after the write, before GPU usage. On the
>>>>>>>>>> tiled architecture the SYNC operation must untile before CPU access and
>>>>>>>>>> probably tile again before GPU access.
>>>>>>>>>>
>>>>>>>>>> If we now have a one-dimensional SYNC api, in this particular case we'd
>>>>>>>>>> either need to sync a far too large area (1600x10) or call SYNC 10 times
>>>>>>>>>> before writing, and then again after writing. If the app forgot to call
>>>>>>>>>> SYNC we must error.
>>>>>>>>>>
>>>>>>>>>> So to summarize, the fact that the interface is generic IMO means:
>>>>>>>>>>
>>>>>>>>>> 1) Any user must be able to make valid assumptions about the internal
>>>>>>>>>> format of the dma-buf. (untiled, color format, stride etc.)
>>>>>>>>>> 2) Any user *must* call SYNC before and after CPU access. On coherent
>>>>>>>>>> architectures, the SYNC is a NULL operation anyway, and that should be
>>>>>>>>>> documented somewhere so that maintainers of drivers of uncoherent
>>>>>>>>>> architectures have somewhere to point their fingers.
>>>>>>>>>> 3) Driver-specific implementations must be allowed to error (segfault)
>>>>>>>>>> if SYNC has not been used.
>>>>>>>>> I think here you are too lax, the common code must segfault or
>>>>>>>>> error badly if SYNC has not been use in all cases even on cache
>>>>>>>>> coherent arch. The device driver sync callback can still be a
>>>>>>>>> no operation. But i think that we need to insist strongly on a
>>>>>>>>> proper sync call being made (and we should forbid empty area
>>>>>>>>> sync call). This would be the only way to make sure userspace
>>>>>>>>> behave properly as otherwise we endup in the situation you were
>>>>>>>>> describing above, where the app is design on a cache coherent
>>>>>>>>> arch and works fine there but broke in subtle way on non cache
>>>>>>>>> coherent arch and app developer is clueless of why.
>>>>>>>>>
>>>>>>>>> I just do not trust userspace.
>>>>>>>> I agree and ideally i'd want it this way as well. The question is, is it
>>>>>>>> possible? Can this be done without a fault() handler in the generic
>>>>>>>> kernel code?
>>>>>>>>
>>>>>>>>>> 4) The use-case stated above clearly shows the benefit of a
>>>>>>>>>> 2-dimensional sync interface (we want to sync the 10x10 region), but
>>>>>>>>>> what if someone in the future wants to use this interface for a 3D
>>>>>>>>>> texture? Will a 2D sync suffice? Can we make the SYNC interface
>>>>>>>>>> extendable in a way that an enum sync_type member defines the layout of
>>>>>>>>>> the argument, and initially we implement only 1d, 2d sync, leaving 3d
>>>>>>>>>> for the future?
>>>>>>>>>>
>>>>>>>>>> Also, I agree there is probably no good way to generically implement an
>>>>>>>>>> error if SYNC has not been called. That needs to be left as an option to
>>>>>>>>>> drivers.
>>>>>>>>> I think there is, just forbid any further use of the dma buffer, mark
>>>>>>>>> it as invalid and printk a big error. Userspace app developer will
>>>>>>>>> quickly see that something is wrong and looking at kernel log should
>>>>>>>>> explain why.
>>>>>>>> The problem is if someone calls mmap() and then decides to not access
>>>>>>>> the buffer before munmap() or GPU usage. How do we recognize that case
>>>>>>>> and separate it from a CPU access occured outside a sync? We could, as
>>>>>>>> mentioned above have a fault() handler in the generic kernel code and
>>>>>>>> make sure drivers don't populate PTEs until the first fault(). Another
>>>>>>>> option would be to scan all PTEs for an "accessed" flag, but I'm not
>>>>>>>> even sure all CPU architectures have such a flag?
>>>>>>>>
>>>>>>>> But there might be a simpler solution that I have overlooked?
>>>>>>> All arch have a dirty flag, so you could check that, as all we
>>>>>>> really care about is CPU write access. So all you need is clearing
>>>>>>> the dirty bit after each successfull GPU command stream ioctl
>>>>>>> and checking that no dirty bit is set in a region not cover by
>>>>>>> a flush(). Note that the clear and check can happen in the same
>>>>>>> function as part of buffer validation for the GPU command stream.
>>>>>> Actually, I do think we care about reading as well, since reading
>>>>>> without flushing anything written by the GPU will
>>>>>> yield garbage on a non-coherent architecture. Instead we might check for
>>>>>> the PAGE_ACCESSED bit.
>>>>> I do not think the read access after GPU matter. If people think it does
>>>>> then it is simple, at buffer validation in buffer mapping we invalidate
>>>>> CPU page table mapping and the prepare access ioctl populate them after
>>>>> checking that GPU is flush.
>>>>>
>>>>>> So the full version of this would keep track of all synced regions and
>>>>>> at buffer validation time, or unmap time error if there
>>>>>> is a page completely outside the union of all synced regions in any VMA
>>>>>> belonging to the dma-buf address space, then unmark all accessed PTEs
>>>>>> and invalidate TLBs accordingly, typically causing a global TLB flush.
>>>>>> Yeah, we could probably put together a helper function that does this,
>>>>>> but it will be expensive to run and the whole point of this API is to be
>>>>>> able to improve performance.
>>>>> No need for such complexity, the sync ioctl would go over the page table
>>>>> and clear dirty bit for all page in range that is synchronized. So once
>>>>> you get to the command ioctl you know that all dirty bit must be clear.
>>>>> Note this is only a page table walk in all case. I do not consider this
>>>>> to be that expensive.
>>>>>
>>>>>
>>>> The expensive part is the TLB flushing after the PTE update, which
>>>> typically needs to be performed on all cores. Although without a
>>>> benchmark I'm not sure how expensive. It might be that it doesn't
>>>> matter. After all, the vmwgfx- and I believe the qxl driver both use a
>>>> similar approach for fbdev user-space access, but that has never been
>>>> cosidered performance-critical...
>>> There is no need to flush the TLB ! At least not for the write case.
>>> Clearing the dirty bit is cache coherent. TLB are about caching page
>>> table directory tree. To convince yourself look at page_mkclean_one()
>>> in mm/rmap.c which, by the way, is almost what we want (remove the mmu
>>> notifier part and the write protection part and force real CPU cache
>>> flush if needed base on device flag).
>> Hmm? Last time I visited this code, page_mkclean_one() was eventually
>> calling flush_tlb_page() through ptep_clear_flush_notify(). I was under
>> the impression that you always have to flush the TLB when you clear a
>> PTE status bit, but typically not when you set it. Are you saying that
>> that's not the case?
> Well this is very down to the implementation details of hw, consideration
> and my memory is that the AMD/Intel doc says that write is a locked read
> modify write operation ie it is atomic. Now there is a bit of unknown on
> how exactly this happens (does the hardware retraverse the full directory
> hierarchy ? does it keeps a pointer to the page table entry ? ...).
>
> So basicly this means that at time of validation if any CPU did write to
> any page inside the range we would see the dirty set, we do not need to
> flush anything to see that happening (unless things are mapped write
> combine).
>
> Well there is a race window if CPU is writing to the buffer while we are
> doing the validation on another CPU. But there is no way around that race
> except to unmap and flush TLB. Which is not what we want.
>
> Now, there is something i am not sure, does CPU keep TLB entry around
> with dirty bit set and do not redo the RMW locked sequence, or does it
> do the RMW locked sequence everytime it write to a cacheline. I would
> guess the former if only for perf reasons. In this case we would need
> TLB flush but linux kernel already offer helpers to optimize this away.
>
>
> All this kind of make me think that there is no sane way around this.
> I mean we are trying to work around potentialy broken userspace mostly
> because most of the GPU driver is in userspace and we can't trust it.
> I think this is just a lost battle and it would be better to just
> provide the API and says that you must sync and userspace that do not
> will get to bear the consequence sooner or latter.
>
> Cheers,
> Jérôme

It would still be possible to track down those cases where SYNC is
completely ignored, by traversing the page tables looking for
PAGE_ACCESSED without any SYNC whatsoever. Sure, userspace could find
ways around that but it would still force people to read up on the SYNC
functionality.

/Thomas


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24  9:50     ` Thomas Hellstrom
@ 2015-08-24 15:52       ` Daniel Stone
  2015-08-24 16:56         ` Thomas Hellstrom
  2015-08-25  9:02         ` Daniel Vetter
  0 siblings, 2 replies; 42+ messages in thread
From: Daniel Stone @ 2015-08-24 15:52 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Thomas Hellstrom, dri-devel, Daniel Vetter

Hi Thomas,

On 24 August 2015 at 10:50, Thomas Hellstrom <thomas@shipmail.org> wrote:
> In any case, Ideally I'd want the struct dma_buf_sync look something like
>
> enum dma_buf_sync_flags {
>         DMA_BUF_SYNC_READ = (1 << 0),
>         DMA_BUF_SYNC_WRITE = (2 << 0),
>         DMA_BUF_SYNC_RW = (3 << 0),
>         DMA_BUF_SYNC_START = (0 << 2),
>         DMA_BUF_SYNC_END = (1 << 2),
>
>         DMA_BUF_SYNC_VALID_FLAGS_MASK = DMA_BUF_SYNC_RW |
>                 DMA_BUF_SYNC_END
> };
>
>
> /* begin/end dma-buf functions used for userspace mmap. */
> struct dma_buf_sync {
>         enum dma_buf_sync_flags flags;
>         __u64 x;
>         __u64 width;
>
>         /* 2D Sync parameters */
>         __u64 stride;
>         __u64 y;
>         __u64 height;
>
> }
>
>
> So again, why use a 2D sync in the first place when it's probably possible to batch 1D syncs in the driver?
> It all boils down to the number of syscalls required, and also the fact that it's probably not trivial to batch sync calls in the read-write case. One can, again, argue that there might be 3D cases or other obscure sync cases,
> but in that case, let's implement them in the future if the overhead becomes unbearable. If the sync mechanism gets established as
> mandatory, the problem will solve itself. And at least now we cover the most common case and we don't need to add a new multiplexing
> mechanism in the sync IOCTL since we already have one in the IOCTL mechanism. We can simply add a new sync IOCTL for new obscure sync cases if desired.

I still don't think this ameliorates the need for batching: consider
the case where you update two disjoint screen regions and want them
both flushed. Either you issue two separate sync calls (which can be
disadvantageous performance-wise on some hardware / setups), or you
accumulate the regions and only flush later. So either two ioctls (one
in the style of dirtyfb and one to perform the sync/flush; you can
shortcut to assume the full buffer was damaged if the former is
missing), or one like this:

struct dma_buf_sync_2d {
        enum dma_buf_sync_flags flags;

        __u64 stride_bytes;
        __u32 bytes_per_pixel;
        __u32 num_regions;

        struct {
                __u64 x;
                __u64 y;
                __u64 width;
                __u64 height;
        } regions[];
};

Cheers,
Daniel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 15:52       ` Daniel Stone
@ 2015-08-24 16:56         ` Thomas Hellstrom
  2015-08-24 17:04           ` Daniel Stone
  2015-08-25  9:02         ` Daniel Vetter
  1 sibling, 1 reply; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-24 16:56 UTC (permalink / raw)
  To: Daniel Stone; +Cc: dri-devel, Daniel Vetter

Hi, Daniel,

On 08/24/2015 05:52 PM, Daniel Stone wrote:
> Hi Thomas,
>
> On 24 August 2015 at 10:50, Thomas Hellstrom <thomas@shipmail.org> wrote:
>> In any case, Ideally I'd want the struct dma_buf_sync look something like
>>
>> enum dma_buf_sync_flags {
>>         DMA_BUF_SYNC_READ = (1 << 0),
>>         DMA_BUF_SYNC_WRITE = (2 << 0),
>>         DMA_BUF_SYNC_RW = (3 << 0),
>>         DMA_BUF_SYNC_START = (0 << 2),
>>         DMA_BUF_SYNC_END = (1 << 2),
>>
>>         DMA_BUF_SYNC_VALID_FLAGS_MASK = DMA_BUF_SYNC_RW |
>>                 DMA_BUF_SYNC_END
>> };
>>
>>
>> /* begin/end dma-buf functions used for userspace mmap. */
>> struct dma_buf_sync {
>>         enum dma_buf_sync_flags flags;
>>         __u64 x;
>>         __u64 width;
>>
>>         /* 2D Sync parameters */
>>         __u64 stride;
>>         __u64 y;
>>         __u64 height;
>>
>> }
>>
>>
>> So again, why use a 2D sync in the first place when it's probably possible to batch 1D syncs in the driver?
>> It all boils down to the number of syscalls required, and also the fact that it's probably not trivial to batch sync calls in the read-write case. One can, again, argue that there might be 3D cases or other obscure sync cases,
>> but in that case, let's implement them in the future if the overhead becomes unbearable. If the sync mechanism gets established as
>> mandatory, the problem will solve itself. And at least now we cover the most common case and we don't need to add a new multiplexing
>> mechanism in the sync IOCTL since we already have one in the IOCTL mechanism. We can simply add a new sync IOCTL for new obscure sync cases if desired.
> I still don't think this ameliorates the need for batching: consider
> the case where you update two disjoint screen regions and want them
> both flushed. Either you issue two separate sync calls (which can be
> disadvantageous performance-wise on some hardware / setups), or you
> accumulate the regions and only flush later. So either two ioctls (one
> in the style of dirtyfb and one to perform the sync/flush; you can
> shortcut to assume the full buffer was damaged if the former is
> missing), or one like this:
>
> struct dma_buf_sync_2d {
>         enum dma_buf_sync_flags flags;
>
>         __u64 stride_bytes;
>         __u32 bytes_per_pixel;
>         __u32 num_regions;
>
>         struct {
>                 __u64 x;
>                 __u64 y;
>                 __u64 width;
>                 __u64 height;
>         } regions[];
> };
>
> Cheers,
> Daniel

Fine with me, although perhaps bytes_per_pixel is a bit redundant?

/Thomas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 16:56         ` Thomas Hellstrom
@ 2015-08-24 17:04           ` Daniel Stone
  2015-08-24 17:10             ` Thomas Hellstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Stone @ 2015-08-24 17:04 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel, Daniel Vetter

Hi,

On 24 August 2015 at 17:56, Thomas Hellstrom <thellstrom@vmware.com> wrote:
> On 08/24/2015 05:52 PM, Daniel Stone wrote:
>> I still don't think this ameliorates the need for batching: consider
>> the case where you update two disjoint screen regions and want them
>> both flushed. Either you issue two separate sync calls (which can be
>> disadvantageous performance-wise on some hardware / setups), or you
>> accumulate the regions and only flush later. So either two ioctls (one
>> in the style of dirtyfb and one to perform the sync/flush; you can
>> shortcut to assume the full buffer was damaged if the former is
>> missing), or one like this:
>>
>> struct dma_buf_sync_2d {
>>         enum dma_buf_sync_flags flags;
>>
>>         __u64 stride_bytes;
>>         __u32 bytes_per_pixel;
>>         __u32 num_regions;
>>
>>         struct {
>>                 __u64 x;
>>                 __u64 y;
>>                 __u64 width;
>>                 __u64 height;
>>         } regions[];
>> };
>>
>> Cheers,
>> Daniel
>
> Fine with me, although perhaps bytes_per_pixel is a bit redundant?

Redundant how? It's not implicit in stride.

Cheers,
Daniel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 17:04           ` Daniel Stone
@ 2015-08-24 17:10             ` Thomas Hellstrom
  2015-08-24 17:12               ` Daniel Stone
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-24 17:10 UTC (permalink / raw)
  To: Daniel Stone; +Cc: dri-devel, Daniel Vetter

On 08/24/2015 07:04 PM, Daniel Stone wrote:
> Hi,
>
> On 24 August 2015 at 17:56, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>> On 08/24/2015 05:52 PM, Daniel Stone wrote:
>>> I still don't think this ameliorates the need for batching: consider
>>> the case where you update two disjoint screen regions and want them
>>> both flushed. Either you issue two separate sync calls (which can be
>>> disadvantageous performance-wise on some hardware / setups), or you
>>> accumulate the regions and only flush later. So either two ioctls (one
>>> in the style of dirtyfb and one to perform the sync/flush; you can
>>> shortcut to assume the full buffer was damaged if the former is
>>> missing), or one like this:
>>>
>>> struct dma_buf_sync_2d {
>>>         enum dma_buf_sync_flags flags;
>>>
>>>         __u64 stride_bytes;
>>>         __u32 bytes_per_pixel;
>>>         __u32 num_regions;
>>>
>>>         struct {
>>>                 __u64 x;
>>>                 __u64 y;
>>>                 __u64 width;
>>>                 __u64 height;
>>>         } regions[];
>>> };
>>>
>>> Cheers,
>>> Daniel
>> Fine with me, although perhaps bytes_per_pixel is a bit redundant?
> Redundant how? It's not implicit in stride.

For flushing purposes, isn't it possible to cover all cases by assuming
bytes_per_pixel=1? Not that it matters much.

/Thomas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 17:10             ` Thomas Hellstrom
@ 2015-08-24 17:12               ` Daniel Stone
  2015-08-24 17:42                 ` Thomas Hellstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Stone @ 2015-08-24 17:12 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel, Daniel Vetter

Hi,

On 24 August 2015 at 18:10, Thomas Hellstrom <thellstrom@vmware.com> wrote:
> On 08/24/2015 07:04 PM, Daniel Stone wrote:
>> On 24 August 2015 at 17:56, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>> On 08/24/2015 05:52 PM, Daniel Stone wrote:
>>>> I still don't think this ameliorates the need for batching: consider
>>>> the case where you update two disjoint screen regions and want them
>>>> both flushed. Either you issue two separate sync calls (which can be
>>>> disadvantageous performance-wise on some hardware / setups), or you
>>>> accumulate the regions and only flush later. So either two ioctls (one
>>>> in the style of dirtyfb and one to perform the sync/flush; you can
>>>> shortcut to assume the full buffer was damaged if the former is
>>>> missing), or one like this:
>>>>
>>>> struct dma_buf_sync_2d {
>>>>         enum dma_buf_sync_flags flags;
>>>>
>>>>         __u64 stride_bytes;
>>>>         __u32 bytes_per_pixel;
>>>>         __u32 num_regions;
>>>>
>>>>         struct {
>>>>                 __u64 x;
>>>>                 __u64 y;
>>>>                 __u64 width;
>>>>                 __u64 height;
>>>>         } regions[];
>>>> };
>>> Fine with me, although perhaps bytes_per_pixel is a bit redundant?
>> Redundant how? It's not implicit in stride.
>
> For flushing purposes, isn't it possible to cover all cases by assuming
> bytes_per_pixel=1? Not that it matters much.

Sure, though in that case best to replace x with line_byte_offset or
something, because otherwise I guarantee you everyone will get it
wrong, and it'll be a pain to track down. Like how I managed to
misread it now. :)

Cheers,
Daniel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 17:12               ` Daniel Stone
@ 2015-08-24 17:42                 ` Thomas Hellstrom
  2015-08-24 17:56                   ` Michael Spang
  2015-08-24 18:01                   ` Tiago Vignatti
  0 siblings, 2 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-24 17:42 UTC (permalink / raw)
  To: Daniel Stone; +Cc: dri-devel, Daniel Vetter

On 08/24/2015 07:12 PM, Daniel Stone wrote:
> Hi,
>
> On 24 August 2015 at 18:10, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>> On 08/24/2015 07:04 PM, Daniel Stone wrote:
>>> On 24 August 2015 at 17:56, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>>> On 08/24/2015 05:52 PM, Daniel Stone wrote:
>>>>> I still don't think this ameliorates the need for batching: consider
>>>>> the case where you update two disjoint screen regions and want them
>>>>> both flushed. Either you issue two separate sync calls (which can be
>>>>> disadvantageous performance-wise on some hardware / setups), or you
>>>>> accumulate the regions and only flush later. So either two ioctls (one
>>>>> in the style of dirtyfb and one to perform the sync/flush; you can
>>>>> shortcut to assume the full buffer was damaged if the former is
>>>>> missing), or one like this:
>>>>>
>>>>> struct dma_buf_sync_2d {
>>>>>         enum dma_buf_sync_flags flags;
>>>>>
>>>>>         __u64 stride_bytes;
>>>>>         __u32 bytes_per_pixel;
>>>>>         __u32 num_regions;
>>>>>
>>>>>         struct {
>>>>>                 __u64 x;
>>>>>                 __u64 y;
>>>>>                 __u64 width;
>>>>>                 __u64 height;
>>>>>         } regions[];
>>>>> };
>>>> Fine with me, although perhaps bytes_per_pixel is a bit redundant?
>>> Redundant how? It's not implicit in stride.
>> For flushing purposes, isn't it possible to cover all cases by assuming
>> bytes_per_pixel=1? Not that it matters much.
> Sure, though in that case best to replace x with line_byte_offset or
> something, because otherwise I guarantee you everyone will get it
> wrong, and it'll be a pain to track down. Like how I managed to
> misread it now. :)

OK, yeah you have a point. IMO let's go for your proposal.

Tiago, is this OK with you?

/Thomas





> Cheers,
> Daniel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 17:42                 ` Thomas Hellstrom
@ 2015-08-24 17:56                   ` Michael Spang
  2015-08-25  5:25                     ` Thomas Hellstrom
  2015-08-24 18:01                   ` Tiago Vignatti
  1 sibling, 1 reply; 42+ messages in thread
From: Michael Spang @ 2015-08-24 17:56 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: Daniel Vetter, dri-devel

On Mon, Aug 24, 2015 at 1:42 PM, Thomas Hellstrom <thellstrom@vmware.com> wrote:
> On 08/24/2015 07:12 PM, Daniel Stone wrote:
>> Hi,
>>
>> On 24 August 2015 at 18:10, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>> On 08/24/2015 07:04 PM, Daniel Stone wrote:
>>>> On 24 August 2015 at 17:56, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>>>> On 08/24/2015 05:52 PM, Daniel Stone wrote:
>>>>>> I still don't think this ameliorates the need for batching: consider
>>>>>> the case where you update two disjoint screen regions and want them
>>>>>> both flushed. Either you issue two separate sync calls (which can be
>>>>>> disadvantageous performance-wise on some hardware / setups), or you
>>>>>> accumulate the regions and only flush later. So either two ioctls (one
>>>>>> in the style of dirtyfb and one to perform the sync/flush; you can
>>>>>> shortcut to assume the full buffer was damaged if the former is
>>>>>> missing), or one like this:
>>>>>>
>>>>>> struct dma_buf_sync_2d {
>>>>>>         enum dma_buf_sync_flags flags;
>>>>>>
>>>>>>         __u64 stride_bytes;
>>>>>>         __u32 bytes_per_pixel;
>>>>>>         __u32 num_regions;
>>>>>>
>>>>>>         struct {
>>>>>>                 __u64 x;
>>>>>>                 __u64 y;
>>>>>>                 __u64 width;
>>>>>>                 __u64 height;
>>>>>>         } regions[];
>>>>>> };
>>>>> Fine with me, although perhaps bytes_per_pixel is a bit redundant?
>>>> Redundant how? It's not implicit in stride.
>>> For flushing purposes, isn't it possible to cover all cases by assuming
>>> bytes_per_pixel=1? Not that it matters much.
>> Sure, though in that case best to replace x with line_byte_offset or
>> something, because otherwise I guarantee you everyone will get it
>> wrong, and it'll be a pain to track down. Like how I managed to
>> misread it now. :)
>
> OK, yeah you have a point. IMO let's go for your proposal.
>
> Tiago, is this OK with you?

Is there any obstacle to making the above API a new syscall?

The reason we have issues with ioctls is because it's not possible to
whitelist them properly with seccomp BPF - there's no single namespace
for ioctls.

If this API is merged as a ioctl, we may not be able to actually use
it. Which is a bit unfortunate when it has been proposed with the
chrome renderer use case in mind.

Michael
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 17:42                 ` Thomas Hellstrom
  2015-08-24 17:56                   ` Michael Spang
@ 2015-08-24 18:01                   ` Tiago Vignatti
  2015-08-24 23:03                     ` Tiago Vignatti
  1 sibling, 1 reply; 42+ messages in thread
From: Tiago Vignatti @ 2015-08-24 18:01 UTC (permalink / raw)
  To: Thomas Hellstrom, Daniel Stone; +Cc: Daniel Vetter, dri-devel

On 08/24/2015 02:42 PM, Thomas Hellstrom wrote:
> On 08/24/2015 07:12 PM, Daniel Stone wrote:
>> Hi,
>>
>> On 24 August 2015 at 18:10, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>> On 08/24/2015 07:04 PM, Daniel Stone wrote:
>>>> On 24 August 2015 at 17:56, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>>>> On 08/24/2015 05:52 PM, Daniel Stone wrote:
>>>>>> I still don't think this ameliorates the need for batching: consider
>>>>>> the case where you update two disjoint screen regions and want them
>>>>>> both flushed. Either you issue two separate sync calls (which can be
>>>>>> disadvantageous performance-wise on some hardware / setups), or you
>>>>>> accumulate the regions and only flush later. So either two ioctls (one
>>>>>> in the style of dirtyfb and one to perform the sync/flush; you can
>>>>>> shortcut to assume the full buffer was damaged if the former is
>>>>>> missing), or one like this:
>>>>>>
>>>>>> struct dma_buf_sync_2d {
>>>>>>          enum dma_buf_sync_flags flags;
>>>>>>
>>>>>>          __u64 stride_bytes;
>>>>>>          __u32 bytes_per_pixel;
>>>>>>          __u32 num_regions;
>>>>>>
>>>>>>          struct {
>>>>>>                  __u64 x;
>>>>>>                  __u64 y;
>>>>>>                  __u64 width;
>>>>>>                  __u64 height;
>>>>>>          } regions[];
>>>>>> };
>>>>> Fine with me, although perhaps bytes_per_pixel is a bit redundant?
>>>> Redundant how? It's not implicit in stride.
>>> For flushing purposes, isn't it possible to cover all cases by assuming
>>> bytes_per_pixel=1? Not that it matters much.
>> Sure, though in that case best to replace x with line_byte_offset or
>> something, because otherwise I guarantee you everyone will get it
>> wrong, and it'll be a pain to track down. Like how I managed to
>> misread it now. :)
>
> OK, yeah you have a point. IMO let's go for your proposal.
>
> Tiago, is this OK with you?

yup, I think so. So IIUC the main changes needed for the drivers 
implement 2D sync lies in the dma_buf_sync_2d structure only. I.e. 
there's nothing really to be changed in the common code, right? Then 
I'll just need to stick somewhere the logic about making sync mandatory, 
which I couldn't conclude much from your discussions with Jerome et al. 
I'll need to investigate more here.

Also, I still want to iterate with Google policy team about the actual 
need for a syscall. But I believe that eventually could be an secondary 
phase of this work (in case we ever agree upon having that).

Tiago

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 18:01                   ` Tiago Vignatti
@ 2015-08-24 23:03                     ` Tiago Vignatti
  0 siblings, 0 replies; 42+ messages in thread
From: Tiago Vignatti @ 2015-08-24 23:03 UTC (permalink / raw)
  To: dri-devel

On 08/24/2015 03:01 PM, Tiago Vignatti wrote:
> yup, I think so. So IIUC the main changes needed for the drivers
> implement 2D sync lies in the dma_buf_sync_2d structure only. I.e.
> there's nothing really to be changed in the common code, right?

Do we have any special requirements in how we want pass the sync 
information to the drivers? I was thinking to push the whole 
responsibility for them, something like:

+int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, size_t stride_bytes,
+                            size_t bytes_per_pixel, size_t num_regions,
+                            struct dma_buf_sync_region regions[], enum 
dma_data_direction dir);

Daniel Vetter mentioned about dma-buf design that should not track 
metadata but I haven't read anything about it, so do you think this 
looks alright?

Tiago
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 17:56                   ` Michael Spang
@ 2015-08-25  5:25                     ` Thomas Hellstrom
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-25  5:25 UTC (permalink / raw)
  To: Michael Spang; +Cc: Daniel Vetter, dri-devel

Hi,

On 08/24/2015 07:56 PM, Michael Spang wrote:
> On Mon, Aug 24, 2015 at 1:42 PM, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>> On 08/24/2015 07:12 PM, Daniel Stone wrote:
>>> Hi,
>>>
>>> On 24 August 2015 at 18:10, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>>> On 08/24/2015 07:04 PM, Daniel Stone wrote:
>>>>> On 24 August 2015 at 17:56, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>>>>> On 08/24/2015 05:52 PM, Daniel Stone wrote:
>>>>>>> I still don't think this ameliorates the need for batching: consider
>>>>>>> the case where you update two disjoint screen regions and want them
>>>>>>> both flushed. Either you issue two separate sync calls (which can be
>>>>>>> disadvantageous performance-wise on some hardware / setups), or you
>>>>>>> accumulate the regions and only flush later. So either two ioctls (one
>>>>>>> in the style of dirtyfb and one to perform the sync/flush; you can
>>>>>>> shortcut to assume the full buffer was damaged if the former is
>>>>>>> missing), or one like this:
>>>>>>>
>>>>>>> struct dma_buf_sync_2d {
>>>>>>>         enum dma_buf_sync_flags flags;
>>>>>>>
>>>>>>>         __u64 stride_bytes;
>>>>>>>         __u32 bytes_per_pixel;
>>>>>>>         __u32 num_regions;
>>>>>>>
>>>>>>>         struct {
>>>>>>>                 __u64 x;
>>>>>>>                 __u64 y;
>>>>>>>                 __u64 width;
>>>>>>>                 __u64 height;
>>>>>>>         } regions[];
>>>>>>> };
>>>>>> Fine with me, although perhaps bytes_per_pixel is a bit redundant?
>>>>> Redundant how? It's not implicit in stride.
>>>> For flushing purposes, isn't it possible to cover all cases by assuming
>>>> bytes_per_pixel=1? Not that it matters much.
>>> Sure, though in that case best to replace x with line_byte_offset or
>>> something, because otherwise I guarantee you everyone will get it
>>> wrong, and it'll be a pain to track down. Like how I managed to
>>> misread it now. :)
>> OK, yeah you have a point. IMO let's go for your proposal.
>>
>> Tiago, is this OK with you?
> Is there any obstacle to making the above API a new syscall?

I think that whether this API can be accepted as a syscall or not is
really up to Dave and in the end Linus.
What speaks against is that it's not applicable to all file types, only
dma-bufs and also that we might want to extend
it in the future.

>
> The reason we have issues with ioctls is because it's not possible to
> whitelist them properly with seccomp BPF - there's no single namespace
> for ioctls.
>
> If this API is merged as a ioctl, we may not be able to actually use
> it. Which is a bit unfortunate when it has been proposed with the
> chrome renderer use case in mind.

I don't think mmap() on dma-bufs can be accepted as a _generic_ API
without a mandatory sync mechanism, and I think there is a need to
specify a sync api regardless of who is the first user.

Thanks,
Thomas


>
> Michael

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-24 15:52       ` Daniel Stone
  2015-08-24 16:56         ` Thomas Hellstrom
@ 2015-08-25  9:02         ` Daniel Vetter
  2015-08-25  9:30           ` Thomas Hellstrom
  1 sibling, 1 reply; 42+ messages in thread
From: Daniel Vetter @ 2015-08-25  9:02 UTC (permalink / raw)
  To: Daniel Stone; +Cc: Daniel Vetter, Thomas Hellstrom, dri-devel

On Mon, Aug 24, 2015 at 04:52:13PM +0100, Daniel Stone wrote:
> Hi Thomas,
> 
> On 24 August 2015 at 10:50, Thomas Hellstrom <thomas@shipmail.org> wrote:
> > In any case, Ideally I'd want the struct dma_buf_sync look something like
> >
> > enum dma_buf_sync_flags {
> >         DMA_BUF_SYNC_READ = (1 << 0),
> >         DMA_BUF_SYNC_WRITE = (2 << 0),
> >         DMA_BUF_SYNC_RW = (3 << 0),
> >         DMA_BUF_SYNC_START = (0 << 2),
> >         DMA_BUF_SYNC_END = (1 << 2),
> >
> >         DMA_BUF_SYNC_VALID_FLAGS_MASK = DMA_BUF_SYNC_RW |
> >                 DMA_BUF_SYNC_END
> > };
> >
> >
> > /* begin/end dma-buf functions used for userspace mmap. */
> > struct dma_buf_sync {
> >         enum dma_buf_sync_flags flags;
> >         __u64 x;
> >         __u64 width;
> >
> >         /* 2D Sync parameters */
> >         __u64 stride;
> >         __u64 y;
> >         __u64 height;
> >
> > }
> >
> >
> > So again, why use a 2D sync in the first place when it's probably possible to batch 1D syncs in the driver?
> > It all boils down to the number of syscalls required, and also the fact that it's probably not trivial to batch sync calls in the read-write case. One can, again, argue that there might be 3D cases or other obscure sync cases,
> > but in that case, let's implement them in the future if the overhead becomes unbearable. If the sync mechanism gets established as
> > mandatory, the problem will solve itself. And at least now we cover the most common case and we don't need to add a new multiplexing
> > mechanism in the sync IOCTL since we already have one in the IOCTL mechanism. We can simply add a new sync IOCTL for new obscure sync cases if desired.
> 
> I still don't think this ameliorates the need for batching: consider
> the case where you update two disjoint screen regions and want them
> both flushed. Either you issue two separate sync calls (which can be
> disadvantageous performance-wise on some hardware / setups), or you
> accumulate the regions and only flush later. So either two ioctls (one
> in the style of dirtyfb and one to perform the sync/flush; you can
> shortcut to assume the full buffer was damaged if the former is
> missing), or one like this:
> 
> struct dma_buf_sync_2d {
>         enum dma_buf_sync_flags flags;
> 
>         __u64 stride_bytes;
>         __u32 bytes_per_pixel;
>         __u32 num_regions;
> 
>         struct {
>                 __u64 x;
>                 __u64 y;
>                 __u64 width;
>                 __u64 height;
>         } regions[];
> };

I really feel like any kind of multi-range flush interface is feature
bloat, and if we do it then we should only do it later on when there's a
clear need for it.

Afaiui dma-buf mmap will mostly be used for up/downloads, which means
there will be some explicit copy involved somewhere anyway. So similar to
userptr usage. And for those most often you just slurp in a linear block
and then let the blitter rearrange it, at least on i915.

Also thus far the consensus was that dma-buf doesn't care/know about
content layout, adding strides/bytes/whatever does bend this quite a bit.

And finally if we concentrate on the simple-case of one byte range we can
focus the discussion on correctness, which imo is the important bit here
really.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-25  9:02         ` Daniel Vetter
@ 2015-08-25  9:30           ` Thomas Hellstrom
  2015-08-25 16:14             ` Tiago Vignatti
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-25  9:30 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, dri-devel

On 08/25/2015 11:02 AM, Daniel Vetter wrote:
> On Mon, Aug 24, 2015 at 04:52:13PM +0100, Daniel Stone wrote:
>> Hi Thomas,
>>
>> On 24 August 2015 at 10:50, Thomas Hellstrom <thomas@shipmail.org> wrote:
>>> In any case, Ideally I'd want the struct dma_buf_sync look something like
>>>
>>> enum dma_buf_sync_flags {
>>>         DMA_BUF_SYNC_READ = (1 << 0),
>>>         DMA_BUF_SYNC_WRITE = (2 << 0),
>>>         DMA_BUF_SYNC_RW = (3 << 0),
>>>         DMA_BUF_SYNC_START = (0 << 2),
>>>         DMA_BUF_SYNC_END = (1 << 2),
>>>
>>>         DMA_BUF_SYNC_VALID_FLAGS_MASK = DMA_BUF_SYNC_RW |
>>>                 DMA_BUF_SYNC_END
>>> };
>>>
>>>
>>> /* begin/end dma-buf functions used for userspace mmap. */
>>> struct dma_buf_sync {
>>>         enum dma_buf_sync_flags flags;
>>>         __u64 x;
>>>         __u64 width;
>>>
>>>         /* 2D Sync parameters */
>>>         __u64 stride;
>>>         __u64 y;
>>>         __u64 height;
>>>
>>> }
>>>
>>>
>>> So again, why use a 2D sync in the first place when it's probably possible to batch 1D syncs in the driver?
>>> It all boils down to the number of syscalls required, and also the fact that it's probably not trivial to batch sync calls in the read-write case. One can, again, argue that there might be 3D cases or other obscure sync cases,
>>> but in that case, let's implement them in the future if the overhead becomes unbearable. If the sync mechanism gets established as
>>> mandatory, the problem will solve itself. And at least now we cover the most common case and we don't need to add a new multiplexing
>>> mechanism in the sync IOCTL since we already have one in the IOCTL mechanism. We can simply add a new sync IOCTL for new obscure sync cases if desired.
>> I still don't think this ameliorates the need for batching: consider
>> the case where you update two disjoint screen regions and want them
>> both flushed. Either you issue two separate sync calls (which can be
>> disadvantageous performance-wise on some hardware / setups), or you
>> accumulate the regions and only flush later. So either two ioctls (one
>> in the style of dirtyfb and one to perform the sync/flush; you can
>> shortcut to assume the full buffer was damaged if the former is
>> missing), or one like this:
>>
>> struct dma_buf_sync_2d {
>>         enum dma_buf_sync_flags flags;
>>
>>         __u64 stride_bytes;
>>         __u32 bytes_per_pixel;
>>         __u32 num_regions;
>>
>>         struct {
>>                 __u64 x;
>>                 __u64 y;
>>                 __u64 width;
>>                 __u64 height;
>>         } regions[];
>> };
> I really feel like any kind of multi-range flush interface is feature
> bloat, and if we do it then we should only do it later on when there's a
> clear need for it.

IMO all the use-cases so far that wanted to do this have been 2D
updates. and having only a 1D sync will most probably scare people away
from this interface.

> Afaiui dma-buf mmap will mostly be used for up/downloads, which means
> there will be some explicit copy involved somewhere anyway. So similar to
> userptr usage. And for those most often you just slurp in a linear block
> and then let the blitter rearrange it, at least on i915.
>
> Also thus far the consensus was that dma-buf doesn't care/know about
> content layout, adding strides/bytes/whatever does bend this quite a bit.

I think with a 2D interface, the stride only applies to the sync
operation itself and is only a parameter for that sync operation.
Whether we should have multiple regions or not is not a big deal for me,
but I think at least a 2D sync is crucial.

>
> And finally if we concentrate on the simple-case of one byte range we can
> focus the discussion on correctness, which imo is the important bit here
> really.

If we only focus on correctness there are other ways to do this without
a sync: We can DMA based on accessed and dirty pages, but performance
will crawl, so IMO the sync interface is purely for performance...

> -Daniel

Thanks,
Thomas


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: about mmap dma-buf and sync
  2015-08-25  9:30           ` Thomas Hellstrom
@ 2015-08-25 16:14             ` Tiago Vignatti
  2015-08-26  0:02               ` [PATCH v4] dma-buf: Add ioctls to allow userspace to flush Tiago Vignatti
  0 siblings, 1 reply; 42+ messages in thread
From: Tiago Vignatti @ 2015-08-25 16:14 UTC (permalink / raw)
  To: dri-devel

On 08/25/2015 06:30 AM, Thomas Hellstrom wrote:
> On 08/25/2015 11:02 AM, Daniel Vetter wrote:
>> I really feel like any kind of multi-range flush interface is feature
>> bloat, and if we do it then we should only do it later on when there's a
>> clear need for it.
>
> IMO all the use-cases so far that wanted to do this have been 2D
> updates. and having only a 1D sync will most probably scare people away
> from this interface.
>
>> Afaiui dma-buf mmap will mostly be used for up/downloads, which means
>> there will be some explicit copy involved somewhere anyway. So similar to
>> userptr usage. And for those most often you just slurp in a linear block
>> and then let the blitter rearrange it, at least on i915.
>>
>> Also thus far the consensus was that dma-buf doesn't care/know about
>> content layout, adding strides/bytes/whatever does bend this quite a bit.
>
> I think with a 2D interface, the stride only applies to the sync
> operation itself and is only a parameter for that sync operation.
> Whether we should have multiple regions or not is not a big deal for me,
> but I think at least a 2D sync is crucial.

Right now only omap, ion and udl-fb make use of begin{,end}_cpu_access() 
dma-buf interface, but it's curious that none uses those 1-d parameters 
(start and len). So in that sense it seems that the tendency is to 
feature bloat the API if we do the 2-d additions.

OTOH we're talking about a different usage of dma-buf right now, so the 
driver might actually start to use in fact that API. That said, I 
thought it was somewhat simple to turn the common code into 2-d, cause, 
as I pointed in the other email, we'd be pushing the whole 
responsibility of dealing with the regions and so on to the driver 
implementors.

Thomas, any comments in the dma_buf_begin_cpu_access() new API I showed? 
Maybe I should just clean up here the draft and sent it to the ML :)

Tiago

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-25 16:14             ` Tiago Vignatti
@ 2015-08-26  0:02               ` Tiago Vignatti
  2015-08-26  6:49                 ` Thomas Hellstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Tiago Vignatti @ 2015-08-26  0:02 UTC (permalink / raw)
  To: dri-devel
  Cc: daniel.thompson, daniel.vetter, intel-gfx, thellstrom, Daniel Vetter

From: Daniel Vetter <daniel.vetter@ffwll.ch>

The userspace might need some sort of cache coherency management e.g. when CPU
and GPU domains are being accessed through dma-buf at the same time. To
circumvent this problem there are begin/end coherency markers, that forward
directly to existing dma-buf device drivers vfunc hooks. Userspace can make use
of those markers through the DMA_BUF_IOCTL_SYNC ioctl. The sequence would be
used like following:

  - mmap dma-buf fd
  - for each drawing/upload cycle in CPU
    1. SYNC_START ioctl
    2. read/write to mmap area or a 2d sub-region of it
    3. SYNC_END ioctl.
  - munamp once you don't need the buffer any more

v2 (Tiago): Fix header file type names (u64 -> __u64)
v3 (Tiago): Add documentation. Use enum dma_buf_sync_flags to the begin/end
dma-buf functions. Check for overflows in start/length.
v4 (Tiago): use 2d regions for sync.

Cc: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Tiago Vignatti <tiago.vignatti@intel.com>
---

I'm unable to test the 2d sync properly, because begin/end access in i915
don't track mapped range for nothing.

 Documentation/dma-buf-sharing.txt      | 13 ++++++
 drivers/dma-buf/dma-buf.c              | 77 ++++++++++++++++++++++++++++------
 drivers/gpu/drm/i915/i915_gem_dmabuf.c |  6 ++-
 include/linux/dma-buf.h                | 20 +++++----
 include/uapi/linux/dma-buf.h           | 57 +++++++++++++++++++++++++
 5 files changed, 150 insertions(+), 23 deletions(-)
 create mode 100644 include/uapi/linux/dma-buf.h

diff --git a/Documentation/dma-buf-sharing.txt b/Documentation/dma-buf-sharing.txt
index 480c8de..8061ac0 100644
--- a/Documentation/dma-buf-sharing.txt
+++ b/Documentation/dma-buf-sharing.txt
@@ -355,6 +355,19 @@ Being able to mmap an export dma-buf buffer object has 2 main use-cases:
 
    No special interfaces, userspace simply calls mmap on the dma-buf fd.
 
+   Also, the userspace might need some sort of cache coherency management e.g.
+   when CPU and GPU domains are being accessed through dma-buf at the same
+   time. To circumvent this problem there are begin/end coherency markers, that
+   forward directly to existing dma-buf device drivers vfunc hooks. Userspace
+   can make use of those markers through the DMA_BUF_IOCTL_SYNC ioctl. The
+   sequence would be used like following:
+     - mmap dma-buf fd
+     - for each drawing/upload cycle in CPU
+       1. SYNC_START ioctl
+       2. read/write to mmap area or a 2d sub-region of it
+       3. SYNC_END ioctl.
+     - munamp once you don't need the buffer any more
+
 2. Supporting existing mmap interfaces in importers
 
    Similar to the motivation for kernel cpu access it is again important that
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 155c146..b6a4a06 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -251,11 +251,55 @@ out:
 	return events;
 }
 
+static long dma_buf_ioctl(struct file *file,
+			  unsigned int cmd, unsigned long arg)
+{
+	struct dma_buf *dmabuf;
+	struct dma_buf_sync sync;
+	enum dma_data_direction direction;
+
+	dmabuf = file->private_data;
+
+	if (!is_dma_buf_file(file))
+		return -EINVAL;
+
+	if (cmd != DMA_BUF_IOCTL_SYNC)
+		return -ENOTTY;
+
+	if (copy_from_user(&sync, (void __user *) arg, sizeof(sync)))
+		return -EFAULT;
+
+	if (sync.flags & DMA_BUF_SYNC_RW)
+		direction = DMA_BIDIRECTIONAL;
+	else if (sync.flags & DMA_BUF_SYNC_READ)
+		direction = DMA_FROM_DEVICE;
+	else if (sync.flags & DMA_BUF_SYNC_WRITE)
+		direction = DMA_TO_DEVICE;
+	else
+		return -EINVAL;
+
+	if (sync.flags & ~DMA_BUF_SYNC_VALID_FLAGS_MASK)
+		return -EINVAL;
+
+	/* TODO: check for overflowing the buffer's size - how so, checking region by
+	 * region here? Maybe need to check for the other parameters as well. */
+
+	if (sync.flags & DMA_BUF_SYNC_END)
+		dma_buf_end_cpu_access(dmabuf, sync.stride_bytes, sync.bytes_per_pixel,
+				sync.num_regions, sync.regions, direction);
+	else
+		dma_buf_begin_cpu_access(dmabuf, sync.stride_bytes, sync.bytes_per_pixel,
+				sync.num_regions, sync.regions, direction);
+
+	return 0;
+}
+
 static const struct file_operations dma_buf_fops = {
 	.release	= dma_buf_release,
 	.mmap		= dma_buf_mmap_internal,
 	.llseek		= dma_buf_llseek,
 	.poll		= dma_buf_poll,
+	.unlocked_ioctl	= dma_buf_ioctl,
 };
 
 /*
@@ -539,14 +583,17 @@ EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment);
  * preparations. Coherency is only guaranteed in the specified range for the
  * specified access direction.
  * @dmabuf:	[in]	buffer to prepare cpu access for.
- * @start:	[in]	start of range for cpu access.
- * @len:	[in]	length of range for cpu access.
- * @direction:	[in]	length of range for cpu access.
+ * @stride_bytes:	[in]	stride in bytes for cpu access.
+ * @bytes_per_pixel:	[in]	bytes per pixel of the region for cpu access.
+ * @num_regions:   [in]  number of regions.
+ * @region:   [in] vector of 2-dimensional regions for cpu access.
+ * @direction:	[in]	direction of range for cpu access.
  *
  * Can return negative error values, returns 0 on success.
  */
-int dma_buf_begin_cpu_access(struct dma_buf *dmabuf, size_t start, size_t len,
-			     enum dma_data_direction direction)
+int dma_buf_begin_cpu_access(struct dma_buf *dmabuf, size_t stride_bytes,
+	size_t bytes_per_pixel, size_t num_regions, struct dma_buf_sync_region regions[],
+	enum dma_data_direction direction)
 {
 	int ret = 0;
 
@@ -554,8 +601,8 @@ int dma_buf_begin_cpu_access(struct dma_buf *dmabuf, size_t start, size_t len,
 		return -EINVAL;
 
 	if (dmabuf->ops->begin_cpu_access)
-		ret = dmabuf->ops->begin_cpu_access(dmabuf, start,
-							len, direction);
+		ret = dmabuf->ops->begin_cpu_access(dmabuf, stride_bytes, bytes_per_pixel,
+							num_regions, regions, direction);
 
 	return ret;
 }
@@ -567,19 +614,23 @@ EXPORT_SYMBOL_GPL(dma_buf_begin_cpu_access);
  * actions. Coherency is only guaranteed in the specified range for the
  * specified access direction.
  * @dmabuf:	[in]	buffer to complete cpu access for.
- * @start:	[in]	start of range for cpu access.
- * @len:	[in]	length of range for cpu access.
- * @direction:	[in]	length of range for cpu access.
+ * @stride_bytes:	[in]	stride in bytes for cpu access.
+ * @bytes_per_pixel:	[in]	bytes per pixel of the region for cpu access.
+ * @num_regions:   [in]  number of regions.
+ * @regions:   [in]  vector of 2-dimensional regions for cpu access.
+ * @direction:	[in]	direction of range for cpu access.
  *
  * This call must always succeed.
  */
-void dma_buf_end_cpu_access(struct dma_buf *dmabuf, size_t start, size_t len,
-			    enum dma_data_direction direction)
+void dma_buf_end_cpu_access(struct dma_buf *dmabuf, size_t stride_bytes,
+	size_t bytes_per_pixel, size_t num_regions, struct dma_buf_sync_region regions[],
+	enum dma_data_direction direction)
 {
 	WARN_ON(!dmabuf);
 
 	if (dmabuf->ops->end_cpu_access)
-		dmabuf->ops->end_cpu_access(dmabuf, start, len, direction);
+		dmabuf->ops->end_cpu_access(dmabuf, stride_bytes, bytes_per_pixel,
+			num_regions, regions, direction);
 }
 EXPORT_SYMBOL_GPL(dma_buf_end_cpu_access);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/i915_gem_dmabuf.c
index 95cbfff..e5bb7a3 100644
--- a/drivers/gpu/drm/i915/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/i915_gem_dmabuf.c
@@ -212,7 +212,8 @@ static int i915_gem_dmabuf_mmap(struct dma_buf *dma_buf, struct vm_area_struct *
 	return 0;
 }
 
-static int i915_gem_begin_cpu_access(struct dma_buf *dma_buf, size_t start, size_t length, enum dma_data_direction direction)
+static int i915_gem_begin_cpu_access(struct dma_buf *dma_buf, size_t stride_bytes, size_t bytes_per_pixel,
+		size_t num_regions, struct dma_buf_sync_region regions[], enum dma_data_direction direction)
 {
 	struct drm_i915_gem_object *obj = dma_buf_to_obj(dma_buf);
 	struct drm_device *dev = obj->base.dev;
@@ -228,7 +229,8 @@ static int i915_gem_begin_cpu_access(struct dma_buf *dma_buf, size_t start, size
 	return ret;
 }
 
-static void i915_gem_end_cpu_access(struct dma_buf *dma_buf, size_t start, size_t length, enum dma_data_direction direction)
+static void i915_gem_end_cpu_access(struct dma_buf *dma_buf, size_t stride_bytes, size_t bytes_per_pixel,
+		size_t num_regions, struct dma_buf_sync_region regions[], enum dma_data_direction direction)
 {
 	struct drm_i915_gem_object *obj = dma_buf_to_obj(dma_buf);
 	struct drm_device *dev = obj->base.dev;
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index f98bd70..ed457cb 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -33,6 +33,8 @@
 #include <linux/fence.h>
 #include <linux/wait.h>
 
+#include <uapi/linux/dma-buf.h>
+
 struct device;
 struct dma_buf;
 struct dma_buf_attachment;
@@ -93,10 +95,10 @@ struct dma_buf_ops {
 	/* after final dma_buf_put() */
 	void (*release)(struct dma_buf *);
 
-	int (*begin_cpu_access)(struct dma_buf *, size_t, size_t,
-				enum dma_data_direction);
-	void (*end_cpu_access)(struct dma_buf *, size_t, size_t,
-			       enum dma_data_direction);
+	int (*begin_cpu_access)(struct dma_buf *, size_t, size_t, size_t,
+				struct dma_buf_sync_region [], enum dma_data_direction);
+	void (*end_cpu_access)(struct dma_buf *, size_t, size_t, size_t,
+			       struct dma_buf_sync_region [], enum dma_data_direction);
 	void *(*kmap_atomic)(struct dma_buf *, unsigned long);
 	void (*kunmap_atomic)(struct dma_buf *, unsigned long, void *);
 	void *(*kmap)(struct dma_buf *, unsigned long);
@@ -224,10 +226,12 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *,
 					enum dma_data_direction);
 void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *,
 				enum dma_data_direction);
-int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, size_t start, size_t len,
-			     enum dma_data_direction dir);
-void dma_buf_end_cpu_access(struct dma_buf *dma_buf, size_t start, size_t len,
-			    enum dma_data_direction dir);
+int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, size_t stride_bytes,
+			     size_t bytes_per_pixel, size_t num_regions,
+			     struct dma_buf_sync_region regions[], enum dma_data_direction dir);
+void dma_buf_end_cpu_access(struct dma_buf *dma_buf, size_t stride_bytes,
+			     size_t bytes_per_pixel, size_t num_regions,
+			     struct dma_buf_sync_region regions[], enum dma_data_direction dir);
 void *dma_buf_kmap_atomic(struct dma_buf *, unsigned long);
 void dma_buf_kunmap_atomic(struct dma_buf *, unsigned long, void *);
 void *dma_buf_kmap(struct dma_buf *, unsigned long);
diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
new file mode 100644
index 0000000..c63b578
--- /dev/null
+++ b/include/uapi/linux/dma-buf.h
@@ -0,0 +1,57 @@
+/*
+ * Framework for buffer objects that can be shared across devices/subsystems.
+ *
+ * Copyright(C) 2015 Intel Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _DMA_BUF_UAPI_H_
+#define _DMA_BUF_UAPI_H_
+
+enum dma_buf_sync_flags {
+	DMA_BUF_SYNC_READ = (1 << 0),
+	DMA_BUF_SYNC_WRITE = (2 << 0),
+	DMA_BUF_SYNC_RW = (3 << 0),
+	DMA_BUF_SYNC_START = (0 << 2),
+	DMA_BUF_SYNC_END = (1 << 2),
+
+	DMA_BUF_SYNC_VALID_FLAGS_MASK = DMA_BUF_SYNC_RW |
+		DMA_BUF_SYNC_END
+};
+
+/* 2-dimensional region, used for multi-range flush. This can be used to
+ * synchronize the CPU by batching several sub-regions, smaller than the
+ * mapped dma-buf, all at once. */
+struct dma_buf_sync_region {
+	__u64 x;
+	__u64 y;
+	__u64 width;
+	__u64 height;
+};
+
+/* begin/end dma-buf functions used for userspace mmap. */
+struct dma_buf_sync {
+	enum dma_buf_sync_flags flags;
+
+	__u64 stride_bytes;
+	__u32 bytes_per_pixel;
+	__u32 num_regions;
+
+	struct dma_buf_sync_region regions[];
+};
+
+#define DMA_BUF_BASE		'b'
+#define DMA_BUF_IOCTL_SYNC	_IOWR(DMA_BUF_BASE, 0, struct dma_buf_sync)
+
+#endif
-- 
2.1.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-26  0:02               ` [PATCH v4] dma-buf: Add ioctls to allow userspace to flush Tiago Vignatti
@ 2015-08-26  6:49                 ` Thomas Hellstrom
  2015-08-26 12:10                   ` Daniel Vetter
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-26  6:49 UTC (permalink / raw)
  To: Tiago Vignatti
  Cc: daniel.thompson, daniel.vetter, intel-gfx, thellstrom, dri-devel,
	Daniel Vetter

Hi, Tiago.

On 08/26/2015 02:02 AM, Tiago Vignatti wrote:
> From: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> The userspace might need some sort of cache coherency management e.g. when CPU
> and GPU domains are being accessed through dma-buf at the same time. To
> circumvent this problem there are begin/end coherency markers, that forward
> directly to existing dma-buf device drivers vfunc hooks. Userspace can make use
> of those markers through the DMA_BUF_IOCTL_SYNC ioctl. The sequence would be
> used like following:
>
>   - mmap dma-buf fd
>   - for each drawing/upload cycle in CPU
>     1. SYNC_START ioctl
>     2. read/write to mmap area or a 2d sub-region of it
>     3. SYNC_END ioctl.
>   - munamp once you don't need the buffer any more
>
> v2 (Tiago): Fix header file type names (u64 -> __u64)
> v3 (Tiago): Add documentation. Use enum dma_buf_sync_flags to the begin/end
> dma-buf functions. Check for overflows in start/length.
> v4 (Tiago): use 2d regions for sync.

Daniel V had issues with the sync argument proposed by Daniel S. I'm
fine with that argument or an argument containing only a single sync
rect. I'm not sure whether Daniel V will find it easier to accept only a
single sync rect...

Also please see below.

>
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Signed-off-by: Tiago Vignatti <tiago.vignatti@intel.com>
> ---
>
> I'm unable to test the 2d sync properly, because begin/end access in i915
> don't track mapped range for nothing.
>
>  Documentation/dma-buf-sharing.txt      | 13 ++++++
>  drivers/dma-buf/dma-buf.c              | 77 ++++++++++++++++++++++++++++------
>  drivers/gpu/drm/i915/i915_gem_dmabuf.c |  6 ++-
>  include/linux/dma-buf.h                | 20 +++++----
>  include/uapi/linux/dma-buf.h           | 57 +++++++++++++++++++++++++
>  5 files changed, 150 insertions(+), 23 deletions(-)
>  create mode 100644 include/uapi/linux/dma-buf.h
>
> diff --git a/Documentation/dma-buf-sharing.txt b/Documentation/dma-buf-sharing.txt
> index 480c8de..8061ac0 100644
> --- a/Documentation/dma-buf-sharing.txt
> +++ b/Documentation/dma-buf-sharing.txt
> @@ -355,6 +355,19 @@ Being able to mmap an export dma-buf buffer object has 2 main use-cases:
>  
>     No special interfaces, userspace simply calls mmap on the dma-buf fd.
>  
> +   Also, the userspace might need some sort of cache coherency management e.g.
> +   when CPU and GPU domains are being accessed through dma-buf at the same
> +   time. To circumvent this problem there are begin/end coherency markers, that
> +   forward directly to existing dma-buf device drivers vfunc hooks. Userspace
> +   can make use of those markers through the DMA_BUF_IOCTL_SYNC ioctl. The
> +   sequence would be used like following:
> +     - mmap dma-buf fd
> +     - for each drawing/upload cycle in CPU
> +       1. SYNC_START ioctl
> +       2. read/write to mmap area or a 2d sub-region of it
> +       3. SYNC_END ioctl.
> +     - munamp once you don't need the buffer any more
> +
>  2. Supporting existing mmap interfaces in importers
>  
>     Similar to the motivation for kernel cpu access it is again important that
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 155c146..b6a4a06 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -251,11 +251,55 @@ out:
>  	return events;
>  }
>  
> +static long dma_buf_ioctl(struct file *file,
> +			  unsigned int cmd, unsigned long arg)
> +{
> +	struct dma_buf *dmabuf;
> +	struct dma_buf_sync sync;
> +	enum dma_data_direction direction;
> +
> +	dmabuf = file->private_data;
> +
> +	if (!is_dma_buf_file(file))
> +		return -EINVAL;
> +
> +	if (cmd != DMA_BUF_IOCTL_SYNC)
> +		return -ENOTTY;
> +
> +	if (copy_from_user(&sync, (void __user *) arg, sizeof(sync)))
> +		return -EFAULT;

Do we actually copy all sync regions here?

> +
> +	if (sync.flags & DMA_BUF_SYNC_RW)
> +		direction = DMA_BIDIRECTIONAL;
> +	else if (sync.flags & DMA_BUF_SYNC_READ)
> +		direction = DMA_FROM_DEVICE;
> +	else if (sync.flags & DMA_BUF_SYNC_WRITE)
> +		direction = DMA_TO_DEVICE;
> +	else
> +		return -EINVAL;
> +
> +	if (sync.flags & ~DMA_BUF_SYNC_VALID_FLAGS_MASK)
> +		return -EINVAL;
> +
> +	/* TODO: check for overflowing the buffer's size - how so, checking region by
> +	 * region here? Maybe need to check for the other parameters as well. */
> +
> +	if (sync.flags & DMA_BUF_SYNC_END)
> +		dma_buf_end_cpu_access(dmabuf, sync.stride_bytes, sync.bytes_per_pixel,
> +				sync.num_regions, sync.regions, direction);
> +	else
> +		dma_buf_begin_cpu_access(dmabuf, sync.stride_bytes, sync.bytes_per_pixel,
> +				sync.num_regions, sync.regions, direction);
> +
> +	return 0;
> +}
> +
>  static const struct file_operations dma_buf_fops = {
>  	.release	= dma_buf_release,
>  	.mmap		= dma_buf_mmap_internal,
>  	.llseek		= dma_buf_llseek,
>  	.poll		= dma_buf_poll,
> +	.unlocked_ioctl	= dma_buf_ioctl,
>  };
>  
>  /*
> @@ -539,14 +583,17 @@ EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment);
>   * preparations. Coherency is only guaranteed in the specified range for the
>   * specified access direction.
>   * @dmabuf:	[in]	buffer to prepare cpu access for.
> - * @start:	[in]	start of range for cpu access.
> - * @len:	[in]	length of range for cpu access.
> - * @direction:	[in]	length of range for cpu access.
> + * @stride_bytes:	[in]	stride in bytes for cpu access.
> + * @bytes_per_pixel:	[in]	bytes per pixel of the region for cpu access.
> + * @num_regions:   [in]  number of regions.
> + * @region:   [in] vector of 2-dimensional regions for cpu access.
> + * @direction:	[in]	direction of range for cpu access.
>   *
>   * Can return negative error values, returns 0 on success.
>   */
> -int dma_buf_begin_cpu_access(struct dma_buf *dmabuf, size_t start, size_t len,
> -			     enum dma_data_direction direction)
> +int dma_buf_begin_cpu_access(struct dma_buf *dmabuf, size_t stride_bytes,
> +	size_t bytes_per_pixel, size_t num_regions, struct dma_buf_sync_region regions[],
> +	enum dma_data_direction direction)
>  {
>  	int ret = 0;
>  
> @@ -554,8 +601,8 @@ int dma_buf_begin_cpu_access(struct dma_buf *dmabuf, size_t start, size_t len,
>  		return -EINVAL;
>  
>  	if (dmabuf->ops->begin_cpu_access)
> -		ret = dmabuf->ops->begin_cpu_access(dmabuf, start,
> -							len, direction);
> +		ret = dmabuf->ops->begin_cpu_access(dmabuf, stride_bytes, bytes_per_pixel,
> +							num_regions, regions, direction);
>  
>  	return ret;
>  }
> @@ -567,19 +614,23 @@ EXPORT_SYMBOL_GPL(dma_buf_begin_cpu_access);
>   * actions. Coherency is only guaranteed in the specified range for the
>   * specified access direction.
>   * @dmabuf:	[in]	buffer to complete cpu access for.
> - * @start:	[in]	start of range for cpu access.
> - * @len:	[in]	length of range for cpu access.
> - * @direction:	[in]	length of range for cpu access.
> + * @stride_bytes:	[in]	stride in bytes for cpu access.
> + * @bytes_per_pixel:	[in]	bytes per pixel of the region for cpu access.
> + * @num_regions:   [in]  number of regions.
> + * @regions:   [in]  vector of 2-dimensional regions for cpu access.
> + * @direction:	[in]	direction of range for cpu access.
>   *
>   * This call must always succeed.
>   */
> -void dma_buf_end_cpu_access(struct dma_buf *dmabuf, size_t start, size_t len,
> -			    enum dma_data_direction direction)
> +void dma_buf_end_cpu_access(struct dma_buf *dmabuf, size_t stride_bytes,
> +	size_t bytes_per_pixel, size_t num_regions, struct dma_buf_sync_region regions[],
> +	enum dma_data_direction direction)
>  {
>  	WARN_ON(!dmabuf);
>  
>  	if (dmabuf->ops->end_cpu_access)
> -		dmabuf->ops->end_cpu_access(dmabuf, start, len, direction);
> +		dmabuf->ops->end_cpu_access(dmabuf, stride_bytes, bytes_per_pixel,
> +			num_regions, regions, direction);
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_end_cpu_access);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/i915_gem_dmabuf.c
> index 95cbfff..e5bb7a3 100644
> --- a/drivers/gpu/drm/i915/i915_gem_dmabuf.c
> +++ b/drivers/gpu/drm/i915/i915_gem_dmabuf.c
> @@ -212,7 +212,8 @@ static int i915_gem_dmabuf_mmap(struct dma_buf *dma_buf, struct vm_area_struct *
>  	return 0;
>  }
>  
> -static int i915_gem_begin_cpu_access(struct dma_buf *dma_buf, size_t start, size_t length, enum dma_data_direction direction)
> +static int i915_gem_begin_cpu_access(struct dma_buf *dma_buf, size_t stride_bytes, size_t bytes_per_pixel,
> +		size_t num_regions, struct dma_buf_sync_region regions[], enum dma_data_direction direction)
>  {
>  	struct drm_i915_gem_object *obj = dma_buf_to_obj(dma_buf);
>  	struct drm_device *dev = obj->base.dev;
> @@ -228,7 +229,8 @@ static int i915_gem_begin_cpu_access(struct dma_buf *dma_buf, size_t start, size
>  	return ret;
>  }
>  
> -static void i915_gem_end_cpu_access(struct dma_buf *dma_buf, size_t start, size_t length, enum dma_data_direction direction)
> +static void i915_gem_end_cpu_access(struct dma_buf *dma_buf, size_t stride_bytes, size_t bytes_per_pixel,
> +		size_t num_regions, struct dma_buf_sync_region regions[], enum dma_data_direction direction)
>  {
>  	struct drm_i915_gem_object *obj = dma_buf_to_obj(dma_buf);
>  	struct drm_device *dev = obj->base.dev;
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index f98bd70..ed457cb 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -33,6 +33,8 @@
>  #include <linux/fence.h>
>  #include <linux/wait.h>
>  
> +#include <uapi/linux/dma-buf.h>
> +
>  struct device;
>  struct dma_buf;
>  struct dma_buf_attachment;
> @@ -93,10 +95,10 @@ struct dma_buf_ops {
>  	/* after final dma_buf_put() */
>  	void (*release)(struct dma_buf *);
>  
> -	int (*begin_cpu_access)(struct dma_buf *, size_t, size_t,
> -				enum dma_data_direction);
> -	void (*end_cpu_access)(struct dma_buf *, size_t, size_t,
> -			       enum dma_data_direction);
> +	int (*begin_cpu_access)(struct dma_buf *, size_t, size_t, size_t,
> +				struct dma_buf_sync_region [], enum dma_data_direction);
> +	void (*end_cpu_access)(struct dma_buf *, size_t, size_t, size_t,
> +			       struct dma_buf_sync_region [], enum dma_data_direction);
>  	void *(*kmap_atomic)(struct dma_buf *, unsigned long);
>  	void (*kunmap_atomic)(struct dma_buf *, unsigned long, void *);
>  	void *(*kmap)(struct dma_buf *, unsigned long);
> @@ -224,10 +226,12 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *,
>  					enum dma_data_direction);
>  void dma_buf_unmap_attachment(struct dma_buf_attachment *, struct sg_table *,
>  				enum dma_data_direction);
> -int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, size_t start, size_t len,
> -			     enum dma_data_direction dir);
> -void dma_buf_end_cpu_access(struct dma_buf *dma_buf, size_t start, size_t len,
> -			    enum dma_data_direction dir);
> +int dma_buf_begin_cpu_access(struct dma_buf *dma_buf, size_t stride_bytes,
> +			     size_t bytes_per_pixel, size_t num_regions,
> +			     struct dma_buf_sync_region regions[], enum dma_data_direction dir);
> +void dma_buf_end_cpu_access(struct dma_buf *dma_buf, size_t stride_bytes,
> +			     size_t bytes_per_pixel, size_t num_regions,
> +			     struct dma_buf_sync_region regions[], enum dma_data_direction dir);
>  void *dma_buf_kmap_atomic(struct dma_buf *, unsigned long);
>  void dma_buf_kunmap_atomic(struct dma_buf *, unsigned long, void *);
>  void *dma_buf_kmap(struct dma_buf *, unsigned long);
> diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
> new file mode 100644
> index 0000000..c63b578
> --- /dev/null
> +++ b/include/uapi/linux/dma-buf.h
> @@ -0,0 +1,57 @@
> +/*
> + * Framework for buffer objects that can be shared across devices/subsystems.
> + *
> + * Copyright(C) 2015 Intel Ltd
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef _DMA_BUF_UAPI_H_
> +#define _DMA_BUF_UAPI_H_
> +
> +enum dma_buf_sync_flags {
> +	DMA_BUF_SYNC_READ = (1 << 0),
> +	DMA_BUF_SYNC_WRITE = (2 << 0),
> +	DMA_BUF_SYNC_RW = (3 << 0),
> +	DMA_BUF_SYNC_START = (0 << 2),
> +	DMA_BUF_SYNC_END = (1 << 2),
> +
> +	DMA_BUF_SYNC_VALID_FLAGS_MASK = DMA_BUF_SYNC_RW |
> +		DMA_BUF_SYNC_END
> +};
> +
> +/* 2-dimensional region, used for multi-range flush. This can be used to
> + * synchronize the CPU by batching several sub-regions, smaller than the
> + * mapped dma-buf, all at once. */
> +struct dma_buf_sync_region {
> +	__u64 x;
> +	__u64 y;
> +	__u64 width;
> +	__u64 height;
> +};
> +
> +/* begin/end dma-buf functions used for userspace mmap. */
> +struct dma_buf_sync {
> +	enum dma_buf_sync_flags flags;
> +
> +	__u64 stride_bytes;
> +	__u32 bytes_per_pixel;
> +	__u32 num_regions;
> +
> +	struct dma_buf_sync_region regions[];
> +};
> +
> +#define DMA_BUF_BASE		'b'
> +#define DMA_BUF_IOCTL_SYNC	_IOWR(DMA_BUF_BASE, 0, struct dma_buf_sync)

Should be _IOW(  ?


> +
> +#endif

Thomas
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-26  6:49                 ` Thomas Hellstrom
@ 2015-08-26 12:10                   ` Daniel Vetter
  2015-08-26 12:28                     ` Thomas Hellstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Vetter @ 2015-08-26 12:10 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: daniel.thompson, intel-gfx, dri-devel, daniel.vetter, Daniel Vetter

On Wed, Aug 26, 2015 at 08:49:00AM +0200, Thomas Hellstrom wrote:
> Hi, Tiago.
> 
> On 08/26/2015 02:02 AM, Tiago Vignatti wrote:
> > From: Daniel Vetter <daniel.vetter@ffwll.ch>
> >
> > The userspace might need some sort of cache coherency management e.g. when CPU
> > and GPU domains are being accessed through dma-buf at the same time. To
> > circumvent this problem there are begin/end coherency markers, that forward
> > directly to existing dma-buf device drivers vfunc hooks. Userspace can make use
> > of those markers through the DMA_BUF_IOCTL_SYNC ioctl. The sequence would be
> > used like following:
> >
> >   - mmap dma-buf fd
> >   - for each drawing/upload cycle in CPU
> >     1. SYNC_START ioctl
> >     2. read/write to mmap area or a 2d sub-region of it
> >     3. SYNC_END ioctl.
> >   - munamp once you don't need the buffer any more
> >
> > v2 (Tiago): Fix header file type names (u64 -> __u64)
> > v3 (Tiago): Add documentation. Use enum dma_buf_sync_flags to the begin/end
> > dma-buf functions. Check for overflows in start/length.
> > v4 (Tiago): use 2d regions for sync.
> 
> Daniel V had issues with the sync argument proposed by Daniel S. I'm
> fine with that argument or an argument containing only a single sync
> rect. I'm not sure whether Daniel V will find it easier to accept only a
> single sync rect...

I'm kinda against all the 2d rect sync proposals ;-) At least for the
current stuff it's all about linear subranges afaik, and even there we
don't bother with flushing them precisely right now.

My expectation would be that if you _really_ want to etch out that last
bit of performance with a list of 2d sync ranges then probably you want to
do the cpu cache flushing in userspace anyway, with 100% machine-specific
trickery.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-26 12:10                   ` Daniel Vetter
@ 2015-08-26 12:28                     ` Thomas Hellstrom
  2015-08-26 12:58                       ` Daniel Vetter
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-26 12:28 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: daniel.thompson, intel-gfx, dri-devel, daniel.vetter, Daniel Vetter

On 08/26/2015 02:10 PM, Daniel Vetter wrote:
> On Wed, Aug 26, 2015 at 08:49:00AM +0200, Thomas Hellstrom wrote:
>> Hi, Tiago.
>>
>> On 08/26/2015 02:02 AM, Tiago Vignatti wrote:
>>> From: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>
>>> The userspace might need some sort of cache coherency management e.g. when CPU
>>> and GPU domains are being accessed through dma-buf at the same time. To
>>> circumvent this problem there are begin/end coherency markers, that forward
>>> directly to existing dma-buf device drivers vfunc hooks. Userspace can make use
>>> of those markers through the DMA_BUF_IOCTL_SYNC ioctl. The sequence would be
>>> used like following:
>>>
>>>   - mmap dma-buf fd
>>>   - for each drawing/upload cycle in CPU
>>>     1. SYNC_START ioctl
>>>     2. read/write to mmap area or a 2d sub-region of it
>>>     3. SYNC_END ioctl.
>>>   - munamp once you don't need the buffer any more
>>>
>>> v2 (Tiago): Fix header file type names (u64 -> __u64)
>>> v3 (Tiago): Add documentation. Use enum dma_buf_sync_flags to the begin/end
>>> dma-buf functions. Check for overflows in start/length.
>>> v4 (Tiago): use 2d regions for sync.
>> Daniel V had issues with the sync argument proposed by Daniel S. I'm
>> fine with that argument or an argument containing only a single sync
>> rect. I'm not sure whether Daniel V will find it easier to accept only a
>> single sync rect...
> I'm kinda against all the 2d rect sync proposals ;-) At least for the
> current stuff it's all about linear subranges afaik, and even there we
> don't bother with flushing them precisely right now.
>
> My expectation would be that if you _really_ want to etch out that last
> bit of performance with a list of 2d sync ranges then probably you want to
> do the cpu cache flushing in userspace anyway, with 100% machine-specific
> trickery.

Daniel,

I might be misunderstanding things, but isn't this about finally
accepting a dma-buf mmap() generic interface for people who want to use
it for zero-copy applications (like people have been trying to do for
years but never bothered to specify an interface that actually performed
on incoherent hardware)?

If it's only about exposing the kernel 1D sync interface to user-space
for correctness, then why isn't that done transparently to the user?

Thanks,
Thomas


> -Daniel

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-26 12:28                     ` Thomas Hellstrom
@ 2015-08-26 12:58                       ` Daniel Vetter
  2015-08-26 14:32                         ` Tiago Vignatti
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Vetter @ 2015-08-26 12:58 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: daniel.thompson, intel-gfx, dri-devel, daniel.vetter, Daniel Vetter

On Wed, Aug 26, 2015 at 02:28:30PM +0200, Thomas Hellstrom wrote:
> On 08/26/2015 02:10 PM, Daniel Vetter wrote:
> > On Wed, Aug 26, 2015 at 08:49:00AM +0200, Thomas Hellstrom wrote:
> >> Hi, Tiago.
> >>
> >> On 08/26/2015 02:02 AM, Tiago Vignatti wrote:
> >>> From: Daniel Vetter <daniel.vetter@ffwll.ch>
> >>>
> >>> The userspace might need some sort of cache coherency management e.g. when CPU
> >>> and GPU domains are being accessed through dma-buf at the same time. To
> >>> circumvent this problem there are begin/end coherency markers, that forward
> >>> directly to existing dma-buf device drivers vfunc hooks. Userspace can make use
> >>> of those markers through the DMA_BUF_IOCTL_SYNC ioctl. The sequence would be
> >>> used like following:
> >>>
> >>>   - mmap dma-buf fd
> >>>   - for each drawing/upload cycle in CPU
> >>>     1. SYNC_START ioctl
> >>>     2. read/write to mmap area or a 2d sub-region of it
> >>>     3. SYNC_END ioctl.
> >>>   - munamp once you don't need the buffer any more
> >>>
> >>> v2 (Tiago): Fix header file type names (u64 -> __u64)
> >>> v3 (Tiago): Add documentation. Use enum dma_buf_sync_flags to the begin/end
> >>> dma-buf functions. Check for overflows in start/length.
> >>> v4 (Tiago): use 2d regions for sync.
> >> Daniel V had issues with the sync argument proposed by Daniel S. I'm
> >> fine with that argument or an argument containing only a single sync
> >> rect. I'm not sure whether Daniel V will find it easier to accept only a
> >> single sync rect...
> > I'm kinda against all the 2d rect sync proposals ;-) At least for the
> > current stuff it's all about linear subranges afaik, and even there we
> > don't bother with flushing them precisely right now.
> >
> > My expectation would be that if you _really_ want to etch out that last
> > bit of performance with a list of 2d sync ranges then probably you want to
> > do the cpu cache flushing in userspace anyway, with 100% machine-specific
> > trickery.
> 
> Daniel,
> 
> I might be misunderstanding things, but isn't this about finally
> accepting a dma-buf mmap() generic interface for people who want to use
> it for zero-copy applications (like people have been trying to do for
> years but never bothered to specify an interface that actually performed
> on incoherent hardware)?
> 
> If it's only about exposing the kernel 1D sync interface to user-space
> for correctness, then why isn't that done transparently to the user?

Mostly pragmatic reasons - we could do the page-fault trickery, but that
means i915 needs another mmap implementation. At least I didn't figure out
how to do faulting in a completely generic way. And we already have 3
other mmap implementations so I prefer not to do that.

The other is that right now there's no user nor implementation in sight
which actually does range-based flush optimizations, so I'm pretty much
expecting we'll get it wrong. Maybe instead we should go one step further
and remove the range from the internal dma-buf interface and also drop it
from the ioctl? With the flags we can always add something later on once
we have a real user with a clear need for it. But afaik cros only wants to
shuffle around entire tiles and has a buffer-per-tile approach.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-26 12:58                       ` Daniel Vetter
@ 2015-08-26 14:32                         ` Tiago Vignatti
  2015-08-26 14:50                           ` Thomas Hellstrom
  2015-08-26 14:51                           ` Daniel Vetter
  0 siblings, 2 replies; 42+ messages in thread
From: Tiago Vignatti @ 2015-08-26 14:32 UTC (permalink / raw)
  To: Daniel Vetter, Thomas Hellstrom
  Cc: daniel.vetter, daniel.thompson, intel-gfx, dri-devel, Daniel Vetter

On 08/26/2015 09:58 AM, Daniel Vetter wrote:
> The other is that right now there's no user nor implementation in sight
> which actually does range-based flush optimizations, so I'm pretty much
> expecting we'll get it wrong. Maybe instead we should go one step further
> and remove the range from the internal dma-buf interface and also drop it
> from the ioctl? With the flags we can always add something later on once
> we have a real user with a clear need for it. But afaik cros only wants to
> shuffle around entire tiles and has a buffer-per-tile approach.

Thomas, I think Daniel has a point here and also, I wouldn't mind 
removing all range control from the dma-buf ioctl either.

In this last patch we can see that it's not complicated to add the 
2d-sync if we eventually decides to want it. But right now there's no 
way we can test it. Therefore in that case I'm all in favour of doing 
this work gradually, adding the basics first.

Tiago
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-26 14:32                         ` Tiago Vignatti
@ 2015-08-26 14:50                           ` Thomas Hellstrom
  2015-08-26 14:51                           ` Daniel Vetter
  1 sibling, 0 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-26 14:50 UTC (permalink / raw)
  To: Tiago Vignatti
  Cc: daniel.thompson, daniel.vetter, intel-gfx, dri-devel, Daniel Vetter

On 08/26/2015 04:32 PM, Tiago Vignatti wrote:
> On 08/26/2015 09:58 AM, Daniel Vetter wrote:
>> The other is that right now there's no user nor implementation in sight
>> which actually does range-based flush optimizations, so I'm pretty much
>> expecting we'll get it wrong. Maybe instead we should go one step
>> further
>> and remove the range from the internal dma-buf interface and also
>> drop it
>> from the ioctl? With the flags we can always add something later on once
>> we have a real user with a clear need for it. But afaik cros only
>> wants to
>> shuffle around entire tiles and has a buffer-per-tile approach.
>
> Thomas, I think Daniel has a point here and also, I wouldn't mind
> removing all range control from the dma-buf ioctl either.
>
> In this last patch we can see that it's not complicated to add the
> 2d-sync if we eventually decides to want it. But right now there's no
> way we can test it. Therefore in that case I'm all in favour of doing
> this work gradually, adding the basics first.
>

Sure. Unfortunately I wasn't completely clear about the use-case here.
IMO adding a user-space sync functionality should be purely for performance.

/Thomas


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-26 14:32                         ` Tiago Vignatti
  2015-08-26 14:50                           ` Thomas Hellstrom
@ 2015-08-26 14:51                           ` Daniel Vetter
  2015-08-26 14:58                             ` Tiago Vignatti
  1 sibling, 1 reply; 42+ messages in thread
From: Daniel Vetter @ 2015-08-26 14:51 UTC (permalink / raw)
  To: Tiago Vignatti
  Cc: Thomas Hellstrom, daniel.vetter, intel-gfx, daniel.thompson,
	dri-devel, Daniel Vetter

On Wed, Aug 26, 2015 at 11:32:30AM -0300, Tiago Vignatti wrote:
> On 08/26/2015 09:58 AM, Daniel Vetter wrote:
> >The other is that right now there's no user nor implementation in sight
> >which actually does range-based flush optimizations, so I'm pretty much
> >expecting we'll get it wrong. Maybe instead we should go one step further
> >and remove the range from the internal dma-buf interface and also drop it
> >from the ioctl? With the flags we can always add something later on once
> >we have a real user with a clear need for it. But afaik cros only wants to
> >shuffle around entire tiles and has a buffer-per-tile approach.
> 
> Thomas, I think Daniel has a point here and also, I wouldn't mind removing
> all range control from the dma-buf ioctl either.

if we go with nuking it from the ioctl I'd suggest to also nuke it from
the dma-buf internal inferface first too.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-26 14:51                           ` Daniel Vetter
@ 2015-08-26 14:58                             ` Tiago Vignatti
  2015-08-26 15:37                               ` Thomas Hellstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Tiago Vignatti @ 2015-08-26 14:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Thomas Hellstrom, daniel.vetter, intel-gfx, daniel.thompson,
	dri-devel, Daniel Vetter

On 08/26/2015 11:51 AM, Daniel Vetter wrote:
> On Wed, Aug 26, 2015 at 11:32:30AM -0300, Tiago Vignatti wrote:
>> On 08/26/2015 09:58 AM, Daniel Vetter wrote:
>>> The other is that right now there's no user nor implementation in sight
>>> which actually does range-based flush optimizations, so I'm pretty much
>>> expecting we'll get it wrong. Maybe instead we should go one step further
>>> and remove the range from the internal dma-buf interface and also drop it
>> >from the ioctl? With the flags we can always add something later on once
>>> we have a real user with a clear need for it. But afaik cros only wants to
>>> shuffle around entire tiles and has a buffer-per-tile approach.
>>
>> Thomas, I think Daniel has a point here and also, I wouldn't mind removing
>> all range control from the dma-buf ioctl either.
>
> if we go with nuking it from the ioctl I'd suggest to also nuke it from
> the dma-buf internal inferface first too.

yep, I can do it.

Thomas, so we leave 2d sync out now?

Tiago

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] dma-buf: Add ioctls to allow userspace to flush
  2015-08-26 14:58                             ` Tiago Vignatti
@ 2015-08-26 15:37                               ` Thomas Hellstrom
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Hellstrom @ 2015-08-26 15:37 UTC (permalink / raw)
  To: Tiago Vignatti
  Cc: daniel.thompson, daniel.vetter, intel-gfx, dri-devel, Daniel Vetter





> 26 aug 2015 kl. 16:58 skrev Tiago Vignatti <tiago.vignatti@intel.com>:
> 
>> On 08/26/2015 11:51 AM, Daniel Vetter wrote:
>>> On Wed, Aug 26, 2015 at 11:32:30AM -0300, Tiago Vignatti wrote:
>>>> On 08/26/2015 09:58 AM, Daniel Vetter wrote:
>>>> The other is that right now there's no user nor implementation in sight
>>>> which actually does range-based flush optimizations, so I'm pretty much
>>>> expecting we'll get it wrong. Maybe instead we should go one step further
>>>> and remove the range from the internal dma-buf interface and also drop it
>>> >from the ioctl? With the flags we can always add something later on once
>>>> we have a real user with a clear need for it. But afaik cros only wants to
>>>> shuffle around entire tiles and has a buffer-per-tile approach.
>>> 
>>> Thomas, I think Daniel has a point here and also, I wouldn't mind removing
>>> all range control from the dma-buf ioctl either.
>> 
>> if we go with nuking it from the ioctl I'd suggest to also nuke it from
>> the dma-buf internal inferface first too.
> 
> yep, I can do it.
> 
> Thomas, so we leave 2d sync out now?
> 
> Tiago
> 
Sure!
Thomas
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2015-08-26 15:37 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <55D50442.5080102@intel.com>
2015-08-20  6:48 ` about mmap dma-buf and sync Thomas Hellstrom
2015-08-20 14:33   ` Rob Clark
2015-08-20 19:27     ` Thomas Hellstrom
2015-08-20 19:32       ` Thomas Hellstrom
2015-08-21 16:42       ` Rob Clark
2015-08-20 14:53   ` Jerome Glisse
2015-08-20 19:39     ` Thomas Hellstrom
2015-08-20 20:34       ` Jerome Glisse
2015-08-21  5:25         ` Thomas Hellstrom
2015-08-21 10:28           ` Lucas Stach
2015-08-21 10:51             ` Thomas Hellstrom
2015-08-21 13:32           ` Jerome Glisse
2015-08-21 14:15             ` Thomas Hellstrom
2015-08-21 16:00               ` Jerome Glisse
2015-08-21 22:00                 ` Thomas Hellstrom
2015-08-24  1:41                   ` Jerome Glisse
2015-08-24  9:59                     ` Thomas Hellstrom
2015-08-21 23:06   ` Tiago Vignatti
2015-08-24  9:50     ` Thomas Hellstrom
2015-08-24 15:52       ` Daniel Stone
2015-08-24 16:56         ` Thomas Hellstrom
2015-08-24 17:04           ` Daniel Stone
2015-08-24 17:10             ` Thomas Hellstrom
2015-08-24 17:12               ` Daniel Stone
2015-08-24 17:42                 ` Thomas Hellstrom
2015-08-24 17:56                   ` Michael Spang
2015-08-25  5:25                     ` Thomas Hellstrom
2015-08-24 18:01                   ` Tiago Vignatti
2015-08-24 23:03                     ` Tiago Vignatti
2015-08-25  9:02         ` Daniel Vetter
2015-08-25  9:30           ` Thomas Hellstrom
2015-08-25 16:14             ` Tiago Vignatti
2015-08-26  0:02               ` [PATCH v4] dma-buf: Add ioctls to allow userspace to flush Tiago Vignatti
2015-08-26  6:49                 ` Thomas Hellstrom
2015-08-26 12:10                   ` Daniel Vetter
2015-08-26 12:28                     ` Thomas Hellstrom
2015-08-26 12:58                       ` Daniel Vetter
2015-08-26 14:32                         ` Tiago Vignatti
2015-08-26 14:50                           ` Thomas Hellstrom
2015-08-26 14:51                           ` Daniel Vetter
2015-08-26 14:58                             ` Tiago Vignatti
2015-08-26 15:37                               ` Thomas Hellstrom

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.