linux-media.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Introduce a new helper framework for buffer synchronization
       [not found] <CAAQKjZNNw4qddo6bE5OY_CahrqDtqkxdO7Pm9RCguXyj9F4cMQ@mail.gmail.com>
@ 2013-05-13  8:00 ` Maarten Lankhorst
  2013-05-13  9:21   ` Inki Dae
  0 siblings, 1 reply; 36+ messages in thread
From: Maarten Lankhorst @ 2013-05-13  8:00 UTC (permalink / raw)
  To: Inki Dae
  Cc: Rob Clark, Daniel Vetter, DRI mailing list, linux-arm-kernel,
	linux-media, linux-fbdev, Kyungmin Park, myungjoo.ham,
	YoungJun Cho

Op 09-05-13 09:33, Inki Dae schreef:
> Hi all,
>
> This post introduces a new helper framework based on dma fence. And the
> purpose of this post is to collect other opinions and advices before RFC
> posting.
>
> First of all, this helper framework, called fence helper, is in progress
> yet so might not have enough comments in codes and also might need to be
> more cleaned up. Moreover, we might be missing some parts of the dma fence.
> However, I'd like to say that all things mentioned below has been tested
> with Linux platform and worked well.

> ....
>
> And tutorial for user process.
>         just before cpu access
>                 struct dma_buf_fence *df;
>
>                 df->type = DMA_BUF_ACCESS_READ or DMA_BUF_ACCESS_WRITE;
>                 ioctl(fd, DMA_BUF_GET_FENCE, &df);
>
>         after memset or memcpy
>                 ioctl(fd, DMA_BUF_PUT_FENCE, &df);
NAK.

Userspace doesn't need to trigger fences. It can do a buffer idle wait, and postpone submitting new commands until after it's done using the buffer.
Kernel space doesn't need the root hole you created by giving a dereferencing a pointer passed from userspace.
Your next exercise should be to write a security exploit from the api you created here. It's the only way to learn how to write safe code. Hint: df.ctx = mmap(..);

~Maarten

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-13  8:00 ` Introduce a new helper framework for buffer synchronization Maarten Lankhorst
@ 2013-05-13  9:21   ` Inki Dae
  2013-05-13  9:52     ` Maarten Lankhorst
  0 siblings, 1 reply; 36+ messages in thread
From: Inki Dae @ 2013-05-13  9:21 UTC (permalink / raw)
  To: 'Maarten Lankhorst'
  Cc: 'Rob Clark', 'Daniel Vetter',
	'DRI mailing list',
	linux-arm-kernel, linux-media, 'linux-fbdev',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho'



> -----Original Message-----
> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
> Sent: Monday, May 13, 2013 5:01 PM
> To: Inki Dae
> Cc: Rob Clark; Daniel Vetter; DRI mailing list; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org; linux-fbdev;
> Kyungmin Park; myungjoo.ham; YoungJun Cho
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> Op 09-05-13 09:33, Inki Dae schreef:
> > Hi all,
> >
> > This post introduces a new helper framework based on dma fence. And the
> > purpose of this post is to collect other opinions and advices before RFC
> > posting.
> >
> > First of all, this helper framework, called fence helper, is in progress
> > yet so might not have enough comments in codes and also might need to be
> > more cleaned up. Moreover, we might be missing some parts of the dma
> fence.
> > However, I'd like to say that all things mentioned below has been tested
> > with Linux platform and worked well.
> 
> > ....
> >
> > And tutorial for user process.
> >         just before cpu access
> >                 struct dma_buf_fence *df;
> >
> >                 df->type = DMA_BUF_ACCESS_READ or DMA_BUF_ACCESS_WRITE;
> >                 ioctl(fd, DMA_BUF_GET_FENCE, &df);
> >
> >         after memset or memcpy
> >                 ioctl(fd, DMA_BUF_PUT_FENCE, &df);
> NAK.
> 
> Userspace doesn't need to trigger fences. It can do a buffer idle wait,
> and postpone submitting new commands until after it's done using the
> buffer.

Hi Maarten,

It seems that you say user should wait for a buffer like KDS does: KDS uses
select() to postpone submitting new commands. But I think this way assumes
that every data flows a DMA device to a CPU. For example, a CPU should keep
polling for the completion of a buffer access by a DMA device. This means
that the this way isn't considered for data flow to opposite case; CPU to
DMA device.

> Kernel space doesn't need the root hole you created by giving a
> dereferencing a pointer passed from userspace.
> Your next exercise should be to write a security exploit from the api you
> created here. It's the only way to learn how to write safe code. Hint:
> df.ctx = mmap(..);
> 

Also I'm not clear to use our way yet and that is why I posted. As you
mentioned, it seems like that using mmap() is more safe. But there is one
issue it makes me confusing. For your hint, df.ctx = mmap(..), the issue is
that dmabuf mmap can be used to map a dmabuf with user space. And the dmabuf
means a physical memory region allocated by some allocator such as drm gem
or ion.

There might be my missing point so could you please give me more comments?

Thanks,
Inki Dae



> ~Maarten


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-13  9:21   ` Inki Dae
@ 2013-05-13  9:52     ` Maarten Lankhorst
  2013-05-13 11:24       ` Inki Dae
  0 siblings, 1 reply; 36+ messages in thread
From: Maarten Lankhorst @ 2013-05-13  9:52 UTC (permalink / raw)
  To: Inki Dae
  Cc: 'Rob Clark', 'Daniel Vetter',
	'DRI mailing list',
	linux-arm-kernel, linux-media, 'linux-fbdev',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho'

Op 13-05-13 11:21, Inki Dae schreef:
>
>> -----Original Message-----
>> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
>> Sent: Monday, May 13, 2013 5:01 PM
>> To: Inki Dae
>> Cc: Rob Clark; Daniel Vetter; DRI mailing list; linux-arm-
>> kernel@lists.infradead.org; linux-media@vger.kernel.org; linux-fbdev;
>> Kyungmin Park; myungjoo.ham; YoungJun Cho
>> Subject: Re: Introduce a new helper framework for buffer synchronization
>>
>> Op 09-05-13 09:33, Inki Dae schreef:
>>> Hi all,
>>>
>>> This post introduces a new helper framework based on dma fence. And the
>>> purpose of this post is to collect other opinions and advices before RFC
>>> posting.
>>>
>>> First of all, this helper framework, called fence helper, is in progress
>>> yet so might not have enough comments in codes and also might need to be
>>> more cleaned up. Moreover, we might be missing some parts of the dma
>> fence.
>>> However, I'd like to say that all things mentioned below has been tested
>>> with Linux platform and worked well.
>>> ....
>>>
>>> And tutorial for user process.
>>>         just before cpu access
>>>                 struct dma_buf_fence *df;
>>>
>>>                 df->type = DMA_BUF_ACCESS_READ or DMA_BUF_ACCESS_WRITE;
>>>                 ioctl(fd, DMA_BUF_GET_FENCE, &df);
>>>
>>>         after memset or memcpy
>>>                 ioctl(fd, DMA_BUF_PUT_FENCE, &df);
>> NAK.
>>
>> Userspace doesn't need to trigger fences. It can do a buffer idle wait,
>> and postpone submitting new commands until after it's done using the
>> buffer.
> Hi Maarten,
>
> It seems that you say user should wait for a buffer like KDS does: KDS uses
> select() to postpone submitting new commands. But I think this way assumes
> that every data flows a DMA device to a CPU. For example, a CPU should keep
> polling for the completion of a buffer access by a DMA device. This means
> that the this way isn't considered for data flow to opposite case; CPU to
> DMA device.
Not really. You do both things the same way. You first wait for the bo to be idle, this could be implemented by adding poll support to the dma-buf fd.
Then you either do your read or write. Since userspace is supposed to be the one controlling the bo it should stay idle at that point. If you have another thread queueing
the buffer againbefore your thread is done that's a bug in the application, and can be solved with userspace locking primitives. No need for the kernel to get involved.

>> Kernel space doesn't need the root hole you created by giving a
>> dereferencing a pointer passed from userspace.
>> Your next exercise should be to write a security exploit from the api you
>> created here. It's the only way to learn how to write safe code. Hint:
>> df.ctx = mmap(..);
>>
> Also I'm not clear to use our way yet and that is why I posted. As you
> mentioned, it seems like that using mmap() is more safe. But there is one
> issue it makes me confusing. For your hint, df.ctx = mmap(..), the issue is
> that dmabuf mmap can be used to map a dmabuf with user space. And the dmabuf
> means a physical memory region allocated by some allocator such as drm gem
> or ion.
>
> There might be my missing point so could you please give me more comments?
>
My point was that userspace could change df.ctx to some mmap'd memory, forcing the kernel to execute some code prepared by userspace.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-13  9:52     ` Maarten Lankhorst
@ 2013-05-13 11:24       ` Inki Dae
  2013-05-13 11:40         ` Maarten Lankhorst
  2013-05-13 19:29         ` Introduce a new helper " Tomasz Figa
  0 siblings, 2 replies; 36+ messages in thread
From: Inki Dae @ 2013-05-13 11:24 UTC (permalink / raw)
  To: 'Maarten Lankhorst'
  Cc: 'Rob Clark', 'Daniel Vetter',
	'DRI mailing list',
	linux-arm-kernel, linux-media, 'linux-fbdev',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho'



> -----Original Message-----
> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
> Sent: Monday, May 13, 2013 6:52 PM
> To: Inki Dae
> Cc: 'Rob Clark'; 'Daniel Vetter'; 'DRI mailing list'; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org; 'linux-fbdev';
> 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> Op 13-05-13 11:21, Inki Dae schreef:
> >
> >> -----Original Message-----
> >> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
> >> Sent: Monday, May 13, 2013 5:01 PM
> >> To: Inki Dae
> >> Cc: Rob Clark; Daniel Vetter; DRI mailing list; linux-arm-
> >> kernel@lists.infradead.org; linux-media@vger.kernel.org; linux-fbdev;
> >> Kyungmin Park; myungjoo.ham; YoungJun Cho
> >> Subject: Re: Introduce a new helper framework for buffer
> synchronization
> >>
> >> Op 09-05-13 09:33, Inki Dae schreef:
> >>> Hi all,
> >>>
> >>> This post introduces a new helper framework based on dma fence. And
> the
> >>> purpose of this post is to collect other opinions and advices before
> RFC
> >>> posting.
> >>>
> >>> First of all, this helper framework, called fence helper, is in
> progress
> >>> yet so might not have enough comments in codes and also might need to
> be
> >>> more cleaned up. Moreover, we might be missing some parts of the dma
> >> fence.
> >>> However, I'd like to say that all things mentioned below has been
> tested
> >>> with Linux platform and worked well.
> >>> ....
> >>>
> >>> And tutorial for user process.
> >>>         just before cpu access
> >>>                 struct dma_buf_fence *df;
> >>>
> >>>                 df->type = DMA_BUF_ACCESS_READ or
DMA_BUF_ACCESS_WRITE;
> >>>                 ioctl(fd, DMA_BUF_GET_FENCE, &df);
> >>>
> >>>         after memset or memcpy
> >>>                 ioctl(fd, DMA_BUF_PUT_FENCE, &df);
> >> NAK.
> >>
> >> Userspace doesn't need to trigger fences. It can do a buffer idle wait,
> >> and postpone submitting new commands until after it's done using the
> >> buffer.
> > Hi Maarten,
> >
> > It seems that you say user should wait for a buffer like KDS does: KDS
> uses
> > select() to postpone submitting new commands. But I think this way
> assumes
> > that every data flows a DMA device to a CPU. For example, a CPU should
> keep
> > polling for the completion of a buffer access by a DMA device. This
> means
> > that the this way isn't considered for data flow to opposite case; CPU
> to
> > DMA device.
> Not really. You do both things the same way. You first wait for the bo to
> be idle, this could be implemented by adding poll support to the dma-buf
> fd.
> Then you either do your read or write. Since userspace is supposed to be
> the one controlling the bo it should stay idle at that point. If you have
> another thread queueing
> the buffer againbefore your thread is done that's a bug in the
application,
> and can be solved with userspace locking primitives. No need for the
> kernel to get involved.
> 

Yes, that is how we have synchronized buffer between CPU and DMA device
until now without buffer synchronization mechanism. I thought that it's best
to make user not considering anything: user can access a buffer regardless
of any DMA device controlling and the buffer synchronization is performed in
kernel level. Moreover, I think we could optimize graphics and multimedia
hardware performance because hardware can do more works: one thread accesses
a shared buffer and the other controls DMA device with the shared buffer in
parallel. Thus, we could avoid sequential processing and that is my
intention. Aren't you think about that we could improve hardware utilization
with such way or other? of course, there could be better way.

> >> Kernel space doesn't need the root hole you created by giving a
> >> dereferencing a pointer passed from userspace.
> >> Your next exercise should be to write a security exploit from the api
> you
> >> created here. It's the only way to learn how to write safe code. Hint:
> >> df.ctx = mmap(..);
> >>
> > Also I'm not clear to use our way yet and that is why I posted. As you
> > mentioned, it seems like that using mmap() is more safe. But there is
> one
> > issue it makes me confusing. For your hint, df.ctx = mmap(..), the issue
> is
> > that dmabuf mmap can be used to map a dmabuf with user space. And the
> dmabuf
> > means a physical memory region allocated by some allocator such as drm
> gem
> > or ion.
> >
> > There might be my missing point so could you please give me more
> comments?
> >
> My point was that userspace could change df.ctx to some mmap'd memory,
> forcing the kernel to execute some code prepared by userspace.

Understood. I have to find a better way. And for this, I'd like to listen
attentively more opinions and advices.

Thanks for comments,
Inki Dae


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-13 11:24       ` Inki Dae
@ 2013-05-13 11:40         ` Maarten Lankhorst
  2013-05-13 12:21           ` Inki Dae
  2013-05-13 19:29         ` Introduce a new helper " Tomasz Figa
  1 sibling, 1 reply; 36+ messages in thread
From: Maarten Lankhorst @ 2013-05-13 11:40 UTC (permalink / raw)
  To: Inki Dae
  Cc: 'Rob Clark', 'Daniel Vetter',
	'DRI mailing list',
	linux-arm-kernel, linux-media, 'linux-fbdev',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho'

Op 13-05-13 13:24, Inki Dae schreef:
>> and can be solved with userspace locking primitives. No need for the
>> kernel to get involved.
>>
> Yes, that is how we have synchronized buffer between CPU and DMA device
> until now without buffer synchronization mechanism. I thought that it's best
> to make user not considering anything: user can access a buffer regardless
> of any DMA device controlling and the buffer synchronization is performed in
> kernel level. Moreover, I think we could optimize graphics and multimedia
> hardware performance because hardware can do more works: one thread accesses
> a shared buffer and the other controls DMA device with the shared buffer in
> parallel. Thus, we could avoid sequential processing and that is my
> intention. Aren't you think about that we could improve hardware utilization
> with such way or other? of course, there could be better way.
>
If you don't want to block the hardware the only option is to allocate a temp bo and blit to/from it using the hardware.
OpenGL already does that when you want to read back, because otherwise the entire pipeline can get stalled.
The overhead of command submission for that shouldn't be high, if it is you should really try to optimize that first
before coming up with this crazy scheme.

In that case you still wouldn't give userspace control over the fences. I don't see any way that can end well.
What if userspace never signals? What if userspace gets killed by oom killer. Who keeps track of that?

~Maarten

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-13 11:40         ` Maarten Lankhorst
@ 2013-05-13 12:21           ` Inki Dae
  2013-05-13 13:48             ` Rob Clark
  0 siblings, 1 reply; 36+ messages in thread
From: Inki Dae @ 2013-05-13 12:21 UTC (permalink / raw)
  To: 'Maarten Lankhorst'
  Cc: 'Rob Clark', 'Daniel Vetter',
	'DRI mailing list',
	linux-arm-kernel, linux-media, 'linux-fbdev',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho'



> -----Original Message-----
> From: linux-fbdev-owner@vger.kernel.org [mailto:linux-fbdev-
> owner@vger.kernel.org] On Behalf Of Maarten Lankhorst
> Sent: Monday, May 13, 2013 8:41 PM
> To: Inki Dae
> Cc: 'Rob Clark'; 'Daniel Vetter'; 'DRI mailing list'; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org; 'linux-fbdev';
> 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> Op 13-05-13 13:24, Inki Dae schreef:
> >> and can be solved with userspace locking primitives. No need for the
> >> kernel to get involved.
> >>
> > Yes, that is how we have synchronized buffer between CPU and DMA device
> > until now without buffer synchronization mechanism. I thought that it's
> best
> > to make user not considering anything: user can access a buffer
> regardless
> > of any DMA device controlling and the buffer synchronization is
> performed in
> > kernel level. Moreover, I think we could optimize graphics and
> multimedia
> > hardware performance because hardware can do more works: one thread
> accesses
> > a shared buffer and the other controls DMA device with the shared buffer
> in
> > parallel. Thus, we could avoid sequential processing and that is my
> > intention. Aren't you think about that we could improve hardware
> utilization
> > with such way or other? of course, there could be better way.
> >
> If you don't want to block the hardware the only option is to allocate a
> temp bo and blit to/from it using the hardware.
> OpenGL already does that when you want to read back, because otherwise the
> entire pipeline can get stalled.
> The overhead of command submission for that shouldn't be high, if it is
> you should really try to optimize that first
> before coming up with this crazy scheme.
> 

I have considered all devices sharing buffer with CPU; multimedia, display
controller and graphics devices (including GPU). And we could provide
easy-to-use user interfaces for buffer synchronization. Of course, the
proposed user interfaces may be so ugly yet but at least, I think we need
something else for it.

> In that case you still wouldn't give userspace control over the fences. I
> don't see any way that can end well.
> What if userspace never signals? What if userspace gets killed by oom
> killer. Who keeps track of that?
> 

In all cases, all kernel resources to user fence will be released by kernel
once the fence is timed out: never signaling and process killing by oom
killer makes the fence timed out. And if we use mmap mechanism you mentioned
before, I think user resource could also be freed properly.

Thanks,
Inki Dae

> ~Maarten
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-13 12:21           ` Inki Dae
@ 2013-05-13 13:48             ` Rob Clark
       [not found]               ` <CAAQKjZP=iOmHRpHZCbZD3v_RKUFSn0eM_WVZZvhe7F9g3eTmPA@mail.gmail.com>
  0 siblings, 1 reply; 36+ messages in thread
From: Rob Clark @ 2013-05-13 13:48 UTC (permalink / raw)
  To: Inki Dae
  Cc: Maarten Lankhorst, Daniel Vetter, DRI mailing list,
	linux-arm-kernel, linux-media, linux-fbdev, Kyungmin Park,
	myungjoo.ham, YoungJun Cho

On Mon, May 13, 2013 at 8:21 AM, Inki Dae <inki.dae@samsung.com> wrote:
>
>> In that case you still wouldn't give userspace control over the fences. I
>> don't see any way that can end well.
>> What if userspace never signals? What if userspace gets killed by oom
>> killer. Who keeps track of that?
>>
>
> In all cases, all kernel resources to user fence will be released by kernel
> once the fence is timed out: never signaling and process killing by oom
> killer makes the fence timed out. And if we use mmap mechanism you mentioned
> before, I think user resource could also be freed properly.


I tend to agree w/ Maarten here.. there is no good reason for
userspace to be *signaling* fences.  The exception might be some blob
gpu drivers which don't have enough knowledge in the kernel to figure
out what to do.  (In which case you can add driver private ioctls for
that.. still not the right thing to do but at least you don't make a
public API out of it.)

BR,
-R

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
       [not found]               ` <CAAQKjZP=iOmHRpHZCbZD3v_RKUFSn0eM_WVZZvhe7F9g3eTmPA@mail.gmail.com>
@ 2013-05-13 17:58                 ` Rob Clark
  2013-05-14  2:52                   ` Inki Dae
  0 siblings, 1 reply; 36+ messages in thread
From: Rob Clark @ 2013-05-13 17:58 UTC (permalink / raw)
  To: Inki Dae
  Cc: linux-fbdev, DRI mailing list, Kyungmin Park, myungjoo.ham,
	YoungJun Cho, linux-arm-kernel, linux-media

On Mon, May 13, 2013 at 1:18 PM, Inki Dae <inki.dae@samsung.com> wrote:
>
>
> 2013/5/13 Rob Clark <robdclark@gmail.com>
>>
>> On Mon, May 13, 2013 at 8:21 AM, Inki Dae <inki.dae@samsung.com> wrote:
>> >
>> >> In that case you still wouldn't give userspace control over the fences.
>> >> I
>> >> don't see any way that can end well.
>> >> What if userspace never signals? What if userspace gets killed by oom
>> >> killer. Who keeps track of that?
>> >>
>> >
>> > In all cases, all kernel resources to user fence will be released by
>> > kernel
>> > once the fence is timed out: never signaling and process killing by oom
>> > killer makes the fence timed out. And if we use mmap mechanism you
>> > mentioned
>> > before, I think user resource could also be freed properly.
>>
>>
>> I tend to agree w/ Maarten here.. there is no good reason for
>> userspace to be *signaling* fences.  The exception might be some blob
>> gpu drivers which don't have enough knowledge in the kernel to figure
>> out what to do.  (In which case you can add driver private ioctls for
>> that.. still not the right thing to do but at least you don't make a
>> public API out of it.)
>>
>
> Please do not care whether those are generic or not. Let's see the following
> three things. First, it's cache operation. As you know, ARM SoC has ACP
> (Accelerator Coherency Port) and can be connected to DMA engine or similar
> devices. And this port is used for cache coherency between CPU cache and DMA
> device. However, most devices on ARM based embedded systems don't use the
> ACP port. So they need proper cache operation before and after of DMA or CPU
> access in case of using cachable mapping. Actually, I see many Linux based
> platforms call cache control interfaces directly for that. I think the
> reason, they do so, is that kernel isn't aware of when and how CPU accessed
> memory.

I think we had kicked around the idea of giving dmabuf's a
prepare/finish ioctl quite some time back.  This is probably something
that should be at least a bit decoupled from fences.  (Possibly
'prepare' waits for dma access to complete, but not the other way
around.)

And I did implement in omapdrm support for simulating coherency via
page fault-in / shoot-down..  It is one option that makes it
completely transparent to userspace, although there is some
performance const, so I suppose it depends a bit on your use-case.

> And second, user process has to do so many things in case of using shared
> memory with DMA device. User process should understand how DMA device is
> operated and when interfaces for controling the DMA device are called. Such
> things would make user application so complicated.
>
> And third, it's performance optimization to multimedia and graphics devices.
> As I mentioned already, we should consider sequential processing for buffer
> sharing between CPU and DMA device. This means that CPU should stay with
> idle until DMA device is completed and vise versa.
>
> That is why I proposed such user interfaces. Of course, these interfaces
> might be so ugly yet: for this, Maarten pointed already out and I agree with
> him. But there must be another better way. Aren't you think we need similar
> thing? With such interfaces, cache control and buffer synchronization can be
> performed in kernel level. Moreover, user applization doesn't need to
> consider DMA device controlling anymore. Therefore, one thread can access a
> shared buffer and the other can control DMA device with the shared buffer in
> parallel. We can really make the best use of CPU and DMA idle time. In other
> words, we can really make the best use of multi tasking OS, Linux.
>
> So could you please tell me about that there is any reason we don't use
> public API for it? I think we can add and use public API if NECESSARY.

well, for cache management, I think it is a better idea.. I didn't
really catch that this was the motivation from the initial patch, but
maybe I read it too quickly.  But cache can be decoupled from
synchronization, because CPU access is not asynchronous.  For
userspace/CPU access to buffer, you should:

  1) wait for buffer
  2) prepare-access
  3)  ... do whatever cpu access to buffer ...
  4) finish-access
  5) submit buffer for new dma-operation

I suppose you could combine the syscall for #1 and #2.. not sure if
that is a good idea or not.  But you don't need to.  And there is
never really any need for userspace to signal a fence.

BR,
-R

> Thanks,
> Inki Dae
>
>>
>> BR,
>> -R
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-13 11:24       ` Inki Dae
  2013-05-13 11:40         ` Maarten Lankhorst
@ 2013-05-13 19:29         ` Tomasz Figa
  1 sibling, 0 replies; 36+ messages in thread
From: Tomasz Figa @ 2013-05-13 19:29 UTC (permalink / raw)
  To: dri-devel
  Cc: Inki Dae, 'Maarten Lankhorst', 'linux-fbdev',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho',
	linux-arm-kernel, linux-media

Hi,

On Monday 13 of May 2013 20:24:01 Inki Dae wrote:
> > -----Original Message-----
> > From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
> > Sent: Monday, May 13, 2013 6:52 PM
> > To: Inki Dae
> > Cc: 'Rob Clark'; 'Daniel Vetter'; 'DRI mailing list'; linux-arm-
> > kernel@lists.infradead.org; linux-media@vger.kernel.org;
> > 'linux-fbdev';
> > 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'
> > Subject: Re: Introduce a new helper framework for buffer
> > synchronization> 
> > Op 13-05-13 11:21, Inki Dae schreef:
> > >> -----Original Message-----
> > >> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
> > >> Sent: Monday, May 13, 2013 5:01 PM
> > >> To: Inki Dae
> > >> Cc: Rob Clark; Daniel Vetter; DRI mailing list; linux-arm-
> > >> kernel@lists.infradead.org; linux-media@vger.kernel.org;
> > >> linux-fbdev;
> > >> Kyungmin Park; myungjoo.ham; YoungJun Cho
> > >> Subject: Re: Introduce a new helper framework for buffer
> > 
> > synchronization
> > 
> > >> Op 09-05-13 09:33, Inki Dae schreef:
> > >>> Hi all,
> > >>> 
> > >>> This post introduces a new helper framework based on dma fence.
> > >>> And
> > 
> > the
> > 
> > >>> purpose of this post is to collect other opinions and advices
> > >>> before
> > 
> > RFC
> > 
> > >>> posting.
> > >>> 
> > >>> First of all, this helper framework, called fence helper, is in
> > 
> > progress
> > 
> > >>> yet so might not have enough comments in codes and also might need
> > >>> to
> > 
> > be
> > 
> > >>> more cleaned up. Moreover, we might be missing some parts of the
> > >>> dma
> > >> 
> > >> fence.
> > >> 
> > >>> However, I'd like to say that all things mentioned below has been
> > 
> > tested
> > 
> > >>> with Linux platform and worked well.
> > >>> ....
> > >>> 
> > >>> And tutorial for user process.
> > >>> 
> > >>>         just before cpu access
> > >>>         
> > >>>                 struct dma_buf_fence *df;
> > >>>                 
> > >>>                 df->type = DMA_BUF_ACCESS_READ or
> 
> DMA_BUF_ACCESS_WRITE;
> 
> > >>>                 ioctl(fd, DMA_BUF_GET_FENCE, &df);
> > >>>         
> > >>>         after memset or memcpy
> > >>>         
> > >>>                 ioctl(fd, DMA_BUF_PUT_FENCE, &df);
> > >> 
> > >> NAK.
> > >> 
> > >> Userspace doesn't need to trigger fences. It can do a buffer idle
> > >> wait,
> > >> and postpone submitting new commands until after it's done using
> > >> the
> > >> buffer.
> > > 
> > > Hi Maarten,
> > > 
> > > It seems that you say user should wait for a buffer like KDS does:
> > > KDS
> > 
> > uses
> > 
> > > select() to postpone submitting new commands. But I think this way
> > 
> > assumes
> > 
> > > that every data flows a DMA device to a CPU. For example, a CPU
> > > should
> > 
> > keep
> > 
> > > polling for the completion of a buffer access by a DMA device. This
> > 
> > means
> > 
> > > that the this way isn't considered for data flow to opposite case;
> > > CPU
> > 
> > to
> > 
> > > DMA device.
> > 
> > Not really. You do both things the same way. You first wait for the bo
> > to be idle, this could be implemented by adding poll support to the
> > dma-buf fd.
> > Then you either do your read or write. Since userspace is supposed to
> > be the one controlling the bo it should stay idle at that point. If
> > you have another thread queueing
> > the buffer againbefore your thread is done that's a bug in the
> 
> application,
> 
> > and can be solved with userspace locking primitives. No need for the
> > kernel to get involved.
> 
> Yes, that is how we have synchronized buffer between CPU and DMA device
> until now without buffer synchronization mechanism. I thought that it's
> best to make user not considering anything: user can access a buffer
> regardless of any DMA device controlling and the buffer synchronization
> is performed in kernel level. Moreover, I think we could optimize
> graphics and multimedia hardware performance because hardware can do
> more works: one thread accesses a shared buffer and the other controls
> DMA device with the shared buffer in parallel.

Could you explain this point? I thought that if there is a shared buffer 
accessible for user and DMA device, only one of them can use it at the 
moment, i.e. the buffer is useless for the reading user (or read DMA) 
until (write) DMA (or writing user) finishes writing for it. Is it 
incorrect? Or this is not the point here?

I'm not an expert here and I'm just trying to understand the idea, so 
correct me if I'm wrong. Thanks in advance.

Best regards,
Tomasz

> Thus, we could avoid
> sequential processing and that is my intention. Aren't you think about
> that we could improve hardware utilization with such way or other? of
> course, there could be better way.
> 
> > >> Kernel space doesn't need the root hole you created by giving a
> > >> dereferencing a pointer passed from userspace.
> > >> Your next exercise should be to write a security exploit from the
> > >> api
> > 
> > you
> > 
> > >> created here. It's the only way to learn how to write safe code.
> > >> Hint:
> > >> df.ctx = mmap(..);
> > > 
> > > Also I'm not clear to use our way yet and that is why I posted. As
> > > you
> > > mentioned, it seems like that using mmap() is more safe. But there
> > > is
> > 
> > one
> > 
> > > issue it makes me confusing. For your hint, df.ctx = mmap(..), the
> > > issue> 
> > is
> > 
> > > that dmabuf mmap can be used to map a dmabuf with user space. And
> > > the
> > 
> > dmabuf
> > 
> > > means a physical memory region allocated by some allocator such as
> > > drm
> > 
> > gem
> > 
> > > or ion.
> > > 
> > > There might be my missing point so could you please give me more
> > 
> > comments?
> > 
> > My point was that userspace could change df.ctx to some mmap'd memory,
> > forcing the kernel to execute some code prepared by userspace.
> 
> Understood. I have to find a better way. And for this, I'd like to
> listen attentively more opinions and advices.
> 
> Thanks for comments,
> Inki Dae
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-13 17:58                 ` Rob Clark
@ 2013-05-14  2:52                   ` Inki Dae
  2013-05-14 13:38                     ` Rob Clark
  0 siblings, 1 reply; 36+ messages in thread
From: Inki Dae @ 2013-05-14  2:52 UTC (permalink / raw)
  To: 'Rob Clark'
  Cc: 'linux-fbdev', 'DRI mailing list',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho',
	linux-arm-kernel, linux-media



> -----Original Message-----
> From: Rob Clark [mailto:robdclark@gmail.com]
> Sent: Tuesday, May 14, 2013 2:58 AM
> To: Inki Dae
> Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
> Cho; linux-arm-kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Mon, May 13, 2013 at 1:18 PM, Inki Dae <inki.dae@samsung.com> wrote:
> >
> >
> > 2013/5/13 Rob Clark <robdclark@gmail.com>
> >>
> >> On Mon, May 13, 2013 at 8:21 AM, Inki Dae <inki.dae@samsung.com> wrote:
> >> >
> >> >> In that case you still wouldn't give userspace control over the
> fences.
> >> >> I
> >> >> don't see any way that can end well.
> >> >> What if userspace never signals? What if userspace gets killed by
> oom
> >> >> killer. Who keeps track of that?
> >> >>
> >> >
> >> > In all cases, all kernel resources to user fence will be released by
> >> > kernel
> >> > once the fence is timed out: never signaling and process killing by
> oom
> >> > killer makes the fence timed out. And if we use mmap mechanism you
> >> > mentioned
> >> > before, I think user resource could also be freed properly.
> >>
> >>
> >> I tend to agree w/ Maarten here.. there is no good reason for
> >> userspace to be *signaling* fences.  The exception might be some blob
> >> gpu drivers which don't have enough knowledge in the kernel to figure
> >> out what to do.  (In which case you can add driver private ioctls for
> >> that.. still not the right thing to do but at least you don't make a
> >> public API out of it.)
> >>
> >
> > Please do not care whether those are generic or not. Let's see the
> following
> > three things. First, it's cache operation. As you know, ARM SoC has ACP
> > (Accelerator Coherency Port) and can be connected to DMA engine or
> similar
> > devices. And this port is used for cache coherency between CPU cache and
> DMA
> > device. However, most devices on ARM based embedded systems don't use
> the
> > ACP port. So they need proper cache operation before and after of DMA or
> CPU
> > access in case of using cachable mapping. Actually, I see many Linux
> based
> > platforms call cache control interfaces directly for that. I think the
> > reason, they do so, is that kernel isn't aware of when and how CPU
> accessed
> > memory.
> 
> I think we had kicked around the idea of giving dmabuf's a
> prepare/finish ioctl quite some time back.  This is probably something
> that should be at least a bit decoupled from fences.  (Possibly
> 'prepare' waits for dma access to complete, but not the other way
> around.)
> 
> And I did implement in omapdrm support for simulating coherency via
> page fault-in / shoot-down..  It is one option that makes it
> completely transparent to userspace, although there is some
> performance const, so I suppose it depends a bit on your use-case.
> 
> > And second, user process has to do so many things in case of using
> shared
> > memory with DMA device. User process should understand how DMA device is
> > operated and when interfaces for controling the DMA device are called.
> Such
> > things would make user application so complicated.
> >
> > And third, it's performance optimization to multimedia and graphics
> devices.
> > As I mentioned already, we should consider sequential processing for
> buffer
> > sharing between CPU and DMA device. This means that CPU should stay with
> > idle until DMA device is completed and vise versa.
> >
> > That is why I proposed such user interfaces. Of course, these interfaces
> > might be so ugly yet: for this, Maarten pointed already out and I agree
> with
> > him. But there must be another better way. Aren't you think we need
> similar
> > thing? With such interfaces, cache control and buffer synchronization
> can be
> > performed in kernel level. Moreover, user applization doesn't need to
> > consider DMA device controlling anymore. Therefore, one thread can
> access a
> > shared buffer and the other can control DMA device with the shared
> buffer in
> > parallel. We can really make the best use of CPU and DMA idle time. In
> other
> > words, we can really make the best use of multi tasking OS, Linux.
> >
> > So could you please tell me about that there is any reason we don't use
> > public API for it? I think we can add and use public API if NECESSARY.
> 
> well, for cache management, I think it is a better idea.. I didn't
> really catch that this was the motivation from the initial patch, but
> maybe I read it too quickly.  But cache can be decoupled from
> synchronization, because CPU access is not asynchronous.  For
> userspace/CPU access to buffer, you should:
> 
>   1) wait for buffer
>   2) prepare-access
>   3)  ... do whatever cpu access to buffer ...
>   4) finish-access
>   5) submit buffer for new dma-operation
> 


For data flow from CPU to DMA device,
1) wait for buffer
2) prepare-access (dma_buf_begin_cpu_access)
3) cpu access to buffer


For data flow from DMA device to CPU
1) wait for buffer
2) finish-access (dma_buf_end _cpu_access)
3) dma access to buffer

1) and 2) are coupled with one function: we have implemented
fence_helper_commit_reserve() for it.

Cache control(cache clean or cache invalidate) is performed properly
checking previous access type and current access type.
And the below is actual codes for it,

static void fence_helper_cache_ops(struct fence_helper *fh)
{
	struct seqno_fence_dmabuf *sfd;

	list_for_each_entry(sfd, &fh->sf.sync_buf_list, list) {
		struct dma_buf *dmabuf = sfd->sync_buf;

		if (WARN_ON(!dmabuf))
			continue;

		/* first time access. */
		if (!dmabuf->access_type)
			goto out;

		if (dmabuf->access_type == DMA_BUF_ACCESS_WRITE &&
				((fh->type == (DMA_BUF_ACCESS_READ |
						DMA_BUF_ACCESS_DMA)) ||
				(fh->type == (DMA_BUF_ACCESS_WRITE |
					     DMA_BUF_ACCESS_DMA))))
			/* cache clean */
			dma_buf_end_cpu_access(dmabuf, 0, dmabuf->size,
						DMA_TO_DEVICE);
		else if (dmabuf->access_type == (DMA_BUF_ACCESS_WRITE |
						DMA_BUF_ACCESS_DMA) &&
				(fh->type == DMA_BUF_ACCESS_READ))
			/* cache invalidate */
			dma_buf_begin_cpu_access(dmabuf, 0, dmabuf->size,
							DMA_FROM_DEVICE);

out:
		/* Update access type to new one. */
		dmabuf->access_type = fh->type;
	}
}

The above function is called after wait for buffer.
Thus, we can check who (CPU or DMA) and how (READ or WRITE) accessed and
accesses buffer with this approach. In other words, now kernel is aware of
buffer access by CPU also.


Thanks,
Inki Dae

> I suppose you could combine the syscall for #1 and #2.. not sure if
> that is a good idea or not.  But you don't need to.  And there is
> never really any need for userspace to signal a fence.
> 
> BR,
> -R
> 
> > Thanks,
> > Inki Dae
> >
> >>
> >> BR,
> >> -R
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-14  2:52                   ` Inki Dae
@ 2013-05-14 13:38                     ` Rob Clark
  2013-05-15  5:19                       ` Inki Dae
  0 siblings, 1 reply; 36+ messages in thread
From: Rob Clark @ 2013-05-14 13:38 UTC (permalink / raw)
  To: Inki Dae
  Cc: linux-fbdev, DRI mailing list, Kyungmin Park, myungjoo.ham,
	YoungJun Cho, linux-arm-kernel, linux-media

On Mon, May 13, 2013 at 10:52 PM, Inki Dae <inki.dae@samsung.com> wrote:
>> well, for cache management, I think it is a better idea.. I didn't
>> really catch that this was the motivation from the initial patch, but
>> maybe I read it too quickly.  But cache can be decoupled from
>> synchronization, because CPU access is not asynchronous.  For
>> userspace/CPU access to buffer, you should:
>>
>>   1) wait for buffer
>>   2) prepare-access
>>   3)  ... do whatever cpu access to buffer ...
>>   4) finish-access
>>   5) submit buffer for new dma-operation
>>
>
>
> For data flow from CPU to DMA device,
> 1) wait for buffer
> 2) prepare-access (dma_buf_begin_cpu_access)
> 3) cpu access to buffer
>
>
> For data flow from DMA device to CPU
> 1) wait for buffer

Right, but CPU access isn't asynchronous (from the point of view of
the CPU), so there isn't really any wait step at this point.  And if
you do want the CPU to be able to signal a fence from userspace for
some reason, you probably what something file/fd based so the
refcnting/cleanup when process dies doesn't leave some pending DMA
action wedged.  But I don't really see the point of that complexity
when the CPU access isn't asynchronous in the first place.

BR,
-R


> 2) finish-access (dma_buf_end _cpu_access)
> 3) dma access to buffer
>
> 1) and 2) are coupled with one function: we have implemented
> fence_helper_commit_reserve() for it.
>
> Cache control(cache clean or cache invalidate) is performed properly
> checking previous access type and current access type.
> And the below is actual codes for it,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-14 13:38                     ` Rob Clark
@ 2013-05-15  5:19                       ` Inki Dae
  2013-05-15 14:06                         ` Rob Clark
  0 siblings, 1 reply; 36+ messages in thread
From: Inki Dae @ 2013-05-15  5:19 UTC (permalink / raw)
  To: 'Rob Clark'
  Cc: 'linux-fbdev', 'DRI mailing list',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho',
	linux-arm-kernel, linux-media



> -----Original Message-----
> From: Rob Clark [mailto:robdclark@gmail.com]
> Sent: Tuesday, May 14, 2013 10:39 PM
> To: Inki Dae
> Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
> Cho; linux-arm-kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Mon, May 13, 2013 at 10:52 PM, Inki Dae <inki.dae@samsung.com> wrote:
> >> well, for cache management, I think it is a better idea.. I didn't
> >> really catch that this was the motivation from the initial patch, but
> >> maybe I read it too quickly.  But cache can be decoupled from
> >> synchronization, because CPU access is not asynchronous.  For
> >> userspace/CPU access to buffer, you should:
> >>
> >>   1) wait for buffer
> >>   2) prepare-access
> >>   3)  ... do whatever cpu access to buffer ...
> >>   4) finish-access
> >>   5) submit buffer for new dma-operation
> >>
> >
> >
> > For data flow from CPU to DMA device,
> > 1) wait for buffer
> > 2) prepare-access (dma_buf_begin_cpu_access)
> > 3) cpu access to buffer
> >
> >
> > For data flow from DMA device to CPU
> > 1) wait for buffer
> 
> Right, but CPU access isn't asynchronous (from the point of view of
> the CPU), so there isn't really any wait step at this point.  And if
> you do want the CPU to be able to signal a fence from userspace for
> some reason, you probably what something file/fd based so the
> refcnting/cleanup when process dies doesn't leave some pending DMA
> action wedged.  But I don't really see the point of that complexity
> when the CPU access isn't asynchronous in the first place.
>

There was my missing comments, please see the below sequence.

For data flow from CPU to DMA device and then from DMA device to CPU,
1) wait for buffer <- at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
	- including prepare-access (dma_buf_begin_cpu_access)
2) cpu access to buffer
3) wait for buffer <- at device driver
	- but CPU is already accessing the buffer so blocked.
4) signal <- at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
5) the thread, blocked at 3), is waked up by 4).
	- and then finish-access (dma_buf_end_cpu_access)
6) dma access to buffer
7) wait for buffer <- at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
	- but DMA is already accessing the buffer so blocked.
8) signal <- at device driver
9) the thread, blocked at 7), is waked up by 8)
	- and then prepare-access (dma_buf_begin_cpu_access)
10 cpu access to buffer

Basically, 'wait for buffer' includes buffer synchronization, committing
processing, and cache operation. The buffer synchronization means that a
current thread should wait for other threads accessing a shared buffer until
the completion of their access. And the committing processing means that a
current thread possesses the shared buffer so any trying to access the
shared buffer by another thread makes the thread to be blocked. However, as
I already mentioned before, it seems that these user interfaces are so ugly
yet. So we need better way.

Give me more comments if there is my missing point :)

Thanks,
Inki Dae

> BR,
> -R
> 
> 
> > 2) finish-access (dma_buf_end _cpu_access)
> > 3) dma access to buffer
> >
> > 1) and 2) are coupled with one function: we have implemented
> > fence_helper_commit_reserve() for it.
> >
> > Cache control(cache clean or cache invalidate) is performed properly
> > checking previous access type and current access type.
> > And the below is actual codes for it,


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-15  5:19                       ` Inki Dae
@ 2013-05-15 14:06                         ` Rob Clark
  2013-05-17  4:19                           ` Daniel Vetter
       [not found]                           ` <CAAQKjZP37koEPob6yqpn-WxxTh3+O=twyvRzDiEhVJTD8BxQzw@mail.gmail.com>
  0 siblings, 2 replies; 36+ messages in thread
From: Rob Clark @ 2013-05-15 14:06 UTC (permalink / raw)
  To: Inki Dae
  Cc: linux-fbdev, DRI mailing list, Kyungmin Park, myungjoo.ham,
	YoungJun Cho, linux-arm-kernel, linux-media

On Wed, May 15, 2013 at 1:19 AM, Inki Dae <inki.dae@samsung.com> wrote:
>
>
>> -----Original Message-----
>> From: Rob Clark [mailto:robdclark@gmail.com]
>> Sent: Tuesday, May 14, 2013 10:39 PM
>> To: Inki Dae
>> Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
>> Cho; linux-arm-kernel@lists.infradead.org; linux-media@vger.kernel.org
>> Subject: Re: Introduce a new helper framework for buffer synchronization
>>
>> On Mon, May 13, 2013 at 10:52 PM, Inki Dae <inki.dae@samsung.com> wrote:
>> >> well, for cache management, I think it is a better idea.. I didn't
>> >> really catch that this was the motivation from the initial patch, but
>> >> maybe I read it too quickly.  But cache can be decoupled from
>> >> synchronization, because CPU access is not asynchronous.  For
>> >> userspace/CPU access to buffer, you should:
>> >>
>> >>   1) wait for buffer
>> >>   2) prepare-access
>> >>   3)  ... do whatever cpu access to buffer ...
>> >>   4) finish-access
>> >>   5) submit buffer for new dma-operation
>> >>
>> >
>> >
>> > For data flow from CPU to DMA device,
>> > 1) wait for buffer
>> > 2) prepare-access (dma_buf_begin_cpu_access)
>> > 3) cpu access to buffer
>> >
>> >
>> > For data flow from DMA device to CPU
>> > 1) wait for buffer
>>
>> Right, but CPU access isn't asynchronous (from the point of view of
>> the CPU), so there isn't really any wait step at this point.  And if
>> you do want the CPU to be able to signal a fence from userspace for
>> some reason, you probably what something file/fd based so the
>> refcnting/cleanup when process dies doesn't leave some pending DMA
>> action wedged.  But I don't really see the point of that complexity
>> when the CPU access isn't asynchronous in the first place.
>>
>
> There was my missing comments, please see the below sequence.
>
> For data flow from CPU to DMA device and then from DMA device to CPU,
> 1) wait for buffer <- at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
>         - including prepare-access (dma_buf_begin_cpu_access)
> 2) cpu access to buffer
> 3) wait for buffer <- at device driver
>         - but CPU is already accessing the buffer so blocked.
> 4) signal <- at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
> 5) the thread, blocked at 3), is waked up by 4).
>         - and then finish-access (dma_buf_end_cpu_access)

right, I understand you can have background threads, etc, in
userspace.  But there are already plenty of synchronization primitives
that can be used for cpu->cpu synchronization, either within the same
process or between multiple processes.  For cpu access, even if it is
handled by background threads/processes, I think it is better to use
the traditional pthreads or unix synchronization primitives.  They
have existed forever, they are well tested, and they work.

So while it seems nice and orthogonal/clean to couple cache and
synchronization and handle dma->cpu and cpu->cpu and cpu->dma in the
same generic way, but I think in practice we have to make things more
complex than they otherwise need to be to do this.  Otherwise I think
we'll be having problems with badly behaved or crashing userspace.

BR,
-R

> 6) dma access to buffer
> 7) wait for buffer <- at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
>         - but DMA is already accessing the buffer so blocked.
> 8) signal <- at device driver
> 9) the thread, blocked at 7), is waked up by 8)
>         - and then prepare-access (dma_buf_begin_cpu_access)
> 10 cpu access to buffer
>
> Basically, 'wait for buffer' includes buffer synchronization, committing
> processing, and cache operation. The buffer synchronization means that a
> current thread should wait for other threads accessing a shared buffer until
> the completion of their access. And the committing processing means that a
> current thread possesses the shared buffer so any trying to access the
> shared buffer by another thread makes the thread to be blocked. However, as
> I already mentioned before, it seems that these user interfaces are so ugly
> yet. So we need better way.
>
> Give me more comments if there is my missing point :)
>
> Thanks,
> Inki Dae
>
>> BR,
>> -R
>>
>>
>> > 2) finish-access (dma_buf_end _cpu_access)
>> > 3) dma access to buffer
>> >
>> > 1) and 2) are coupled with one function: we have implemented
>> > fence_helper_commit_reserve() for it.
>> >
>> > Cache control(cache clean or cache invalidate) is performed properly
>> > checking previous access type and current access type.
>> > And the below is actual codes for it,
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-15 14:06                         ` Rob Clark
@ 2013-05-17  4:19                           ` Daniel Vetter
       [not found]                           ` <CAAQKjZP37koEPob6yqpn-WxxTh3+O=twyvRzDiEhVJTD8BxQzw@mail.gmail.com>
  1 sibling, 0 replies; 36+ messages in thread
From: Daniel Vetter @ 2013-05-17  4:19 UTC (permalink / raw)
  To: Rob Clark
  Cc: Inki Dae, linux-fbdev, YoungJun Cho, Kyungmin Park, myungjoo.ham,
	DRI mailing list, linux-arm-kernel, linux-media

On Wed, May 15, 2013 at 4:06 PM, Rob Clark <robdclark@gmail.com> wrote:
> So while it seems nice and orthogonal/clean to couple cache and
> synchronization and handle dma->cpu and cpu->cpu and cpu->dma in the
> same generic way, but I think in practice we have to make things more
> complex than they otherwise need to be to do this.  Otherwise I think
> we'll be having problems with badly behaved or crashing userspace.

I haven't read through the entire thread careful but imo this is very
important. If we add a fence interface which allows userspace to block
dma this is a no-go. The only thing we need is to sync up with all
outstanding dma operations and flush caches for cpu access. If broken
userspace starts to issue new dma (or multiple thread stomp onto each
another) that's not a problem dma fences/syncpoints should try to
solve. This way we can concentrate on solving the (already
challenging) device-to-device sync issues without additional
complexities which cpu->cpu sync would impose.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
       [not found]                           ` <CAAQKjZP37koEPob6yqpn-WxxTh3+O=twyvRzDiEhVJTD8BxQzw@mail.gmail.com>
@ 2013-05-20 21:13                             ` Daniel Vetter
  2013-05-20 21:30                               ` Daniel Vetter
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Vetter @ 2013-05-20 21:13 UTC (permalink / raw)
  To: Inki Dae
  Cc: Rob Clark, linux-fbdev, DRI mailing list, Kyungmin Park,
	myungjoo.ham, YoungJun Cho, linux-arm-kernel, linux-media

On Sat, May 18, 2013 at 03:47:43PM +0900, Inki Dae wrote:
> 2013/5/15 Rob Clark <robdclark@gmail.com>
> 
> > On Wed, May 15, 2013 at 1:19 AM, Inki Dae <inki.dae@samsung.com> wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Rob Clark [mailto:robdclark@gmail.com]
> > >> Sent: Tuesday, May 14, 2013 10:39 PM
> > >> To: Inki Dae
> > >> Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
> > >> Cho; linux-arm-kernel@lists.infradead.org; linux-media@vger.kernel.org
> > >> Subject: Re: Introduce a new helper framework for buffer synchronization
> > >>
> > >> On Mon, May 13, 2013 at 10:52 PM, Inki Dae <inki.dae@samsung.com>
> > wrote:
> > >> >> well, for cache management, I think it is a better idea.. I didn't
> > >> >> really catch that this was the motivation from the initial patch, but
> > >> >> maybe I read it too quickly.  But cache can be decoupled from
> > >> >> synchronization, because CPU access is not asynchronous.  For
> > >> >> userspace/CPU access to buffer, you should:
> > >> >>
> > >> >>   1) wait for buffer
> > >> >>   2) prepare-access
> > >> >>   3)  ... do whatever cpu access to buffer ...
> > >> >>   4) finish-access
> > >> >>   5) submit buffer for new dma-operation
> > >> >>
> > >> >
> > >> >
> > >> > For data flow from CPU to DMA device,
> > >> > 1) wait for buffer
> > >> > 2) prepare-access (dma_buf_begin_cpu_access)
> > >> > 3) cpu access to buffer
> > >> >
> > >> >
> > >> > For data flow from DMA device to CPU
> > >> > 1) wait for buffer
> > >>
> > >> Right, but CPU access isn't asynchronous (from the point of view of
> > >> the CPU), so there isn't really any wait step at this point.  And if
> > >> you do want the CPU to be able to signal a fence from userspace for
> > >> some reason, you probably what something file/fd based so the
> > >> refcnting/cleanup when process dies doesn't leave some pending DMA
> > >> action wedged.  But I don't really see the point of that complexity
> > >> when the CPU access isn't asynchronous in the first place.
> > >>
> > >
> > > There was my missing comments, please see the below sequence.
> > >
> > > For data flow from CPU to DMA device and then from DMA device to CPU,
> > > 1) wait for buffer <- at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
> > >         - including prepare-access (dma_buf_begin_cpu_access)
> > > 2) cpu access to buffer
> > > 3) wait for buffer <- at device driver
> > >         - but CPU is already accessing the buffer so blocked.
> > > 4) signal <- at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
> > > 5) the thread, blocked at 3), is waked up by 4).
> > >         - and then finish-access (dma_buf_end_cpu_access)
> >
> > right, I understand you can have background threads, etc, in
> > userspace.  But there are already plenty of synchronization primitives
> > that can be used for cpu->cpu synchronization, either within the same
> > process or between multiple processes.  For cpu access, even if it is
> > handled by background threads/processes, I think it is better to use
> > the traditional pthreads or unix synchronization primitives.  They
> > have existed forever, they are well tested, and they work.
> >
> > So while it seems nice and orthogonal/clean to couple cache and
> > synchronization and handle dma->cpu and cpu->cpu and cpu->dma in the
> > same generic way, but I think in practice we have to make things more
> > complex than they otherwise need to be to do this.  Otherwise I think
> > we'll be having problems with badly behaved or crashing userspace.
> >
> >
> Right, this is very important. I think it's not esay to solve this
> issue. Aand at least for ARM based embedded system, such feature is useful
> to us; coupling cache operation and buffer synchronization. I'd like to
> collect other opinions and advices to solve this issue.

Maybe we have a bit a misunderstanding here. The kernel really should take
care of sync and cache coherency, and I agree that with the current
dma_buf code (and the proposed fences) that part is still a bit hazy. But
the kernel should not allow userspace to block access to a buffer until
userspace is done. It should only sync with any oustanding fences and
flush buffers before that userspace access happens.

Then the kernel would again flush caches on the next dma access (which
hopefully is only started once userspace completed access). Atm this isn't
possible in an efficient way since the dma_buf api only exposes map/unmap
attachment and not a function to just sync an existing mapping like
dma_sync_single_for_device. I guess we should add a
dma_buf_sync_attachment interface for that - we already have range-based
cpu sync support with the begin/end_cpu_access interfaces.

Yours, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-20 21:13                             ` Daniel Vetter
@ 2013-05-20 21:30                               ` Daniel Vetter
  2013-05-21  7:03                                 ` Inki Dae
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Vetter @ 2013-05-20 21:30 UTC (permalink / raw)
  To: Inki Dae
  Cc: Rob Clark, linux-fbdev, DRI mailing list, Kyungmin Park,
	myungjoo.ham, YoungJun Cho, linux-arm-kernel, linux-media

On Mon, May 20, 2013 at 11:13:04PM +0200, Daniel Vetter wrote:
> On Sat, May 18, 2013 at 03:47:43PM +0900, Inki Dae wrote:
> > 2013/5/15 Rob Clark <robdclark@gmail.com>
> > 
> > > On Wed, May 15, 2013 at 1:19 AM, Inki Dae <inki.dae@samsung.com> wrote:
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Rob Clark [mailto:robdclark@gmail.com]
> > > >> Sent: Tuesday, May 14, 2013 10:39 PM
> > > >> To: Inki Dae
> > > >> Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
> > > >> Cho; linux-arm-kernel@lists.infradead.org; linux-media@vger.kernel.org
> > > >> Subject: Re: Introduce a new helper framework for buffer synchronization
> > > >>
> > > >> On Mon, May 13, 2013 at 10:52 PM, Inki Dae <inki.dae@samsung.com>
> > > wrote:
> > > >> >> well, for cache management, I think it is a better idea.. I didn't
> > > >> >> really catch that this was the motivation from the initial patch, but
> > > >> >> maybe I read it too quickly.  But cache can be decoupled from
> > > >> >> synchronization, because CPU access is not asynchronous.  For
> > > >> >> userspace/CPU access to buffer, you should:
> > > >> >>
> > > >> >>   1) wait for buffer
> > > >> >>   2) prepare-access
> > > >> >>   3)  ... do whatever cpu access to buffer ...
> > > >> >>   4) finish-access
> > > >> >>   5) submit buffer for new dma-operation
> > > >> >>
> > > >> >
> > > >> >
> > > >> > For data flow from CPU to DMA device,
> > > >> > 1) wait for buffer
> > > >> > 2) prepare-access (dma_buf_begin_cpu_access)
> > > >> > 3) cpu access to buffer
> > > >> >
> > > >> >
> > > >> > For data flow from DMA device to CPU
> > > >> > 1) wait for buffer
> > > >>
> > > >> Right, but CPU access isn't asynchronous (from the point of view of
> > > >> the CPU), so there isn't really any wait step at this point.  And if
> > > >> you do want the CPU to be able to signal a fence from userspace for
> > > >> some reason, you probably what something file/fd based so the
> > > >> refcnting/cleanup when process dies doesn't leave some pending DMA
> > > >> action wedged.  But I don't really see the point of that complexity
> > > >> when the CPU access isn't asynchronous in the first place.
> > > >>
> > > >
> > > > There was my missing comments, please see the below sequence.
> > > >
> > > > For data flow from CPU to DMA device and then from DMA device to CPU,
> > > > 1) wait for buffer <- at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
> > > >         - including prepare-access (dma_buf_begin_cpu_access)
> > > > 2) cpu access to buffer
> > > > 3) wait for buffer <- at device driver
> > > >         - but CPU is already accessing the buffer so blocked.
> > > > 4) signal <- at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
> > > > 5) the thread, blocked at 3), is waked up by 4).
> > > >         - and then finish-access (dma_buf_end_cpu_access)
> > >
> > > right, I understand you can have background threads, etc, in
> > > userspace.  But there are already plenty of synchronization primitives
> > > that can be used for cpu->cpu synchronization, either within the same
> > > process or between multiple processes.  For cpu access, even if it is
> > > handled by background threads/processes, I think it is better to use
> > > the traditional pthreads or unix synchronization primitives.  They
> > > have existed forever, they are well tested, and they work.
> > >
> > > So while it seems nice and orthogonal/clean to couple cache and
> > > synchronization and handle dma->cpu and cpu->cpu and cpu->dma in the
> > > same generic way, but I think in practice we have to make things more
> > > complex than they otherwise need to be to do this.  Otherwise I think
> > > we'll be having problems with badly behaved or crashing userspace.
> > >
> > >
> > Right, this is very important. I think it's not esay to solve this
> > issue. Aand at least for ARM based embedded system, such feature is useful
> > to us; coupling cache operation and buffer synchronization. I'd like to
> > collect other opinions and advices to solve this issue.
> 
> Maybe we have a bit a misunderstanding here. The kernel really should take
> care of sync and cache coherency, and I agree that with the current
> dma_buf code (and the proposed fences) that part is still a bit hazy. But
> the kernel should not allow userspace to block access to a buffer until
> userspace is done. It should only sync with any oustanding fences and
> flush buffers before that userspace access happens.
> 
> Then the kernel would again flush caches on the next dma access (which
> hopefully is only started once userspace completed access). Atm this isn't
> possible in an efficient way since the dma_buf api only exposes map/unmap
> attachment and not a function to just sync an existing mapping like
> dma_sync_single_for_device. I guess we should add a
> dma_buf_sync_attachment interface for that - we already have range-based
> cpu sync support with the begin/end_cpu_access interfaces.

I think my mail here was a bit unclear, so let me try to rephrase.
Re-reading through part of this discussion I think you raise some good
shortcomings of the current dma_buf interfaces and the proposed fence
patches. But I think we should address the individual pieces bit-by-bit.
On a quick survey I see a few parts to what you're trying to solve:

- More efficient cache coherency management. I think we already have all
  required interfaces for efficient cpu-side access with
  begin/end_cpu_access. So I think we only need a device-side dma sync
  mechanism to be able to flush cpu caches after userspace/cpu access has
  completed (before the next dma op).

- More efficient mmap handling. The current dma_buf mmap support is
  explicitly designed as something simply, but probably dead-slow for
  last-resort fallback ops. I'm not sure whether we should fix this up and
  extend with special ioctls. But the current coherency interface should
  be good enough for you to write an efficient private mmap implementation
  for exynos drm.

- Integration of fence syncing into dma_buf. Imo we should have a
  per-attachment mode which decides whether map/unmap (and the new sync)
  should wait for fences or whether the driver takes care of syncing
  through the new fence framework itself (for direct hw sync). Imo cpu
  access should also have such a flag. I guess in both cases we should
  sync by default.

- cpu/cpu sync additions to dma_buf. Like I've said, I'm not sold at all
  on this idea. I think it would be best if we try to fix up all the other
  current shortcomings first though and then take a fresh look at this
  issue here.

Have I missed or misunderstood something?

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-20 21:30                               ` Daniel Vetter
@ 2013-05-21  7:03                                 ` Inki Dae
  2013-05-21  7:44                                   ` Daniel Vetter
  0 siblings, 1 reply; 36+ messages in thread
From: Inki Dae @ 2013-05-21  7:03 UTC (permalink / raw)
  To: 'Daniel Vetter'
  Cc: 'Rob Clark', 'linux-fbdev',
	'DRI mailing list', 'Kyungmin Park',
	'myungjoo.ham', 'YoungJun Cho',
	linux-arm-kernel, linux-media



> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Tuesday, May 21, 2013 6:31 AM
> To: Inki Dae
> Cc: Rob Clark; linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham;
> YoungJun Cho; linux-arm-kernel@lists.infradead.org; linux-
> media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Mon, May 20, 2013 at 11:13:04PM +0200, Daniel Vetter wrote:
> > On Sat, May 18, 2013 at 03:47:43PM +0900, Inki Dae wrote:
> > > 2013/5/15 Rob Clark <robdclark@gmail.com>
> > >
> > > > On Wed, May 15, 2013 at 1:19 AM, Inki Dae <inki.dae@samsung.com>
> wrote:
> > > > >
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Rob Clark [mailto:robdclark@gmail.com]
> > > > >> Sent: Tuesday, May 14, 2013 10:39 PM
> > > > >> To: Inki Dae
> > > > >> Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham;
> YoungJun
> > > > >> Cho; linux-arm-kernel@lists.infradead.org; linux-
> media@vger.kernel.org
> > > > >> Subject: Re: Introduce a new helper framework for buffer
> synchronization
> > > > >>
> > > > >> On Mon, May 13, 2013 at 10:52 PM, Inki Dae <inki.dae@samsung.com>
> > > > wrote:
> > > > >> >> well, for cache management, I think it is a better idea.. I
> didn't
> > > > >> >> really catch that this was the motivation from the initial
> patch, but
> > > > >> >> maybe I read it too quickly.  But cache can be decoupled from
> > > > >> >> synchronization, because CPU access is not asynchronous.  For
> > > > >> >> userspace/CPU access to buffer, you should:
> > > > >> >>
> > > > >> >>   1) wait for buffer
> > > > >> >>   2) prepare-access
> > > > >> >>   3)  ... do whatever cpu access to buffer ...
> > > > >> >>   4) finish-access
> > > > >> >>   5) submit buffer for new dma-operation
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> > For data flow from CPU to DMA device,
> > > > >> > 1) wait for buffer
> > > > >> > 2) prepare-access (dma_buf_begin_cpu_access)
> > > > >> > 3) cpu access to buffer
> > > > >> >
> > > > >> >
> > > > >> > For data flow from DMA device to CPU
> > > > >> > 1) wait for buffer
> > > > >>
> > > > >> Right, but CPU access isn't asynchronous (from the point of view
> of
> > > > >> the CPU), so there isn't really any wait step at this point.  And
> if
> > > > >> you do want the CPU to be able to signal a fence from userspace
> for
> > > > >> some reason, you probably what something file/fd based so the
> > > > >> refcnting/cleanup when process dies doesn't leave some pending
> DMA
> > > > >> action wedged.  But I don't really see the point of that
> complexity
> > > > >> when the CPU access isn't asynchronous in the first place.
> > > > >>
> > > > >
> > > > > There was my missing comments, please see the below sequence.
> > > > >
> > > > > For data flow from CPU to DMA device and then from DMA device to
> CPU,
> > > > > 1) wait for buffer <- at user side - ioctl(fd,
> DMA_BUF_GET_FENCE, ...)
> > > > >         - including prepare-access (dma_buf_begin_cpu_access)
> > > > > 2) cpu access to buffer
> > > > > 3) wait for buffer <- at device driver
> > > > >         - but CPU is already accessing the buffer so blocked.
> > > > > 4) signal <- at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
> > > > > 5) the thread, blocked at 3), is waked up by 4).
> > > > >         - and then finish-access (dma_buf_end_cpu_access)
> > > >
> > > > right, I understand you can have background threads, etc, in
> > > > userspace.  But there are already plenty of synchronization
> primitives
> > > > that can be used for cpu->cpu synchronization, either within the
> same
> > > > process or between multiple processes.  For cpu access, even if it
> is
> > > > handled by background threads/processes, I think it is better to use
> > > > the traditional pthreads or unix synchronization primitives.  They
> > > > have existed forever, they are well tested, and they work.
> > > >
> > > > So while it seems nice and orthogonal/clean to couple cache and
> > > > synchronization and handle dma->cpu and cpu->cpu and cpu->dma in the
> > > > same generic way, but I think in practice we have to make things
> more
> > > > complex than they otherwise need to be to do this.  Otherwise I
> think
> > > > we'll be having problems with badly behaved or crashing userspace.
> > > >
> > > >
> > > Right, this is very important. I think it's not esay to solve this
> > > issue. Aand at least for ARM based embedded system, such feature is
> useful
> > > to us; coupling cache operation and buffer synchronization. I'd like
> to
> > > collect other opinions and advices to solve this issue.
> >
> > Maybe we have a bit a misunderstanding here. The kernel really should
> take
> > care of sync and cache coherency, and I agree that with the current
> > dma_buf code (and the proposed fences) that part is still a bit hazy.
> But
> > the kernel should not allow userspace to block access to a buffer until
> > userspace is done. It should only sync with any oustanding fences and
> > flush buffers before that userspace access happens.
> >
> > Then the kernel would again flush caches on the next dma access (which
> > hopefully is only started once userspace completed access). Atm this
> isn't
> > possible in an efficient way since the dma_buf api only exposes
> map/unmap
> > attachment and not a function to just sync an existing mapping like
> > dma_sync_single_for_device. I guess we should add a
> > dma_buf_sync_attachment interface for that - we already have range-based
> > cpu sync support with the begin/end_cpu_access interfaces.
> 
> I think my mail here was a bit unclear, so let me try to rephrase.
> Re-reading through part of this discussion I think you raise some good
> shortcomings of the current dma_buf interfaces and the proposed fence
> patches. But I think we should address the individual pieces bit-by-bit.
> On a quick survey I see a few parts to what you're trying to solve:
> 
> - More efficient cache coherency management. I think we already have all
>   required interfaces for efficient cpu-side access with
>   begin/end_cpu_access. So I think we only need a device-side dma sync
>   mechanism to be able to flush cpu caches after userspace/cpu access has
>   completed (before the next dma op).
> 
> - More efficient mmap handling. The current dma_buf mmap support is
>   explicitly designed as something simply, but probably dead-slow for
>   last-resort fallback ops. I'm not sure whether we should fix this up and
>   extend with special ioctls. But the current coherency interface should
>   be good enough for you to write an efficient private mmap implementation
>   for exynos drm.

I agree with you.

> 
> - Integration of fence syncing into dma_buf. Imo we should have a
>   per-attachment mode which decides whether map/unmap (and the new sync)
>   should wait for fences or whether the driver takes care of syncing
>   through the new fence framework itself (for direct hw sync).

I think it's a good idea to have per-attachment mode for buffer sync. But
I'd like to say we make sure what is the purpose of map
(dma_buf_map_attachment)first. This interface is used to get a sgt;
containing pages to physical memory region, and map that region with
device's iommu table. The important thing is that this interface is called
just one time when user wants to share an allocated buffer with dma. But cpu
will try to access the buffer every time as long as it wants. Therefore, we
need cache control every time cpu and dma access the shared buffer: cache
clean when cpu goes to dma and cache invalidate when dma goes to cpu. That
is why I added new interfaces, DMA_BUF_GET_FENCE and DMA_BUF_PUT_FENCE, to
dma buf framework. Of course, Those are ugly so we need a better way: I just
wanted to show you that in such way, we can synchronize the shared buffer
between cpu and dma. By any chance, is there any way that kernel can be
aware of when cpu accesses the shared buffer or is there any point I didn't
catch up?

>   Imo cpu access should also have such a flag. I guess in both cases we
should
>   sync by default.
> 
> - cpu/cpu sync additions to dma_buf. Like I've said, I'm not sold at all

I think we should concentrate on cpu and dma sync. 

>   on this idea. I think it would be best if we try to fix up all the other
>   current shortcomings first though and then take a fresh look at this
>   issue here.

Right, agree.

> 
> Have I missed or misunderstood something?
> 

Your comments are very useful to me. Thanks again.

In Linux kernel, especially embedded system, we had tried to implement
generic interfaces for buffer management; how to allocate a buffer and how
to share a buffer. As a result, now we have CMA (Contiguous Memory
Allocator) for buffer allocation, and we have DMA-BUF for buffer sharing
between cpu and dma. And then how to synchronize a buffer between cpu and
dma? I think now it's best time to discuss buffer synchronization mechanism,
and that is very natural.

Please give me more comments and advices if there is my missing or
misunderstanding point.

Thanks,
Inki Dae

> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-21  7:03                                 ` Inki Dae
@ 2013-05-21  7:44                                   ` Daniel Vetter
  2013-05-21  9:22                                     ` Inki Dae
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Vetter @ 2013-05-21  7:44 UTC (permalink / raw)
  To: Inki Dae
  Cc: 'Daniel Vetter', 'Rob Clark',
	'linux-fbdev', 'DRI mailing list',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho',
	linux-arm-kernel, linux-media

On Tue, May 21, 2013 at 04:03:06PM +0900, Inki Dae wrote:
> > - Integration of fence syncing into dma_buf. Imo we should have a
> >   per-attachment mode which decides whether map/unmap (and the new sync)
> >   should wait for fences or whether the driver takes care of syncing
> >   through the new fence framework itself (for direct hw sync).
> 
> I think it's a good idea to have per-attachment mode for buffer sync. But
> I'd like to say we make sure what is the purpose of map
> (dma_buf_map_attachment)first. This interface is used to get a sgt;
> containing pages to physical memory region, and map that region with
> device's iommu table. The important thing is that this interface is called
> just one time when user wants to share an allocated buffer with dma. But cpu
> will try to access the buffer every time as long as it wants. Therefore, we
> need cache control every time cpu and dma access the shared buffer: cache
> clean when cpu goes to dma and cache invalidate when dma goes to cpu. That
> is why I added new interfaces, DMA_BUF_GET_FENCE and DMA_BUF_PUT_FENCE, to
> dma buf framework. Of course, Those are ugly so we need a better way: I just
> wanted to show you that in such way, we can synchronize the shared buffer
> between cpu and dma. By any chance, is there any way that kernel can be
> aware of when cpu accesses the shared buffer or is there any point I didn't
> catch up?

Well dma_buf_map/unmap_attachment should also flush/invalidate any caches,
and with the current dma_buf spec those two functions are the _only_ means
you have to do so. Which strictly means that if you interleave device dma
and cpu acccess you need to unmap/map every time.

Which is obviously horribly inefficient, but a known gap in the dma_buf
interface. Hence why I proposed to add dma_buf_sync_attachment similar to
dma_sync_single_for_device:

/**
 * dma_buf_sync_sg_attachment - sync caches for dma access
 * @attach: dma-buf attachment to sync
 * @sgt: the sg table to sync (returned by dma_buf_map_attachement)
 * @direction: dma direction to sync for
 *
 * This function synchronizes caches for device dma through the given
 * dma-buf attachment when interleaving dma from different devices and the
 * cpu. Other device dma needs to be synced also by calls to this
 * function (or a pair of dma_buf_map/unmap_attachment calls), cpu access
 * needs to be synced with dma_buf_begin/end_cpu_access.
 */
void dma_buf_sync_sg_attachment(struct dma_buf_attachment *attach,
				struct sg_table *sgt,
				enum dma_data_direction direction)

Note that "sync" here only means to synchronize caches, not wait for any
outstanding fences. This is simply to be consistent with the established
lingo of the dma api. How the dma-buf fences fit into this is imo a
different topic, but my idea is that all the cache coherency barriers
(i.e. dma_buf_map/unmap_attachment, dma_buf_sync_sg_attachment and
dma_buf_begin/end_cpu_access) would automatically block for any attached
fences (i.e. block for write fences when doing read-only access, block for
all fences otherwise).

Then we could do a new dma_buf_attach_flags interface for special cases
(might also be useful for other things, similar to the recently added
flags in the dma api for wc/no-kernel-mapping/...). A new flag like
DMA_BUF_ATTACH_NO_AUTOFENCE would then mean that the driver takes care of all
fencing for this attachment and the dma-buf functions should not do the
automagic fence blocking.

> In Linux kernel, especially embedded system, we had tried to implement
> generic interfaces for buffer management; how to allocate a buffer and how
> to share a buffer. As a result, now we have CMA (Contiguous Memory
> Allocator) for buffer allocation, and we have DMA-BUF for buffer sharing
> between cpu and dma. And then how to synchronize a buffer between cpu and
> dma? I think now it's best time to discuss buffer synchronization mechanism,
> and that is very natural.

I think it's important to differentiate between the two meanings of sync:
- synchronize caches (i.e. cpu cache flushing in most cases)
- and synchronize in-flight dma with fences.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-21  7:44                                   ` Daniel Vetter
@ 2013-05-21  9:22                                     ` Inki Dae
  2013-05-23 11:55                                       ` Daniel Vetter
  0 siblings, 1 reply; 36+ messages in thread
From: Inki Dae @ 2013-05-21  9:22 UTC (permalink / raw)
  To: 'Daniel Vetter'
  Cc: 'Rob Clark', 'linux-fbdev',
	'DRI mailing list', 'Kyungmin Park',
	'myungjoo.ham', 'YoungJun Cho',
	linux-arm-kernel, linux-media



> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Tuesday, May 21, 2013 4:45 PM
> To: Inki Dae
> Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'DRI mailing list';
> 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Tue, May 21, 2013 at 04:03:06PM +0900, Inki Dae wrote:
> > > - Integration of fence syncing into dma_buf. Imo we should have a
> > >   per-attachment mode which decides whether map/unmap (and the new
> sync)
> > >   should wait for fences or whether the driver takes care of syncing
> > >   through the new fence framework itself (for direct hw sync).
> >
> > I think it's a good idea to have per-attachment mode for buffer sync.
> But
> > I'd like to say we make sure what is the purpose of map
> > (dma_buf_map_attachment)first. This interface is used to get a sgt;
> > containing pages to physical memory region, and map that region with
> > device's iommu table. The important thing is that this interface is
> called
> > just one time when user wants to share an allocated buffer with dma. But
> cpu
> > will try to access the buffer every time as long as it wants. Therefore,
> we
> > need cache control every time cpu and dma access the shared buffer:
> cache
> > clean when cpu goes to dma and cache invalidate when dma goes to cpu.
> That
> > is why I added new interfaces, DMA_BUF_GET_FENCE and DMA_BUF_PUT_FENCE,
> to
> > dma buf framework. Of course, Those are ugly so we need a better way: I
> just
> > wanted to show you that in such way, we can synchronize the shared
> buffer
> > between cpu and dma. By any chance, is there any way that kernel can be
> > aware of when cpu accesses the shared buffer or is there any point I
> didn't
> > catch up?
> 
> Well dma_buf_map/unmap_attachment should also flush/invalidate any caches,
> and with the current dma_buf spec those two functions are the _only_ means

I know that dma buf exporter should make sure that cache clean/invalidate
are done when dma_buf_map/unmap_attachement is called. For this, already we
do so. However, this function is called when dma buf import is requested by
user to map a dmabuf fd with dma. This means that dma_buf_map_attachement()
is called ONCE when user wants to share the dmabuf fd with dma. Actually, in
case of exynos drm, dma_map_sg_attrs(), performing cache operation
internally, is called when dmabuf import is requested by user.

> you have to do so. Which strictly means that if you interleave device dma
> and cpu acccess you need to unmap/map every time.
> 
> Which is obviously horribly inefficient, but a known gap in the dma_buf

Right, and also this has big overhead.

> interface. Hence why I proposed to add dma_buf_sync_attachment similar to
> dma_sync_single_for_device:
> 
> /**
>  * dma_buf_sync_sg_attachment - sync caches for dma access
>  * @attach: dma-buf attachment to sync
>  * @sgt: the sg table to sync (returned by dma_buf_map_attachement)
>  * @direction: dma direction to sync for
>  *
>  * This function synchronizes caches for device dma through the given
>  * dma-buf attachment when interleaving dma from different devices and the
>  * cpu. Other device dma needs to be synced also by calls to this
>  * function (or a pair of dma_buf_map/unmap_attachment calls), cpu access
>  * needs to be synced with dma_buf_begin/end_cpu_access.
>  */
> void dma_buf_sync_sg_attachment(struct dma_buf_attachment *attach,
> 				struct sg_table *sgt,
> 				enum dma_data_direction direction)
> 
> Note that "sync" here only means to synchronize caches, not wait for any
> outstanding fences. This is simply to be consistent with the established
> lingo of the dma api. How the dma-buf fences fit into this is imo a
> different topic, but my idea is that all the cache coherency barriers
> (i.e. dma_buf_map/unmap_attachment, dma_buf_sync_sg_attachment and
> dma_buf_begin/end_cpu_access) would automatically block for any attached
> fences (i.e. block for write fences when doing read-only access, block for
> all fences otherwise).

As I mentioned already, kernel can't aware of when cpu accesses a shared
buffer: user can access a shared buffer after mmap anytime and the shared
buffer should be synchronized between cpu and dma. Therefore, the above
cache coherency barriers should be called every time cpu and dma tries to
access a shared buffer, checking before and after of cpu and dma access. And
exactly, the proposed way do such thing. For this, you can refer to the
below link,
	
http://www.mail-archive.com/linux-media@vger.kernel.org/msg62124.html

My point is that how kernel can aware of when those cache coherency barriers
should be called to synchronize caches and buffer access between cpu and
dma.

> 
> Then we could do a new dma_buf_attach_flags interface for special cases
> (might also be useful for other things, similar to the recently added
> flags in the dma api for wc/no-kernel-mapping/...). A new flag like
> DMA_BUF_ATTACH_NO_AUTOFENCE would then mean that the driver takes care of
> all
> fencing for this attachment and the dma-buf functions should not do the
> automagic fence blocking.
> 
> > In Linux kernel, especially embedded system, we had tried to implement
> > generic interfaces for buffer management; how to allocate a buffer and
> how
> > to share a buffer. As a result, now we have CMA (Contiguous Memory
> > Allocator) for buffer allocation, and we have DMA-BUF for buffer sharing
> > between cpu and dma. And then how to synchronize a buffer between cpu
> and
> > dma? I think now it's best time to discuss buffer synchronization
> mechanism,
> > and that is very natural.
> 
> I think it's important to differentiate between the two meanings of sync:
> - synchronize caches (i.e. cpu cache flushing in most cases)
> - and synchronize in-flight dma with fences.
> 

Agree, and I meant that the buffer synchronization mechanism includes the
above two things. And I think the two things should be addressed together.

Thanks,
Inki Dae

> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-21  9:22                                     ` Inki Dae
@ 2013-05-23 11:55                                       ` Daniel Vetter
  2013-05-23 13:37                                         ` Inki Dae
  2013-06-07  9:02                                         ` Introduce a dmabuf sync " Inki Dae
  0 siblings, 2 replies; 36+ messages in thread
From: Daniel Vetter @ 2013-05-23 11:55 UTC (permalink / raw)
  To: Inki Dae
  Cc: Rob Clark, linux-fbdev, DRI mailing list, Kyungmin Park,
	myungjoo.ham, YoungJun Cho, linux-arm-kernel, linux-media

On Tue, May 21, 2013 at 11:22 AM, Inki Dae <inki.dae@samsung.com> wrote:
>> -----Original Message-----
>> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
>> Vetter
>> Sent: Tuesday, May 21, 2013 4:45 PM
>> To: Inki Dae
>> Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'DRI mailing list';
>> 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
>> kernel@lists.infradead.org; linux-media@vger.kernel.org
>> Subject: Re: Introduce a new helper framework for buffer synchronization
>>
>> On Tue, May 21, 2013 at 04:03:06PM +0900, Inki Dae wrote:
>> > > - Integration of fence syncing into dma_buf. Imo we should have a
>> > >   per-attachment mode which decides whether map/unmap (and the new
>> sync)
>> > >   should wait for fences or whether the driver takes care of syncing
>> > >   through the new fence framework itself (for direct hw sync).
>> >
>> > I think it's a good idea to have per-attachment mode for buffer sync.
>> But
>> > I'd like to say we make sure what is the purpose of map
>> > (dma_buf_map_attachment)first. This interface is used to get a sgt;
>> > containing pages to physical memory region, and map that region with
>> > device's iommu table. The important thing is that this interface is
>> called
>> > just one time when user wants to share an allocated buffer with dma. But
>> cpu
>> > will try to access the buffer every time as long as it wants. Therefore,
>> we
>> > need cache control every time cpu and dma access the shared buffer:
>> cache
>> > clean when cpu goes to dma and cache invalidate when dma goes to cpu.
>> That
>> > is why I added new interfaces, DMA_BUF_GET_FENCE and DMA_BUF_PUT_FENCE,
>> to
>> > dma buf framework. Of course, Those are ugly so we need a better way: I
>> just
>> > wanted to show you that in such way, we can synchronize the shared
>> buffer
>> > between cpu and dma. By any chance, is there any way that kernel can be
>> > aware of when cpu accesses the shared buffer or is there any point I
>> didn't
>> > catch up?
>>
>> Well dma_buf_map/unmap_attachment should also flush/invalidate any caches,
>> and with the current dma_buf spec those two functions are the _only_ means
>
> I know that dma buf exporter should make sure that cache clean/invalidate
> are done when dma_buf_map/unmap_attachement is called. For this, already we
> do so. However, this function is called when dma buf import is requested by
> user to map a dmabuf fd with dma. This means that dma_buf_map_attachement()
> is called ONCE when user wants to share the dmabuf fd with dma. Actually, in
> case of exynos drm, dma_map_sg_attrs(), performing cache operation
> internally, is called when dmabuf import is requested by user.
>
>> you have to do so. Which strictly means that if you interleave device dma
>> and cpu acccess you need to unmap/map every time.
>>
>> Which is obviously horribly inefficient, but a known gap in the dma_buf
>
> Right, and also this has big overhead.
>
>> interface. Hence why I proposed to add dma_buf_sync_attachment similar to
>> dma_sync_single_for_device:
>>
>> /**
>>  * dma_buf_sync_sg_attachment - sync caches for dma access
>>  * @attach: dma-buf attachment to sync
>>  * @sgt: the sg table to sync (returned by dma_buf_map_attachement)
>>  * @direction: dma direction to sync for
>>  *
>>  * This function synchronizes caches for device dma through the given
>>  * dma-buf attachment when interleaving dma from different devices and the
>>  * cpu. Other device dma needs to be synced also by calls to this
>>  * function (or a pair of dma_buf_map/unmap_attachment calls), cpu access
>>  * needs to be synced with dma_buf_begin/end_cpu_access.
>>  */
>> void dma_buf_sync_sg_attachment(struct dma_buf_attachment *attach,
>>                               struct sg_table *sgt,
>>                               enum dma_data_direction direction)
>>
>> Note that "sync" here only means to synchronize caches, not wait for any
>> outstanding fences. This is simply to be consistent with the established
>> lingo of the dma api. How the dma-buf fences fit into this is imo a
>> different topic, but my idea is that all the cache coherency barriers
>> (i.e. dma_buf_map/unmap_attachment, dma_buf_sync_sg_attachment and
>> dma_buf_begin/end_cpu_access) would automatically block for any attached
>> fences (i.e. block for write fences when doing read-only access, block for
>> all fences otherwise).
>
> As I mentioned already, kernel can't aware of when cpu accesses a shared
> buffer: user can access a shared buffer after mmap anytime and the shared
> buffer should be synchronized between cpu and dma. Therefore, the above
> cache coherency barriers should be called every time cpu and dma tries to
> access a shared buffer, checking before and after of cpu and dma access. And
> exactly, the proposed way do such thing. For this, you can refer to the
> below link,
>
> http://www.mail-archive.com/linux-media@vger.kernel.org/msg62124.html
>
> My point is that how kernel can aware of when those cache coherency barriers
> should be called to synchronize caches and buffer access between cpu and
> dma.
>
>>
>> Then we could do a new dma_buf_attach_flags interface for special cases
>> (might also be useful for other things, similar to the recently added
>> flags in the dma api for wc/no-kernel-mapping/...). A new flag like
>> DMA_BUF_ATTACH_NO_AUTOFENCE would then mean that the driver takes care of
>> all
>> fencing for this attachment and the dma-buf functions should not do the
>> automagic fence blocking.
>>
>> > In Linux kernel, especially embedded system, we had tried to implement
>> > generic interfaces for buffer management; how to allocate a buffer and
>> how
>> > to share a buffer. As a result, now we have CMA (Contiguous Memory
>> > Allocator) for buffer allocation, and we have DMA-BUF for buffer sharing
>> > between cpu and dma. And then how to synchronize a buffer between cpu
>> and
>> > dma? I think now it's best time to discuss buffer synchronization
>> mechanism,
>> > and that is very natural.
>>
>> I think it's important to differentiate between the two meanings of sync:
>> - synchronize caches (i.e. cpu cache flushing in most cases)
>> - and synchronize in-flight dma with fences.
>>
>
> Agree, and I meant that the buffer synchronization mechanism includes the
> above two things. And I think the two things should be addressed together.

I'm still confused about what you're trying to achive in the big
picture here exactly. I think the key point is your statement above
that the kernel can't know when the cpu access a dma-buf. That's not
true though:

For userspace mmaps the kernel can shoot down ptes and so prevent
userspace from accessing a buffer. Since that's pretty inefficient
gem/ttm have additional ioctls for cache coherency transitions.
dma_buf currently has no support for explicit coherency transitions
from userpsace, so if pte shootdown is an issue we need to improve
that.

If I read your proposal correctly you instead suggest to _alway_ flush
cache before/after each dma. That will _completely_ kill performance
(been there with i915, trust me).

For cpu access from kernel space we already have explicit mappings all
over the place (kmap&friends), so I don't see any issues with
inserting the relevant begin/end_cpu_access calls just around the cpu
access. The big exception is the framebuffer layer, but even that has
the deferred io support. KMS otoh has a nice frontbuffer dirty
interface to correctly support dumb clients, unfortunately fbcon does
not (yet) use that.

For these reasons I don't see a need to smash cache coherency and
fencing into one problem, and hence why I've proposed to tackle each
of them individually.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-23 11:55                                       ` Daniel Vetter
@ 2013-05-23 13:37                                         ` Inki Dae
  2013-05-27 10:38                                           ` Inki Dae
  2013-06-07  9:02                                         ` Introduce a dmabuf sync " Inki Dae
  1 sibling, 1 reply; 36+ messages in thread
From: Inki Dae @ 2013-05-23 13:37 UTC (permalink / raw)
  To: 'Daniel Vetter'
  Cc: 'Rob Clark', 'linux-fbdev',
	'DRI mailing list', 'Kyungmin Park',
	'myungjoo.ham', 'YoungJun Cho',
	linux-arm-kernel, linux-media

> -----Original Message-----
> From: daniel.vetter@ffwll.ch [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> Daniel Vetter
> Sent: Thursday, May 23, 2013 8:56 PM
> To: Inki Dae
> Cc: Rob Clark; linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham;
> YoungJun Cho; linux-arm-kernel@lists.infradead.org; linux-
> media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Tue, May 21, 2013 at 11:22 AM, Inki Dae <inki.dae@samsung.com> wrote:
> >> -----Original Message-----
> >> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> >> Vetter
> >> Sent: Tuesday, May 21, 2013 4:45 PM
> >> To: Inki Dae
> >> Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'DRI mailing list';
> >> 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
> >> kernel@lists.infradead.org; linux-media@vger.kernel.org
> >> Subject: Re: Introduce a new helper framework for buffer
> synchronization
> >>
> >> On Tue, May 21, 2013 at 04:03:06PM +0900, Inki Dae wrote:
> >> > > - Integration of fence syncing into dma_buf. Imo we should have a
> >> > >   per-attachment mode which decides whether map/unmap (and the new
> >> sync)
> >> > >   should wait for fences or whether the driver takes care of
> syncing
> >> > >   through the new fence framework itself (for direct hw sync).
> >> >
> >> > I think it's a good idea to have per-attachment mode for buffer sync.
> >> But
> >> > I'd like to say we make sure what is the purpose of map
> >> > (dma_buf_map_attachment)first. This interface is used to get a sgt;
> >> > containing pages to physical memory region, and map that region with
> >> > device's iommu table. The important thing is that this interface is
> >> called
> >> > just one time when user wants to share an allocated buffer with dma.
> But
> >> cpu
> >> > will try to access the buffer every time as long as it wants.
> Therefore,
> >> we
> >> > need cache control every time cpu and dma access the shared buffer:
> >> cache
> >> > clean when cpu goes to dma and cache invalidate when dma goes to cpu.
> >> That
> >> > is why I added new interfaces, DMA_BUF_GET_FENCE and
> DMA_BUF_PUT_FENCE,
> >> to
> >> > dma buf framework. Of course, Those are ugly so we need a better way:
> I
> >> just
> >> > wanted to show you that in such way, we can synchronize the shared
> >> buffer
> >> > between cpu and dma. By any chance, is there any way that kernel can
> be
> >> > aware of when cpu accesses the shared buffer or is there any point I
> >> didn't
> >> > catch up?
> >>
> >> Well dma_buf_map/unmap_attachment should also flush/invalidate any
> caches,
> >> and with the current dma_buf spec those two functions are the _only_
> means
> >
> > I know that dma buf exporter should make sure that cache
> clean/invalidate
> > are done when dma_buf_map/unmap_attachement is called. For this, already
> we
> > do so. However, this function is called when dma buf import is requested
> by
> > user to map a dmabuf fd with dma. This means that
> dma_buf_map_attachement()
> > is called ONCE when user wants to share the dmabuf fd with dma.
Actually,
> in
> > case of exynos drm, dma_map_sg_attrs(), performing cache operation
> > internally, is called when dmabuf import is requested by user.
> >
> >> you have to do so. Which strictly means that if you interleave device
> dma
> >> and cpu acccess you need to unmap/map every time.
> >>
> >> Which is obviously horribly inefficient, but a known gap in the dma_buf
> >
> > Right, and also this has big overhead.
> >
> >> interface. Hence why I proposed to add dma_buf_sync_attachment similar
> to
> >> dma_sync_single_for_device:
> >>
> >> /**
> >>  * dma_buf_sync_sg_attachment - sync caches for dma access
> >>  * @attach: dma-buf attachment to sync
> >>  * @sgt: the sg table to sync (returned by dma_buf_map_attachement)
> >>  * @direction: dma direction to sync for
> >>  *
> >>  * This function synchronizes caches for device dma through the given
> >>  * dma-buf attachment when interleaving dma from different devices and
> the
> >>  * cpu. Other device dma needs to be synced also by calls to this
> >>  * function (or a pair of dma_buf_map/unmap_attachment calls), cpu
> access
> >>  * needs to be synced with dma_buf_begin/end_cpu_access.
> >>  */
> >> void dma_buf_sync_sg_attachment(struct dma_buf_attachment *attach,
> >>                               struct sg_table *sgt,
> >>                               enum dma_data_direction direction)
> >>
> >> Note that "sync" here only means to synchronize caches, not wait for
> any
> >> outstanding fences. This is simply to be consistent with the
> established
> >> lingo of the dma api. How the dma-buf fences fit into this is imo a
> >> different topic, but my idea is that all the cache coherency barriers
> >> (i.e. dma_buf_map/unmap_attachment, dma_buf_sync_sg_attachment and
> >> dma_buf_begin/end_cpu_access) would automatically block for any
> attached
> >> fences (i.e. block for write fences when doing read-only access, block
> for
> >> all fences otherwise).
> >
> > As I mentioned already, kernel can't aware of when cpu accesses a shared
> > buffer: user can access a shared buffer after mmap anytime and the
> shared
> > buffer should be synchronized between cpu and dma. Therefore, the above
> > cache coherency barriers should be called every time cpu and dma tries
> to
> > access a shared buffer, checking before and after of cpu and dma access.
> And
> > exactly, the proposed way do such thing. For this, you can refer to the
> > below link,
> >
> > http://www.mail-archive.com/linux-media@vger.kernel.org/msg62124.html
> >
> > My point is that how kernel can aware of when those cache coherency
> barriers
> > should be called to synchronize caches and buffer access between cpu and
> > dma.
> >
> >>
> >> Then we could do a new dma_buf_attach_flags interface for special cases
> >> (might also be useful for other things, similar to the recently added
> >> flags in the dma api for wc/no-kernel-mapping/...). A new flag like
> >> DMA_BUF_ATTACH_NO_AUTOFENCE would then mean that the driver takes care
> of
> >> all
> >> fencing for this attachment and the dma-buf functions should not do the
> >> automagic fence blocking.
> >>
> >> > In Linux kernel, especially embedded system, we had tried to
> implement
> >> > generic interfaces for buffer management; how to allocate a buffer
> and
> >> how
> >> > to share a buffer. As a result, now we have CMA (Contiguous Memory
> >> > Allocator) for buffer allocation, and we have DMA-BUF for buffer
> sharing
> >> > between cpu and dma. And then how to synchronize a buffer between cpu
> >> and
> >> > dma? I think now it's best time to discuss buffer synchronization
> >> mechanism,
> >> > and that is very natural.
> >>
> >> I think it's important to differentiate between the two meanings of
> sync:
> >> - synchronize caches (i.e. cpu cache flushing in most cases)
> >> - and synchronize in-flight dma with fences.
> >>
> >
> > Agree, and I meant that the buffer synchronization mechanism includes
> the
> > above two things. And I think the two things should be addressed
> together.
> 
> I'm still confused about what you're trying to achive in the big
> picture here exactly. I think the key point is your statement above
> that the kernel can't know when the cpu access a dma-buf. That's not
> true though:
> 
> For userspace mmaps the kernel can shoot down ptes and so prevent
> userspace from accessing a buffer.

In this case, cpu access to user space will incur page fault. However, cpu
accesses user space region after mmap in user mode and every time user
wants. And user needs something before passing a buffer accessed into device
driver and also vice versa.

> Since that's pretty inefficient
> gem/ttm have additional ioctls for cache coherency transitions.

Yes, that's the something, and that's a way now we are using if I understood
surely. User side calls those ioctls for cache coherency between cpu and
dma: cache clean after cpu access and before dma access, and cache
invalidate after dma access and before cpu access. I thought it's better to
control cache in kernel side than user because I had found some Linux based
platforms overuse cache operations. So I thought the best place for it is
the proposed fence helper or similar thing: we would need interfaces to
notify the moment, 'before and after of cpu access', into kernel side. My
proposal is to couple those two things, synchronizing caches and in-flight
dma with fences, in kernel side. With this approach, user doesn't need to
care when dma and cpu access will be started, and when dma and cpu access
will be completed anymore when user wants to access a buffer and share it
with dma. All things, cache coherency and buffer fencing between dma and
cpu, will be done in kernel side.

> dma_buf currently has no support for explicit coherency transitions
> from userpsace, so if pte shootdown is an issue we need to improve
> that.
> 
> If I read your proposal correctly you instead suggest to _alway_ flush
> cache before/after each dma. That will _completely_ kill performance
> (been there with i915, trust me).
> 

In case of using a cachable mapped buffer and sharing the buffer with dma,
we need to clean cache after cpu access and before dma access, and we need
to invalidate cache after dma access and before cpu access every time. Isn't
that we are doing so now? I have a little knowledge about Desktop world so
there might be my misunderstanding.

Could you please give me advices and more comments if there is my missing
point or misunderstanding?

> For cpu access from kernel space we already have explicit mappings all
> over the place (kmap&friends), so I don't see any issues with
> inserting the relevant begin/end_cpu_access calls just around the cpu
> access. The big exception is the framebuffer layer, but even that has
> the deferred io support. KMS otoh has a nice frontbuffer dirty
> interface to correctly support dumb clients, unfortunately fbcon does
> not (yet) use that.
> 
> For these reasons I don't see a need to smash cache coherency and
> fencing into one problem, and hence why I've proposed to tackle each
> of them individually.

I might be really missing some points :)

Thanks,
Inki Dae

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-23 13:37                                         ` Inki Dae
@ 2013-05-27 10:38                                           ` Inki Dae
  2013-05-27 15:23                                             ` Maarten Lankhorst
  2013-05-27 15:47                                             ` Rob Clark
  0 siblings, 2 replies; 36+ messages in thread
From: Inki Dae @ 2013-05-27 10:38 UTC (permalink / raw)
  To: 'Maarten Lankhorst', 'Daniel Vetter',
	'Rob Clark'
  Cc: 'linux-fbdev', 'YoungJun Cho',
	'Kyungmin Park', 'myungjoo.ham',
	'DRI mailing list',
	linux-arm-kernel, linux-media

Hi all,

I have been removed previous branch and added new one with more cleanup.
This time, the fence helper doesn't include user side interfaces and cache
operation relevant codes anymore because not only we are not sure that
coupling those two things, synchronizing caches and buffer access between
CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is a
good idea yet but also existing codes for user side have problems with badly
behaved or crashing userspace. So this could be more discussed later.

The below is a new branch,
	
https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/?h=dma-f
ence-helper

And fence helper codes,
	
https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005

And example codes for device driver,
	
https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae

I think the time is not yet ripe for RFC posting: maybe existing dma fence
and reservation need more review and addition work. So I'd glad for somebody
giving other opinions and advices in advance before RFC posting.

Thanks,
Inki Dae


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-27 10:38                                           ` Inki Dae
@ 2013-05-27 15:23                                             ` Maarten Lankhorst
  2013-05-28  2:49                                               ` Inki Dae
  2013-05-27 15:47                                             ` Rob Clark
  1 sibling, 1 reply; 36+ messages in thread
From: Maarten Lankhorst @ 2013-05-27 15:23 UTC (permalink / raw)
  To: Inki Dae
  Cc: 'Daniel Vetter', 'Rob Clark',
	'linux-fbdev', 'YoungJun Cho',
	'Kyungmin Park', 'myungjoo.ham',
	'DRI mailing list',
	linux-arm-kernel, linux-media

Hey,

Op 27-05-13 12:38, Inki Dae schreef:
> Hi all,
>
> I have been removed previous branch and added new one with more cleanup.
> This time, the fence helper doesn't include user side interfaces and cache
> operation relevant codes anymore because not only we are not sure that
> coupling those two things, synchronizing caches and buffer access between
> CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is a
> good idea yet but also existing codes for user side have problems with badly
> behaved or crashing userspace. So this could be more discussed later.
>
> The below is a new branch,
> 	
> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/?h=dma-f
> ence-helper
>
> And fence helper codes,
> 	
> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
> h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
>
> And example codes for device driver,
> 	
> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
> h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
>
> I think the time is not yet ripe for RFC posting: maybe existing dma fence
> and reservation need more review and addition work. So I'd glad for somebody
> giving other opinions and advices in advance before RFC posting.
>
NAK.

For examples for how to handle locking properly, see Documentation/ww-mutex-design.txt in my recent tree.
I could list what I believe is wrong with your implementation, but real problem is that the approach you're taking is wrong.

~Maarten

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-27 10:38                                           ` Inki Dae
  2013-05-27 15:23                                             ` Maarten Lankhorst
@ 2013-05-27 15:47                                             ` Rob Clark
  2013-05-27 16:02                                               ` Daniel Vetter
  2013-05-28  3:56                                               ` Inki Dae
  1 sibling, 2 replies; 36+ messages in thread
From: Rob Clark @ 2013-05-27 15:47 UTC (permalink / raw)
  To: Inki Dae
  Cc: Maarten Lankhorst, Daniel Vetter, linux-fbdev, YoungJun Cho,
	Kyungmin Park, myungjoo.ham, DRI mailing list, linux-arm-kernel,
	linux-media

On Mon, May 27, 2013 at 6:38 AM, Inki Dae <inki.dae@samsung.com> wrote:
> Hi all,
>
> I have been removed previous branch and added new one with more cleanup.
> This time, the fence helper doesn't include user side interfaces and cache
> operation relevant codes anymore because not only we are not sure that
> coupling those two things, synchronizing caches and buffer access between
> CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is a
> good idea yet but also existing codes for user side have problems with badly
> behaved or crashing userspace. So this could be more discussed later.
>
> The below is a new branch,
>
> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/?h=dma-f
> ence-helper
>
> And fence helper codes,
>
> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
> h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
>
> And example codes for device driver,
>
> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
> h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
>
> I think the time is not yet ripe for RFC posting: maybe existing dma fence
> and reservation need more review and addition work. So I'd glad for somebody
> giving other opinions and advices in advance before RFC posting.

thoughts from a *really* quick, pre-coffee, first look:
* any sort of helper to simplify single-buffer sort of use-cases (v4l)
probably wouldn't want to bake in assumption that seqno_fence is used.
* I guess g2d is probably not actually a simple use case, since I
expect you can submit blits involving multiple buffers :-P
* otherwise, you probably don't want to depend on dmabuf, which is why
reservation/fence is split out the way it is..  you want to be able to
use a single reservation/fence mechanism within your driver without
having to care about which buffers are exported to dmabuf's and which
are not.  Creating a dmabuf for every GEM bo is too heavyweight.

I'm not entirely sure if reservation/fence could/should be made any
simpler for multi-buffer users.  Probably the best thing to do is just
get reservation/fence rolled out in a few drivers and see if some
common patterns emerge.

BR,
-R

>
> Thanks,
> Inki Dae
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-27 15:47                                             ` Rob Clark
@ 2013-05-27 16:02                                               ` Daniel Vetter
  2013-05-28  4:04                                                 ` Inki Dae
  2013-05-28  3:56                                               ` Inki Dae
  1 sibling, 1 reply; 36+ messages in thread
From: Daniel Vetter @ 2013-05-27 16:02 UTC (permalink / raw)
  To: Rob Clark
  Cc: Inki Dae, Maarten Lankhorst, linux-fbdev, YoungJun Cho,
	Kyungmin Park, myungjoo.ham, DRI mailing list, linux-arm-kernel,
	linux-media

On Mon, May 27, 2013 at 5:47 PM, Rob Clark <robdclark@gmail.com> wrote:
> On Mon, May 27, 2013 at 6:38 AM, Inki Dae <inki.dae@samsung.com> wrote:
>> Hi all,
>>
>> I have been removed previous branch and added new one with more cleanup.
>> This time, the fence helper doesn't include user side interfaces and cache
>> operation relevant codes anymore because not only we are not sure that
>> coupling those two things, synchronizing caches and buffer access between
>> CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is a
>> good idea yet but also existing codes for user side have problems with badly
>> behaved or crashing userspace. So this could be more discussed later.
>>
>> The below is a new branch,
>>
>> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/?h=dma-f
>> ence-helper
>>
>> And fence helper codes,
>>
>> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
>> h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
>>
>> And example codes for device driver,
>>
>> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
>> h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
>>
>> I think the time is not yet ripe for RFC posting: maybe existing dma fence
>> and reservation need more review and addition work. So I'd glad for somebody
>> giving other opinions and advices in advance before RFC posting.
>
> thoughts from a *really* quick, pre-coffee, first look:
> * any sort of helper to simplify single-buffer sort of use-cases (v4l)
> probably wouldn't want to bake in assumption that seqno_fence is used.

Yeah, which is why Maarten&I discussed ideas already for what needs to
be improved in the current dma-buf interface code to make this Just
Work. At least as long as a driver doesn't want to add new fences,
which would be especially useful for all kinds of gpu access.

> * I guess g2d is probably not actually a simple use case, since I
> expect you can submit blits involving multiple buffers :-P

Yeah, on a quick read the current fence helper code seems to be a bit
limited in scope.

> * otherwise, you probably don't want to depend on dmabuf, which is why
> reservation/fence is split out the way it is..  you want to be able to
> use a single reservation/fence mechanism within your driver without
> having to care about which buffers are exported to dmabuf's and which
> are not.  Creating a dmabuf for every GEM bo is too heavyweight.

That's pretty much the reason that reservations are free-standing from
dma_bufs. The idea is to embed them into the gem/ttm/v4l buffer
object. Maarten also has some helpers to keep track of multi-buffer
ww_mutex locking and fence attaching in his reservation helpers, but I
think we should wait with those until we have drivers using them.

For now I think the priority should be to get the basic stuff in and
ttm as the first user established. Then we can go nuts later on.

> I'm not entirely sure if reservation/fence could/should be made any
> simpler for multi-buffer users.  Probably the best thing to do is just
> get reservation/fence rolled out in a few drivers and see if some
> common patterns emerge.

I think we can make the 1 buffer per dma op (i.e. 1:1
dma_buf->reservation : fence mapping) work fairly simple in dma_buf
with maybe a dma_buf_attachment_start_dma/end_dma helpers. But there's
also still the open that currently there's no way to flush cpu caches
for dma access without unmapping the attachement (or resorting to
trick which might not work), so we have a few gaping holes in the
interface already anyway.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-27 15:23                                             ` Maarten Lankhorst
@ 2013-05-28  2:49                                               ` Inki Dae
  2013-05-28  7:23                                                 ` Maarten Lankhorst
  0 siblings, 1 reply; 36+ messages in thread
From: Inki Dae @ 2013-05-28  2:49 UTC (permalink / raw)
  To: 'Maarten Lankhorst'
  Cc: 'Daniel Vetter', 'Rob Clark',
	'linux-fbdev', 'YoungJun Cho',
	'Kyungmin Park', 'myungjoo.ham',
	'DRI mailing list',
	linux-arm-kernel, linux-media



> -----Original Message-----
> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
> Sent: Tuesday, May 28, 2013 12:23 AM
> To: Inki Dae
> Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'YoungJun Cho'; 'Kyungmin
> Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm-
> kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> Hey,
> 
> Op 27-05-13 12:38, Inki Dae schreef:
> > Hi all,
> >
> > I have been removed previous branch and added new one with more cleanup.
> > This time, the fence helper doesn't include user side interfaces and
> cache
> > operation relevant codes anymore because not only we are not sure that
> > coupling those two things, synchronizing caches and buffer access
> between
> > CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
> a
> > good idea yet but also existing codes for user side have problems with
> badly
> > behaved or crashing userspace. So this could be more discussed later.
> >
> > The below is a new branch,
> >
> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/?h=dma-f
> > ence-helper
> >
> > And fence helper codes,
> >
> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/commit/?
> > h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
> >
> > And example codes for device driver,
> >
> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/commit/?
> > h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
> >
> > I think the time is not yet ripe for RFC posting: maybe existing dma
> fence
> > and reservation need more review and addition work. So I'd glad for
> somebody
> > giving other opinions and advices in advance before RFC posting.
> >
> NAK.
> 
> For examples for how to handle locking properly, see Documentation/ww-
> mutex-design.txt in my recent tree.
> I could list what I believe is wrong with your implementation, but real
> problem is that the approach you're taking is wrong.

I just removed ticket stubs to show my approach you guys as simple as
possible, and I just wanted to show that we could use buffer synchronization
mechanism without ticket stubs.

Question, WW-Mutexes could be used for all devices? I guess this has
dependence on x86 gpu: gpu has VRAM and it means different memory domain.
And could you tell my why shared fence should have only eight objects? I
think we could need more than eight objects for read access. Anyway I think
I don't surely understand yet so there might be my missing point.

Thanks,
Inki Dae

> 

> ~Maarten


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-27 15:47                                             ` Rob Clark
  2013-05-27 16:02                                               ` Daniel Vetter
@ 2013-05-28  3:56                                               ` Inki Dae
  2013-05-28 10:32                                                 ` Daniel Vetter
  2013-05-28 13:48                                                 ` Rob Clark
  1 sibling, 2 replies; 36+ messages in thread
From: Inki Dae @ 2013-05-28  3:56 UTC (permalink / raw)
  To: 'Rob Clark'
  Cc: 'Maarten Lankhorst', 'Daniel Vetter',
	'linux-fbdev', 'YoungJun Cho',
	'Kyungmin Park', 'myungjoo.ham',
	'DRI mailing list',
	linux-arm-kernel, linux-media



> -----Original Message-----
> From: linux-fbdev-owner@vger.kernel.org [mailto:linux-fbdev-
> owner@vger.kernel.org] On Behalf Of Rob Clark
> Sent: Tuesday, May 28, 2013 12:48 AM
> To: Inki Dae
> Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho; Kyungmin
> Park; myungjoo.ham; DRI mailing list;
linux-arm-kernel@lists.infradead.org;
> linux-media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Mon, May 27, 2013 at 6:38 AM, Inki Dae <inki.dae@samsung.com> wrote:
> > Hi all,
> >
> > I have been removed previous branch and added new one with more cleanup.
> > This time, the fence helper doesn't include user side interfaces and
> cache
> > operation relevant codes anymore because not only we are not sure that
> > coupling those two things, synchronizing caches and buffer access
> between
> > CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
> a
> > good idea yet but also existing codes for user side have problems with
> badly
> > behaved or crashing userspace. So this could be more discussed later.
> >
> > The below is a new branch,
> >
> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/?h=dma-f
> > ence-helper
> >
> > And fence helper codes,
> >
> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/commit/?
> > h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
> >
> > And example codes for device driver,
> >
> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/commit/?
> > h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
> >
> > I think the time is not yet ripe for RFC posting: maybe existing dma
> fence
> > and reservation need more review and addition work. So I'd glad for
> somebody
> > giving other opinions and advices in advance before RFC posting.
> 
> thoughts from a *really* quick, pre-coffee, first look:
> * any sort of helper to simplify single-buffer sort of use-cases (v4l)
> probably wouldn't want to bake in assumption that seqno_fence is used.
> * I guess g2d is probably not actually a simple use case, since I
> expect you can submit blits involving multiple buffers :-P

I don't think so. One and more buffers can be used: seqno_fence also has
only one buffer. Actually, we have already applied this approach to most
devices; multimedia, gpu and display controller. And this approach shows
more performance; reduced power consumption against traditional way. And g2d
example is just to show you how to apply my approach to device driver.

> * otherwise, you probably don't want to depend on dmabuf, which is why
> reservation/fence is split out the way it is..  you want to be able to
> use a single reservation/fence mechanism within your driver without
> having to care about which buffers are exported to dmabuf's and which
> are not.  Creating a dmabuf for every GEM bo is too heavyweight.

Right. But I think we should dealt with this separately. Actually, we are
trying to use reservation for gpu pipe line synchronization such as sgx sync
object and this approach is used without dmabuf. In order words, some device
can use only reservation for such pipe line synchronization and at the same
time, fence helper or similar thing with dmabuf for buffer synchronization.

> 
> I'm not entirely sure if reservation/fence could/should be made any
> simpler for multi-buffer users.  Probably the best thing to do is just
> get reservation/fence rolled out in a few drivers and see if some
> common patterns emerge.
> 
> BR,
> -R
> 
> >
> > Thanks,
> > Inki Dae
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-27 16:02                                               ` Daniel Vetter
@ 2013-05-28  4:04                                                 ` Inki Dae
  0 siblings, 0 replies; 36+ messages in thread
From: Inki Dae @ 2013-05-28  4:04 UTC (permalink / raw)
  To: 'Daniel Vetter', 'Rob Clark'
  Cc: 'Maarten Lankhorst', 'linux-fbdev',
	'YoungJun Cho', 'Kyungmin Park',
	'myungjoo.ham', 'DRI mailing list',
	linux-arm-kernel, linux-media



> -----Original Message-----
> From: daniel.vetter@ffwll.ch [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> Daniel Vetter
> Sent: Tuesday, May 28, 2013 1:02 AM
> To: Rob Clark
> Cc: Inki Dae; Maarten Lankhorst; linux-fbdev; YoungJun Cho; Kyungmin Park;
> myungjoo.ham; DRI mailing list; linux-arm-kernel@lists.infradead.org;
> linux-media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Mon, May 27, 2013 at 5:47 PM, Rob Clark <robdclark@gmail.com> wrote:
> > On Mon, May 27, 2013 at 6:38 AM, Inki Dae <inki.dae@samsung.com> wrote:
> >> Hi all,
> >>
> >> I have been removed previous branch and added new one with more
cleanup.
> >> This time, the fence helper doesn't include user side interfaces and
> cache
> >> operation relevant codes anymore because not only we are not sure that
> >> coupling those two things, synchronizing caches and buffer access
> between
> >> CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side
> is a
> >> good idea yet but also existing codes for user side have problems with
> badly
> >> behaved or crashing userspace. So this could be more discussed later.
> >>
> >> The below is a new branch,
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/?h=dma-f
> >> ence-helper
> >>
> >> And fence helper codes,
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/commit/?
> >> h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
> >>
> >> And example codes for device driver,
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> exynos.git/commit/?
> >> h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
> >>
> >> I think the time is not yet ripe for RFC posting: maybe existing dma
> fence
> >> and reservation need more review and addition work. So I'd glad for
> somebody
> >> giving other opinions and advices in advance before RFC posting.
> >
> > thoughts from a *really* quick, pre-coffee, first look:
> > * any sort of helper to simplify single-buffer sort of use-cases (v4l)
> > probably wouldn't want to bake in assumption that seqno_fence is used.
> 
> Yeah, which is why Maarten&I discussed ideas already for what needs to
> be improved in the current dma-buf interface code to make this Just
> Work. At least as long as a driver doesn't want to add new fences,
> which would be especially useful for all kinds of gpu access.
> 
> > * I guess g2d is probably not actually a simple use case, since I
> > expect you can submit blits involving multiple buffers :-P
> 
> Yeah, on a quick read the current fence helper code seems to be a bit
> limited in scope.
> 
> > * otherwise, you probably don't want to depend on dmabuf, which is why
> > reservation/fence is split out the way it is..  you want to be able to
> > use a single reservation/fence mechanism within your driver without
> > having to care about which buffers are exported to dmabuf's and which
> > are not.  Creating a dmabuf for every GEM bo is too heavyweight.
> 
> That's pretty much the reason that reservations are free-standing from
> dma_bufs. The idea is to embed them into the gem/ttm/v4l buffer
> object. Maarten also has some helpers to keep track of multi-buffer
> ww_mutex locking and fence attaching in his reservation helpers, but I
> think we should wait with those until we have drivers using them.
> 
> For now I think the priority should be to get the basic stuff in and
> ttm as the first user established. Then we can go nuts later on.
> 
> > I'm not entirely sure if reservation/fence could/should be made any
> > simpler for multi-buffer users.  Probably the best thing to do is just
> > get reservation/fence rolled out in a few drivers and see if some
> > common patterns emerge.
> 
> I think we can make the 1 buffer per dma op (i.e. 1:1
> dma_buf->reservation : fence mapping) work fairly simple in dma_buf
> with maybe a dma_buf_attachment_start_dma/end_dma helpers. But there's
> also still the open that currently there's no way to flush cpu caches
> for dma access without unmapping the attachement (or resorting to


That was what I tried adding user interfaces to dmabuf: coupling
synchronizing caches and buffer access between CPU and CPU, CPU and DMA, and
DMA and DMA with fences in kernel side. We need something to do between
mapping and unmapping attachment.

> trick which might not work), so we have a few gaping holes in the
> interface already anyway.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-28  2:49                                               ` Inki Dae
@ 2013-05-28  7:23                                                 ` Maarten Lankhorst
  0 siblings, 0 replies; 36+ messages in thread
From: Maarten Lankhorst @ 2013-05-28  7:23 UTC (permalink / raw)
  To: Inki Dae
  Cc: 'Daniel Vetter', 'Rob Clark',
	'linux-fbdev', 'YoungJun Cho',
	'Kyungmin Park', 'myungjoo.ham',
	'DRI mailing list',
	linux-arm-kernel, linux-media

Hey,

Op 28-05-13 04:49, Inki Dae schreef:
>
>> -----Original Message-----
>> From: Maarten Lankhorst [mailto:maarten.lankhorst@canonical.com]
>> Sent: Tuesday, May 28, 2013 12:23 AM
>> To: Inki Dae
>> Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'YoungJun Cho'; 'Kyungmin
>> Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm-
>> kernel@lists.infradead.org; linux-media@vger.kernel.org
>> Subject: Re: Introduce a new helper framework for buffer synchronization
>>
>> Hey,
>>
>> Op 27-05-13 12:38, Inki Dae schreef:
>>> Hi all,
>>>
>>> I have been removed previous branch and added new one with more cleanup.
>>> This time, the fence helper doesn't include user side interfaces and
>> cache
>>> operation relevant codes anymore because not only we are not sure that
>>> coupling those two things, synchronizing caches and buffer access
>> between
>>> CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
>> a
>>> good idea yet but also existing codes for user side have problems with
>> badly
>>> behaved or crashing userspace. So this could be more discussed later.
>>>
>>> The below is a new branch,
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
>> exynos.git/?h=dma-f
>>> ence-helper
>>>
>>> And fence helper codes,
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
>> exynos.git/commit/?
>>> h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
>>>
>>> And example codes for device driver,
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
>> exynos.git/commit/?
>>> h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
>>>
>>> I think the time is not yet ripe for RFC posting: maybe existing dma
>> fence
>>> and reservation need more review and addition work. So I'd glad for
>> somebody
>>> giving other opinions and advices in advance before RFC posting.
>>>
>> NAK.
>>
>> For examples for how to handle locking properly, see Documentation/ww-
>> mutex-design.txt in my recent tree.
>> I could list what I believe is wrong with your implementation, but real
>> problem is that the approach you're taking is wrong.
> I just removed ticket stubs to show my approach you guys as simple as
> possible, and I just wanted to show that we could use buffer synchronization
> mechanism without ticket stubs.
The tickets have been removed in favor of a ww_context. Moving it in as a base primitive
allows more locking abuse to be detected, and makes some other things easier too.

> Question, WW-Mutexes could be used for all devices? I guess this has
> dependence on x86 gpu: gpu has VRAM and it means different memory domain.
> And could you tell my why shared fence should have only eight objects? I
> think we could need more than eight objects for read access. Anyway I think
> I don't surely understand yet so there might be my missing point.
Yes, ww mutexes are not limited in any way to x86. They're a locking mechanism.
When you acquired the ww mutexes for all buffer objects, all it does is say at
that point in time you have exclusively acquired the locks of all bo's.

After locking everything you can read the fence pointers safely, queue waits, and set a
new fence pointer on all reservation_objects. You only need a single fence
on all those objects, so 8 is plenty. Nonetheless this was a limitation of my
earlier design, and I'll dynamically allocate fence_shared in the future.

~Maarten


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-28  3:56                                               ` Inki Dae
@ 2013-05-28 10:32                                                 ` Daniel Vetter
  2013-05-28 14:43                                                   ` Inki Dae
  2013-05-28 13:48                                                 ` Rob Clark
  1 sibling, 1 reply; 36+ messages in thread
From: Daniel Vetter @ 2013-05-28 10:32 UTC (permalink / raw)
  To: Inki Dae
  Cc: 'Rob Clark', 'Maarten Lankhorst',
	'Daniel Vetter', 'linux-fbdev',
	'YoungJun Cho', 'Kyungmin Park',
	'myungjoo.ham', 'DRI mailing list',
	linux-arm-kernel, linux-media

On Tue, May 28, 2013 at 12:56:57PM +0900, Inki Dae wrote:
> 
> 
> > -----Original Message-----
> > From: linux-fbdev-owner@vger.kernel.org [mailto:linux-fbdev-
> > owner@vger.kernel.org] On Behalf Of Rob Clark
> > Sent: Tuesday, May 28, 2013 12:48 AM
> > To: Inki Dae
> > Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho; Kyungmin
> > Park; myungjoo.ham; DRI mailing list;
> linux-arm-kernel@lists.infradead.org;
> > linux-media@vger.kernel.org
> > Subject: Re: Introduce a new helper framework for buffer synchronization
> > 
> > On Mon, May 27, 2013 at 6:38 AM, Inki Dae <inki.dae@samsung.com> wrote:
> > > Hi all,
> > >
> > > I have been removed previous branch and added new one with more cleanup.
> > > This time, the fence helper doesn't include user side interfaces and
> > cache
> > > operation relevant codes anymore because not only we are not sure that
> > > coupling those two things, synchronizing caches and buffer access
> > between
> > > CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
> > a
> > > good idea yet but also existing codes for user side have problems with
> > badly
> > > behaved or crashing userspace. So this could be more discussed later.
> > >
> > > The below is a new branch,
> > >
> > > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> > exynos.git/?h=dma-f
> > > ence-helper
> > >
> > > And fence helper codes,
> > >
> > > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> > exynos.git/commit/?
> > > h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
> > >
> > > And example codes for device driver,
> > >
> > > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> > exynos.git/commit/?
> > > h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
> > >
> > > I think the time is not yet ripe for RFC posting: maybe existing dma
> > fence
> > > and reservation need more review and addition work. So I'd glad for
> > somebody
> > > giving other opinions and advices in advance before RFC posting.
> > 
> > thoughts from a *really* quick, pre-coffee, first look:
> > * any sort of helper to simplify single-buffer sort of use-cases (v4l)
> > probably wouldn't want to bake in assumption that seqno_fence is used.
> > * I guess g2d is probably not actually a simple use case, since I
> > expect you can submit blits involving multiple buffers :-P
> 
> I don't think so. One and more buffers can be used: seqno_fence also has
> only one buffer. Actually, we have already applied this approach to most
> devices; multimedia, gpu and display controller. And this approach shows
> more performance; reduced power consumption against traditional way. And g2d
> example is just to show you how to apply my approach to device driver.

Note that seqno_fence is an implementation pattern for a certain type of
direct hw->hw synchronization which uses a shared dma_buf to exchange the
sync cookie. The dma_buf attached to the seqno_fence has _nothing_ to do
with the dma_buf the fence actually coordinates access to.

I think that confusing is a large reason for why Maarten&I don't
understand what you want to achieve with your fence helpers. Currently
they're using the seqno_fence, but totally not in a way the seqno_fence
was meant to be used.

Note that with the current code there is only a pointer from dma_bufs to
the fence. The fence itself has _no_ pointer to the dma_buf it syncs. This
shouldn't be a problem since the fence fastpath for already signalled
fences is completely barrier&lock free (it's just a load+bit-test), and
fences are meant to be embedded into whatever dma tracking structure you
already have, so no overhead there. The only ugly part is the fence
refcounting, but I don't think we can drop that.

Note that you completely reinvent this part of Maarten's fence patches by
adding new r/w_complete completions to the reservation object, which
completely replaces the fence stuff.

Also note that a list of reservation entries is again meant to be used
only when submitting the dma to the gpu. With your patches you seem to
hang onto that list until dma completes. This has the ugly side-effect
that you need to allocate these reservation entries with kmalloc, whereas
if you just use them in the execbuf ioctl to construct the dma you can
usually embed it into something else you need already anyway. At least
i915 and ttm based drivers can work that way.

Furthermore fences are specifically constructed as frankenstein-monsters
between completion/waitqueues and callbacks. All the different use-cases
need the different aspects:
- busy/idle checks and bo retiring need the completion semantics
- callbacks (in interrupt context) are used for hybrid hw->irq handler->hw
  sync approaches

> 
> > * otherwise, you probably don't want to depend on dmabuf, which is why
> > reservation/fence is split out the way it is..  you want to be able to
> > use a single reservation/fence mechanism within your driver without
> > having to care about which buffers are exported to dmabuf's and which
> > are not.  Creating a dmabuf for every GEM bo is too heavyweight.
> 
> Right. But I think we should dealt with this separately. Actually, we are
> trying to use reservation for gpu pipe line synchronization such as sgx sync
> object and this approach is used without dmabuf. In order words, some device
> can use only reservation for such pipe line synchronization and at the same
> time, fence helper or similar thing with dmabuf for buffer synchronization.

I think a quick overview of the different pieces would be in order.

- ww_mutex: This is just the locking primite which allows you to grab
  multiple mutexes without the need to check for ordering (and so fear
  deadlocks).

- fence: This is just a fancy kind of completion with a bit of support for
  hw->hw completion events. The fences themselve have no tie-in with
  buffers, ww_mutexes or anything else.

- reservation: This ties together an object (doesn't need to be a buffer,
  could be any other gpu resource - see the drm/vmwgfx driver if you want
  your mind blown) with fences. Note that there's no connection the other
  way round, i.e. with the current patches you can't see which
  reservations are using which fences. Also note that other objects than
  reservations could point at fences, e.g. when the provide
  shared/exclusive semantics are not good enough for your needs.

  The ww_mutex in the reservation is simply the (fancy) lock which
  protects each reservation. The reason we need something fancy is that
  you need to lock all objects which are synced by the same fence
  toghether, otherwise you can race and construct deadlocks in the hw->hw
  sync part of fences.

- dma_buf integration: This is simply a pointer to the reservation object
  of the underlying buffer object. We need a pointer here since otherwise
  you can construct deadlocks between dma_buf objects and native buffer
  objects.

The crux is also that dma_buf integration comes last - before you can do
that you need to have your driver converted over to use
ww_mutexes/fences/reservations.

I hope that helps to unconfuse things a bit and help you understand the
different pieces of the fence/reservation/ww_mutex patches floating
around.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-28  3:56                                               ` Inki Dae
  2013-05-28 10:32                                                 ` Daniel Vetter
@ 2013-05-28 13:48                                                 ` Rob Clark
  2013-05-28 14:50                                                   ` Inki Dae
  1 sibling, 1 reply; 36+ messages in thread
From: Rob Clark @ 2013-05-28 13:48 UTC (permalink / raw)
  To: Inki Dae
  Cc: Maarten Lankhorst, Daniel Vetter, linux-fbdev, YoungJun Cho,
	Kyungmin Park, myungjoo.ham, DRI mailing list, linux-arm-kernel,
	linux-media

On Mon, May 27, 2013 at 11:56 PM, Inki Dae <inki.dae@samsung.com> wrote:
>
>
>> -----Original Message-----
>> From: linux-fbdev-owner@vger.kernel.org [mailto:linux-fbdev-
>> owner@vger.kernel.org] On Behalf Of Rob Clark
>> Sent: Tuesday, May 28, 2013 12:48 AM
>> To: Inki Dae
>> Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho; Kyungmin
>> Park; myungjoo.ham; DRI mailing list;
> linux-arm-kernel@lists.infradead.org;
>> linux-media@vger.kernel.org
>> Subject: Re: Introduce a new helper framework for buffer synchronization
>>
>> On Mon, May 27, 2013 at 6:38 AM, Inki Dae <inki.dae@samsung.com> wrote:
>> > Hi all,
>> >
>> > I have been removed previous branch and added new one with more cleanup.
>> > This time, the fence helper doesn't include user side interfaces and
>> cache
>> > operation relevant codes anymore because not only we are not sure that
>> > coupling those two things, synchronizing caches and buffer access
>> between
>> > CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
>> a
>> > good idea yet but also existing codes for user side have problems with
>> badly
>> > behaved or crashing userspace. So this could be more discussed later.
>> >
>> > The below is a new branch,
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
>> exynos.git/?h=dma-f
>> > ence-helper
>> >
>> > And fence helper codes,
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
>> exynos.git/commit/?
>> > h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
>> >
>> > And example codes for device driver,
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
>> exynos.git/commit/?
>> > h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
>> >
>> > I think the time is not yet ripe for RFC posting: maybe existing dma
>> fence
>> > and reservation need more review and addition work. So I'd glad for
>> somebody
>> > giving other opinions and advices in advance before RFC posting.
>>
>> thoughts from a *really* quick, pre-coffee, first look:
>> * any sort of helper to simplify single-buffer sort of use-cases (v4l)
>> probably wouldn't want to bake in assumption that seqno_fence is used.
>> * I guess g2d is probably not actually a simple use case, since I
>> expect you can submit blits involving multiple buffers :-P
>
> I don't think so. One and more buffers can be used: seqno_fence also has
> only one buffer. Actually, we have already applied this approach to most
> devices; multimedia, gpu and display controller. And this approach shows
> more performance; reduced power consumption against traditional way. And g2d
> example is just to show you how to apply my approach to device driver.

no, you need the ww-mutex / reservation stuff any time you have
multiple independent devices (or rings/contexts for hw that can
support multiple contexts) which can do operations with multiple
buffers.  So you could conceivably hit this w/ gpu + g2d if multiple
buffers where shared between the two.  vram migration and such
'desktop stuff' might make the problem worse, but just because you
don't have vram doesn't mean you don't have a problem with multiple
buffers.

>> * otherwise, you probably don't want to depend on dmabuf, which is why
>> reservation/fence is split out the way it is..  you want to be able to
>> use a single reservation/fence mechanism within your driver without
>> having to care about which buffers are exported to dmabuf's and which
>> are not.  Creating a dmabuf for every GEM bo is too heavyweight.
>
> Right. But I think we should dealt with this separately. Actually, we are
> trying to use reservation for gpu pipe line synchronization such as sgx sync
> object and this approach is used without dmabuf. In order words, some device
> can use only reservation for such pipe line synchronization and at the same
> time, fence helper or similar thing with dmabuf for buffer synchronization.

it is probably easier to approach from the reverse direction.. ie, get
reservation/synchronization right first, and then dmabuf.  (Well, that
isn't really a problem because Maarten's reservation/fence patches
support dmabuf from the beginning.)

BR,
-R

>>
>> I'm not entirely sure if reservation/fence could/should be made any
>> simpler for multi-buffer users.  Probably the best thing to do is just
>> get reservation/fence rolled out in a few drivers and see if some
>> common patterns emerge.
>>
>> BR,
>> -R
>>
>> >
>> > Thanks,
>> > Inki Dae
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-28 10:32                                                 ` Daniel Vetter
@ 2013-05-28 14:43                                                   ` Inki Dae
  0 siblings, 0 replies; 36+ messages in thread
From: Inki Dae @ 2013-05-28 14:43 UTC (permalink / raw)
  To: 'Daniel Vetter'
  Cc: 'Rob Clark', 'Maarten Lankhorst',
	'linux-fbdev', 'YoungJun Cho',
	'Kyungmin Park', 'myungjoo.ham',
	'DRI mailing list',
	linux-arm-kernel, linux-media


Hi Daniel,

Thank you so much. And so very useful.:) Sorry but could be give me more
comments to the below my comments? There are still things making me
confusing.:(


> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Tuesday, May 28, 2013 7:33 PM
> To: Inki Dae
> Cc: 'Rob Clark'; 'Maarten Lankhorst'; 'Daniel Vetter'; 'linux-fbdev';
> 'YoungJun Cho'; 'Kyungmin Park'; 'myungjoo.ham'; 'DRI mailing list';
> linux-arm-kernel@lists.infradead.org; linux-media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Tue, May 28, 2013 at 12:56:57PM +0900, Inki Dae wrote:
> >
> >
> > > -----Original Message-----
> > > From: linux-fbdev-owner@vger.kernel.org [mailto:linux-fbdev-
> > > owner@vger.kernel.org] On Behalf Of Rob Clark
> > > Sent: Tuesday, May 28, 2013 12:48 AM
> > > To: Inki Dae
> > > Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho;
> Kyungmin
> > > Park; myungjoo.ham; DRI mailing list;
> > linux-arm-kernel@lists.infradead.org;
> > > linux-media@vger.kernel.org
> > > Subject: Re: Introduce a new helper framework for buffer
> synchronization
> > >
> > > On Mon, May 27, 2013 at 6:38 AM, Inki Dae <inki.dae@samsung.com>
wrote:
> > > > Hi all,
> > > >
> > > > I have been removed previous branch and added new one with more
> cleanup.
> > > > This time, the fence helper doesn't include user side interfaces and
> > > cache
> > > > operation relevant codes anymore because not only we are not sure
> that
> > > > coupling those two things, synchronizing caches and buffer access
> > > between
> > > > CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel
> side is
> > > a
> > > > good idea yet but also existing codes for user side have problems
> with
> > > badly
> > > > behaved or crashing userspace. So this could be more discussed
later.
> > > >
> > > > The below is a new branch,
> > > >
> > > > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> > > exynos.git/?h=dma-f
> > > > ence-helper
> > > >
> > > > And fence helper codes,
> > > >
> > > > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> > > exynos.git/commit/?
> > > > h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
> > > >
> > > > And example codes for device driver,
> > > >
> > > > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> > > exynos.git/commit/?
> > > > h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
> > > >
> > > > I think the time is not yet ripe for RFC posting: maybe existing dma
> > > fence
> > > > and reservation need more review and addition work. So I'd glad for
> > > somebody
> > > > giving other opinions and advices in advance before RFC posting.
> > >
> > > thoughts from a *really* quick, pre-coffee, first look:
> > > * any sort of helper to simplify single-buffer sort of use-cases (v4l)
> > > probably wouldn't want to bake in assumption that seqno_fence is used.
> > > * I guess g2d is probably not actually a simple use case, since I
> > > expect you can submit blits involving multiple buffers :-P
> >
> > I don't think so. One and more buffers can be used: seqno_fence also has
> > only one buffer. Actually, we have already applied this approach to most
> > devices; multimedia, gpu and display controller. And this approach shows
> > more performance; reduced power consumption against traditional way. And
> g2d
> > example is just to show you how to apply my approach to device driver.
> 
> Note that seqno_fence is an implementation pattern for a certain type of
> direct hw->hw synchronization which uses a shared dma_buf to exchange the
> sync cookie.

I'm afraid that I don't understand hw->hw synchronization. hw->hw
synchronization means that device has a hardware feature which supports
buffer synchronization hardware internally? And what is the sync cookie?

> The dma_buf attached to the seqno_fence has _nothing_ to do
> with the dma_buf the fence actually coordinates access to.
> 
> I think that confusing is a large reason for why Maarten&I don't
> understand what you want to achieve with your fence helpers. Currently
> they're using the seqno_fence, but totally not in a way the seqno_fence
> was meant to be used.
> 
> Note that with the current code there is only a pointer from dma_bufs to
> the fence. The fence itself has _no_ pointer to the dma_buf it syncs. This
> shouldn't be a problem since the fence fastpath for already signalled
> fences is completely barrier&lock free (it's just a load+bit-test), and
> fences are meant to be embedded into whatever dma tracking structure you
> already have, so no overhead there. The only ugly part is the fence
> refcounting, but I don't think we can drop that.

The below is the proposed way,
dma device has to create a fence before accessing a shared buffer, and then
check if there are other dma which are accessing the shared buffer; if exist
then the dma device should be blocked, and then  it sets the fence to
reservation object of the shared buffer. And then the dma begins access to
the shared buffer. And finally, the dma signals its own fence so that other
blocked can be waked up. However, if there was another dma blocked before
signaling then another dma can access invalid fence object after waked up
because the fence object can be released after signaling. So I made another
dma takes a one reference before blocked. And I think origin version also
takes a reference at ticket_commit(). Is there wrong point in proposed way?

> 
> Note that you completely reinvent this part of Maarten's fence patches by
> adding new r/w_complete completions to the reservation object, which
> completely replaces the fence stuff.
> 
> Also note that a list of reservation entries is again meant to be used
> only when submitting the dma to the gpu. With your patches you seem to
> hang onto that list until dma completes.

Yeah, that is for prevent a dma from accessing a shared buffer while the
other is accessing the shared buffer. Maybe this way means software
synchronization. And your saying seems that there is another mechanism for
dealing with such thing. Could you give me more comments for it? is there
really another mechanism?

> This has the ugly side-effect
> that you need to allocate these reservation entries with kmalloc, whereas
> if you just use them in the execbuf ioctl to construct the dma you can
> usually embed it into something else you need already anyway. At least
> i915 and ttm based drivers can work that way.
> 

Maybe there is my misunderstanding but I thought now dma fence and relevant
stubs tend to depend on Desktop world, especially x86 gpu. So I thought we
need a little bit customizing the dma fence for embedded system.

> Furthermore fences are specifically constructed as frankenstein-monsters
> between completion/waitqueues and callbacks. All the different use-cases
> need the different aspects:
> - busy/idle checks and bo retiring need the completion semantics
> - callbacks (in interrupt context) are used for hybrid hw->irq handler->hw
>   sync approaches
> 
> >
> > > * otherwise, you probably don't want to depend on dmabuf, which is why
> > > reservation/fence is split out the way it is..  you want to be able to
> > > use a single reservation/fence mechanism within your driver without
> > > having to care about which buffers are exported to dmabuf's and which
> > > are not.  Creating a dmabuf for every GEM bo is too heavyweight.
> >
> > Right. But I think we should dealt with this separately. Actually, we
> are
> > trying to use reservation for gpu pipe line synchronization such as sgx
> sync
> > object and this approach is used without dmabuf. In order words, some
> device
> > can use only reservation for such pipe line synchronization and at the
> same
> > time, fence helper or similar thing with dmabuf for buffer
> synchronization.
> 
> I think a quick overview of the different pieces would be in order.
> 
> - ww_mutex: This is just the locking primite which allows you to grab
>   multiple mutexes without the need to check for ordering (and so fear
>   deadlocks).

Should ww_mutex be necessarily used for embedded system? I'm not sure that
there are any cases we have to use this feature like x86 gpu. That is also
why I removed ticket stubs from existing reservation framework.

> 
> - fence: This is just a fancy kind of completion with a bit of support for
>   hw->hw completion events.

hw->hw completion event means completion of dma operation? Otherwise,
hardware buffer synchronization feature? If the latter, how fence can be
used for embedded system which has no hardware feature for buffer
synchronization? Actually, that made me confusing at initial work.

> The fences themselve have no tie-in with
>   buffers, ww_mutexes or anything else.
> 
> - reservation: This ties together an object (doesn't need to be a buffer,
>   could be any other gpu resource - see the drm/vmwgfx driver if you want
>   your mind blown) with fences. Note that there's no connection the other
>   way round, i.e. with the current patches you can't see which
>   reservations are using which fences. Also note that other objects than
>   reservations could point at fences, e.g. when the provide
>   shared/exclusive semantics are not good enough for your needs.
> 
>   The ww_mutex in the reservation is simply the (fancy) lock which
>   protects each reservation. The reason we need something fancy is that
>   you need to lock all objects which are synced by the same fence
>   toghether, otherwise you can race and construct deadlocks in the hw->hw
>   sync part of fences.
> 
> - dma_buf integration: This is simply a pointer to the reservation object
>   of the underlying buffer object. We need a pointer here since otherwise
>   you can construct deadlocks between dma_buf objects and native buffer
>   objects.
> 
> The crux is also that dma_buf integration comes last - before you can do
> that you need to have your driver converted over to use
> ww_mutexes/fences/reservations.
> 
> I hope that helps to unconfuse things a bit and help you understand the
> different pieces of the fence/reservation/ww_mutex patches floating
> around.

Thank you again.:) 

Thanks,
Inki Dae

> 
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-28 13:48                                                 ` Rob Clark
@ 2013-05-28 14:50                                                   ` Inki Dae
  2013-05-28 16:49                                                     ` Daniel Vetter
  0 siblings, 1 reply; 36+ messages in thread
From: Inki Dae @ 2013-05-28 14:50 UTC (permalink / raw)
  To: 'Rob Clark'
  Cc: 'Maarten Lankhorst', 'Daniel Vetter',
	'linux-fbdev', 'YoungJun Cho',
	'Kyungmin Park', 'myungjoo.ham',
	'DRI mailing list',
	linux-arm-kernel, linux-media



> -----Original Message-----
> From: linux-fbdev-owner@vger.kernel.org [mailto:linux-fbdev-
> owner@vger.kernel.org] On Behalf Of Rob Clark
> Sent: Tuesday, May 28, 2013 10:49 PM
> To: Inki Dae
> Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho; Kyungmin
> Park; myungjoo.ham; DRI mailing list;
linux-arm-kernel@lists.infradead.org;
> linux-media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Mon, May 27, 2013 at 11:56 PM, Inki Dae <inki.dae@samsung.com> wrote:
> >
> >
> >> -----Original Message-----
> >> From: linux-fbdev-owner@vger.kernel.org [mailto:linux-fbdev-
> >> owner@vger.kernel.org] On Behalf Of Rob Clark
> >> Sent: Tuesday, May 28, 2013 12:48 AM
> >> To: Inki Dae
> >> Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho;
> Kyungmin
> >> Park; myungjoo.ham; DRI mailing list;
> > linux-arm-kernel@lists.infradead.org;
> >> linux-media@vger.kernel.org
> >> Subject: Re: Introduce a new helper framework for buffer
> synchronization
> >>
> >> On Mon, May 27, 2013 at 6:38 AM, Inki Dae <inki.dae@samsung.com> wrote:
> >> > Hi all,
> >> >
> >> > I have been removed previous branch and added new one with more
> cleanup.
> >> > This time, the fence helper doesn't include user side interfaces and
> >> cache
> >> > operation relevant codes anymore because not only we are not sure
> that
> >> > coupling those two things, synchronizing caches and buffer access
> >> between
> >> > CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side
> is
> >> a
> >> > good idea yet but also existing codes for user side have problems
> with
> >> badly
> >> > behaved or crashing userspace. So this could be more discussed later.
> >> >
> >> > The below is a new branch,
> >> >
> >> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> >> exynos.git/?h=dma-f
> >> > ence-helper
> >> >
> >> > And fence helper codes,
> >> >
> >> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> >> exynos.git/commit/?
> >> > h=dma-fence-helper&id=adcbc0fe7e285ce866e5816e5e21443dcce01005
> >> >
> >> > And example codes for device driver,
> >> >
> >> > https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
> >> exynos.git/commit/?
> >> > h=dma-fence-helper&id=d2ce7af23835789602a99d0ccef1f53cdd5caaae
> >> >
> >> > I think the time is not yet ripe for RFC posting: maybe existing dma
> >> fence
> >> > and reservation need more review and addition work. So I'd glad for
> >> somebody
> >> > giving other opinions and advices in advance before RFC posting.
> >>
> >> thoughts from a *really* quick, pre-coffee, first look:
> >> * any sort of helper to simplify single-buffer sort of use-cases (v4l)
> >> probably wouldn't want to bake in assumption that seqno_fence is used.
> >> * I guess g2d is probably not actually a simple use case, since I
> >> expect you can submit blits involving multiple buffers :-P
> >
> > I don't think so. One and more buffers can be used: seqno_fence also has
> > only one buffer. Actually, we have already applied this approach to most
> > devices; multimedia, gpu and display controller. And this approach shows
> > more performance; reduced power consumption against traditional way. And
> g2d
> > example is just to show you how to apply my approach to device driver.
> 
> no, you need the ww-mutex / reservation stuff any time you have
> multiple independent devices (or rings/contexts for hw that can
> support multiple contexts) which can do operations with multiple
> buffers.

I think I already used reservation stuff any time in that way except
ww-mutex. And I'm not sure that embedded system really needs ww-mutex. If
there is any case, 
could you tell me the case? I really need more advice and understanding :)

Thanks,
Inki Dae

  So you could conceivably hit this w/ gpu + g2d if multiple
> buffers where shared between the two.  vram migration and such
> 'desktop stuff' might make the problem worse, but just because you
> don't have vram doesn't mean you don't have a problem with multiple
> buffers.
> 
> >> * otherwise, you probably don't want to depend on dmabuf, which is why
> >> reservation/fence is split out the way it is..  you want to be able to
> >> use a single reservation/fence mechanism within your driver without
> >> having to care about which buffers are exported to dmabuf's and which
> >> are not.  Creating a dmabuf for every GEM bo is too heavyweight.
> >
> > Right. But I think we should dealt with this separately. Actually, we
> are
> > trying to use reservation for gpu pipe line synchronization such as sgx
> sync
> > object and this approach is used without dmabuf. In order words, some
> device
> > can use only reservation for such pipe line synchronization and at the
> same
> > time, fence helper or similar thing with dmabuf for buffer
> synchronization.
> 
> it is probably easier to approach from the reverse direction.. ie, get
> reservation/synchronization right first, and then dmabuf.  (Well, that
> isn't really a problem because Maarten's reservation/fence patches
> support dmabuf from the beginning.)
> 
> BR,
> -R
> 
> >>
> >> I'm not entirely sure if reservation/fence could/should be made any
> >> simpler for multi-buffer users.  Probably the best thing to do is just
> >> get reservation/fence rolled out in a few drivers and see if some
> >> common patterns emerge.
> >>
> >> BR,
> >> -R
> >>
> >> >
> >> > Thanks,
> >> > Inki Dae
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-fbdev"
> in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Introduce a new helper framework for buffer synchronization
  2013-05-28 14:50                                                   ` Inki Dae
@ 2013-05-28 16:49                                                     ` Daniel Vetter
  2013-05-29  2:21                                                       ` Inki Dae
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Vetter @ 2013-05-28 16:49 UTC (permalink / raw)
  To: Inki Dae
  Cc: Rob Clark, Maarten Lankhorst, linux-fbdev, YoungJun Cho,
	Kyungmin Park, myungjoo.ham, DRI mailing list, linux-arm-kernel,
	linux-media

On Tue, May 28, 2013 at 4:50 PM, Inki Dae <inki.dae@samsung.com> wrote:
> I think I already used reservation stuff any time in that way except
> ww-mutex. And I'm not sure that embedded system really needs ww-mutex. If
> there is any case,
> could you tell me the case? I really need more advice and understanding :)

If you have only one driver, you can get away without ww_mutex.
drm/i915 does it, all buffer state is protected by dev->struct_mutex.
But as soon as you have multiple drivers sharing buffers with dma_buf
things will blow up.

Yep, current prime is broken and can lead to deadlocks.

In practice it doesn't (yet) matter since only the X server does the
sharing dance, and that one's single-threaded. Now you can claim that
since you have all buffers pinned in embedded gfx anyway, you don't
care. But both in desktop gfx and embedded gfx the real fun starts
once you put fences into the mix and link them up with buffers, then
every command submission risks that deadlock. Furthermore you can get
unlucky and construct a circle of fences waiting on each another (only
though if the fence singalling fires off the next batchbuffer
asynchronously).

To prevent such deadlocks you _absolutely_ need to lock _all_ buffers
that take part in a command submission at once. To do that you either
need a global lock (ugh) or ww_mutexes.

So ww_mutexes are the fundamental ingredient of all this, if you don't
see why you need them then everything piled on top is broken. I think
until you've understood why exactly we need ww_mutexes there's not
much point in discussing the finer issues of fences, reservation
objects and how to integrate it with dma_bufs exactly.

I'll try to clarify the motivating example in the ww_mutex
documentation a bit, but I dunno how else I could explain this ...

Yours, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Introduce a new helper framework for buffer synchronization
  2013-05-28 16:49                                                     ` Daniel Vetter
@ 2013-05-29  2:21                                                       ` Inki Dae
  0 siblings, 0 replies; 36+ messages in thread
From: Inki Dae @ 2013-05-29  2:21 UTC (permalink / raw)
  To: 'Daniel Vetter'
  Cc: 'Rob Clark', 'Maarten Lankhorst',
	'linux-fbdev', 'YoungJun Cho',
	'Kyungmin Park', 'myungjoo.ham',
	'DRI mailing list',
	linux-arm-kernel, linux-media



> -----Original Message-----
> From: daniel.vetter@ffwll.ch [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> Daniel Vetter
> Sent: Wednesday, May 29, 2013 1:50 AM
> To: Inki Dae
> Cc: Rob Clark; Maarten Lankhorst; linux-fbdev; YoungJun Cho; Kyungmin
Park;
> myungjoo.ham; DRI mailing list; linux-arm-kernel@lists.infradead.org;
> linux-media@vger.kernel.org
> Subject: Re: Introduce a new helper framework for buffer synchronization
> 
> On Tue, May 28, 2013 at 4:50 PM, Inki Dae <inki.dae@samsung.com> wrote:
> > I think I already used reservation stuff any time in that way except
> > ww-mutex. And I'm not sure that embedded system really needs ww-mutex.
> If
> > there is any case,
> > could you tell me the case? I really need more advice and
> understanding :)
> 
> If you have only one driver, you can get away without ww_mutex.
> drm/i915 does it, all buffer state is protected by dev->struct_mutex.
> But as soon as you have multiple drivers sharing buffers with dma_buf
> things will blow up.
> 
> Yep, current prime is broken and can lead to deadlocks.
> 
> In practice it doesn't (yet) matter since only the X server does the
> sharing dance, and that one's single-threaded. Now you can claim that
> since you have all buffers pinned in embedded gfx anyway, you don't
> care. But both in desktop gfx and embedded gfx the real fun starts
> once you put fences into the mix and link them up with buffers, then
> every command submission risks that deadlock. Furthermore you can get
> unlucky and construct a circle of fences waiting on each another (only
> though if the fence singalling fires off the next batchbuffer
> asynchronously).

In our case, we haven't ever experienced deadlock yet but there is still
possible to face with deadlock in case that a process is sharing two buffer
with another process like below,
	Process A committed buffer A and  waits for buffer B,
	Process B committed buffer B and waits for buffer A

That is deadlock and it seems that you say we can resolve deadlock issue
with ww-mutexes. And it seems that we can replace our block-wakeup mechanism
with mutex lock for more performance.

> 
> To prevent such deadlocks you _absolutely_ need to lock _all_ buffers
> that take part in a command submission at once. To do that you either
> need a global lock (ugh) or ww_mutexes.
> 
> So ww_mutexes are the fundamental ingredient of all this, if you don't
> see why you need them then everything piled on top is broken. I think
> until you've understood why exactly we need ww_mutexes there's not
> much point in discussing the finer issues of fences, reservation
> objects and how to integrate it with dma_bufs exactly.
> 
> I'll try to clarify the motivating example in the ww_mutex
> documentation a bit, but I dunno how else I could explain this ...
> 

I don't really want for you waste your time on me. I will trying to apply
ww-mutexes (v4) to the proposed framework for more understanding.

Thanks for your advices.:) 
Inki Dae

> Yours, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Introduce a dmabuf sync framework for buffer synchronization
  2013-05-23 11:55                                       ` Daniel Vetter
  2013-05-23 13:37                                         ` Inki Dae
@ 2013-06-07  9:02                                         ` Inki Dae
  1 sibling, 0 replies; 36+ messages in thread
From: Inki Dae @ 2013-06-07  9:02 UTC (permalink / raw)
  To: 'Maarten Lankhorst', 'Daniel Vetter',
	'Rob Clark'
  Cc: 'linux-fbdev', 'DRI mailing list',
	'Kyungmin Park', 'myungjoo.ham',
	'YoungJun Cho',
	linux-arm-kernel, linux-media

Hi all,

Came back :)

Please, ignore previous fence helper. I have re-implemented buffer
synchronization mechanism, dmabuf sync, based on DMA BUF and wound/wait
style lock v5[1] and tested it with 2d gpu and drm kms drivers.

The below is dmabuf sync framework codes,
	
https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
h=dmabuf-sync&id=ae6c5a0146ab72ea64d9fc91af4595aacf9a5139
	
And g2d driver with dmabuf sync framework,
	
https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
h=dmabuf-sync&id=4030bdee9bab5841ad32faade528d04cc0c5fc94

And also exynos drm kms framework,
	
https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
h=dmabuf-sync&id=6ca548e9ea9e865592719ef6b1cde58366af9f5c

[1] https://patchwork-mail1.kernel.org/patch/2625321/

The purpose of this framework is not only to couple synchronizing caches and
buffer access between CPU and CPU, CPU and DMA, and DMA and DMA but also to
provide easy-to-use interfaces for device drivers: we may need
user interfaces but there is no good way for exposing the interfaces to user
side yet.

Basically, the mechanism of this framework has the following steps,
Initialize dmabuf sync object - A task gets a new sync object allocated and
initialized by dmabuf_sync_init(). This work should be done when a context
or similar thing of device driver is created or before cpu access.
    
Add a dmabuf to the sync object - A task can add a dmabuf or more that this
task wants to access to its own sync object. This work should be done just
before setting up dma buffer relevant registers or buffer cpu access.
    
Lock a sync object - A task tries to lock all dmabufs added in the sync
object. And for this, ww-mutexes is used to resolve dead lock. This work
should be done before dma start or cpu access. The lock means DMA or CPU of
current task can access dmabufs so the other cannot access dmabufs.

Unlock a sync object - A task tries to unlock all dmabufs added in the sync
object. This work should be done when after the completion of dma operation
or CPU access. The unlock means DMA or CPU access of current task has been
completed so the other can access the dmabufs.

In addition, this framework includes the following two features,
Consider read and write lock - this feature is for more performance: we
don't need to take a lock in case that a buffer was accessed for read and
current task tries to access the buffer for read. So when current task tries
to take a lock, this feature checks previous access type and desired type to
a given buffer and then decides if it has to take a lock to the buffer or
not.
    
Consider lockup and resource management - this feature considers the case
that any task never unlocks a buffer after taking a lock to the buffer. In
this case, a timer handler to this task is called sometimes later and then
the buffer is unlocked by workqueue handler to avoid lockup and release the
buffer relevant resources.
    
Tutorial.
	when allocating and initializing device context
		struct dmabuf_sync *sync;

		sync = dmabuf_sync_init(NULL, NULL);

	when setting up dma buffer relevant registers
		/* it can be called repeatly for multiple buffers. */
		dmabuf_sync_get(sync, dmabuf);

	just before dma start or cpu access
		dmabuf_sync_lock(sync);

	after the completion of dma operation or cpu access
		dmabuf_sync_unlock(sync);
		dmabuf_sync_put_all(sync);
		dmabuf_sync_fini(sync);


Deadlock reproduction with dmabuf sync.
For deadlock test, I had tried to reproduce deadlock situation like below
and the below approach had been worked well,
(Please, presume that two tasks share two dmabufs together.)

	[Task A]
	struct dmabuf_sync *sync;

	sync = dmabuf_sync_init(NULL, NULL);
	
	dmabuf_sync_get(sync, dmabuf A);
	dmabuf_sync_get(sync, dmabuf B);

	while(1) {
		dmabuf_sync_lock(sync);
		sleep(1);
		dmabuf_sync_unlock(sync);
		sleep(1);
	}

	[Task B]
	struct dmabuf_sync *sync;

	sync = dmabuf_sync_init(NULL, NULL);
	
	dmabuf_sync_get(sync, dmabuf C);
	dmabuf_sync_get(sync, dmabuf A);

	while(1) {
		dmabuf_sync_lock(sync); <-- deadlock
		sleep(1);
		dmabuf_sync_unlock(sync);
		sleep(1);
	}

With the above example codes, deadlock occurs when Task B called
dmabuf_sync_lock function: internally, Task B takes a lock to dmabuf C and
then tries to take a lock to dmabuf A. But at this time, ww_mutex_lock()
returns -EDEADLK because ctx->acquired became 1 once taking a lock to dmabuf
C. And then task B unlocks dmabuf C and takes a slowpath lock to dmabuf A.
And then once Task A unlocks dmabuf A, Task B tries to take a lock to dmabuf
C again.

And the below is my concerns and opinions,
A dma buf has a reservation object when a buffer is exported to the dma buf
- I'm not sure but it seems that reservation object is used for x86 gpu;
having vram and different domain, or similar devices. So in case of embedded
system, most dma devices and cpu share system memory so I think that
reservation object should be considered for them also because basically,
buffer synchronization mechanism should be worked based on dmabuf. For this,
I have added four members to reservation_object; shared_cnt and shared for
read lock, accessed_type for cache operation, and locked for timeout case.
Hence, some devices might need specific something for itself. So how about
remaining only common part for reservation_object structure; it seems that
fence_excl, fence_shared, and so on are not common part.

Now wound/wait mutex doesn't consider read and write lock - In case of using
ww-mutexes for buffer synchronization, it seems that we need read and write
lock for more performance; read access and then read access doesn't need to
be locked. For this, I have added three members, shared_cnt and shared to
reservation_object and this is just to show you how we can use read lock.
However, I'm sure that there is a better/more generic way.

The above all things is just quick implementation for buffer synchronization
so this should be more cleaned up and there might be my missing point.
Please give me your advices and opinions.

Thanks,
Inki Dae


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-06-07  9:02 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAAQKjZNNw4qddo6bE5OY_CahrqDtqkxdO7Pm9RCguXyj9F4cMQ@mail.gmail.com>
2013-05-13  8:00 ` Introduce a new helper framework for buffer synchronization Maarten Lankhorst
2013-05-13  9:21   ` Inki Dae
2013-05-13  9:52     ` Maarten Lankhorst
2013-05-13 11:24       ` Inki Dae
2013-05-13 11:40         ` Maarten Lankhorst
2013-05-13 12:21           ` Inki Dae
2013-05-13 13:48             ` Rob Clark
     [not found]               ` <CAAQKjZP=iOmHRpHZCbZD3v_RKUFSn0eM_WVZZvhe7F9g3eTmPA@mail.gmail.com>
2013-05-13 17:58                 ` Rob Clark
2013-05-14  2:52                   ` Inki Dae
2013-05-14 13:38                     ` Rob Clark
2013-05-15  5:19                       ` Inki Dae
2013-05-15 14:06                         ` Rob Clark
2013-05-17  4:19                           ` Daniel Vetter
     [not found]                           ` <CAAQKjZP37koEPob6yqpn-WxxTh3+O=twyvRzDiEhVJTD8BxQzw@mail.gmail.com>
2013-05-20 21:13                             ` Daniel Vetter
2013-05-20 21:30                               ` Daniel Vetter
2013-05-21  7:03                                 ` Inki Dae
2013-05-21  7:44                                   ` Daniel Vetter
2013-05-21  9:22                                     ` Inki Dae
2013-05-23 11:55                                       ` Daniel Vetter
2013-05-23 13:37                                         ` Inki Dae
2013-05-27 10:38                                           ` Inki Dae
2013-05-27 15:23                                             ` Maarten Lankhorst
2013-05-28  2:49                                               ` Inki Dae
2013-05-28  7:23                                                 ` Maarten Lankhorst
2013-05-27 15:47                                             ` Rob Clark
2013-05-27 16:02                                               ` Daniel Vetter
2013-05-28  4:04                                                 ` Inki Dae
2013-05-28  3:56                                               ` Inki Dae
2013-05-28 10:32                                                 ` Daniel Vetter
2013-05-28 14:43                                                   ` Inki Dae
2013-05-28 13:48                                                 ` Rob Clark
2013-05-28 14:50                                                   ` Inki Dae
2013-05-28 16:49                                                     ` Daniel Vetter
2013-05-29  2:21                                                       ` Inki Dae
2013-06-07  9:02                                         ` Introduce a dmabuf sync " Inki Dae
2013-05-13 19:29         ` Introduce a new helper " Tomasz Figa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).