All of lore.kernel.org
 help / color / mirror / Atom feed
* Try to address the DMA-buf coherency problem
@ 2022-10-20 12:13 Christian König
  2022-10-20 12:13 ` [PATCH 1/3] dma-buf: add dma_coherent flag Christian König
                   ` (4 more replies)
  0 siblings, 5 replies; 110+ messages in thread
From: Christian König @ 2022-10-20 12:13 UTC (permalink / raw)
  To: l.stach, nicolas, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Hi guys,

after finding that we essentially have two separate worlds for coherent sharing
of buffer through DMA-buf I thought I will tackle that problem a bit and at
least allow the framework to reject attachments which won't work.

So those patches here add a new dma_coherent flag to each DMA-buf object
telling the framework that dev_is_dma_coherent() needs to return true for an
importing device to be able to attach. Since we should always have a fallback
path this should give userspace the chance to still keep the use case working,
either by doing a CPU copy instead or reversing the roles of exporter and
importer.

For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
the flag for their buffers if special handling like the USWC flag in amdgpu or
the uncached allocations for radeon/nouveau are in use.

Additional to that importers can also check the flag if they have some
non-snooping operations like the special scanout case for amdgpu for example.

The patches are only smoke tested and the solution isn't ideal, but as far as
I can see should at least keep things working.

Please review and/or comment,
Christian.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/3] dma-buf: add dma_coherent flag
  2022-10-20 12:13 Try to address the DMA-buf coherency problem Christian König
@ 2022-10-20 12:13 ` Christian König
  2022-10-20 12:13 ` [PATCH 2/3] drm/prime: set the dma_coherent flag for export Christian König
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-10-20 12:13 UTC (permalink / raw)
  To: l.stach, nicolas, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media
  Cc: Christian König

Start to fix the coherency issues for non PCI devices by adding a dma_coherent
flag to the DMA-buf.

The flag should be set by the exporter if only devices which can do coherent
DMA-access with respect to the CPU cache are allowed to access the buffer.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-buf.c |  5 +++++
 include/linux/dma-buf.h   | 16 ++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index c40d72d318fd..7509807bf485 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/dma-buf.h>
 #include <linux/dma-fence.h>
+#include <linux/dma-map-ops.h>
 #include <linux/anon_inodes.h>
 #include <linux/export.h>
 #include <linux/debugfs.h>
@@ -635,6 +636,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 
 	dmabuf->priv = exp_info->priv;
 	dmabuf->ops = exp_info->ops;
+	dmabuf->dma_coherent = exp_info->coherent;
 	dmabuf->size = exp_info->size;
 	dmabuf->exp_name = exp_info->exp_name;
 	dmabuf->owner = exp_info->owner;
@@ -894,6 +896,9 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 	if (WARN_ON(importer_ops && !importer_ops->move_notify))
 		return ERR_PTR(-EINVAL);
 
+	if (dmabuf->dma_coherent && !dev_is_dma_coherent(dev))
+		return ERR_PTR(-EINVAL);
+
 	attach = kzalloc(sizeof(*attach), GFP_KERNEL);
 	if (!attach)
 		return ERR_PTR(-ENOMEM);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 6fa8d4e29719..f2083b94b116 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -326,6 +326,20 @@ struct dma_buf {
 	/** @ops: dma_buf_ops associated with this buffer object. */
 	const struct dma_buf_ops *ops;
 
+	/**
+	 * @dma_coherent:
+	 *
+	 * True if the buffer is backed by DMA coherent memory with respect to
+	 * the CPU cache even if the architecture can support incoherent
+	 * devices.
+	 *
+	 * Usually mirrors the result of dev_is_dma_coherent() of the exporter,
+	 * but can be cleared by the exporter to allow incoherent devices
+	 * access to the buffer if the exporter takes care of coherency for CPU
+	 * accesses.
+	 */
+	bool dma_coherent;
+
 	/**
 	 * @vmapping_counter:
 	 *
@@ -524,6 +538,7 @@ struct dma_buf_attachment {
  * @ops:	Attach allocator-defined dma buf ops to the new buffer
  * @size:	Size of the buffer - invariant over the lifetime of the buffer
  * @flags:	mode flags for the file
+ * @coherent:	If DMA accesses must be coherent to the CPU cache.
  * @resv:	reservation-object, NULL to allocate default one
  * @priv:	Attach private data of allocator to this buffer
  *
@@ -536,6 +551,7 @@ struct dma_buf_export_info {
 	const struct dma_buf_ops *ops;
 	size_t size;
 	int flags;
+	bool coherent;
 	struct dma_resv *resv;
 	void *priv;
 };
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/3] drm/prime: set the dma_coherent flag for export
  2022-10-20 12:13 Try to address the DMA-buf coherency problem Christian König
  2022-10-20 12:13 ` [PATCH 1/3] dma-buf: add dma_coherent flag Christian König
@ 2022-10-20 12:13 ` Christian König
  2022-10-20 14:43     ` Rob Clark
  2022-10-20 12:13 ` [PATCH 3/3] media: videobuf2: set dma_coherent flag for DMA-buf Christian König
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 110+ messages in thread
From: Christian König @ 2022-10-20 12:13 UTC (permalink / raw)
  To: l.stach, nicolas, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media
  Cc: Christian König

When a device driver is snooping the CPU cache during access we assume
that all importers need to be able to snoop the CPU cache as well.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/drm_prime.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 20e109a802ae..d5c70b6fe8a4 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -28,6 +28,7 @@
 
 #include <linux/export.h>
 #include <linux/dma-buf.h>
+#include <linux/dma-map-ops.h>
 #include <linux/rbtree.h>
 #include <linux/module.h>
 
@@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
 		.size = obj->size,
 		.flags = flags,
 		.priv = obj,
+		.coherent = dev_is_dma_coherent(dev->dev),
 		.resv = obj->resv,
 	};
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 3/3] media: videobuf2: set dma_coherent flag for DMA-buf
  2022-10-20 12:13 Try to address the DMA-buf coherency problem Christian König
  2022-10-20 12:13 ` [PATCH 1/3] dma-buf: add dma_coherent flag Christian König
  2022-10-20 12:13 ` [PATCH 2/3] drm/prime: set the dma_coherent flag for export Christian König
@ 2022-10-20 12:13 ` Christian König
  2022-11-04 10:42   ` Hans Verkuil
  2022-10-27 12:14 ` Try to address the DMA-buf coherency problem Christian König
  2022-10-28  8:09 ` Lucas Stach
  4 siblings, 1 reply; 110+ messages in thread
From: Christian König @ 2022-10-20 12:13 UTC (permalink / raw)
  To: l.stach, nicolas, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media
  Cc: Christian König

When a device is snooping the CPU cache we assume that all importers
must snoop the CPU cache as well.

Execpt for vmalloc allocations since those implement mmap() imports must
always snoop the cache or we will run into coherency problems.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/media/common/videobuf2/videobuf2-dma-contig.c | 2 ++
 drivers/media/common/videobuf2/videobuf2-dma-sg.c     | 2 ++
 drivers/media/common/videobuf2/videobuf2-vmalloc.c    | 1 +
 3 files changed, 5 insertions(+)

diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c
index 555bd40fa472..57433310c55c 100644
--- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c
+++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c
@@ -11,6 +11,7 @@
  */
 
 #include <linux/dma-buf.h>
+#include <linux/dma-map-ops.h>
 #include <linux/module.h>
 #include <linux/refcount.h>
 #include <linux/scatterlist.h>
@@ -507,6 +508,7 @@ static struct dma_buf *vb2_dc_get_dmabuf(struct vb2_buffer *vb,
 	exp_info.size = buf->size;
 	exp_info.flags = flags;
 	exp_info.priv = buf;
+	exp_info.coherent = dev_is_dma_coherent(vb->vb2_queue->dev);
 
 	if (!buf->sgt_base)
 		buf->sgt_base = vb2_dc_get_base_sgt(buf);
diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
index 36981a5b5c53..dbdd753e4a39 100644
--- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
+++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
@@ -10,6 +10,7 @@
  * the Free Software Foundation.
  */
 
+#include <linux/dma-map-ops.h>
 #include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/refcount.h>
@@ -522,6 +523,7 @@ static struct dma_buf *vb2_dma_sg_get_dmabuf(struct vb2_buffer *vb,
 	exp_info.size = buf->size;
 	exp_info.flags = flags;
 	exp_info.priv = buf;
+	exp_info.coherent = dev_is_dma_coherent(vb->vb2_queue->dev);
 
 	if (WARN_ON(!buf->dma_sgt))
 		return NULL;
diff --git a/drivers/media/common/videobuf2/videobuf2-vmalloc.c b/drivers/media/common/videobuf2/videobuf2-vmalloc.c
index 41db707e43a4..0b6874733e86 100644
--- a/drivers/media/common/videobuf2/videobuf2-vmalloc.c
+++ b/drivers/media/common/videobuf2/videobuf2-vmalloc.c
@@ -341,6 +341,7 @@ static struct dma_buf *vb2_vmalloc_get_dmabuf(struct vb2_buffer *vb,
 	exp_info.size = buf->size;
 	exp_info.flags = flags;
 	exp_info.priv = buf;
+	exp_info.coherent = true;
 
 	if (WARN_ON(!buf->vaddr))
 		return NULL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/3] drm/prime: set the dma_coherent flag for export
  2022-10-20 12:13 ` [PATCH 2/3] drm/prime: set the dma_coherent flag for export Christian König
@ 2022-10-20 14:43     ` Rob Clark
  0 siblings, 0 replies; 110+ messages in thread
From: Rob Clark @ 2022-10-20 14:43 UTC (permalink / raw)
  To: Christian König
  Cc: l.stach, nicolas, ppaalanen, sumit.semwal, daniel, dri-devel,
	linaro-mm-sig, linux-media, Christian König

On Thu, Oct 20, 2022 at 5:13 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> When a device driver is snooping the CPU cache during access we assume
> that all importers need to be able to snoop the CPU cache as well.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/drm_prime.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> index 20e109a802ae..d5c70b6fe8a4 100644
> --- a/drivers/gpu/drm/drm_prime.c
> +++ b/drivers/gpu/drm/drm_prime.c
> @@ -28,6 +28,7 @@
>
>  #include <linux/export.h>
>  #include <linux/dma-buf.h>
> +#include <linux/dma-map-ops.h>
>  #include <linux/rbtree.h>
>  #include <linux/module.h>
>
> @@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
>                 .size = obj->size,
>                 .flags = flags,
>                 .priv = obj,
> +               .coherent = dev_is_dma_coherent(dev->dev),

To set the coherent flag correctly, I think I'd need a way to override
on a per buffer basis, since coherency is a property of the gpu
pgtables (which in the msm case is an immutable property of the gem
object).  We also have some awkwardness that drm->dev isn't actually
the GPU, thanks to the kernels device model seeing a collection of
other small devices shoehorned into a single drm device to fit
userspace's view of the world.  So relying on drm->dev isn't really
going to give sensible results.

I guess msm could just bury our heads in the sand and continue to do
things the way we have been (buffers that are mapped cached-coherent
are only self-shared) but would be nice to catch if userspace tried to
import one into (for ex) v4l2..

BR,
-R

>                 .resv = obj->resv,
>         };
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/3] drm/prime: set the dma_coherent flag for export
@ 2022-10-20 14:43     ` Rob Clark
  0 siblings, 0 replies; 110+ messages in thread
From: Rob Clark @ 2022-10-20 14:43 UTC (permalink / raw)
  To: Christian König
  Cc: Christian König, dri-devel, nicolas, linaro-mm-sig,
	ppaalanen, linux-media, sumit.semwal

On Thu, Oct 20, 2022 at 5:13 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> When a device driver is snooping the CPU cache during access we assume
> that all importers need to be able to snoop the CPU cache as well.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/drm_prime.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> index 20e109a802ae..d5c70b6fe8a4 100644
> --- a/drivers/gpu/drm/drm_prime.c
> +++ b/drivers/gpu/drm/drm_prime.c
> @@ -28,6 +28,7 @@
>
>  #include <linux/export.h>
>  #include <linux/dma-buf.h>
> +#include <linux/dma-map-ops.h>
>  #include <linux/rbtree.h>
>  #include <linux/module.h>
>
> @@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
>                 .size = obj->size,
>                 .flags = flags,
>                 .priv = obj,
> +               .coherent = dev_is_dma_coherent(dev->dev),

To set the coherent flag correctly, I think I'd need a way to override
on a per buffer basis, since coherency is a property of the gpu
pgtables (which in the msm case is an immutable property of the gem
object).  We also have some awkwardness that drm->dev isn't actually
the GPU, thanks to the kernels device model seeing a collection of
other small devices shoehorned into a single drm device to fit
userspace's view of the world.  So relying on drm->dev isn't really
going to give sensible results.

I guess msm could just bury our heads in the sand and continue to do
things the way we have been (buffers that are mapped cached-coherent
are only self-shared) but would be nice to catch if userspace tried to
import one into (for ex) v4l2..

BR,
-R

>                 .resv = obj->resv,
>         };
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/3] drm/prime: set the dma_coherent flag for export
  2022-10-20 14:43     ` Rob Clark
@ 2022-10-20 14:56       ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-10-20 14:56 UTC (permalink / raw)
  To: Rob Clark, Christian König
  Cc: l.stach, nicolas, ppaalanen, sumit.semwal, daniel, dri-devel,
	linaro-mm-sig, linux-media

Am 20.10.22 um 16:43 schrieb Rob Clark:
> On Thu, Oct 20, 2022 at 5:13 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> When a device driver is snooping the CPU cache during access we assume
>> that all importers need to be able to snoop the CPU cache as well.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/drm_prime.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
>> index 20e109a802ae..d5c70b6fe8a4 100644
>> --- a/drivers/gpu/drm/drm_prime.c
>> +++ b/drivers/gpu/drm/drm_prime.c
>> @@ -28,6 +28,7 @@
>>
>>   #include <linux/export.h>
>>   #include <linux/dma-buf.h>
>> +#include <linux/dma-map-ops.h>
>>   #include <linux/rbtree.h>
>>   #include <linux/module.h>
>>
>> @@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
>>                  .size = obj->size,
>>                  .flags = flags,
>>                  .priv = obj,
>> +               .coherent = dev_is_dma_coherent(dev->dev),
> To set the coherent flag correctly, I think I'd need a way to override
> on a per buffer basis, since coherency is a property of the gpu
> pgtables (which in the msm case is an immutable property of the gem
> object).  We also have some awkwardness that drm->dev isn't actually
> the GPU, thanks to the kernels device model seeing a collection of
> other small devices shoehorned into a single drm device to fit
> userspace's view of the world.  So relying on drm->dev isn't really
> going to give sensible results.

Yeah, I've the same problem for amdgpu where some buffers are snooped 
while others aren't.

But this should be unproblematic since the flag can always be cleared by 
the driver later on (it just can't be set).

Additional to that I've just noted that armada, i915, omap and tegra use 
their own DMA-buf export function. MSM could do the same as well if the 
device itself is marked as not coherent while some buffers are mapped 
cache coherent.

Regards,
Christian.

> I guess msm could just bury our heads in the sand and continue to do
> things the way we have been (buffers that are mapped cached-coherent
> are only self-shared) but would be nice to catch if userspace tried to
> import one into (for ex) v4l2..
>
> BR,
> -R
>
>>                  .resv = obj->resv,
>>          };
>>
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/3] drm/prime: set the dma_coherent flag for export
@ 2022-10-20 14:56       ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-10-20 14:56 UTC (permalink / raw)
  To: Rob Clark, Christian König
  Cc: dri-devel, nicolas, linaro-mm-sig, ppaalanen, linux-media, sumit.semwal

Am 20.10.22 um 16:43 schrieb Rob Clark:
> On Thu, Oct 20, 2022 at 5:13 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> When a device driver is snooping the CPU cache during access we assume
>> that all importers need to be able to snoop the CPU cache as well.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/drm_prime.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
>> index 20e109a802ae..d5c70b6fe8a4 100644
>> --- a/drivers/gpu/drm/drm_prime.c
>> +++ b/drivers/gpu/drm/drm_prime.c
>> @@ -28,6 +28,7 @@
>>
>>   #include <linux/export.h>
>>   #include <linux/dma-buf.h>
>> +#include <linux/dma-map-ops.h>
>>   #include <linux/rbtree.h>
>>   #include <linux/module.h>
>>
>> @@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
>>                  .size = obj->size,
>>                  .flags = flags,
>>                  .priv = obj,
>> +               .coherent = dev_is_dma_coherent(dev->dev),
> To set the coherent flag correctly, I think I'd need a way to override
> on a per buffer basis, since coherency is a property of the gpu
> pgtables (which in the msm case is an immutable property of the gem
> object).  We also have some awkwardness that drm->dev isn't actually
> the GPU, thanks to the kernels device model seeing a collection of
> other small devices shoehorned into a single drm device to fit
> userspace's view of the world.  So relying on drm->dev isn't really
> going to give sensible results.

Yeah, I've the same problem for amdgpu where some buffers are snooped 
while others aren't.

But this should be unproblematic since the flag can always be cleared by 
the driver later on (it just can't be set).

Additional to that I've just noted that armada, i915, omap and tegra use 
their own DMA-buf export function. MSM could do the same as well if the 
device itself is marked as not coherent while some buffers are mapped 
cache coherent.

Regards,
Christian.

> I guess msm could just bury our heads in the sand and continue to do
> things the way we have been (buffers that are mapped cached-coherent
> are only self-shared) but would be nice to catch if userspace tried to
> import one into (for ex) v4l2..
>
> BR,
> -R
>
>>                  .resv = obj->resv,
>>          };
>>
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-10-20 12:13 Try to address the DMA-buf coherency problem Christian König
                   ` (2 preceding siblings ...)
  2022-10-20 12:13 ` [PATCH 3/3] media: videobuf2: set dma_coherent flag for DMA-buf Christian König
@ 2022-10-27 12:14 ` Christian König
  2022-10-28  8:09 ` Lucas Stach
  4 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-10-27 12:14 UTC (permalink / raw)
  To: l.stach, nicolas, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Am 20.10.22 um 14:13 schrieb Christian König:
> Hi guys,
>
> after finding that we essentially have two separate worlds for coherent sharing
> of buffer through DMA-buf I thought I will tackle that problem a bit and at
> least allow the framework to reject attachments which won't work.
>
> So those patches here add a new dma_coherent flag to each DMA-buf object
> telling the framework that dev_is_dma_coherent() needs to return true for an
> importing device to be able to attach. Since we should always have a fallback
> path this should give userspace the chance to still keep the use case working,
> either by doing a CPU copy instead or reversing the roles of exporter and
> importer.
>
> For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
> return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
> the flag for their buffers if special handling like the USWC flag in amdgpu or
> the uncached allocations for radeon/nouveau are in use.
>
> Additional to that importers can also check the flag if they have some
> non-snooping operations like the special scanout case for amdgpu for example.
>
> The patches are only smoke tested and the solution isn't ideal, but as far as
> I can see should at least keep things working.

Gentle ping on this. Lucas, Daniel and Nicolas you have been rather 
active in the last discussion. Do you mind taking a look?

Thanks,
Christian.

>
> Please review and/or comment,
> Christian.
>
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/3] drm/prime: set the dma_coherent flag for export
  2022-10-20 14:56       ` Christian König
@ 2022-10-28  1:11         ` Rob Clark
  -1 siblings, 0 replies; 110+ messages in thread
From: Rob Clark @ 2022-10-28  1:11 UTC (permalink / raw)
  To: Christian König
  Cc: Christian König, l.stach, nicolas, ppaalanen, sumit.semwal,
	daniel, dri-devel, linaro-mm-sig, linux-media

On Thu, Oct 20, 2022 at 7:57 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 20.10.22 um 16:43 schrieb Rob Clark:
> > On Thu, Oct 20, 2022 at 5:13 AM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> When a device driver is snooping the CPU cache during access we assume
> >> that all importers need to be able to snoop the CPU cache as well.
> >>
> >> Signed-off-by: Christian König <christian.koenig@amd.com>
> >> ---
> >>   drivers/gpu/drm/drm_prime.c | 2 ++
> >>   1 file changed, 2 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> >> index 20e109a802ae..d5c70b6fe8a4 100644
> >> --- a/drivers/gpu/drm/drm_prime.c
> >> +++ b/drivers/gpu/drm/drm_prime.c
> >> @@ -28,6 +28,7 @@
> >>
> >>   #include <linux/export.h>
> >>   #include <linux/dma-buf.h>
> >> +#include <linux/dma-map-ops.h>
> >>   #include <linux/rbtree.h>
> >>   #include <linux/module.h>
> >>
> >> @@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
> >>                  .size = obj->size,
> >>                  .flags = flags,
> >>                  .priv = obj,
> >> +               .coherent = dev_is_dma_coherent(dev->dev),
> > To set the coherent flag correctly, I think I'd need a way to override
> > on a per buffer basis, since coherency is a property of the gpu
> > pgtables (which in the msm case is an immutable property of the gem
> > object).  We also have some awkwardness that drm->dev isn't actually
> > the GPU, thanks to the kernels device model seeing a collection of
> > other small devices shoehorned into a single drm device to fit
> > userspace's view of the world.  So relying on drm->dev isn't really
> > going to give sensible results.
>
> Yeah, I've the same problem for amdgpu where some buffers are snooped
> while others aren't.
>
> But this should be unproblematic since the flag can always be cleared by
> the driver later on (it just can't be set).
>
> Additional to that I've just noted that armada, i915, omap and tegra use
> their own DMA-buf export function. MSM could do the same as well if the
> device itself is marked as not coherent while some buffers are mapped
> cache coherent.

yeah, I guess that would work.. it would be a bit unfortunate to need
to use our own export function, but I guess it is a small price to pay
and I like the overall idea, so a-b for the series

For the VMM case, it would be nice to expose this to userspace, but
I've sent a patch to do this in an msm specific way, and I guess at
least solving that problem for one driver and better than the current
state of "iff driver == "i915" { it's mmap'd cached } else { it's
writecombine }" in crosvm

Admittedly the VMM case is a rather special case compared to your
average userspace dealing with dmabuf's, but it would be nice to get
out of the current situation where it is having to make assumptions
which are quite possibly wrong, so giving the VMM some information
even if it is "the cachability isn't static, you should bail now if
your arch can't cope" would be an improvement.  (For background, this
case is also a bit specific for android/gralloc.. for driver allocated
buffers in a VM, with the native usermode driver (UMD) in guest, you
still have some control within the UMD)

BR,
-R


> Regards,
> Christian.
>
> > I guess msm could just bury our heads in the sand and continue to do
> > things the way we have been (buffers that are mapped cached-coherent
> > are only self-shared) but would be nice to catch if userspace tried to
> > import one into (for ex) v4l2..
> >
> > BR,
> > -R
> >
> >>                  .resv = obj->resv,
> >>          };
> >>
> >> --
> >> 2.25.1
> >>
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/3] drm/prime: set the dma_coherent flag for export
@ 2022-10-28  1:11         ` Rob Clark
  0 siblings, 0 replies; 110+ messages in thread
From: Rob Clark @ 2022-10-28  1:11 UTC (permalink / raw)
  To: Christian König
  Cc: Christian König, dri-devel, nicolas, linaro-mm-sig,
	ppaalanen, linux-media, sumit.semwal

On Thu, Oct 20, 2022 at 7:57 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 20.10.22 um 16:43 schrieb Rob Clark:
> > On Thu, Oct 20, 2022 at 5:13 AM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> When a device driver is snooping the CPU cache during access we assume
> >> that all importers need to be able to snoop the CPU cache as well.
> >>
> >> Signed-off-by: Christian König <christian.koenig@amd.com>
> >> ---
> >>   drivers/gpu/drm/drm_prime.c | 2 ++
> >>   1 file changed, 2 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> >> index 20e109a802ae..d5c70b6fe8a4 100644
> >> --- a/drivers/gpu/drm/drm_prime.c
> >> +++ b/drivers/gpu/drm/drm_prime.c
> >> @@ -28,6 +28,7 @@
> >>
> >>   #include <linux/export.h>
> >>   #include <linux/dma-buf.h>
> >> +#include <linux/dma-map-ops.h>
> >>   #include <linux/rbtree.h>
> >>   #include <linux/module.h>
> >>
> >> @@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj,
> >>                  .size = obj->size,
> >>                  .flags = flags,
> >>                  .priv = obj,
> >> +               .coherent = dev_is_dma_coherent(dev->dev),
> > To set the coherent flag correctly, I think I'd need a way to override
> > on a per buffer basis, since coherency is a property of the gpu
> > pgtables (which in the msm case is an immutable property of the gem
> > object).  We also have some awkwardness that drm->dev isn't actually
> > the GPU, thanks to the kernels device model seeing a collection of
> > other small devices shoehorned into a single drm device to fit
> > userspace's view of the world.  So relying on drm->dev isn't really
> > going to give sensible results.
>
> Yeah, I've the same problem for amdgpu where some buffers are snooped
> while others aren't.
>
> But this should be unproblematic since the flag can always be cleared by
> the driver later on (it just can't be set).
>
> Additional to that I've just noted that armada, i915, omap and tegra use
> their own DMA-buf export function. MSM could do the same as well if the
> device itself is marked as not coherent while some buffers are mapped
> cache coherent.

yeah, I guess that would work.. it would be a bit unfortunate to need
to use our own export function, but I guess it is a small price to pay
and I like the overall idea, so a-b for the series

For the VMM case, it would be nice to expose this to userspace, but
I've sent a patch to do this in an msm specific way, and I guess at
least solving that problem for one driver and better than the current
state of "iff driver == "i915" { it's mmap'd cached } else { it's
writecombine }" in crosvm

Admittedly the VMM case is a rather special case compared to your
average userspace dealing with dmabuf's, but it would be nice to get
out of the current situation where it is having to make assumptions
which are quite possibly wrong, so giving the VMM some information
even if it is "the cachability isn't static, you should bail now if
your arch can't cope" would be an improvement.  (For background, this
case is also a bit specific for android/gralloc.. for driver allocated
buffers in a VM, with the native usermode driver (UMD) in guest, you
still have some control within the UMD)

BR,
-R


> Regards,
> Christian.
>
> > I guess msm could just bury our heads in the sand and continue to do
> > things the way we have been (buffers that are mapped cached-coherent
> > are only self-shared) but would be nice to catch if userspace tried to
> > import one into (for ex) v4l2..
> >
> > BR,
> > -R
> >
> >>                  .resv = obj->resv,
> >>          };
> >>
> >> --
> >> 2.25.1
> >>
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-10-20 12:13 Try to address the DMA-buf coherency problem Christian König
                   ` (3 preceding siblings ...)
  2022-10-27 12:14 ` Try to address the DMA-buf coherency problem Christian König
@ 2022-10-28  8:09 ` Lucas Stach
  2022-10-28  8:40   ` Christian König
  4 siblings, 1 reply; 110+ messages in thread
From: Lucas Stach @ 2022-10-28  8:09 UTC (permalink / raw)
  To: Christian König, nicolas, ppaalanen, sumit.semwal, daniel,
	robdclark, dri-devel, linaro-mm-sig, linux-media

Hi Christian,

Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:
> Hi guys,
> 
> after finding that we essentially have two separate worlds for coherent sharing
> of buffer through DMA-buf I thought I will tackle that problem a bit and at
> least allow the framework to reject attachments which won't work.
> 
> So those patches here add a new dma_coherent flag to each DMA-buf object
> telling the framework that dev_is_dma_coherent() needs to return true for an
> importing device to be able to attach. Since we should always have a fallback
> path this should give userspace the chance to still keep the use case working,
> either by doing a CPU copy instead or reversing the roles of exporter and
> importer.
> 
The fallback would likely be a CPU copy with the appropriate cache
flushes done by the device driver, which would be a massiv performance
issue.

> For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
> return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
> the flag for their buffers if special handling like the USWC flag in amdgpu or
> the uncached allocations for radeon/nouveau are in use.
> 
I don't think the V4L2 part works for most ARM systems. The default
there is for devices to be noncoherent unless explicitly marked
otherwise. I don't think any of the "devices" writing the video buffers
in cached memory with the CPU do this. While we could probably mark
them as coherent, I don't think this is moving in the right direction.

> Additional to that importers can also check the flag if they have some
> non-snooping operations like the special scanout case for amdgpu for example.
> 
> The patches are only smoke tested and the solution isn't ideal, but as far as
> I can see should at least keep things working.
> 
I would like to see this solved properly. Where I think properly means
we make things work on systems with mixed coherent/noncoherent masters
in the same way the DMA API does on such systems: by inserting the
proper cache maintenance operations.

I also think that we should keep in mind that the world is moving into
a direction where DMA masters may not only snoop the CPU caches (what
is the definition of cache coherent on x86), but actually take part in
the system coherence and are able to have writeback caches for shared
data on their own. I can only speculate, as I haven't seen the amdgpu
side yet, but I think this proposal is moving in the other direction by
assuming a central system cache, where the importer has some magic way
to clean this central cache.

Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU
dma-buf sharing work on ARM systems and seem to hold some strong
opinions on how this should work, I guess I need to make some time
available to type it up, so we can discuss over coder rather than
abstract ideas. If I come up with something that works for my use-cases
would you be up for taking a shot at a amdgpu implementation?

Regards,
Lucas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-10-28  8:09 ` Lucas Stach
@ 2022-10-28  8:40   ` Christian König
  2022-10-28 11:42     ` Lucas Stach
  0 siblings, 1 reply; 110+ messages in thread
From: Christian König @ 2022-10-28  8:40 UTC (permalink / raw)
  To: Lucas Stach, nicolas, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Hi Lucas,

Am 28.10.22 um 10:09 schrieb Lucas Stach:
> Hi Christian,
>
> Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:
>> Hi guys,
>>
>> after finding that we essentially have two separate worlds for coherent sharing
>> of buffer through DMA-buf I thought I will tackle that problem a bit and at
>> least allow the framework to reject attachments which won't work.
>>
>> So those patches here add a new dma_coherent flag to each DMA-buf object
>> telling the framework that dev_is_dma_coherent() needs to return true for an
>> importing device to be able to attach. Since we should always have a fallback
>> path this should give userspace the chance to still keep the use case working,
>> either by doing a CPU copy instead or reversing the roles of exporter and
>> importer.
>>
> The fallback would likely be a CPU copy with the appropriate cache
> flushes done by the device driver, which would be a massiv performance
> issue.

But essentially the right thing to do. The only alternative I can see is 
to reverse the role of exporter and importer.

>> For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
>> return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
>> the flag for their buffers if special handling like the USWC flag in amdgpu or
>> the uncached allocations for radeon/nouveau are in use.
>>
> I don't think the V4L2 part works for most ARM systems. The default
> there is for devices to be noncoherent unless explicitly marked
> otherwise. I don't think any of the "devices" writing the video buffers
> in cached memory with the CPU do this. While we could probably mark
> them as coherent, I don't think this is moving in the right direction.

Well why not? Those devices are coherent in the sense of the DMA API 
that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.

We could come up with a better name for coherency, e.g. snooping for 
example. But that is just an documentation detail.

>> Additional to that importers can also check the flag if they have some
>> non-snooping operations like the special scanout case for amdgpu for example.
>>
>> The patches are only smoke tested and the solution isn't ideal, but as far as
>> I can see should at least keep things working.
>>
> I would like to see this solved properly. Where I think properly means
> we make things work on systems with mixed coherent/noncoherent masters
> in the same way the DMA API does on such systems: by inserting the
> proper cache maintenance operations.

And this the exact wrong approach as far as I can see. As Daniel noted 
as well we absolutely need some kind of coherency between exporter and 
importer.

This can either be provided by the PCI specification which makes it 
mandatory for device to snoop the caches or by platform devices agreeing 
that they only work on uncached memory.

Explicit cache flush operations are simple not part of the design of 
DMA-buf because they are not part of the design of the higher level APIs 
like Vulkan and OpenGL.

Adding this to DMA-buf for the rather special use case would completely 
break that and make live much more complicated for everybody.

> I also think that we should keep in mind that the world is moving into
> a direction where DMA masters may not only snoop the CPU caches (what
> is the definition of cache coherent on x86), but actually take part in
> the system coherence and are able to have writeback caches for shared
> data on their own. I can only speculate, as I haven't seen the amdgpu
> side yet, but I think this proposal is moving in the other direction by
> assuming a central system cache, where the importer has some magic way
> to clean this central cache.

What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link

And yes we support that in a bunch of configurations and also have 
worked with that and amdgpu with DMA-buf based shared.

This should not be a problem with this approach.

> Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU
> dma-buf sharing work on ARM systems and seem to hold some strong
> opinions on how this should work, I guess I need to make some time
> available to type it up, so we can discuss over coder rather than
> abstract ideas. If I come up with something that works for my use-cases
> would you be up for taking a shot at a amdgpu implementation?

Well, not really. As I said I see what you have in mind here as 
completely wrong approach we will certainly not support in any GPU driver.

What we can do is to request the use case which won't work and this is 
exactly what the patch here does.

Regards,
Christian.

>
> Regards,
> Lucas
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-10-28  8:40   ` Christian König
@ 2022-10-28 11:42     ` Lucas Stach
  2022-10-28 14:26       ` Christian König
  0 siblings, 1 reply; 110+ messages in thread
From: Lucas Stach @ 2022-10-28 11:42 UTC (permalink / raw)
  To: Christian König, nicolas, ppaalanen, sumit.semwal, daniel,
	robdclark, dri-devel, linaro-mm-sig, linux-media

Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:
> Hi Lucas,
> 
> Am 28.10.22 um 10:09 schrieb Lucas Stach:
> > Hi Christian,
> > 
> > Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:
> > > Hi guys,
> > > 
> > > after finding that we essentially have two separate worlds for coherent sharing
> > > of buffer through DMA-buf I thought I will tackle that problem a bit and at
> > > least allow the framework to reject attachments which won't work.
> > > 
> > > So those patches here add a new dma_coherent flag to each DMA-buf object
> > > telling the framework that dev_is_dma_coherent() needs to return true for an
> > > importing device to be able to attach. Since we should always have a fallback
> > > path this should give userspace the chance to still keep the use case working,
> > > either by doing a CPU copy instead or reversing the roles of exporter and
> > > importer.
> > > 
> > The fallback would likely be a CPU copy with the appropriate cache
> > flushes done by the device driver, which would be a massiv performance
> > issue.
> 
> But essentially the right thing to do. The only alternative I can see is 
> to reverse the role of exporter and importer.
> 
I don't think that would work generally either, as buffer exporter and
importer isn't always a 1:1 thing. As soon as any attached importer has
a different coherency behavior than the others, things fall apart.

> > > For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
> > > return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
> > > the flag for their buffers if special handling like the USWC flag in amdgpu or
> > > the uncached allocations for radeon/nouveau are in use.
> > > 
> > I don't think the V4L2 part works for most ARM systems. The default
> > there is for devices to be noncoherent unless explicitly marked
> > otherwise. I don't think any of the "devices" writing the video buffers
> > in cached memory with the CPU do this. While we could probably mark
> > them as coherent, I don't think this is moving in the right direction.
> 
> Well why not? Those devices are coherent in the sense of the DMA API 
> that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.
> 
> We could come up with a better name for coherency, e.g. snooping for 
> example. But that is just an documentation detail.
> 
I agree that those devices copying data into a CPU cacheable buffer
should be marked as coherent, just not sure right now if other things
like DMA mappings are done on that device, which would require the
cache maintenance.

> > > Additional to that importers can also check the flag if they have some
> > > non-snooping operations like the special scanout case for amdgpu for example.
> > > 
> > > The patches are only smoke tested and the solution isn't ideal, but as far as
> > > I can see should at least keep things working.
> > > 
> > I would like to see this solved properly. Where I think properly means
> > we make things work on systems with mixed coherent/noncoherent masters
> > in the same way the DMA API does on such systems: by inserting the
> > proper cache maintenance operations.
> 
> And this the exact wrong approach as far as I can see. As Daniel noted 
> as well we absolutely need some kind of coherency between exporter and 
> importer.
> 
I think it's important that we are very specific about the thing we are
talking about here: I guess when you say coherency you mean hardware
enforced coherency on cacheable memory, which is the default on
x86/PCI.

The other way to enforce coherency is to either insert cache
maintenance operations, or make sure that the buffer is not cacheable
by any device taking part in the sharing, including the CPU.

> This can either be provided by the PCI specification which makes it 
> mandatory for device to snoop the caches or by platform devices agreeing 
> that they only work on uncached memory.

What you disregard here is the fact that there are many systems out
there with mixed topologies, where some masters are part of the
coherency domain and some are not.

We have two options here: either mandate that coherency for dma-bufs
need to be established by hardware, which is the option that you
strongly prefer, which means forcing all buffers to be uncacheable in a
system with masters that are not coherent, or allowing some form of
bracketed DMA access with cache maintenance ops.

> 
> Explicit cache flush operations are simple not part of the design of 
> DMA-buf because they are not part of the design of the higher level APIs 
> like Vulkan and OpenGL.

I'm aware that some graphics APIs have been living in a universe
blissfully unaware of systems without hardware enforced coherency. But
that isn't the only use for dma-bufs.

I totally agree that some higher level API primitives aren't possible
without coherency at the hardware level and for those uses we should
require either HW enforced coherency or uncachable memory. But I don't
think we should make things slow deliberately on systems that allow to
optimize things with the help of bracketed access.

If I understood things right your amdgpu use-case even falls into this
category: normally you would want to use cacheable memory for
everything, but just make sure to clean the caches before using the
buffer with the non-coherent display engine.

> 
> Adding this to DMA-buf for the rather special use case would completely 
> break that and make live much more complicated for everybody.
> 
> > I also think that we should keep in mind that the world is moving into
> > a direction where DMA masters may not only snoop the CPU caches (what
> > is the definition of cache coherent on x86), but actually take part in
> > the system coherence and are able to have writeback caches for shared
> > data on their own. I can only speculate, as I haven't seen the amdgpu
> > side yet, but I think this proposal is moving in the other direction by
> > assuming a central system cache, where the importer has some magic way
> > to clean this central cache.
> 
> What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link

Or ARM AMBA CHI.
> 
> And yes we support that in a bunch of configurations and also have 
> worked with that and amdgpu with DMA-buf based shared.
> 
> This should not be a problem with this approach.

It works as long as all masters sharing the buffer are accessing the
buffer through the HW cache coherence facilities provided by CXL. As
soon as a master wants to bypass it (like your nosnoop scanout) you
need some way to force all other masters sharing access to the buffer
to clean their caches.

> 
> > Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU
> > dma-buf sharing work on ARM systems and seem to hold some strong
> > opinions on how this should work, I guess I need to make some time
> > available to type it up, so we can discuss over coder rather than
> > abstract ideas. If I come up with something that works for my use-cases
> > would you be up for taking a shot at a amdgpu implementation?
> 
> Well, not really. As I said I see what you have in mind here as 
> completely wrong approach we will certainly not support in any GPU driver.
> 
> What we can do is to request the use case which won't work and this is 
> exactly what the patch here does.
> 
Did you mean to write "prevent the use case which won't work" here?

Regards,
Lucas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-10-28 11:42     ` Lucas Stach
@ 2022-10-28 14:26       ` Christian König
  2022-10-28 15:46         ` Nicolas Dufresne
  0 siblings, 1 reply; 110+ messages in thread
From: Christian König @ 2022-10-28 14:26 UTC (permalink / raw)
  To: Lucas Stach, nicolas, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Am 28.10.22 um 13:42 schrieb Lucas Stach:
> Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:
>> But essentially the right thing to do. The only alternative I can see is
>> to reverse the role of exporter and importer.
>>
> I don't think that would work generally either, as buffer exporter and
> importer isn't always a 1:1 thing. As soon as any attached importer has
> a different coherency behavior than the others, things fall apart.

I've just mentioned it because somebody noted that when you reverse the 
roles of exporter and importer with the V4L driver and i915 then the use 
case suddenly starts working.

>>>> For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
>>>> return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
>>>> the flag for their buffers if special handling like the USWC flag in amdgpu or
>>>> the uncached allocations for radeon/nouveau are in use.
>>>>
>>> I don't think the V4L2 part works for most ARM systems. The default
>>> there is for devices to be noncoherent unless explicitly marked
>>> otherwise. I don't think any of the "devices" writing the video buffers
>>> in cached memory with the CPU do this. While we could probably mark
>>> them as coherent, I don't think this is moving in the right direction.
>> Well why not? Those devices are coherent in the sense of the DMA API
>> that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.
>>
>> We could come up with a better name for coherency, e.g. snooping for
>> example. But that is just an documentation detail.
>>
> I agree that those devices copying data into a CPU cacheable buffer
> should be marked as coherent, just not sure right now if other things
> like DMA mappings are done on that device, which would require the
> cache maintenance.

Yeah, good point.

>> And this the exact wrong approach as far as I can see. As Daniel noted
>> as well we absolutely need some kind of coherency between exporter and
>> importer.
>>
> I think it's important that we are very specific about the thing we are
> talking about here: I guess when you say coherency you mean hardware
> enforced coherency on cacheable memory, which is the default on
> x86/PCI.

Well, no. What I mean with coherency is that the devices don't need 
insert special operation to access each others data.

This can be archived by multiple approaches, e.g. by the PCI coherency 
requirements, device internal connections (XGMI, NVLink, CXL etc...) as 
well as using uncached system memory.

The key point is what we certainly don't want is special operations 
which say: Ok, now device A can access the data, now device B..... 
because this breaks tons of use cases.

> The other way to enforce coherency is to either insert cache
> maintenance operations, or make sure that the buffer is not cacheable
> by any device taking part in the sharing, including the CPU.

Yes and no. When we want the devices to interact with each other without 
the CPU then we need some negotiated coherency between the two.

>> This can either be provided by the PCI specification which makes it
>> mandatory for device to snoop the caches or by platform devices agreeing
>> that they only work on uncached memory.
> What you disregard here is the fact that there are many systems out
> there with mixed topologies, where some masters are part of the
> coherency domain and some are not.
>
> We have two options here: either mandate that coherency for dma-bufs
> need to be established by hardware, which is the option that you
> strongly prefer, which means forcing all buffers to be uncacheable in a
> system with masters that are not coherent, or allowing some form of
> bracketed DMA access with cache maintenance ops.

Well I don't prefer that option, it's just the only one I can see. One 
of the main goals of DMA-buf is to allow device to share data without 
the need for CPU interactions.

In other words we negotiate the high level properties and then the 
device can talk to each other without explicitly noting who is accessing 
what.

And this concept is completely incompatible with maintenance ops. We 
made that mistake with SWIOTLB for the DMA API and I don't really want 
to repeat that stunt.

>> Explicit cache flush operations are simple not part of the design of
>> DMA-buf because they are not part of the design of the higher level APIs
>> like Vulkan and OpenGL.
> I'm aware that some graphics APIs have been living in a universe
> blissfully unaware of systems without hardware enforced coherency. But
> that isn't the only use for dma-bufs.

Yeah, but the other use cases are extremely limited. As far as I can see

>
> I totally agree that some higher level API primitives aren't possible
> without coherency at the hardware level and for those uses we should
> require either HW enforced coherency or uncachable memory. But I don't
> think we should make things slow deliberately on systems that allow to
> optimize things with the help of bracketed access.
>
> If I understood things right your amdgpu use-case even falls into this
> category: normally you would want to use cacheable memory for
> everything, but just make sure to clean the caches before using the
> buffer with the non-coherent display engine.

No, that won't work like this. The caching attributes must be coherent 
for the display engine to work correctly.

>> Adding this to DMA-buf for the rather special use case would completely
>> break that and make live much more complicated for everybody.
>>
>>> I also think that we should keep in mind that the world is moving into
>>> a direction where DMA masters may not only snoop the CPU caches (what
>>> is the definition of cache coherent on x86), but actually take part in
>>> the system coherence and are able to have writeback caches for shared
>>> data on their own. I can only speculate, as I haven't seen the amdgpu
>>> side yet, but I think this proposal is moving in the other direction by
>>> assuming a central system cache, where the importer has some magic way
>>> to clean this central cache.
>> What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link
> Or ARM AMBA CHI.
>> And yes we support that in a bunch of configurations and also have
>> worked with that and amdgpu with DMA-buf based shared.
>>
>> This should not be a problem with this approach.
> It works as long as all masters sharing the buffer are accessing the
> buffer through the HW cache coherence facilities provided by CXL. As
> soon as a master wants to bypass it (like your nosnoop scanout) you
> need some way to force all other masters sharing access to the buffer
> to clean their caches.

That won't work like this. The problem is that this is an APU and so the 
display is part of the CPU. When either the MTRR or PAT says that the 
physical address is cacheable the engine might just hang on access.

>>> Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU
>>> dma-buf sharing work on ARM systems and seem to hold some strong
>>> opinions on how this should work, I guess I need to make some time
>>> available to type it up, so we can discuss over coder rather than
>>> abstract ideas. If I come up with something that works for my use-cases
>>> would you be up for taking a shot at a amdgpu implementation?
>> Well, not really. As I said I see what you have in mind here as
>> completely wrong approach we will certainly not support in any GPU driver.
>>
>> What we can do is to request the use case which won't work and this is
>> exactly what the patch here does.
>>
> Did you mean to write "prevent the use case which won't work" here?

Oh, yes. To fast typing.

Regards,
Christian.

>
> Regards,
> Lucas
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-10-28 14:26       ` Christian König
@ 2022-10-28 15:46         ` Nicolas Dufresne
  2022-10-28 17:50           ` Christian König
  0 siblings, 1 reply; 110+ messages in thread
From: Nicolas Dufresne @ 2022-10-28 15:46 UTC (permalink / raw)
  To: Christian König, Lucas Stach, ppaalanen, sumit.semwal,
	daniel, robdclark, dri-devel, linaro-mm-sig, linux-media

Hi,

just dropping some real live use case, sorry I'm not really proposing solutions,
I believe you are much more knowledgeable in this regard.

Le vendredi 28 octobre 2022 à 16:26 +0200, Christian König a écrit :
> Am 28.10.22 um 13:42 schrieb Lucas Stach:
> > Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:
> > > But essentially the right thing to do. The only alternative I can see is
> > > to reverse the role of exporter and importer.
> > > 
> > I don't think that would work generally either, as buffer exporter and
> > importer isn't always a 1:1 thing. As soon as any attached importer has
> > a different coherency behavior than the others, things fall apart.
> 
> I've just mentioned it because somebody noted that when you reverse the 
> roles of exporter and importer with the V4L driver and i915 then the use 
> case suddenly starts working.

Though, its not generically possible to reverse these roles. If you want to do
so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
because you will have to allocate DRM buffers that knows about importer specific
requirements. See link [1] for what it looks like for RK3399, with Motion Vector
size calculation copied from the kernel driver into a userspace lib (arguably
that was available from V4L2 sizeimage, but this is technically difficult to
communicate within the software layers). If you could let the decoder export
(with proper cache management) the non-generic code would not be needed.

Another case where reversing the role is difficult is for case where you need to
multiplex the streams (let's use a camera to illustrate) and share that with
multiple processes. In these uses case, the DRM importers are volatile, which
one do you abuse to do allocation from ? In multimedia server like PipeWire, you
are not really aware if the camera will be used by DRM or not, and if something
"special" is needed in term of role inversion. It is relatively easy to deal
with matching modifiers, but using downstream (display/gpu) as an exporter is
always difficult (and require some level of abuse and guessing).

[1]
https://android.googlesource.com/platform/external/minigbm/+/refs/heads/master/rockchip.c#140

> 
> > > > > For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
> > > > > return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
> > > > > the flag for their buffers if special handling like the USWC flag in amdgpu or
> > > > > the uncached allocations for radeon/nouveau are in use.
> > > > > 
> > > > I don't think the V4L2 part works for most ARM systems. The default
> > > > there is for devices to be noncoherent unless explicitly marked
> > > > otherwise. I don't think any of the "devices" writing the video buffers
> > > > in cached memory with the CPU do this. While we could probably mark
> > > > them as coherent, I don't think this is moving in the right direction.
> > > Well why not? Those devices are coherent in the sense of the DMA API
> > > that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.
> > > 
> > > We could come up with a better name for coherency, e.g. snooping for
> > > example. But that is just an documentation detail.
> > > 
> > I agree that those devices copying data into a CPU cacheable buffer
> > should be marked as coherent, just not sure right now if other things
> > like DMA mappings are done on that device, which would require the
> > cache maintenance.
> 
> Yeah, good point.
> 
> > > And this the exact wrong approach as far as I can see. As Daniel noted
> > > as well we absolutely need some kind of coherency between exporter and
> > > importer.
> > > 
> > I think it's important that we are very specific about the thing we are
> > talking about here: I guess when you say coherency you mean hardware
> > enforced coherency on cacheable memory, which is the default on
> > x86/PCI.
> 
> Well, no. What I mean with coherency is that the devices don't need 
> insert special operation to access each others data.
> 
> This can be archived by multiple approaches, e.g. by the PCI coherency 
> requirements, device internal connections (XGMI, NVLink, CXL etc...) as 
> well as using uncached system memory.
> 
> The key point is what we certainly don't want is special operations 
> which say: Ok, now device A can access the data, now device B..... 
> because this breaks tons of use cases.

I'm coming back again with the multiplexing case. We keep having mixed uses case
with multiple receiver. In some case, data may endup on CPU while being encoded
in HW. Current approach of disabling cache does work, but CPU algorithm truly
suffer in performance. Doing a full memcpy to a cached buffer helps, but remains
slower then if the cache had been snooped by the importer (encoder here) driver.

> 
> > The other way to enforce coherency is to either insert cache
> > maintenance operations, or make sure that the buffer is not cacheable
> > by any device taking part in the sharing, including the CPU.
> 
> Yes and no. When we want the devices to interact with each other without 
> the CPU then we need some negotiated coherency between the two.
> 
> > > This can either be provided by the PCI specification which makes it
> > > mandatory for device to snoop the caches or by platform devices agreeing
> > > that they only work on uncached memory.
> > What you disregard here is the fact that there are many systems out
> > there with mixed topologies, where some masters are part of the
> > coherency domain and some are not.
> > 
> > We have two options here: either mandate that coherency for dma-bufs
> > need to be established by hardware, which is the option that you
> > strongly prefer, which means forcing all buffers to be uncacheable in a
> > system with masters that are not coherent, or allowing some form of
> > bracketed DMA access with cache maintenance ops.
> 
> Well I don't prefer that option, it's just the only one I can see. One 
> of the main goals of DMA-buf is to allow device to share data without 
> the need for CPU interactions.
> 
> In other words we negotiate the high level properties and then the 
> device can talk to each other without explicitly noting who is accessing 
> what.
> 
> And this concept is completely incompatible with maintenance ops. We 
> made that mistake with SWIOTLB for the DMA API and I don't really want 
> to repeat that stunt.
> 
> > > Explicit cache flush operations are simple not part of the design of
> > > DMA-buf because they are not part of the design of the higher level APIs
> > > like Vulkan and OpenGL.
> > I'm aware that some graphics APIs have been living in a universe
> > blissfully unaware of systems without hardware enforced coherency. But
> > that isn't the only use for dma-bufs.
> 
> Yeah, but the other use cases are extremely limited. As far as I can see
> 
> > 
> > I totally agree that some higher level API primitives aren't possible
> > without coherency at the hardware level and for those uses we should
> > require either HW enforced coherency or uncachable memory. But I don't
> > think we should make things slow deliberately on systems that allow to
> > optimize things with the help of bracketed access.
> > 
> > If I understood things right your amdgpu use-case even falls into this
> > category: normally you would want to use cacheable memory for
> > everything, but just make sure to clean the caches before using the
> > buffer with the non-coherent display engine.
> 
> No, that won't work like this. The caching attributes must be coherent 
> for the display engine to work correctly.
> 
> > > Adding this to DMA-buf for the rather special use case would completely
> > > break that and make live much more complicated for everybody.
> > > 
> > > > I also think that we should keep in mind that the world is moving into
> > > > a direction where DMA masters may not only snoop the CPU caches (what
> > > > is the definition of cache coherent on x86), but actually take part in
> > > > the system coherence and are able to have writeback caches for shared
> > > > data on their own. I can only speculate, as I haven't seen the amdgpu
> > > > side yet, but I think this proposal is moving in the other direction by
> > > > assuming a central system cache, where the importer has some magic way
> > > > to clean this central cache.
> > > What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link
> > Or ARM AMBA CHI.
> > > And yes we support that in a bunch of configurations and also have
> > > worked with that and amdgpu with DMA-buf based shared.
> > > 
> > > This should not be a problem with this approach.
> > It works as long as all masters sharing the buffer are accessing the
> > buffer through the HW cache coherence facilities provided by CXL. As
> > soon as a master wants to bypass it (like your nosnoop scanout) you
> > need some way to force all other masters sharing access to the buffer
> > to clean their caches.
> 
> That won't work like this. The problem is that this is an APU and so the 
> display is part of the CPU. When either the MTRR or PAT says that the 
> physical address is cacheable the engine might just hang on access.
> 
> > > > Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU
> > > > dma-buf sharing work on ARM systems and seem to hold some strong
> > > > opinions on how this should work, I guess I need to make some time
> > > > available to type it up, so we can discuss over coder rather than
> > > > abstract ideas. If I come up with something that works for my use-cases
> > > > would you be up for taking a shot at a amdgpu implementation?
> > > Well, not really. As I said I see what you have in mind here as
> > > completely wrong approach we will certainly not support in any GPU driver.
> > > 
> > > What we can do is to request the use case which won't work and this is
> > > exactly what the patch here does.
> > > 
> > Did you mean to write "prevent the use case which won't work" here?
> 
> Oh, yes. To fast typing.
> 
> Regards,
> Christian.
> 
> > 
> > Regards,
> > Lucas
> > 
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-10-28 15:46         ` Nicolas Dufresne
@ 2022-10-28 17:50           ` Christian König
  2022-10-28 18:47               ` Daniel Stone
  0 siblings, 1 reply; 110+ messages in thread
From: Christian König @ 2022-10-28 17:50 UTC (permalink / raw)
  To: Nicolas Dufresne, Lucas Stach, ppaalanen, sumit.semwal, daniel,
	robdclark, dri-devel, linaro-mm-sig, linux-media

Hi Nicolas,

Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
> Hi,
>
> just dropping some real live use case, sorry I'm not really proposing solutions,
> I believe you are much more knowledgeable in this regard.

Well, I think each of us has a lot of specialized knowledge. For example 
I don't know to much about gralloc/minigbm. So this input is very valuable.

> Le vendredi 28 octobre 2022 à 16:26 +0200, Christian König a écrit :
>> Am 28.10.22 um 13:42 schrieb Lucas Stach:
>>> Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:
>>>> But essentially the right thing to do. The only alternative I can see is
>>>> to reverse the role of exporter and importer.
>>>>
>>> I don't think that would work generally either, as buffer exporter and
>>> importer isn't always a 1:1 thing. As soon as any attached importer has
>>> a different coherency behavior than the others, things fall apart.
>> I've just mentioned it because somebody noted that when you reverse the
>> roles of exporter and importer with the V4L driver and i915 then the use
>> case suddenly starts working.
> Though, its not generically possible to reverse these roles. If you want to do
> so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
> because you will have to allocate DRM buffers that knows about importer specific
> requirements. See link [1] for what it looks like for RK3399, with Motion Vector
> size calculation copied from the kernel driver into a userspace lib (arguably
> that was available from V4L2 sizeimage, but this is technically difficult to
> communicate within the software layers). If you could let the decoder export
> (with proper cache management) the non-generic code would not be needed.

Yeah, but I can also reverse the argument:

Getting the parameters for V4L right so that we can share the image is 
tricky, but getting the parameters so that the stuff is actually 
directly displayable by GPUs is even trickier.

Essentially you need to look at both sides and interference to get to a 
common ground, e.g. alignment, pitch, width/height, padding, etc.....

Deciding from which side to allocate from is just one step in this 
process. For example most dGPUs can't display directly from system 
memory altogether, but it is possible to allocate the DMA-buf through 
the GPU driver and then write into device memory with P2P PCI transfers.

So as far as I can see switching importer and exporter roles and even 
having performant extra fallbacks should be a standard feature of userspace.

> Another case where reversing the role is difficult is for case where you need to
> multiplex the streams (let's use a camera to illustrate) and share that with
> multiple processes. In these uses case, the DRM importers are volatile, which
> one do you abuse to do allocation from ? In multimedia server like PipeWire, you
> are not really aware if the camera will be used by DRM or not, and if something
> "special" is needed in term of role inversion. It is relatively easy to deal
> with matching modifiers, but using downstream (display/gpu) as an exporter is
> always difficult (and require some level of abuse and guessing).

Oh, very good point! Yeah we do have use cases for this where an input 
buffer is both displayed as well as encoded.

>
>> Well, no. What I mean with coherency is that the devices don't need
>> insert special operation to access each others data.
>>
>> This can be archived by multiple approaches, e.g. by the PCI coherency
>> requirements, device internal connections (XGMI, NVLink, CXL etc...) as
>> well as using uncached system memory.
>>
>> The key point is what we certainly don't want is special operations
>> which say: Ok, now device A can access the data, now device B.....
>> because this breaks tons of use cases.
> I'm coming back again with the multiplexing case. We keep having mixed uses case
> with multiple receiver. In some case, data may endup on CPU while being encoded
> in HW. Current approach of disabling cache does work, but CPU algorithm truly
> suffer in performance. Doing a full memcpy to a cached buffer helps, but remains
> slower then if the cache had been snooped by the importer (encoder here) driver.

Yeah, that was another reason why we ended up rather having an extra 
copy than working with uncached buffers for display as well.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-10-28 17:50           ` Christian König
@ 2022-10-28 18:47               ` Daniel Stone
  0 siblings, 0 replies; 110+ messages in thread
From: Daniel Stone @ 2022-10-28 18:47 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, Lucas Stach, ppaalanen, sumit.semwal, daniel,
	robdclark, dri-devel, linaro-mm-sig, linux-media

Hi Christian,

On Fri, 28 Oct 2022 at 18:50, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
> > Though, its not generically possible to reverse these roles. If you want to do
> > so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
> > because you will have to allocate DRM buffers that knows about importer specific
> > requirements. See link [1] for what it looks like for RK3399, with Motion Vector
> > size calculation copied from the kernel driver into a userspace lib (arguably
> > that was available from V4L2 sizeimage, but this is technically difficult to
> > communicate within the software layers). If you could let the decoder export
> > (with proper cache management) the non-generic code would not be needed.
>
> Yeah, but I can also reverse the argument:
>
> Getting the parameters for V4L right so that we can share the image is
> tricky, but getting the parameters so that the stuff is actually
> directly displayable by GPUs is even trickier.
>
> Essentially you need to look at both sides and interference to get to a
> common ground, e.g. alignment, pitch, width/height, padding, etc.....
>
> Deciding from which side to allocate from is just one step in this
> process. For example most dGPUs can't display directly from system
> memory altogether, but it is possible to allocate the DMA-buf through
> the GPU driver and then write into device memory with P2P PCI transfers.
>
> So as far as I can see switching importer and exporter roles and even
> having performant extra fallbacks should be a standard feature of userspace.
>
> > Another case where reversing the role is difficult is for case where you need to
> > multiplex the streams (let's use a camera to illustrate) and share that with
> > multiple processes. In these uses case, the DRM importers are volatile, which
> > one do you abuse to do allocation from ? In multimedia server like PipeWire, you
> > are not really aware if the camera will be used by DRM or not, and if something
> > "special" is needed in term of role inversion. It is relatively easy to deal
> > with matching modifiers, but using downstream (display/gpu) as an exporter is
> > always difficult (and require some level of abuse and guessing).
>
> Oh, very good point! Yeah we do have use cases for this where an input
> buffer is both displayed as well as encoded.

This is the main issue, yeah.

For a standard media player, they would try to allocate through V4L2
and decode through that into locally-allocated buffers. All they know
is that there's a Wayland server at the other end of a socket
somewhere which will want to import the FD. The server does give you
some hints along the way: it will tell you that importing into a
particular GPU target device is necessary as the ultimate fallback,
and importing into a particular KMS device is preferable as the
optimal path to hit an overlay.

So let's say that the V4L2 client does what you're proposing: it
allocates a buffer chain, schedules a decode into that buffer, and
passes it along to the server to import. The server fails to import
the buffer into the GPU, and tells the client this. The client then
... well, it doesn't know that it needs to allocate within the GPU
instead, but it knows that doing so might be one thing which would
make the request succeed.

But the client is just a video player. It doesn't understand how to
allocate BOs for Panfrost or AMD or etnaviv. So without a universal
allocator (again ...), 'just allocate on the GPU' isn't a useful
response to the client.

I fully understand your point about APIs like Vulkan not sensibly
allowing bracketing, and that's fine. On the other hand, a lot of
extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
Arm just cannot fulfill complete coherency. On a lot of these
platforms, despite what you might think about the CPU/GPU
capabilities, the bottleneck is _always_ memory bandwidth, so
mandating extra copies is an absolute non-starter, and would instantly
cripple billions of devices. Lucas has been pretty gentle, but to be
more clear, this is not an option and won't be for at least the next
decade.

So we obviously need a third way at this point, because 'all devices
must always be coherent' vs. 'cache must be an unknown' can't work.
How about this as a suggestion: we have some unused flags in the PRIME
ioctls. Can we add a flag for 'import must be coherent'?

That flag wouldn't be set for the existing ecosystem
Lucas/Nicolas/myself are talking about, where we have explicit
handover points and users are fully able to perform cache maintenance.
For newer APIs where it's not possible to properly express that
bracketing, they would always set that flag (unless we add an API
carve-out where the client promises to do whatever is required to
maintain that).

Would that be viable?

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-10-28 18:47               ` Daniel Stone
  0 siblings, 0 replies; 110+ messages in thread
From: Daniel Stone @ 2022-10-28 18:47 UTC (permalink / raw)
  To: Christian König
  Cc: ppaalanen, dri-devel, Nicolas Dufresne, linaro-mm-sig,
	linux-media, sumit.semwal

Hi Christian,

On Fri, 28 Oct 2022 at 18:50, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
> > Though, its not generically possible to reverse these roles. If you want to do
> > so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
> > because you will have to allocate DRM buffers that knows about importer specific
> > requirements. See link [1] for what it looks like for RK3399, with Motion Vector
> > size calculation copied from the kernel driver into a userspace lib (arguably
> > that was available from V4L2 sizeimage, but this is technically difficult to
> > communicate within the software layers). If you could let the decoder export
> > (with proper cache management) the non-generic code would not be needed.
>
> Yeah, but I can also reverse the argument:
>
> Getting the parameters for V4L right so that we can share the image is
> tricky, but getting the parameters so that the stuff is actually
> directly displayable by GPUs is even trickier.
>
> Essentially you need to look at both sides and interference to get to a
> common ground, e.g. alignment, pitch, width/height, padding, etc.....
>
> Deciding from which side to allocate from is just one step in this
> process. For example most dGPUs can't display directly from system
> memory altogether, but it is possible to allocate the DMA-buf through
> the GPU driver and then write into device memory with P2P PCI transfers.
>
> So as far as I can see switching importer and exporter roles and even
> having performant extra fallbacks should be a standard feature of userspace.
>
> > Another case where reversing the role is difficult is for case where you need to
> > multiplex the streams (let's use a camera to illustrate) and share that with
> > multiple processes. In these uses case, the DRM importers are volatile, which
> > one do you abuse to do allocation from ? In multimedia server like PipeWire, you
> > are not really aware if the camera will be used by DRM or not, and if something
> > "special" is needed in term of role inversion. It is relatively easy to deal
> > with matching modifiers, but using downstream (display/gpu) as an exporter is
> > always difficult (and require some level of abuse and guessing).
>
> Oh, very good point! Yeah we do have use cases for this where an input
> buffer is both displayed as well as encoded.

This is the main issue, yeah.

For a standard media player, they would try to allocate through V4L2
and decode through that into locally-allocated buffers. All they know
is that there's a Wayland server at the other end of a socket
somewhere which will want to import the FD. The server does give you
some hints along the way: it will tell you that importing into a
particular GPU target device is necessary as the ultimate fallback,
and importing into a particular KMS device is preferable as the
optimal path to hit an overlay.

So let's say that the V4L2 client does what you're proposing: it
allocates a buffer chain, schedules a decode into that buffer, and
passes it along to the server to import. The server fails to import
the buffer into the GPU, and tells the client this. The client then
... well, it doesn't know that it needs to allocate within the GPU
instead, but it knows that doing so might be one thing which would
make the request succeed.

But the client is just a video player. It doesn't understand how to
allocate BOs for Panfrost or AMD or etnaviv. So without a universal
allocator (again ...), 'just allocate on the GPU' isn't a useful
response to the client.

I fully understand your point about APIs like Vulkan not sensibly
allowing bracketing, and that's fine. On the other hand, a lot of
extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
Arm just cannot fulfill complete coherency. On a lot of these
platforms, despite what you might think about the CPU/GPU
capabilities, the bottleneck is _always_ memory bandwidth, so
mandating extra copies is an absolute non-starter, and would instantly
cripple billions of devices. Lucas has been pretty gentle, but to be
more clear, this is not an option and won't be for at least the next
decade.

So we obviously need a third way at this point, because 'all devices
must always be coherent' vs. 'cache must be an unknown' can't work.
How about this as a suggestion: we have some unused flags in the PRIME
ioctls. Can we add a flag for 'import must be coherent'?

That flag wouldn't be set for the existing ecosystem
Lucas/Nicolas/myself are talking about, where we have explicit
handover points and users are fully able to perform cache maintenance.
For newer APIs where it's not possible to properly express that
bracketing, they would always set that flag (unless we add an API
carve-out where the client promises to do whatever is required to
maintain that).

Would that be viable?

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-10-28 18:47               ` Daniel Stone
@ 2022-11-01 17:40                 ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-01 17:40 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Nicolas Dufresne, Lucas Stach, ppaalanen, sumit.semwal, daniel,
	robdclark, dri-devel, linaro-mm-sig, linux-media

Am 28.10.22 um 20:47 schrieb Daniel Stone:
> Hi Christian,
>
> On Fri, 28 Oct 2022 at 18:50, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
>>> Though, its not generically possible to reverse these roles. If you want to do
>>> so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
>>> because you will have to allocate DRM buffers that knows about importer specific
>>> requirements. See link [1] for what it looks like for RK3399, with Motion Vector
>>> size calculation copied from the kernel driver into a userspace lib (arguably
>>> that was available from V4L2 sizeimage, but this is technically difficult to
>>> communicate within the software layers). If you could let the decoder export
>>> (with proper cache management) the non-generic code would not be needed.
>> Yeah, but I can also reverse the argument:
>>
>> Getting the parameters for V4L right so that we can share the image is
>> tricky, but getting the parameters so that the stuff is actually
>> directly displayable by GPUs is even trickier.
>>
>> Essentially you need to look at both sides and interference to get to a
>> common ground, e.g. alignment, pitch, width/height, padding, etc.....
>>
>> Deciding from which side to allocate from is just one step in this
>> process. For example most dGPUs can't display directly from system
>> memory altogether, but it is possible to allocate the DMA-buf through
>> the GPU driver and then write into device memory with P2P PCI transfers.
>>
>> So as far as I can see switching importer and exporter roles and even
>> having performant extra fallbacks should be a standard feature of userspace.
>>
>>> Another case where reversing the role is difficult is for case where you need to
>>> multiplex the streams (let's use a camera to illustrate) and share that with
>>> multiple processes. In these uses case, the DRM importers are volatile, which
>>> one do you abuse to do allocation from ? In multimedia server like PipeWire, you
>>> are not really aware if the camera will be used by DRM or not, and if something
>>> "special" is needed in term of role inversion. It is relatively easy to deal
>>> with matching modifiers, but using downstream (display/gpu) as an exporter is
>>> always difficult (and require some level of abuse and guessing).
>> Oh, very good point! Yeah we do have use cases for this where an input
>> buffer is both displayed as well as encoded.
> This is the main issue, yeah.
>
> For a standard media player, they would try to allocate through V4L2
> and decode through that into locally-allocated buffers. All they know
> is that there's a Wayland server at the other end of a socket
> somewhere which will want to import the FD. The server does give you
> some hints along the way: it will tell you that importing into a
> particular GPU target device is necessary as the ultimate fallback,
> and importing into a particular KMS device is preferable as the
> optimal path to hit an overlay.
>
> So let's say that the V4L2 client does what you're proposing: it
> allocates a buffer chain, schedules a decode into that buffer, and
> passes it along to the server to import. The server fails to import
> the buffer into the GPU, and tells the client this. The client then
> ... well, it doesn't know that it needs to allocate within the GPU
> instead, but it knows that doing so might be one thing which would
> make the request succeed.
>
> But the client is just a video player. It doesn't understand how to
> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> allocator (again ...), 'just allocate on the GPU' isn't a useful
> response to the client.

Well exactly that's the point I'm raising: The client *must* understand 
that!

See we need to be able to handle all restrictions here, coherency of the 
data is just one of them.

For example the much more important question is the location of the data 
and for this allocating from the V4L2 device is in most cases just not 
going to fly.

The more common case is that you need to allocate from the GPU and then 
import that into the V4L2 device. The background is that all dGPUs I 
know of need the data inside local memory (VRAM) to be able to scan out 
from it.

> I fully understand your point about APIs like Vulkan not sensibly
> allowing bracketing, and that's fine. On the other hand, a lot of
> extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
> Arm just cannot fulfill complete coherency. On a lot of these
> platforms, despite what you might think about the CPU/GPU
> capabilities, the bottleneck is _always_ memory bandwidth, so
> mandating extra copies is an absolute non-starter, and would instantly
> cripple billions of devices. Lucas has been pretty gentle, but to be
> more clear, this is not an option and won't be for at least the next
> decade.

Well x86 pretty much has the same restrictions.

For example the scanout buffer is usually always in local memory because 
you often scan out at up to 120Hz while your recording is only 30fps and 
most of the time lower resolution.

Pumping all that data 120 time a second over the PCIe bus would just not 
be doable in a lot of use cases.

> So we obviously need a third way at this point, because 'all devices
> must always be coherent' vs. 'cache must be an unknown' can't work.
> How about this as a suggestion: we have some unused flags in the PRIME
> ioctls. Can we add a flag for 'import must be coherent'?

That's pretty much exactly what my patch set does. It just keeps 
userspace out of the way and says that creating the initial connection 
between the devices fails if they can't talk directly with each other.

Maybe we should move that into userspace so that the involved components 
know of hand that a certain approach won't work?

> That flag wouldn't be set for the existing ecosystem
> Lucas/Nicolas/myself are talking about, where we have explicit
> handover points and users are fully able to perform cache maintenance.
> For newer APIs where it's not possible to properly express that
> bracketing, they would always set that flag (unless we add an API
> carve-out where the client promises to do whatever is required to
> maintain that).
>
> Would that be viable?

No, as I said. Explicit handover points are just an absolutely no-go. We 
just have way to many use cases which don't work with that idea.

As I said we made the same mistake with the DMA-Api and even more 20 
years later are still running into problems because of that.

Just try to run any dGPU under a XEN hypervisor with memory 
fragmentation for a very good example why this is such a bad idea.

Regards,
Christian.

>
> Cheers,
> Daniel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-01 17:40                 ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-01 17:40 UTC (permalink / raw)
  To: Daniel Stone
  Cc: ppaalanen, dri-devel, Nicolas Dufresne, linaro-mm-sig,
	linux-media, sumit.semwal

Am 28.10.22 um 20:47 schrieb Daniel Stone:
> Hi Christian,
>
> On Fri, 28 Oct 2022 at 18:50, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
>>> Though, its not generically possible to reverse these roles. If you want to do
>>> so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
>>> because you will have to allocate DRM buffers that knows about importer specific
>>> requirements. See link [1] for what it looks like for RK3399, with Motion Vector
>>> size calculation copied from the kernel driver into a userspace lib (arguably
>>> that was available from V4L2 sizeimage, but this is technically difficult to
>>> communicate within the software layers). If you could let the decoder export
>>> (with proper cache management) the non-generic code would not be needed.
>> Yeah, but I can also reverse the argument:
>>
>> Getting the parameters for V4L right so that we can share the image is
>> tricky, but getting the parameters so that the stuff is actually
>> directly displayable by GPUs is even trickier.
>>
>> Essentially you need to look at both sides and interference to get to a
>> common ground, e.g. alignment, pitch, width/height, padding, etc.....
>>
>> Deciding from which side to allocate from is just one step in this
>> process. For example most dGPUs can't display directly from system
>> memory altogether, but it is possible to allocate the DMA-buf through
>> the GPU driver and then write into device memory with P2P PCI transfers.
>>
>> So as far as I can see switching importer and exporter roles and even
>> having performant extra fallbacks should be a standard feature of userspace.
>>
>>> Another case where reversing the role is difficult is for case where you need to
>>> multiplex the streams (let's use a camera to illustrate) and share that with
>>> multiple processes. In these uses case, the DRM importers are volatile, which
>>> one do you abuse to do allocation from ? In multimedia server like PipeWire, you
>>> are not really aware if the camera will be used by DRM or not, and if something
>>> "special" is needed in term of role inversion. It is relatively easy to deal
>>> with matching modifiers, but using downstream (display/gpu) as an exporter is
>>> always difficult (and require some level of abuse and guessing).
>> Oh, very good point! Yeah we do have use cases for this where an input
>> buffer is both displayed as well as encoded.
> This is the main issue, yeah.
>
> For a standard media player, they would try to allocate through V4L2
> and decode through that into locally-allocated buffers. All they know
> is that there's a Wayland server at the other end of a socket
> somewhere which will want to import the FD. The server does give you
> some hints along the way: it will tell you that importing into a
> particular GPU target device is necessary as the ultimate fallback,
> and importing into a particular KMS device is preferable as the
> optimal path to hit an overlay.
>
> So let's say that the V4L2 client does what you're proposing: it
> allocates a buffer chain, schedules a decode into that buffer, and
> passes it along to the server to import. The server fails to import
> the buffer into the GPU, and tells the client this. The client then
> ... well, it doesn't know that it needs to allocate within the GPU
> instead, but it knows that doing so might be one thing which would
> make the request succeed.
>
> But the client is just a video player. It doesn't understand how to
> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> allocator (again ...), 'just allocate on the GPU' isn't a useful
> response to the client.

Well exactly that's the point I'm raising: The client *must* understand 
that!

See we need to be able to handle all restrictions here, coherency of the 
data is just one of them.

For example the much more important question is the location of the data 
and for this allocating from the V4L2 device is in most cases just not 
going to fly.

The more common case is that you need to allocate from the GPU and then 
import that into the V4L2 device. The background is that all dGPUs I 
know of need the data inside local memory (VRAM) to be able to scan out 
from it.

> I fully understand your point about APIs like Vulkan not sensibly
> allowing bracketing, and that's fine. On the other hand, a lot of
> extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
> Arm just cannot fulfill complete coherency. On a lot of these
> platforms, despite what you might think about the CPU/GPU
> capabilities, the bottleneck is _always_ memory bandwidth, so
> mandating extra copies is an absolute non-starter, and would instantly
> cripple billions of devices. Lucas has been pretty gentle, but to be
> more clear, this is not an option and won't be for at least the next
> decade.

Well x86 pretty much has the same restrictions.

For example the scanout buffer is usually always in local memory because 
you often scan out at up to 120Hz while your recording is only 30fps and 
most of the time lower resolution.

Pumping all that data 120 time a second over the PCIe bus would just not 
be doable in a lot of use cases.

> So we obviously need a third way at this point, because 'all devices
> must always be coherent' vs. 'cache must be an unknown' can't work.
> How about this as a suggestion: we have some unused flags in the PRIME
> ioctls. Can we add a flag for 'import must be coherent'?

That's pretty much exactly what my patch set does. It just keeps 
userspace out of the way and says that creating the initial connection 
between the devices fails if they can't talk directly with each other.

Maybe we should move that into userspace so that the involved components 
know of hand that a certain approach won't work?

> That flag wouldn't be set for the existing ecosystem
> Lucas/Nicolas/myself are talking about, where we have explicit
> handover points and users are fully able to perform cache maintenance.
> For newer APIs where it's not possible to properly express that
> bracketing, they would always set that flag (unless we add an API
> carve-out where the client promises to do whatever is required to
> maintain that).
>
> Would that be viable?

No, as I said. Explicit handover points are just an absolutely no-go. We 
just have way to many use cases which don't work with that idea.

As I said we made the same mistake with the DMA-Api and even more 20 
years later are still running into problems because of that.

Just try to run any dGPU under a XEN hypervisor with memory 
fragmentation for a very good example why this is such a bad idea.

Regards,
Christian.

>
> Cheers,
> Daniel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-01 17:40                 ` Christian König
@ 2022-11-01 21:09                   ` Nicolas Dufresne
  -1 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-01 21:09 UTC (permalink / raw)
  To: Christian König, Daniel Stone
  Cc: Lucas Stach, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Le mardi 01 novembre 2022 à 18:40 +0100, Christian König a écrit :
> Am 28.10.22 um 20:47 schrieb Daniel Stone:
> > Hi Christian,
> > 
> > On Fri, 28 Oct 2022 at 18:50, Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> > > Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
> > > > Though, its not generically possible to reverse these roles. If you want to do
> > > > so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
> > > > because you will have to allocate DRM buffers that knows about importer specific
> > > > requirements. See link [1] for what it looks like for RK3399, with Motion Vector
> > > > size calculation copied from the kernel driver into a userspace lib (arguably
> > > > that was available from V4L2 sizeimage, but this is technically difficult to
> > > > communicate within the software layers). If you could let the decoder export
> > > > (with proper cache management) the non-generic code would not be needed.
> > > Yeah, but I can also reverse the argument:
> > > 
> > > Getting the parameters for V4L right so that we can share the image is
> > > tricky, but getting the parameters so that the stuff is actually
> > > directly displayable by GPUs is even trickier.
> > > 
> > > Essentially you need to look at both sides and interference to get to a
> > > common ground, e.g. alignment, pitch, width/height, padding, etc.....
> > > 
> > > Deciding from which side to allocate from is just one step in this
> > > process. For example most dGPUs can't display directly from system
> > > memory altogether, but it is possible to allocate the DMA-buf through
> > > the GPU driver and then write into device memory with P2P PCI transfers.
> > > 
> > > So as far as I can see switching importer and exporter roles and even
> > > having performant extra fallbacks should be a standard feature of userspace.
> > > 
> > > > Another case where reversing the role is difficult is for case where you need to
> > > > multiplex the streams (let's use a camera to illustrate) and share that with
> > > > multiple processes. In these uses case, the DRM importers are volatile, which
> > > > one do you abuse to do allocation from ? In multimedia server like PipeWire, you
> > > > are not really aware if the camera will be used by DRM or not, and if something
> > > > "special" is needed in term of role inversion. It is relatively easy to deal
> > > > with matching modifiers, but using downstream (display/gpu) as an exporter is
> > > > always difficult (and require some level of abuse and guessing).
> > > Oh, very good point! Yeah we do have use cases for this where an input
> > > buffer is both displayed as well as encoded.
> > This is the main issue, yeah.
> > 
> > For a standard media player, they would try to allocate through V4L2
> > and decode through that into locally-allocated buffers. All they know
> > is that there's a Wayland server at the other end of a socket
> > somewhere which will want to import the FD. The server does give you
> > some hints along the way: it will tell you that importing into a
> > particular GPU target device is necessary as the ultimate fallback,
> > and importing into a particular KMS device is preferable as the
> > optimal path to hit an overlay.
> > 
> > So let's say that the V4L2 client does what you're proposing: it
> > allocates a buffer chain, schedules a decode into that buffer, and
> > passes it along to the server to import. The server fails to import
> > the buffer into the GPU, and tells the client this. The client then
> > ... well, it doesn't know that it needs to allocate within the GPU
> > instead, but it knows that doing so might be one thing which would
> > make the request succeed.
> > 
> > But the client is just a video player. It doesn't understand how to
> > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > response to the client.
> 
> Well exactly that's the point I'm raising: The client *must* understand 
> that!
> 
> See we need to be able to handle all restrictions here, coherency of the 
> data is just one of them.
> 
> For example the much more important question is the location of the data 
> and for this allocating from the V4L2 device is in most cases just not 
> going to fly.

It feels like this is a generic statement and there is no reason it could not be
the other way around. I have colleague who integrated PCIe CODEC (Blaize Xplorer
X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
to use it. Of course, in current state of DMABuf, you have to be an exporter to
do anything fancy, but it did not have to be like this, its a design choice. I'm
not sure in the end what was the final method used, the driver isn't yet
upstream, so maybe that is not even final. What I know is that there is various
condition you may use the CODEC for which the optimal location will vary. As an
example, using the post processor or not, see my next comment for more details.

> 
> The more common case is that you need to allocate from the GPU and then 
> import that into the V4L2 device. The background is that all dGPUs I 
> know of need the data inside local memory (VRAM) to be able to scan out 
> from it.

The reality is that what is common to you, might not be to others. In my work,
most ARM SoC have display that just handle direct scannout from cameras and
codecs. The only case the commonly fails is whenever we try to display UVC
created dmabuf, which have dirty CPU write cache and this is the type of thing
we'd like to see solved. I think this series was addressing it in principle, but
failing the import and the raised point is that this wasn't the optimal way.

There is a community project called LibreELEC, if you aren't aware, they run
Khodi with direct scanout of video stream on a wide variety of SoC and they use
the CODEC as exporter all the time. They simply don't have cases were the
opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
does not really offer you any API to reverse the allocation.

> 
> > I fully understand your point about APIs like Vulkan not sensibly
> > allowing bracketing, and that's fine. On the other hand, a lot of
> > extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
> > Arm just cannot fulfill complete coherency. On a lot of these
> > platforms, despite what you might think about the CPU/GPU
> > capabilities, the bottleneck is _always_ memory bandwidth, so
> > mandating extra copies is an absolute non-starter, and would instantly
> > cripple billions of devices. Lucas has been pretty gentle, but to be
> > more clear, this is not an option and won't be for at least the next
> > decade.
> 
> Well x86 pretty much has the same restrictions.
> 
> For example the scanout buffer is usually always in local memory because 
> you often scan out at up to 120Hz while your recording is only 30fps and 
> most of the time lower resolution.
> 
> Pumping all that data 120 time a second over the PCIe bus would just not 
> be doable in a lot of use cases.

This is good point for this case. Though, the effect of using remote RAM in
CODEC can be very dramatic. In some case, the buffer you are going to display
are what we call the reference frames. That means that while you are displaying
these, the CODEC needs to read from these in order to construct the following
frames. Most of the time, reads are massively slower with remote RAM, and over-
uploading, like you describe here is going to be the most optimal path.

Note that in some other cases, the buffers are called secondary buffers, which
is the outcome of a post processor embedded into the CODEC. In that case, remote
RAM may be fine, it really depends on the write speed (though usually really
good compared to reads). So yes, in a case of high refresh rate, a CODEC with
post processor may do a better job (if you have a single display path for that
buffer). 

p.s. Note that the reason we support reference frame display even if secondary
buffer is possible is often because we are limited on memory bandwidth. For
let's say 4K60, a secondary render will require an extra 18gb writes (on the
best platform, might require equivalent reads on other platforms, like Mediatek,
Samsung Exynos and some other).
> 
> > So we obviously need a third way at this point, because 'all devices
> > must always be coherent' vs. 'cache must be an unknown' can't work.
> > How about this as a suggestion: we have some unused flags in the PRIME
> > ioctls. Can we add a flag for 'import must be coherent'?
> 
> That's pretty much exactly what my patch set does. It just keeps 
> userspace out of the way and says that creating the initial connection 
> between the devices fails if they can't talk directly with each other.
> 
> Maybe we should move that into userspace so that the involved components 
> know of hand that a certain approach won't work?

Anything that an be predict without trial and error is great idea for sure.
Though, we have to be realist, there is no guarantied way to be sure other then
trying. So I would not be too worried. Imho, I lacked the time to try it out,
but the current implementation should in theory make software like GStreamer
fallback to memcpy for simple cases like UVC Cameras. Very simple gstreamer
pipeline most folks can run (and that usually has produced artifacts for years)
is:

gst-launch-1.0 v4l2src ! queue ! kmssink


p.s. you need to turn off any drm master, may not work on multi-display setup,
kmssink is very limited, but it does implement memcpy fallback if dmabuf import
fails (just that it does not fail).

> 
> > That flag wouldn't be set for the existing ecosystem
> > Lucas/Nicolas/myself are talking about, where we have explicit
> > handover points and users are fully able to perform cache maintenance.
> > For newer APIs where it's not possible to properly express that
> > bracketing, they would always set that flag (unless we add an API
> > carve-out where the client promises to do whatever is required to
> > maintain that).
> > 
> > Would that be viable?
> 
> No, as I said. Explicit handover points are just an absolutely no-go. We 
> just have way to many use cases which don't work with that idea.
> 
> As I said we made the same mistake with the DMA-Api and even more 20 
> years later are still running into problems because of that.
> 
> Just try to run any dGPU under a XEN hypervisor with memory 
> fragmentation for a very good example why this is such a bad idea.
> 
> Regards,
> Christian.
> 
> > 
> > Cheers,
> > Daniel
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-01 21:09                   ` Nicolas Dufresne
  0 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-01 21:09 UTC (permalink / raw)
  To: Christian König, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, linux-media, sumit.semwal

Le mardi 01 novembre 2022 à 18:40 +0100, Christian König a écrit :
> Am 28.10.22 um 20:47 schrieb Daniel Stone:
> > Hi Christian,
> > 
> > On Fri, 28 Oct 2022 at 18:50, Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> > > Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
> > > > Though, its not generically possible to reverse these roles. If you want to do
> > > > so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
> > > > because you will have to allocate DRM buffers that knows about importer specific
> > > > requirements. See link [1] for what it looks like for RK3399, with Motion Vector
> > > > size calculation copied from the kernel driver into a userspace lib (arguably
> > > > that was available from V4L2 sizeimage, but this is technically difficult to
> > > > communicate within the software layers). If you could let the decoder export
> > > > (with proper cache management) the non-generic code would not be needed.
> > > Yeah, but I can also reverse the argument:
> > > 
> > > Getting the parameters for V4L right so that we can share the image is
> > > tricky, but getting the parameters so that the stuff is actually
> > > directly displayable by GPUs is even trickier.
> > > 
> > > Essentially you need to look at both sides and interference to get to a
> > > common ground, e.g. alignment, pitch, width/height, padding, etc.....
> > > 
> > > Deciding from which side to allocate from is just one step in this
> > > process. For example most dGPUs can't display directly from system
> > > memory altogether, but it is possible to allocate the DMA-buf through
> > > the GPU driver and then write into device memory with P2P PCI transfers.
> > > 
> > > So as far as I can see switching importer and exporter roles and even
> > > having performant extra fallbacks should be a standard feature of userspace.
> > > 
> > > > Another case where reversing the role is difficult is for case where you need to
> > > > multiplex the streams (let's use a camera to illustrate) and share that with
> > > > multiple processes. In these uses case, the DRM importers are volatile, which
> > > > one do you abuse to do allocation from ? In multimedia server like PipeWire, you
> > > > are not really aware if the camera will be used by DRM or not, and if something
> > > > "special" is needed in term of role inversion. It is relatively easy to deal
> > > > with matching modifiers, but using downstream (display/gpu) as an exporter is
> > > > always difficult (and require some level of abuse and guessing).
> > > Oh, very good point! Yeah we do have use cases for this where an input
> > > buffer is both displayed as well as encoded.
> > This is the main issue, yeah.
> > 
> > For a standard media player, they would try to allocate through V4L2
> > and decode through that into locally-allocated buffers. All they know
> > is that there's a Wayland server at the other end of a socket
> > somewhere which will want to import the FD. The server does give you
> > some hints along the way: it will tell you that importing into a
> > particular GPU target device is necessary as the ultimate fallback,
> > and importing into a particular KMS device is preferable as the
> > optimal path to hit an overlay.
> > 
> > So let's say that the V4L2 client does what you're proposing: it
> > allocates a buffer chain, schedules a decode into that buffer, and
> > passes it along to the server to import. The server fails to import
> > the buffer into the GPU, and tells the client this. The client then
> > ... well, it doesn't know that it needs to allocate within the GPU
> > instead, but it knows that doing so might be one thing which would
> > make the request succeed.
> > 
> > But the client is just a video player. It doesn't understand how to
> > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > response to the client.
> 
> Well exactly that's the point I'm raising: The client *must* understand 
> that!
> 
> See we need to be able to handle all restrictions here, coherency of the 
> data is just one of them.
> 
> For example the much more important question is the location of the data 
> and for this allocating from the V4L2 device is in most cases just not 
> going to fly.

It feels like this is a generic statement and there is no reason it could not be
the other way around. I have colleague who integrated PCIe CODEC (Blaize Xplorer
X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
to use it. Of course, in current state of DMABuf, you have to be an exporter to
do anything fancy, but it did not have to be like this, its a design choice. I'm
not sure in the end what was the final method used, the driver isn't yet
upstream, so maybe that is not even final. What I know is that there is various
condition you may use the CODEC for which the optimal location will vary. As an
example, using the post processor or not, see my next comment for more details.

> 
> The more common case is that you need to allocate from the GPU and then 
> import that into the V4L2 device. The background is that all dGPUs I 
> know of need the data inside local memory (VRAM) to be able to scan out 
> from it.

The reality is that what is common to you, might not be to others. In my work,
most ARM SoC have display that just handle direct scannout from cameras and
codecs. The only case the commonly fails is whenever we try to display UVC
created dmabuf, which have dirty CPU write cache and this is the type of thing
we'd like to see solved. I think this series was addressing it in principle, but
failing the import and the raised point is that this wasn't the optimal way.

There is a community project called LibreELEC, if you aren't aware, they run
Khodi with direct scanout of video stream on a wide variety of SoC and they use
the CODEC as exporter all the time. They simply don't have cases were the
opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
does not really offer you any API to reverse the allocation.

> 
> > I fully understand your point about APIs like Vulkan not sensibly
> > allowing bracketing, and that's fine. On the other hand, a lot of
> > extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
> > Arm just cannot fulfill complete coherency. On a lot of these
> > platforms, despite what you might think about the CPU/GPU
> > capabilities, the bottleneck is _always_ memory bandwidth, so
> > mandating extra copies is an absolute non-starter, and would instantly
> > cripple billions of devices. Lucas has been pretty gentle, but to be
> > more clear, this is not an option and won't be for at least the next
> > decade.
> 
> Well x86 pretty much has the same restrictions.
> 
> For example the scanout buffer is usually always in local memory because 
> you often scan out at up to 120Hz while your recording is only 30fps and 
> most of the time lower resolution.
> 
> Pumping all that data 120 time a second over the PCIe bus would just not 
> be doable in a lot of use cases.

This is good point for this case. Though, the effect of using remote RAM in
CODEC can be very dramatic. In some case, the buffer you are going to display
are what we call the reference frames. That means that while you are displaying
these, the CODEC needs to read from these in order to construct the following
frames. Most of the time, reads are massively slower with remote RAM, and over-
uploading, like you describe here is going to be the most optimal path.

Note that in some other cases, the buffers are called secondary buffers, which
is the outcome of a post processor embedded into the CODEC. In that case, remote
RAM may be fine, it really depends on the write speed (though usually really
good compared to reads). So yes, in a case of high refresh rate, a CODEC with
post processor may do a better job (if you have a single display path for that
buffer). 

p.s. Note that the reason we support reference frame display even if secondary
buffer is possible is often because we are limited on memory bandwidth. For
let's say 4K60, a secondary render will require an extra 18gb writes (on the
best platform, might require equivalent reads on other platforms, like Mediatek,
Samsung Exynos and some other).
> 
> > So we obviously need a third way at this point, because 'all devices
> > must always be coherent' vs. 'cache must be an unknown' can't work.
> > How about this as a suggestion: we have some unused flags in the PRIME
> > ioctls. Can we add a flag for 'import must be coherent'?
> 
> That's pretty much exactly what my patch set does. It just keeps 
> userspace out of the way and says that creating the initial connection 
> between the devices fails if they can't talk directly with each other.
> 
> Maybe we should move that into userspace so that the involved components 
> know of hand that a certain approach won't work?

Anything that an be predict without trial and error is great idea for sure.
Though, we have to be realist, there is no guarantied way to be sure other then
trying. So I would not be too worried. Imho, I lacked the time to try it out,
but the current implementation should in theory make software like GStreamer
fallback to memcpy for simple cases like UVC Cameras. Very simple gstreamer
pipeline most folks can run (and that usually has produced artifacts for years)
is:

gst-launch-1.0 v4l2src ! queue ! kmssink


p.s. you need to turn off any drm master, may not work on multi-display setup,
kmssink is very limited, but it does implement memcpy fallback if dmabuf import
fails (just that it does not fail).

> 
> > That flag wouldn't be set for the existing ecosystem
> > Lucas/Nicolas/myself are talking about, where we have explicit
> > handover points and users are fully able to perform cache maintenance.
> > For newer APIs where it's not possible to properly express that
> > bracketing, they would always set that flag (unless we add an API
> > carve-out where the client promises to do whatever is required to
> > maintain that).
> > 
> > Would that be viable?
> 
> No, as I said. Explicit handover points are just an absolutely no-go. We 
> just have way to many use cases which don't work with that idea.
> 
> As I said we made the same mistake with the DMA-Api and even more 20 
> years later are still running into problems because of that.
> 
> Just try to run any dGPU under a XEN hypervisor with memory 
> fragmentation for a very good example why this is such a bad idea.
> 
> Regards,
> Christian.
> 
> > 
> > Cheers,
> > Daniel
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-01 21:09                   ` Nicolas Dufresne
@ 2022-11-02 11:18                     ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 11:18 UTC (permalink / raw)
  To: Nicolas Dufresne, Daniel Stone
  Cc: Lucas Stach, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> [SNIP]
>>> But the client is just a video player. It doesn't understand how to
>>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
>>> allocator (again ...), 'just allocate on the GPU' isn't a useful
>>> response to the client.
>> Well exactly that's the point I'm raising: The client *must* understand
>> that!
>>
>> See we need to be able to handle all restrictions here, coherency of the
>> data is just one of them.
>>
>> For example the much more important question is the location of the data
>> and for this allocating from the V4L2 device is in most cases just not
>> going to fly.
> It feels like this is a generic statement and there is no reason it could not be
> the other way around.

And exactly that's my point. You always need to look at both ways to 
share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2 
device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed 
back on this practice already.

This only works because the Wayland as well as X display pipeline is 
smart enough to insert an extra copy when it find that an imported 
buffer can't be used as a framebuffer directly.

>   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
> to use it. Of course, in current state of DMABuf, you have to be an exporter to
> do anything fancy, but it did not have to be like this, its a design choice. I'm
> not sure in the end what was the final method used, the driver isn't yet
> upstream, so maybe that is not even final. What I know is that there is various
> condition you may use the CODEC for which the optimal location will vary. As an
> example, using the post processor or not, see my next comment for more details.

Yeah, and stuff like this was already discussed multiple times. Local 
memory of devices can only be made available by the exporter, not the 
importer.

So in the case of separated camera and encoder you run into exactly the 
same limitation that some device needs the allocation to happen on the 
camera while others need it on the encoder.

>> The more common case is that you need to allocate from the GPU and then
>> import that into the V4L2 device. The background is that all dGPUs I
>> know of need the data inside local memory (VRAM) to be able to scan out
>> from it.
> The reality is that what is common to you, might not be to others. In my work,
> most ARM SoC have display that just handle direct scannout from cameras and
> codecs.

> The only case the commonly fails is whenever we try to display UVC
> created dmabuf,

Well, exactly that's not correct! The whole x86 use cases of direct 
display for dGPUs are broken because media players think they can do the 
simple thing and offload all the problematic cases to the display server.

This is absolutely *not* the common use case you describe here, but 
rather something completely special to ARM.

>   which have dirty CPU write cache and this is the type of thing
> we'd like to see solved. I think this series was addressing it in principle, but
> failing the import and the raised point is that this wasn't the optimal way.
>
> There is a community project called LibreELEC, if you aren't aware, they run
> Khodi with direct scanout of video stream on a wide variety of SoC and they use
> the CODEC as exporter all the time. They simply don't have cases were the
> opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
> does not really offer you any API to reverse the allocation.

Ok, let me try to explain it once more. It sounds like I wasn't able to 
get my point through.

That we haven't heard anybody screaming that x86 doesn't work is just 
because we handle the case that a buffer isn't directly displayable in 
X/Wayland anyway, but this is absolutely not the optimal solution.

The argument that you want to keep the allocation on the codec side is 
completely false as far as I can see.

We already had numerous projects where we reported this practice as bugs 
to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

This is just a software solution which works because of coincident and 
not because of engineering.

Regards,
Christian.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 11:18                     ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 11:18 UTC (permalink / raw)
  To: Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, linux-media, sumit.semwal

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> [SNIP]
>>> But the client is just a video player. It doesn't understand how to
>>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
>>> allocator (again ...), 'just allocate on the GPU' isn't a useful
>>> response to the client.
>> Well exactly that's the point I'm raising: The client *must* understand
>> that!
>>
>> See we need to be able to handle all restrictions here, coherency of the
>> data is just one of them.
>>
>> For example the much more important question is the location of the data
>> and for this allocating from the V4L2 device is in most cases just not
>> going to fly.
> It feels like this is a generic statement and there is no reason it could not be
> the other way around.

And exactly that's my point. You always need to look at both ways to 
share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2 
device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed 
back on this practice already.

This only works because the Wayland as well as X display pipeline is 
smart enough to insert an extra copy when it find that an imported 
buffer can't be used as a framebuffer directly.

>   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
> to use it. Of course, in current state of DMABuf, you have to be an exporter to
> do anything fancy, but it did not have to be like this, its a design choice. I'm
> not sure in the end what was the final method used, the driver isn't yet
> upstream, so maybe that is not even final. What I know is that there is various
> condition you may use the CODEC for which the optimal location will vary. As an
> example, using the post processor or not, see my next comment for more details.

Yeah, and stuff like this was already discussed multiple times. Local 
memory of devices can only be made available by the exporter, not the 
importer.

So in the case of separated camera and encoder you run into exactly the 
same limitation that some device needs the allocation to happen on the 
camera while others need it on the encoder.

>> The more common case is that you need to allocate from the GPU and then
>> import that into the V4L2 device. The background is that all dGPUs I
>> know of need the data inside local memory (VRAM) to be able to scan out
>> from it.
> The reality is that what is common to you, might not be to others. In my work,
> most ARM SoC have display that just handle direct scannout from cameras and
> codecs.

> The only case the commonly fails is whenever we try to display UVC
> created dmabuf,

Well, exactly that's not correct! The whole x86 use cases of direct 
display for dGPUs are broken because media players think they can do the 
simple thing and offload all the problematic cases to the display server.

This is absolutely *not* the common use case you describe here, but 
rather something completely special to ARM.

>   which have dirty CPU write cache and this is the type of thing
> we'd like to see solved. I think this series was addressing it in principle, but
> failing the import and the raised point is that this wasn't the optimal way.
>
> There is a community project called LibreELEC, if you aren't aware, they run
> Khodi with direct scanout of video stream on a wide variety of SoC and they use
> the CODEC as exporter all the time. They simply don't have cases were the
> opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
> does not really offer you any API to reverse the allocation.

Ok, let me try to explain it once more. It sounds like I wasn't able to 
get my point through.

That we haven't heard anybody screaming that x86 doesn't work is just 
because we handle the case that a buffer isn't directly displayable in 
X/Wayland anyway, but this is absolutely not the optimal solution.

The argument that you want to keep the allocation on the codec side is 
completely false as far as I can see.

We already had numerous projects where we reported this practice as bugs 
to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

This is just a software solution which works because of coincident and 
not because of engineering.

Regards,
Christian.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 11:18                     ` Christian König
@ 2022-11-02 11:39                       ` Lucas Stach
  -1 siblings, 0 replies; 110+ messages in thread
From: Lucas Stach @ 2022-11-02 11:39 UTC (permalink / raw)
  To: Christian König, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, sumit.semwal, daniel, robdclark, dri-devel,
	linaro-mm-sig, linux-media

Hi Christian,

going to reply in more detail when I have some more time, so just some
quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]
> > > > But the client is just a video player. It doesn't understand how to
> > > > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > > > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > > > response to the client.
> > > Well exactly that's the point I'm raising: The client *must* understand
> > > that!
> > > 
> > > See we need to be able to handle all restrictions here, coherency of the
> > > data is just one of them.
> > > 
> > > For example the much more important question is the location of the data
> > > and for this allocating from the V4L2 device is in most cases just not
> > > going to fly.
> > It feels like this is a generic statement and there is no reason it could not be
> > the other way around.
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.
> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.
> 
> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.
> 
With bracketed access you could even make this case work, as the dGPU
would be able to slurp a copy of the dma-buf into LMEM for scanout.
 
> >   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> > X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
> > to use it. Of course, in current state of DMABuf, you have to be an exporter to
> > do anything fancy, but it did not have to be like this, its a design choice. I'm
> > not sure in the end what was the final method used, the driver isn't yet
> > upstream, so maybe that is not even final. What I know is that there is various
> > condition you may use the CODEC for which the optimal location will vary. As an
> > example, using the post processor or not, see my next comment for more details.
> 
> Yeah, and stuff like this was already discussed multiple times. Local 
> memory of devices can only be made available by the exporter, not the 
> importer.
> 
> So in the case of separated camera and encoder you run into exactly the 
> same limitation that some device needs the allocation to happen on the 
> camera while others need it on the encoder.
> 
> > > The more common case is that you need to allocate from the GPU and then
> > > import that into the V4L2 device. The background is that all dGPUs I
> > > know of need the data inside local memory (VRAM) to be able to scan out
> > > from it.
> > The reality is that what is common to you, might not be to others. In my work,
> > most ARM SoC have display that just handle direct scannout from cameras and
> > codecs.
> 
> > The only case the commonly fails is whenever we try to display UVC
> > created dmabuf,
> 
> Well, exactly that's not correct! The whole x86 use cases of direct 
> display for dGPUs are broken because media players think they can do the 
> simple thing and offload all the problematic cases to the display server.
> 
> This is absolutely *not* the common use case you describe here, but 
> rather something completely special to ARM.

It the normal case for a lot of ARM SoCs. That world is certainly not
any less big than the x86 dGPU world. A huge number of devices are ARM
based set-top boxes and other video players. Just because it is a
special case for you doesn't mean it's a global special case.

> 
> >   which have dirty CPU write cache and this is the type of thing
> > we'd like to see solved. I think this series was addressing it in principle, but
> > failing the import and the raised point is that this wasn't the optimal way.
> > 
> > There is a community project called LibreELEC, if you aren't aware, they run
> > Khodi with direct scanout of video stream on a wide variety of SoC and they use
> > the CODEC as exporter all the time. They simply don't have cases were the
> > opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
> > does not really offer you any API to reverse the allocation.
> 
> Ok, let me try to explain it once more. It sounds like I wasn't able to 
> get my point through.
> 
> That we haven't heard anybody screaming that x86 doesn't work is just 
> because we handle the case that a buffer isn't directly displayable in 
> X/Wayland anyway, but this is absolutely not the optimal solution.
> 
> The argument that you want to keep the allocation on the codec side is 
> completely false as far as I can see.
> 
> We already had numerous projects where we reported this practice as bugs 
> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
> 
And on a lot of ARM SoCs it's exactly the right thing to do. Many
codecs need contiguous memory there, so importing a scatter-gather
buffer from the GPU via dma-buf will simply not work.

> This is just a software solution which works because of coincident and 
> not because of engineering.

By mandating a software fallback for the cases where you would need
bracketed access to the dma-buf, you simply shift the problem into
userspace. Userspace then creates the bracket by falling back to some
other import option that mostly do a copy and then the appropriate
cache maintenance.

While I understand your sentiment about the DMA-API design being
inconvenient when things are just coherent by system design, the DMA-
API design wasn't done this way due to bad engineering, but due to the
fact that performant DMA access on some systems just require this kind
of bracketing.

Regards,
Lucas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 11:39                       ` Lucas Stach
  0 siblings, 0 replies; 110+ messages in thread
From: Lucas Stach @ 2022-11-02 11:39 UTC (permalink / raw)
  To: Christian König, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, sumit.semwal, linux-media

Hi Christian,

going to reply in more detail when I have some more time, so just some
quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]
> > > > But the client is just a video player. It doesn't understand how to
> > > > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > > > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > > > response to the client.
> > > Well exactly that's the point I'm raising: The client *must* understand
> > > that!
> > > 
> > > See we need to be able to handle all restrictions here, coherency of the
> > > data is just one of them.
> > > 
> > > For example the much more important question is the location of the data
> > > and for this allocating from the V4L2 device is in most cases just not
> > > going to fly.
> > It feels like this is a generic statement and there is no reason it could not be
> > the other way around.
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.
> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.
> 
> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.
> 
With bracketed access you could even make this case work, as the dGPU
would be able to slurp a copy of the dma-buf into LMEM for scanout.
 
> >   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> > X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
> > to use it. Of course, in current state of DMABuf, you have to be an exporter to
> > do anything fancy, but it did not have to be like this, its a design choice. I'm
> > not sure in the end what was the final method used, the driver isn't yet
> > upstream, so maybe that is not even final. What I know is that there is various
> > condition you may use the CODEC for which the optimal location will vary. As an
> > example, using the post processor or not, see my next comment for more details.
> 
> Yeah, and stuff like this was already discussed multiple times. Local 
> memory of devices can only be made available by the exporter, not the 
> importer.
> 
> So in the case of separated camera and encoder you run into exactly the 
> same limitation that some device needs the allocation to happen on the 
> camera while others need it on the encoder.
> 
> > > The more common case is that you need to allocate from the GPU and then
> > > import that into the V4L2 device. The background is that all dGPUs I
> > > know of need the data inside local memory (VRAM) to be able to scan out
> > > from it.
> > The reality is that what is common to you, might not be to others. In my work,
> > most ARM SoC have display that just handle direct scannout from cameras and
> > codecs.
> 
> > The only case the commonly fails is whenever we try to display UVC
> > created dmabuf,
> 
> Well, exactly that's not correct! The whole x86 use cases of direct 
> display for dGPUs are broken because media players think they can do the 
> simple thing and offload all the problematic cases to the display server.
> 
> This is absolutely *not* the common use case you describe here, but 
> rather something completely special to ARM.

It the normal case for a lot of ARM SoCs. That world is certainly not
any less big than the x86 dGPU world. A huge number of devices are ARM
based set-top boxes and other video players. Just because it is a
special case for you doesn't mean it's a global special case.

> 
> >   which have dirty CPU write cache and this is the type of thing
> > we'd like to see solved. I think this series was addressing it in principle, but
> > failing the import and the raised point is that this wasn't the optimal way.
> > 
> > There is a community project called LibreELEC, if you aren't aware, they run
> > Khodi with direct scanout of video stream on a wide variety of SoC and they use
> > the CODEC as exporter all the time. They simply don't have cases were the
> > opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
> > does not really offer you any API to reverse the allocation.
> 
> Ok, let me try to explain it once more. It sounds like I wasn't able to 
> get my point through.
> 
> That we haven't heard anybody screaming that x86 doesn't work is just 
> because we handle the case that a buffer isn't directly displayable in 
> X/Wayland anyway, but this is absolutely not the optimal solution.
> 
> The argument that you want to keep the allocation on the codec side is 
> completely false as far as I can see.
> 
> We already had numerous projects where we reported this practice as bugs 
> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
> 
And on a lot of ARM SoCs it's exactly the right thing to do. Many
codecs need contiguous memory there, so importing a scatter-gather
buffer from the GPU via dma-buf will simply not work.

> This is just a software solution which works because of coincident and 
> not because of engineering.

By mandating a software fallback for the cases where you would need
bracketed access to the dma-buf, you simply shift the problem into
userspace. Userspace then creates the bracket by falling back to some
other import option that mostly do a copy and then the appropriate
cache maintenance.

While I understand your sentiment about the DMA-API design being
inconvenient when things are just coherent by system design, the DMA-
API design wasn't done this way due to bad engineering, but due to the
fact that performant DMA access on some systems just require this kind
of bracketing.

Regards,
Lucas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 11:18                     ` Christian König
@ 2022-11-02 12:19                       ` Pekka Paalanen
  -1 siblings, 0 replies; 110+ messages in thread
From: Pekka Paalanen @ 2022-11-02 12:19 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, Daniel Stone, Lucas Stach, sumit.semwal,
	daniel, robdclark, dri-devel, linaro-mm-sig, linux-media

[-- Attachment #1: Type: text/plain, Size: 3006 bytes --]

On Wed, 2 Nov 2022 12:18:01 +0100
Christian König <ckoenig.leichtzumerken@gmail.com> wrote:

> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]  
> >>> But the client is just a video player. It doesn't understand how to
> >>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> >>> allocator (again ...), 'just allocate on the GPU' isn't a useful
> >>> response to the client.  
> >> Well exactly that's the point I'm raising: The client *must* understand
> >> that!
> >>
> >> See we need to be able to handle all restrictions here, coherency of the
> >> data is just one of them.
> >>
> >> For example the much more important question is the location of the data
> >> and for this allocating from the V4L2 device is in most cases just not
> >> going to fly.  
> > It feels like this is a generic statement and there is no reason it could not be
> > the other way around.  
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.
> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.

What should we Wayland people be pushing back on exactly? And where is
our authority and opportunity to do so?

The Wayland protocol dmabuf extension allows a graceful failure if the
Wayland compositor cannot use the given dmabuf at all, giving the
client an opportunity to try something else. The Wayland protocol also
tells clients which DRM rendering device at minimum the dmabuf needs to
be compatible with. It even tells which KMS device the dmabuf could be
put on direct scanout if the dmabuf was suitable for that and direct
scanout is otherwise possible.

What the client (application) does with all that information is up to
the client. That code is not part of Wayland.

I'm sure we would be happy to add protocol for anything that
https://github.com/cubanismo/allocator needs to become the universal
optimal buffer allocator library.


> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.

The only fallback Wayland compositors tend to do is use OpenGL/Vulkan
rendering for composition if direct scanout on a KMS plane does not
work. There are many reasons why direct scanout may not be possible in
addition to hardware and drivers not agreeing to do it with the given
set of buffers.

A general purpose (read: desktop) Wayland compositor simply cannot live
without being able to fall back from KMS composition to software/GPU
composition.

But yes, there are use cases where that fallback is as good as failing
completely. Those are not desktops but more like set-top-boxes and TVs.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 12:19                       ` Pekka Paalanen
  0 siblings, 0 replies; 110+ messages in thread
From: Pekka Paalanen @ 2022-11-02 12:19 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, linaro-mm-sig, dri-devel, linux-media, sumit.semwal

[-- Attachment #1: Type: text/plain, Size: 3006 bytes --]

On Wed, 2 Nov 2022 12:18:01 +0100
Christian König <ckoenig.leichtzumerken@gmail.com> wrote:

> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]  
> >>> But the client is just a video player. It doesn't understand how to
> >>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> >>> allocator (again ...), 'just allocate on the GPU' isn't a useful
> >>> response to the client.  
> >> Well exactly that's the point I'm raising: The client *must* understand
> >> that!
> >>
> >> See we need to be able to handle all restrictions here, coherency of the
> >> data is just one of them.
> >>
> >> For example the much more important question is the location of the data
> >> and for this allocating from the V4L2 device is in most cases just not
> >> going to fly.  
> > It feels like this is a generic statement and there is no reason it could not be
> > the other way around.  
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.
> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.

What should we Wayland people be pushing back on exactly? And where is
our authority and opportunity to do so?

The Wayland protocol dmabuf extension allows a graceful failure if the
Wayland compositor cannot use the given dmabuf at all, giving the
client an opportunity to try something else. The Wayland protocol also
tells clients which DRM rendering device at minimum the dmabuf needs to
be compatible with. It even tells which KMS device the dmabuf could be
put on direct scanout if the dmabuf was suitable for that and direct
scanout is otherwise possible.

What the client (application) does with all that information is up to
the client. That code is not part of Wayland.

I'm sure we would be happy to add protocol for anything that
https://github.com/cubanismo/allocator needs to become the universal
optimal buffer allocator library.


> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.

The only fallback Wayland compositors tend to do is use OpenGL/Vulkan
rendering for composition if direct scanout on a KMS plane does not
work. There are many reasons why direct scanout may not be possible in
addition to hardware and drivers not agreeing to do it with the given
set of buffers.

A general purpose (read: desktop) Wayland compositor simply cannot live
without being able to fall back from KMS composition to software/GPU
composition.

But yes, there are use cases where that fallback is as good as failing
completely. Those are not desktops but more like set-top-boxes and TVs.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 11:39                       ` Lucas Stach
@ 2022-11-02 12:21                         ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 12:21 UTC (permalink / raw)
  To: Lucas Stach, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, sumit.semwal, daniel, robdclark, dri-devel,
	linaro-mm-sig, linux-media

Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:
> Hi Christian,
>
> going to reply in more detail when I have some more time, so just some
> quick thoughts for now.
>
> Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
>> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
>>> [SNIP]
>> As far as I can see it you guys just allocate a buffer from a V4L2
>> device, fill it with data and send it to Wayland for displaying.
>>
>> To be honest I'm really surprised that the Wayland guys hasn't pushed
>> back on this practice already.
>>
>> This only works because the Wayland as well as X display pipeline is
>> smart enough to insert an extra copy when it find that an imported
>> buffer can't be used as a framebuffer directly.
>>
> With bracketed access you could even make this case work, as the dGPU
> would be able to slurp a copy of the dma-buf into LMEM for scanout.

Well, this copy is what we are trying to avoid here. The codec should 
pump the data into LMEM in the first place.

>>> The only case the commonly fails is whenever we try to display UVC
>>> created dmabuf,
>> Well, exactly that's not correct! The whole x86 use cases of direct
>> display for dGPUs are broken because media players think they can do the
>> simple thing and offload all the problematic cases to the display server.
>>
>> This is absolutely *not* the common use case you describe here, but
>> rather something completely special to ARM.
> It the normal case for a lot of ARM SoCs.

Yeah, but it's not the normal case for everybody.

We had numerous projects where customers wanted to pump video data 
directly from a decoder into an GPU frame or from a GPU frame into an 
encoder.

The fact that media frameworks doesn't support that out of the box is 
simply a bug.

> That world is certainly not
> any less big than the x86 dGPU world. A huge number of devices are ARM
> based set-top boxes and other video players. Just because it is a
> special case for you doesn't mean it's a global special case.

Ok, let's stop with that. This isn't helpful in the technical discussion.

>
>> That we haven't heard anybody screaming that x86 doesn't work is just
>> because we handle the case that a buffer isn't directly displayable in
>> X/Wayland anyway, but this is absolutely not the optimal solution.
>>
>> The argument that you want to keep the allocation on the codec side is
>> completely false as far as I can see.
>>
>> We already had numerous projects where we reported this practice as bugs
>> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
>>
> And on a lot of ARM SoCs it's exactly the right thing to do.

Yeah and that's fine, it just doesn't seem to work in all cases.

For both x86 as well as the case here that the CPU cache might be dirty 
the exporter needs to be the device with the requirements.

For x86 dGPUs that's the backing store is some local memory. For the 
non-coherent ARM devices it's that the CPU cache is not dirty.

For a device driver which solely works with cached system memory 
inserting cache flush operations is something it would never do for 
itself. It would just be doing this for the importer and exactly that 
would be bad design because we then have handling for the display driver 
outside of the driver.

>> This is just a software solution which works because of coincident and
>> not because of engineering.
> By mandating a software fallback for the cases where you would need
> bracketed access to the dma-buf, you simply shift the problem into
> userspace. Userspace then creates the bracket by falling back to some
> other import option that mostly do a copy and then the appropriate
> cache maintenance.
>
> While I understand your sentiment about the DMA-API design being
> inconvenient when things are just coherent by system design, the DMA-
> API design wasn't done this way due to bad engineering, but due to the
> fact that performant DMA access on some systems just require this kind
> of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of 
giving you a proper error code when something won't work in a specific 
way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API 
transparently insert bounce buffers instead of giving you a proper error 
code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it 
into the upper layer where it can be handled much more gracefully.

Regards,
Christian.

>
> Regards,
> Lucas
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 12:21                         ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 12:21 UTC (permalink / raw)
  To: Lucas Stach, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, sumit.semwal, linux-media

Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:
> Hi Christian,
>
> going to reply in more detail when I have some more time, so just some
> quick thoughts for now.
>
> Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
>> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
>>> [SNIP]
>> As far as I can see it you guys just allocate a buffer from a V4L2
>> device, fill it with data and send it to Wayland for displaying.
>>
>> To be honest I'm really surprised that the Wayland guys hasn't pushed
>> back on this practice already.
>>
>> This only works because the Wayland as well as X display pipeline is
>> smart enough to insert an extra copy when it find that an imported
>> buffer can't be used as a framebuffer directly.
>>
> With bracketed access you could even make this case work, as the dGPU
> would be able to slurp a copy of the dma-buf into LMEM for scanout.

Well, this copy is what we are trying to avoid here. The codec should 
pump the data into LMEM in the first place.

>>> The only case the commonly fails is whenever we try to display UVC
>>> created dmabuf,
>> Well, exactly that's not correct! The whole x86 use cases of direct
>> display for dGPUs are broken because media players think they can do the
>> simple thing and offload all the problematic cases to the display server.
>>
>> This is absolutely *not* the common use case you describe here, but
>> rather something completely special to ARM.
> It the normal case for a lot of ARM SoCs.

Yeah, but it's not the normal case for everybody.

We had numerous projects where customers wanted to pump video data 
directly from a decoder into an GPU frame or from a GPU frame into an 
encoder.

The fact that media frameworks doesn't support that out of the box is 
simply a bug.

> That world is certainly not
> any less big than the x86 dGPU world. A huge number of devices are ARM
> based set-top boxes and other video players. Just because it is a
> special case for you doesn't mean it's a global special case.

Ok, let's stop with that. This isn't helpful in the technical discussion.

>
>> That we haven't heard anybody screaming that x86 doesn't work is just
>> because we handle the case that a buffer isn't directly displayable in
>> X/Wayland anyway, but this is absolutely not the optimal solution.
>>
>> The argument that you want to keep the allocation on the codec side is
>> completely false as far as I can see.
>>
>> We already had numerous projects where we reported this practice as bugs
>> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
>>
> And on a lot of ARM SoCs it's exactly the right thing to do.

Yeah and that's fine, it just doesn't seem to work in all cases.

For both x86 as well as the case here that the CPU cache might be dirty 
the exporter needs to be the device with the requirements.

For x86 dGPUs that's the backing store is some local memory. For the 
non-coherent ARM devices it's that the CPU cache is not dirty.

For a device driver which solely works with cached system memory 
inserting cache flush operations is something it would never do for 
itself. It would just be doing this for the importer and exactly that 
would be bad design because we then have handling for the display driver 
outside of the driver.

>> This is just a software solution which works because of coincident and
>> not because of engineering.
> By mandating a software fallback for the cases where you would need
> bracketed access to the dma-buf, you simply shift the problem into
> userspace. Userspace then creates the bracket by falling back to some
> other import option that mostly do a copy and then the appropriate
> cache maintenance.
>
> While I understand your sentiment about the DMA-API design being
> inconvenient when things are just coherent by system design, the DMA-
> API design wasn't done this way due to bad engineering, but due to the
> fact that performant DMA access on some systems just require this kind
> of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of 
giving you a proper error code when something won't work in a specific 
way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API 
transparently insert bounce buffers instead of giving you a proper error 
code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it 
into the upper layer where it can be handled much more gracefully.

Regards,
Christian.

>
> Regards,
> Lucas
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 12:19                       ` Pekka Paalanen
@ 2022-11-02 12:27                         ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 12:27 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Nicolas Dufresne, Daniel Stone, Lucas Stach, sumit.semwal,
	daniel, robdclark, dri-devel, linaro-mm-sig, linux-media

Am 02.11.22 um 13:19 schrieb Pekka Paalanen:
> On Wed, 2 Nov 2022 12:18:01 +0100
> Christian König <ckoenig.leichtzumerken@gmail.com> wrote:
>
>> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
>>> [SNIP]
>>>>> But the client is just a video player. It doesn't understand how to
>>>>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
>>>>> allocator (again ...), 'just allocate on the GPU' isn't a useful
>>>>> response to the client.
>>>> Well exactly that's the point I'm raising: The client *must* understand
>>>> that!
>>>>
>>>> See we need to be able to handle all restrictions here, coherency of the
>>>> data is just one of them.
>>>>
>>>> For example the much more important question is the location of the data
>>>> and for this allocating from the V4L2 device is in most cases just not
>>>> going to fly.
>>> It feels like this is a generic statement and there is no reason it could not be
>>> the other way around.
>> And exactly that's my point. You always need to look at both ways to
>> share the buffer and can't assume that one will always work.
>>
>> As far as I can see it you guys just allocate a buffer from a V4L2
>> device, fill it with data and send it to Wayland for displaying.
>>
>> To be honest I'm really surprised that the Wayland guys hasn't pushed
>> back on this practice already.
> What should we Wayland people be pushing back on exactly? And where is
> our authority and opportunity to do so?
>
> The Wayland protocol dmabuf extension allows a graceful failure if the
> Wayland compositor cannot use the given dmabuf at all, giving the
> client an opportunity to try something else.

That's exactly what I meant with pushing back :)

I wasn't aware that this handling is already implemented.

> The Wayland protocol also
> tells clients which DRM rendering device at minimum the dmabuf needs to
> be compatible with. It even tells which KMS device the dmabuf could be
> put on direct scanout if the dmabuf was suitable for that and direct
> scanout is otherwise possible.

Yeah, perfect. Exactly that's what's needed here.

> What the client (application) does with all that information is up to
> the client. That code is not part of Wayland.
>
> I'm sure we would be happy to add protocol for anything that
> https://github.com/cubanismo/allocator needs to become the universal
> optimal buffer allocator library.

 From what you wrote it's already perfectly covered.

>> This only works because the Wayland as well as X display pipeline is
>> smart enough to insert an extra copy when it find that an imported
>> buffer can't be used as a framebuffer directly.
> The only fallback Wayland compositors tend to do is use OpenGL/Vulkan
> rendering for composition if direct scanout on a KMS plane does not
> work. There are many reasons why direct scanout may not be possible in
> addition to hardware and drivers not agreeing to do it with the given
> set of buffers.
>
> A general purpose (read: desktop) Wayland compositor simply cannot live
> without being able to fall back from KMS composition to software/GPU
> composition.
>
> But yes, there are use cases where that fallback is as good as failing
> completely. Those are not desktops but more like set-top-boxes and TVs.

Completely agree to this approach.

The only problem is that media players tend to not implement a way to 
allow direct scanout because of those fallback paths.

But as you said that's their decision.

Thanks,
Christian.

>
>
> Thanks,
> pq


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 12:27                         ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 12:27 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Nicolas Dufresne, linaro-mm-sig, dri-devel, linux-media, sumit.semwal

Am 02.11.22 um 13:19 schrieb Pekka Paalanen:
> On Wed, 2 Nov 2022 12:18:01 +0100
> Christian König <ckoenig.leichtzumerken@gmail.com> wrote:
>
>> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
>>> [SNIP]
>>>>> But the client is just a video player. It doesn't understand how to
>>>>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
>>>>> allocator (again ...), 'just allocate on the GPU' isn't a useful
>>>>> response to the client.
>>>> Well exactly that's the point I'm raising: The client *must* understand
>>>> that!
>>>>
>>>> See we need to be able to handle all restrictions here, coherency of the
>>>> data is just one of them.
>>>>
>>>> For example the much more important question is the location of the data
>>>> and for this allocating from the V4L2 device is in most cases just not
>>>> going to fly.
>>> It feels like this is a generic statement and there is no reason it could not be
>>> the other way around.
>> And exactly that's my point. You always need to look at both ways to
>> share the buffer and can't assume that one will always work.
>>
>> As far as I can see it you guys just allocate a buffer from a V4L2
>> device, fill it with data and send it to Wayland for displaying.
>>
>> To be honest I'm really surprised that the Wayland guys hasn't pushed
>> back on this practice already.
> What should we Wayland people be pushing back on exactly? And where is
> our authority and opportunity to do so?
>
> The Wayland protocol dmabuf extension allows a graceful failure if the
> Wayland compositor cannot use the given dmabuf at all, giving the
> client an opportunity to try something else.

That's exactly what I meant with pushing back :)

I wasn't aware that this handling is already implemented.

> The Wayland protocol also
> tells clients which DRM rendering device at minimum the dmabuf needs to
> be compatible with. It even tells which KMS device the dmabuf could be
> put on direct scanout if the dmabuf was suitable for that and direct
> scanout is otherwise possible.

Yeah, perfect. Exactly that's what's needed here.

> What the client (application) does with all that information is up to
> the client. That code is not part of Wayland.
>
> I'm sure we would be happy to add protocol for anything that
> https://github.com/cubanismo/allocator needs to become the universal
> optimal buffer allocator library.

 From what you wrote it's already perfectly covered.

>> This only works because the Wayland as well as X display pipeline is
>> smart enough to insert an extra copy when it find that an imported
>> buffer can't be used as a framebuffer directly.
> The only fallback Wayland compositors tend to do is use OpenGL/Vulkan
> rendering for composition if direct scanout on a KMS plane does not
> work. There are many reasons why direct scanout may not be possible in
> addition to hardware and drivers not agreeing to do it with the given
> set of buffers.
>
> A general purpose (read: desktop) Wayland compositor simply cannot live
> without being able to fall back from KMS composition to software/GPU
> composition.
>
> But yes, there are use cases where that fallback is as good as failing
> completely. Those are not desktops but more like set-top-boxes and TVs.

Completely agree to this approach.

The only problem is that media players tend to not implement a way to 
allow direct scanout because of those fallback paths.

But as you said that's their decision.

Thanks,
Christian.

>
>
> Thanks,
> pq


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 12:27                         ` Christian König
@ 2022-11-02 12:50                           ` Pekka Paalanen
  -1 siblings, 0 replies; 110+ messages in thread
From: Pekka Paalanen @ 2022-11-02 12:50 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, Daniel Stone, Lucas Stach, sumit.semwal,
	daniel, robdclark, dri-devel, linaro-mm-sig, linux-media

[-- Attachment #1: Type: text/plain, Size: 4632 bytes --]

On Wed, 2 Nov 2022 13:27:18 +0100
Christian König <ckoenig.leichtzumerken@gmail.com> wrote:

> Am 02.11.22 um 13:19 schrieb Pekka Paalanen:
> > On Wed, 2 Nov 2022 12:18:01 +0100
> > Christian König <ckoenig.leichtzumerken@gmail.com> wrote:
> >  
> >> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:  
> >>> [SNIP]  
> >>>>> But the client is just a video player. It doesn't understand how to
> >>>>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> >>>>> allocator (again ...), 'just allocate on the GPU' isn't a useful
> >>>>> response to the client.  
> >>>> Well exactly that's the point I'm raising: The client *must* understand
> >>>> that!
> >>>>
> >>>> See we need to be able to handle all restrictions here, coherency of the
> >>>> data is just one of them.
> >>>>
> >>>> For example the much more important question is the location of the data
> >>>> and for this allocating from the V4L2 device is in most cases just not
> >>>> going to fly.  
> >>> It feels like this is a generic statement and there is no reason it could not be
> >>> the other way around.  
> >> And exactly that's my point. You always need to look at both ways to
> >> share the buffer and can't assume that one will always work.
> >>
> >> As far as I can see it you guys just allocate a buffer from a V4L2
> >> device, fill it with data and send it to Wayland for displaying.
> >>
> >> To be honest I'm really surprised that the Wayland guys hasn't pushed
> >> back on this practice already.  
> > What should we Wayland people be pushing back on exactly? And where is
> > our authority and opportunity to do so?
> >
> > The Wayland protocol dmabuf extension allows a graceful failure if the
> > Wayland compositor cannot use the given dmabuf at all, giving the
> > client an opportunity to try something else.  
> 
> That's exactly what I meant with pushing back :)
> 
> I wasn't aware that this handling is already implemented.

Well... it's "optional". The Wayland dmabuf protocol has two cases for
a client/app:

a) you do the right thing and wait for the compositor to ack that the
   dmabuf works (the reply should come pretty much immediately, not
   needing the compositor to actually composite), or

b) you just send the dmabuf and continue as if it always worked. If it
   doesn't, you might get a protocol error later and be disconnected.

Guess which one Mesa uses?

I've been told Mesa has no way to handle a failure there, so it
doesn't. I would not be surprised if others just copy that behaviour.

I recall personally being against adding option b) to begin with, but
there it is, added for Mesa IIRC.


> > The Wayland protocol also
> > tells clients which DRM rendering device at minimum the dmabuf needs to
> > be compatible with. It even tells which KMS device the dmabuf could be
> > put on direct scanout if the dmabuf was suitable for that and direct
> > scanout is otherwise possible.  
> 
> Yeah, perfect. Exactly that's what's needed here.
> 
> > What the client (application) does with all that information is up to
> > the client. That code is not part of Wayland.
> >
> > I'm sure we would be happy to add protocol for anything that
> > https://github.com/cubanismo/allocator needs to become the universal
> > optimal buffer allocator library.  
> 
>  From what you wrote it's already perfectly covered.

Cool. Now if clients would just use it...


Thanks,
pq

> >> This only works because the Wayland as well as X display pipeline is
> >> smart enough to insert an extra copy when it find that an imported
> >> buffer can't be used as a framebuffer directly.  
> > The only fallback Wayland compositors tend to do is use OpenGL/Vulkan
> > rendering for composition if direct scanout on a KMS plane does not
> > work. There are many reasons why direct scanout may not be possible in
> > addition to hardware and drivers not agreeing to do it with the given
> > set of buffers.
> >
> > A general purpose (read: desktop) Wayland compositor simply cannot live
> > without being able to fall back from KMS composition to software/GPU
> > composition.
> >
> > But yes, there are use cases where that fallback is as good as failing
> > completely. Those are not desktops but more like set-top-boxes and TVs.  
> 
> Completely agree to this approach.
> 
> The only problem is that media players tend to not implement a way to 
> allow direct scanout because of those fallback paths.
> 
> But as you said that's their decision.
> 
> Thanks,
> Christian.
> 
> >
> >
> > Thanks,
> > pq  
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 12:50                           ` Pekka Paalanen
  0 siblings, 0 replies; 110+ messages in thread
From: Pekka Paalanen @ 2022-11-02 12:50 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, linaro-mm-sig, dri-devel, linux-media, sumit.semwal

[-- Attachment #1: Type: text/plain, Size: 4632 bytes --]

On Wed, 2 Nov 2022 13:27:18 +0100
Christian König <ckoenig.leichtzumerken@gmail.com> wrote:

> Am 02.11.22 um 13:19 schrieb Pekka Paalanen:
> > On Wed, 2 Nov 2022 12:18:01 +0100
> > Christian König <ckoenig.leichtzumerken@gmail.com> wrote:
> >  
> >> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:  
> >>> [SNIP]  
> >>>>> But the client is just a video player. It doesn't understand how to
> >>>>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> >>>>> allocator (again ...), 'just allocate on the GPU' isn't a useful
> >>>>> response to the client.  
> >>>> Well exactly that's the point I'm raising: The client *must* understand
> >>>> that!
> >>>>
> >>>> See we need to be able to handle all restrictions here, coherency of the
> >>>> data is just one of them.
> >>>>
> >>>> For example the much more important question is the location of the data
> >>>> and for this allocating from the V4L2 device is in most cases just not
> >>>> going to fly.  
> >>> It feels like this is a generic statement and there is no reason it could not be
> >>> the other way around.  
> >> And exactly that's my point. You always need to look at both ways to
> >> share the buffer and can't assume that one will always work.
> >>
> >> As far as I can see it you guys just allocate a buffer from a V4L2
> >> device, fill it with data and send it to Wayland for displaying.
> >>
> >> To be honest I'm really surprised that the Wayland guys hasn't pushed
> >> back on this practice already.  
> > What should we Wayland people be pushing back on exactly? And where is
> > our authority and opportunity to do so?
> >
> > The Wayland protocol dmabuf extension allows a graceful failure if the
> > Wayland compositor cannot use the given dmabuf at all, giving the
> > client an opportunity to try something else.  
> 
> That's exactly what I meant with pushing back :)
> 
> I wasn't aware that this handling is already implemented.

Well... it's "optional". The Wayland dmabuf protocol has two cases for
a client/app:

a) you do the right thing and wait for the compositor to ack that the
   dmabuf works (the reply should come pretty much immediately, not
   needing the compositor to actually composite), or

b) you just send the dmabuf and continue as if it always worked. If it
   doesn't, you might get a protocol error later and be disconnected.

Guess which one Mesa uses?

I've been told Mesa has no way to handle a failure there, so it
doesn't. I would not be surprised if others just copy that behaviour.

I recall personally being against adding option b) to begin with, but
there it is, added for Mesa IIRC.


> > The Wayland protocol also
> > tells clients which DRM rendering device at minimum the dmabuf needs to
> > be compatible with. It even tells which KMS device the dmabuf could be
> > put on direct scanout if the dmabuf was suitable for that and direct
> > scanout is otherwise possible.  
> 
> Yeah, perfect. Exactly that's what's needed here.
> 
> > What the client (application) does with all that information is up to
> > the client. That code is not part of Wayland.
> >
> > I'm sure we would be happy to add protocol for anything that
> > https://github.com/cubanismo/allocator needs to become the universal
> > optimal buffer allocator library.  
> 
>  From what you wrote it's already perfectly covered.

Cool. Now if clients would just use it...


Thanks,
pq

> >> This only works because the Wayland as well as X display pipeline is
> >> smart enough to insert an extra copy when it find that an imported
> >> buffer can't be used as a framebuffer directly.  
> > The only fallback Wayland compositors tend to do is use OpenGL/Vulkan
> > rendering for composition if direct scanout on a KMS plane does not
> > work. There are many reasons why direct scanout may not be possible in
> > addition to hardware and drivers not agreeing to do it with the given
> > set of buffers.
> >
> > A general purpose (read: desktop) Wayland compositor simply cannot live
> > without being able to fall back from KMS composition to software/GPU
> > composition.
> >
> > But yes, there are use cases where that fallback is as good as failing
> > completely. Those are not desktops but more like set-top-boxes and TVs.  
> 
> Completely agree to this approach.
> 
> The only problem is that media players tend to not implement a way to 
> allow direct scanout because of those fallback paths.
> 
> But as you said that's their decision.
> 
> Thanks,
> Christian.
> 
> >
> >
> > Thanks,
> > pq  
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 12:50                           ` Pekka Paalanen
@ 2022-11-02 12:56                             ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 12:56 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Nicolas Dufresne, Daniel Stone, Lucas Stach, sumit.semwal,
	daniel, robdclark, dri-devel, linaro-mm-sig, linux-media

Am 02.11.22 um 13:50 schrieb Pekka Paalanen:
> On Wed, 2 Nov 2022 13:27:18 +0100
> Christian König <ckoenig.leichtzumerken@gmail.com> wrote:
>
>> Am 02.11.22 um 13:19 schrieb Pekka Paalanen:
>>> On Wed, 2 Nov 2022 12:18:01 +0100
>>> Christian König <ckoenig.leichtzumerken@gmail.com> wrote:
>>>   
>>>> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
>>>>> [SNIP]
>>>>>>> But the client is just a video player. It doesn't understand how to
>>>>>>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
>>>>>>> allocator (again ...), 'just allocate on the GPU' isn't a useful
>>>>>>> response to the client.
>>>>>> Well exactly that's the point I'm raising: The client *must* understand
>>>>>> that!
>>>>>>
>>>>>> See we need to be able to handle all restrictions here, coherency of the
>>>>>> data is just one of them.
>>>>>>
>>>>>> For example the much more important question is the location of the data
>>>>>> and for this allocating from the V4L2 device is in most cases just not
>>>>>> going to fly.
>>>>> It feels like this is a generic statement and there is no reason it could not be
>>>>> the other way around.
>>>> And exactly that's my point. You always need to look at both ways to
>>>> share the buffer and can't assume that one will always work.
>>>>
>>>> As far as I can see it you guys just allocate a buffer from a V4L2
>>>> device, fill it with data and send it to Wayland for displaying.
>>>>
>>>> To be honest I'm really surprised that the Wayland guys hasn't pushed
>>>> back on this practice already.
>>> What should we Wayland people be pushing back on exactly? And where is
>>> our authority and opportunity to do so?
>>>
>>> The Wayland protocol dmabuf extension allows a graceful failure if the
>>> Wayland compositor cannot use the given dmabuf at all, giving the
>>> client an opportunity to try something else.
>> That's exactly what I meant with pushing back :)
>>
>> I wasn't aware that this handling is already implemented.
> Well... it's "optional". The Wayland dmabuf protocol has two cases for
> a client/app:
>
> a) you do the right thing and wait for the compositor to ack that the
>     dmabuf works (the reply should come pretty much immediately, not
>     needing the compositor to actually composite), or
>
> b) you just send the dmabuf and continue as if it always worked. If it
>     doesn't, you might get a protocol error later and be disconnected.
>
> Guess which one Mesa uses?
>
> I've been told Mesa has no way to handle a failure there, so it
> doesn't. I would not be surprised if others just copy that behaviour.
>
> I recall personally being against adding option b) to begin with, but
> there it is, added for Mesa IIRC.

Well I'm not sure if those projects actually used X or Wayland, but we 
did had cases where this was really used.

IIRC in the case of Prime offloading we now allocate a buffer from the 
displaying device in Mesa and do the copy from the rendering GPU to the 
displaying GPU on the client side.

The background is that this gave us much better control which engine and 
parameters where used for the copy resulting in a nice performance 
improvement.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 12:56                             ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 12:56 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Nicolas Dufresne, linaro-mm-sig, dri-devel, linux-media, sumit.semwal

Am 02.11.22 um 13:50 schrieb Pekka Paalanen:
> On Wed, 2 Nov 2022 13:27:18 +0100
> Christian König <ckoenig.leichtzumerken@gmail.com> wrote:
>
>> Am 02.11.22 um 13:19 schrieb Pekka Paalanen:
>>> On Wed, 2 Nov 2022 12:18:01 +0100
>>> Christian König <ckoenig.leichtzumerken@gmail.com> wrote:
>>>   
>>>> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
>>>>> [SNIP]
>>>>>>> But the client is just a video player. It doesn't understand how to
>>>>>>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
>>>>>>> allocator (again ...), 'just allocate on the GPU' isn't a useful
>>>>>>> response to the client.
>>>>>> Well exactly that's the point I'm raising: The client *must* understand
>>>>>> that!
>>>>>>
>>>>>> See we need to be able to handle all restrictions here, coherency of the
>>>>>> data is just one of them.
>>>>>>
>>>>>> For example the much more important question is the location of the data
>>>>>> and for this allocating from the V4L2 device is in most cases just not
>>>>>> going to fly.
>>>>> It feels like this is a generic statement and there is no reason it could not be
>>>>> the other way around.
>>>> And exactly that's my point. You always need to look at both ways to
>>>> share the buffer and can't assume that one will always work.
>>>>
>>>> As far as I can see it you guys just allocate a buffer from a V4L2
>>>> device, fill it with data and send it to Wayland for displaying.
>>>>
>>>> To be honest I'm really surprised that the Wayland guys hasn't pushed
>>>> back on this practice already.
>>> What should we Wayland people be pushing back on exactly? And where is
>>> our authority and opportunity to do so?
>>>
>>> The Wayland protocol dmabuf extension allows a graceful failure if the
>>> Wayland compositor cannot use the given dmabuf at all, giving the
>>> client an opportunity to try something else.
>> That's exactly what I meant with pushing back :)
>>
>> I wasn't aware that this handling is already implemented.
> Well... it's "optional". The Wayland dmabuf protocol has two cases for
> a client/app:
>
> a) you do the right thing and wait for the compositor to ack that the
>     dmabuf works (the reply should come pretty much immediately, not
>     needing the compositor to actually composite), or
>
> b) you just send the dmabuf and continue as if it always worked. If it
>     doesn't, you might get a protocol error later and be disconnected.
>
> Guess which one Mesa uses?
>
> I've been told Mesa has no way to handle a failure there, so it
> doesn't. I would not be surprised if others just copy that behaviour.
>
> I recall personally being against adding option b) to begin with, but
> there it is, added for Mesa IIRC.

Well I'm not sure if those projects actually used X or Wayland, but we 
did had cases where this was really used.

IIRC in the case of Prime offloading we now allocate a buffer from the 
displaying device in Mesa and do the copy from the rendering GPU to the 
displaying GPU on the client side.

The background is that this gave us much better control which engine and 
parameters where used for the copy resulting in a nice performance 
improvement.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 12:21                         ` Christian König
@ 2022-11-02 17:10                           ` Lucas Stach
  -1 siblings, 0 replies; 110+ messages in thread
From: Lucas Stach @ 2022-11-02 17:10 UTC (permalink / raw)
  To: Christian König, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, sumit.semwal, linux-media

Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> Hi Lucas,
> 
> Am 02.11.22 um 12:39 schrieb Lucas Stach:
> > Hi Christian,
> > 
> > going to reply in more detail when I have some more time, so just some
> > quick thoughts for now.
> > 
> > Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> > > Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > > > [SNIP]
> > > As far as I can see it you guys just allocate a buffer from a V4L2
> > > device, fill it with data and send it to Wayland for displaying.
> > > 
> > > To be honest I'm really surprised that the Wayland guys hasn't pushed
> > > back on this practice already.
> > > 
> > > This only works because the Wayland as well as X display pipeline is
> > > smart enough to insert an extra copy when it find that an imported
> > > buffer can't be used as a framebuffer directly.
> > > 
> > With bracketed access you could even make this case work, as the dGPU
> > would be able to slurp a copy of the dma-buf into LMEM for scanout.
> 
> Well, this copy is what we are trying to avoid here. The codec should 
> pump the data into LMEM in the first place.
> 
That's very use-case specific. If the codec used those frames as
reference frames for further coding operations, allocating them in LMEM
of a remote device doesn't sound like a clever idea. But then I think
this is a whole different discussion on its own.

> > > > The only case the commonly fails is whenever we try to display UVC
> > > > created dmabuf,
> > > Well, exactly that's not correct! The whole x86 use cases of direct
> > > display for dGPUs are broken because media players think they can do the
> > > simple thing and offload all the problematic cases to the display server.
> > > 
> > > This is absolutely *not* the common use case you describe here, but
> > > rather something completely special to ARM.
> > It the normal case for a lot of ARM SoCs.
> 
> Yeah, but it's not the normal case for everybody.
> 
> We had numerous projects where customers wanted to pump video data 
> directly from a decoder into an GPU frame or from a GPU frame into an 
> encoder.
> 
> The fact that media frameworks doesn't support that out of the box is 
> simply a bug.
> 
> > That world is certainly not
> > any less big than the x86 dGPU world. A huge number of devices are ARM
> > based set-top boxes and other video players. Just because it is a
> > special case for you doesn't mean it's a global special case.
> 
> Ok, let's stop with that. This isn't helpful in the technical discussion.
> 
Agreed. I think we should try to understand the fact that there are
very different world-views and use-cases for dma-buf right now and that
breaking one or the other is simply no option. What may be an obscure
special case on x86 may be very common on ARM and vice versa.

> > 
> > > That we haven't heard anybody screaming that x86 doesn't work is just
> > > because we handle the case that a buffer isn't directly displayable in
> > > X/Wayland anyway, but this is absolutely not the optimal solution.
> > > 
> > > The argument that you want to keep the allocation on the codec side is
> > > completely false as far as I can see.
> > > 
> > > We already had numerous projects where we reported this practice as bugs
> > > to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
> > > 
> > And on a lot of ARM SoCs it's exactly the right thing to do.
> 
> Yeah and that's fine, it just doesn't seem to work in all cases.
> 
> For both x86 as well as the case here that the CPU cache might be dirty 
> the exporter needs to be the device with the requirements.
> 
> For x86 dGPUs that's the backing store is some local memory. For the 
> non-coherent ARM devices it's that the CPU cache is not dirty.
> 
> For a device driver which solely works with cached system memory 
> inserting cache flush operations is something it would never do for 
> itself. 
> 
It's exactly what a device driver working with cacheable memory buffers
on a architecture with non-coherent DMA masters would do. In fact it's
the main reason why the bracketing with explicit ownership transfer
exists in DMA API: to allow the DMA API implementation to do the
necessary cache maintenance operations on architectures that need them.

> It would just be doing this for the importer and exactly that 
> would be bad design because we then have handling for the display driver 
> outside of the driver.
> 
The driver would have to do those cache maintenance operations if it
directly worked with a non-coherent device. Doing it for the importer
is just doing it for another device, not the one directly managed by
the exporter.

I really don't see the difference to the other dma-buf ops: in
dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
the address space of the importer. Why would cache maintenance be any
different?

> > > This is just a software solution which works because of coincident and
> > > not because of engineering.
> > By mandating a software fallback for the cases where you would need
> > bracketed access to the dma-buf, you simply shift the problem into
> > userspace. Userspace then creates the bracket by falling back to some
> > other import option that mostly do a copy and then the appropriate
> > cache maintenance.
> > 
> > While I understand your sentiment about the DMA-API design being
> > inconvenient when things are just coherent by system design, the DMA-
> > API design wasn't done this way due to bad engineering, but due to the
> > fact that performant DMA access on some systems just require this kind
> > of bracketing.
> 
> Well, this is exactly what I'm criticizing on the DMA-API. Instead of 
> giving you a proper error code when something won't work in a specific 
> way it just tries to hide the requirements inside the DMA layer.
> 
> For example when your device can only access 32bits the DMA-API 
> transparently insert bounce buffers instead of giving you a proper error 
> code that the memory in question can't be accessed.
> 
> This just tries to hide the underlying problem instead of pushing it 
> into the upper layer where it can be handled much more gracefully.

How would you expect the DMA API to behave on a system where the device
driver is operating on cacheable memory, but the device is non-
coherent? Telling the driver that this just doesn't work? It's a use-
case that is working fine today with many devices (e.g. network
adapters) in the ARM world, exactly because the architecture specific
implementation of the DMA API inserts the cache maintenance operations
on buffer ownership transfer.

Regards,
Lucas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 17:10                           ` Lucas Stach
  0 siblings, 0 replies; 110+ messages in thread
From: Lucas Stach @ 2022-11-02 17:10 UTC (permalink / raw)
  To: Christian König, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, sumit.semwal, daniel, robdclark, dri-devel,
	linaro-mm-sig, linux-media

Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> Hi Lucas,
> 
> Am 02.11.22 um 12:39 schrieb Lucas Stach:
> > Hi Christian,
> > 
> > going to reply in more detail when I have some more time, so just some
> > quick thoughts for now.
> > 
> > Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> > > Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > > > [SNIP]
> > > As far as I can see it you guys just allocate a buffer from a V4L2
> > > device, fill it with data and send it to Wayland for displaying.
> > > 
> > > To be honest I'm really surprised that the Wayland guys hasn't pushed
> > > back on this practice already.
> > > 
> > > This only works because the Wayland as well as X display pipeline is
> > > smart enough to insert an extra copy when it find that an imported
> > > buffer can't be used as a framebuffer directly.
> > > 
> > With bracketed access you could even make this case work, as the dGPU
> > would be able to slurp a copy of the dma-buf into LMEM for scanout.
> 
> Well, this copy is what we are trying to avoid here. The codec should 
> pump the data into LMEM in the first place.
> 
That's very use-case specific. If the codec used those frames as
reference frames for further coding operations, allocating them in LMEM
of a remote device doesn't sound like a clever idea. But then I think
this is a whole different discussion on its own.

> > > > The only case the commonly fails is whenever we try to display UVC
> > > > created dmabuf,
> > > Well, exactly that's not correct! The whole x86 use cases of direct
> > > display for dGPUs are broken because media players think they can do the
> > > simple thing and offload all the problematic cases to the display server.
> > > 
> > > This is absolutely *not* the common use case you describe here, but
> > > rather something completely special to ARM.
> > It the normal case for a lot of ARM SoCs.
> 
> Yeah, but it's not the normal case for everybody.
> 
> We had numerous projects where customers wanted to pump video data 
> directly from a decoder into an GPU frame or from a GPU frame into an 
> encoder.
> 
> The fact that media frameworks doesn't support that out of the box is 
> simply a bug.
> 
> > That world is certainly not
> > any less big than the x86 dGPU world. A huge number of devices are ARM
> > based set-top boxes and other video players. Just because it is a
> > special case for you doesn't mean it's a global special case.
> 
> Ok, let's stop with that. This isn't helpful in the technical discussion.
> 
Agreed. I think we should try to understand the fact that there are
very different world-views and use-cases for dma-buf right now and that
breaking one or the other is simply no option. What may be an obscure
special case on x86 may be very common on ARM and vice versa.

> > 
> > > That we haven't heard anybody screaming that x86 doesn't work is just
> > > because we handle the case that a buffer isn't directly displayable in
> > > X/Wayland anyway, but this is absolutely not the optimal solution.
> > > 
> > > The argument that you want to keep the allocation on the codec side is
> > > completely false as far as I can see.
> > > 
> > > We already had numerous projects where we reported this practice as bugs
> > > to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
> > > 
> > And on a lot of ARM SoCs it's exactly the right thing to do.
> 
> Yeah and that's fine, it just doesn't seem to work in all cases.
> 
> For both x86 as well as the case here that the CPU cache might be dirty 
> the exporter needs to be the device with the requirements.
> 
> For x86 dGPUs that's the backing store is some local memory. For the 
> non-coherent ARM devices it's that the CPU cache is not dirty.
> 
> For a device driver which solely works with cached system memory 
> inserting cache flush operations is something it would never do for 
> itself. 
> 
It's exactly what a device driver working with cacheable memory buffers
on a architecture with non-coherent DMA masters would do. In fact it's
the main reason why the bracketing with explicit ownership transfer
exists in DMA API: to allow the DMA API implementation to do the
necessary cache maintenance operations on architectures that need them.

> It would just be doing this for the importer and exactly that 
> would be bad design because we then have handling for the display driver 
> outside of the driver.
> 
The driver would have to do those cache maintenance operations if it
directly worked with a non-coherent device. Doing it for the importer
is just doing it for another device, not the one directly managed by
the exporter.

I really don't see the difference to the other dma-buf ops: in
dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
the address space of the importer. Why would cache maintenance be any
different?

> > > This is just a software solution which works because of coincident and
> > > not because of engineering.
> > By mandating a software fallback for the cases where you would need
> > bracketed access to the dma-buf, you simply shift the problem into
> > userspace. Userspace then creates the bracket by falling back to some
> > other import option that mostly do a copy and then the appropriate
> > cache maintenance.
> > 
> > While I understand your sentiment about the DMA-API design being
> > inconvenient when things are just coherent by system design, the DMA-
> > API design wasn't done this way due to bad engineering, but due to the
> > fact that performant DMA access on some systems just require this kind
> > of bracketing.
> 
> Well, this is exactly what I'm criticizing on the DMA-API. Instead of 
> giving you a proper error code when something won't work in a specific 
> way it just tries to hide the requirements inside the DMA layer.
> 
> For example when your device can only access 32bits the DMA-API 
> transparently insert bounce buffers instead of giving you a proper error 
> code that the memory in question can't be accessed.
> 
> This just tries to hide the underlying problem instead of pushing it 
> into the upper layer where it can be handled much more gracefully.

How would you expect the DMA API to behave on a system where the device
driver is operating on cacheable memory, but the device is non-
coherent? Telling the driver that this just doesn't work? It's a use-
case that is working fine today with many devices (e.g. network
adapters) in the ARM world, exactly because the architecture specific
implementation of the DMA API inserts the cache maintenance operations
on buffer ownership transfer.

Regards,
Lucas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 17:10                           ` Lucas Stach
@ 2022-11-02 19:13                             ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 19:13 UTC (permalink / raw)
  To: Lucas Stach, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, sumit.semwal, linux-media

Am 02.11.22 um 18:10 schrieb Lucas Stach:
> Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> [SNIP]
>> It would just be doing this for the importer and exactly that
>> would be bad design because we then have handling for the display driver
>> outside of the driver.
>>
> The driver would have to do those cache maintenance operations if it
> directly worked with a non-coherent device. Doing it for the importer
> is just doing it for another device, not the one directly managed by
> the exporter.
>
> I really don't see the difference to the other dma-buf ops: in
> dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> the address space of the importer. Why would cache maintenance be any
> different?

The issue here is the explicit ownership transfer.

We intentionally decided against that because it breaks tons of use 
cases and is at least by me and a couple of others seen as generally 
design failure of the Linux DMA-API.

DMA-Buf let's the exporter setup the DMA addresses the importer uses to 
be able to directly decided where a certain operation should go. E.g. we 
have cases where for example a P2P write doesn't even go to memory, but 
rather a doorbell BAR to trigger another operation. Throwing in CPU 
round trips for explicit ownership transfer completely breaks that concept.

Additional to that a very basic concept of DMA-buf is that the exporter 
provides the buffer as it is and just double checks if the importer can 
access it. For example we have XGMI links which makes memory accessible 
to other devices on the same bus, but not to PCIe device and not even to 
the CPU. Otherwise you wouldn't be able to implement things like secure 
decoding where the data isn't even accessible outside the device to 
device link.

So if a device driver uses cached system memory on an architecture which 
devices which can't access it the right approach is clearly to reject 
the access.

What we can do is to reverse the role of the exporter and importer and 
let the device which needs uncached memory take control. This way this 
device can insert operations as needed, e.g. flush read caches or 
invalidate write caches.

This is what we have already done in DMA-buf and what already works 
perfectly fine with use cases which are even more complicated than a 
simple write cache invalidation.

>>>> This is just a software solution which works because of coincident and
>>>> not because of engineering.
>>> By mandating a software fallback for the cases where you would need
>>> bracketed access to the dma-buf, you simply shift the problem into
>>> userspace. Userspace then creates the bracket by falling back to some
>>> other import option that mostly do a copy and then the appropriate
>>> cache maintenance.
>>>
>>> While I understand your sentiment about the DMA-API design being
>>> inconvenient when things are just coherent by system design, the DMA-
>>> API design wasn't done this way due to bad engineering, but due to the
>>> fact that performant DMA access on some systems just require this kind
>>> of bracketing.
>> Well, this is exactly what I'm criticizing on the DMA-API. Instead of
>> giving you a proper error code when something won't work in a specific
>> way it just tries to hide the requirements inside the DMA layer.
>>
>> For example when your device can only access 32bits the DMA-API
>> transparently insert bounce buffers instead of giving you a proper error
>> code that the memory in question can't be accessed.
>>
>> This just tries to hide the underlying problem instead of pushing it
>> into the upper layer where it can be handled much more gracefully.
> How would you expect the DMA API to behave on a system where the device
> driver is operating on cacheable memory, but the device is non-
> coherent? Telling the driver that this just doesn't work?

Yes, exactly that.

It's the job of the higher level to prepare the buffer a device work 
with, not the one of the lower level.

In other words in a proper design the higher level would prepare the 
memory in a way the device driver can work with it, not the other way 
around.

When a device driver gets memory it can't work with the correct response 
is to throw an error and bubble that up into a layer where it can be 
handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer 
the MM should make sure that when you call read() with O_DIRECT that the 
pages in question are accessible by the device.

> It's a use-case that is working fine today with many devices (e.g. network
> adapters) in the ARM world, exactly because the architecture specific
> implementation of the DMA API inserts the cache maintenance operations
> on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that 
design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that 
this design was a really really bad idea because it favors just one use 
case while making other use cases practically impossible if not really 
hard to implement.

Regards,
Christian.

>
> Regards,
> Lucas
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 19:13                             ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-02 19:13 UTC (permalink / raw)
  To: Lucas Stach, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, sumit.semwal, daniel, robdclark, dri-devel,
	linaro-mm-sig, linux-media

Am 02.11.22 um 18:10 schrieb Lucas Stach:
> Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> [SNIP]
>> It would just be doing this for the importer and exactly that
>> would be bad design because we then have handling for the display driver
>> outside of the driver.
>>
> The driver would have to do those cache maintenance operations if it
> directly worked with a non-coherent device. Doing it for the importer
> is just doing it for another device, not the one directly managed by
> the exporter.
>
> I really don't see the difference to the other dma-buf ops: in
> dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> the address space of the importer. Why would cache maintenance be any
> different?

The issue here is the explicit ownership transfer.

We intentionally decided against that because it breaks tons of use 
cases and is at least by me and a couple of others seen as generally 
design failure of the Linux DMA-API.

DMA-Buf let's the exporter setup the DMA addresses the importer uses to 
be able to directly decided where a certain operation should go. E.g. we 
have cases where for example a P2P write doesn't even go to memory, but 
rather a doorbell BAR to trigger another operation. Throwing in CPU 
round trips for explicit ownership transfer completely breaks that concept.

Additional to that a very basic concept of DMA-buf is that the exporter 
provides the buffer as it is and just double checks if the importer can 
access it. For example we have XGMI links which makes memory accessible 
to other devices on the same bus, but not to PCIe device and not even to 
the CPU. Otherwise you wouldn't be able to implement things like secure 
decoding where the data isn't even accessible outside the device to 
device link.

So if a device driver uses cached system memory on an architecture which 
devices which can't access it the right approach is clearly to reject 
the access.

What we can do is to reverse the role of the exporter and importer and 
let the device which needs uncached memory take control. This way this 
device can insert operations as needed, e.g. flush read caches or 
invalidate write caches.

This is what we have already done in DMA-buf and what already works 
perfectly fine with use cases which are even more complicated than a 
simple write cache invalidation.

>>>> This is just a software solution which works because of coincident and
>>>> not because of engineering.
>>> By mandating a software fallback for the cases where you would need
>>> bracketed access to the dma-buf, you simply shift the problem into
>>> userspace. Userspace then creates the bracket by falling back to some
>>> other import option that mostly do a copy and then the appropriate
>>> cache maintenance.
>>>
>>> While I understand your sentiment about the DMA-API design being
>>> inconvenient when things are just coherent by system design, the DMA-
>>> API design wasn't done this way due to bad engineering, but due to the
>>> fact that performant DMA access on some systems just require this kind
>>> of bracketing.
>> Well, this is exactly what I'm criticizing on the DMA-API. Instead of
>> giving you a proper error code when something won't work in a specific
>> way it just tries to hide the requirements inside the DMA layer.
>>
>> For example when your device can only access 32bits the DMA-API
>> transparently insert bounce buffers instead of giving you a proper error
>> code that the memory in question can't be accessed.
>>
>> This just tries to hide the underlying problem instead of pushing it
>> into the upper layer where it can be handled much more gracefully.
> How would you expect the DMA API to behave on a system where the device
> driver is operating on cacheable memory, but the device is non-
> coherent? Telling the driver that this just doesn't work?

Yes, exactly that.

It's the job of the higher level to prepare the buffer a device work 
with, not the one of the lower level.

In other words in a proper design the higher level would prepare the 
memory in a way the device driver can work with it, not the other way 
around.

When a device driver gets memory it can't work with the correct response 
is to throw an error and bubble that up into a layer where it can be 
handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer 
the MM should make sure that when you call read() with O_DIRECT that the 
pages in question are accessible by the device.

> It's a use-case that is working fine today with many devices (e.g. network
> adapters) in the ARM world, exactly because the architecture specific
> implementation of the DMA API inserts the cache maintenance operations
> on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that 
design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that 
this design was a really really bad idea because it favors just one use 
case while making other use cases practically impossible if not really 
hard to implement.

Regards,
Christian.

>
> Regards,
> Lucas
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 19:13                             ` Christian König
@ 2022-11-02 19:48                               ` Lucas Stach
  -1 siblings, 0 replies; 110+ messages in thread
From: Lucas Stach @ 2022-11-02 19:48 UTC (permalink / raw)
  To: Christian König, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, sumit.semwal, daniel, robdclark, dri-devel,
	linaro-mm-sig, linux-media

Am Mittwoch, dem 02.11.2022 um 20:13 +0100 schrieb Christian König:
> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> > Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> > [SNIP]
> > > It would just be doing this for the importer and exactly that
> > > would be bad design because we then have handling for the display driver
> > > outside of the driver.
> > > 
> > The driver would have to do those cache maintenance operations if it
> > directly worked with a non-coherent device. Doing it for the importer
> > is just doing it for another device, not the one directly managed by
> > the exporter.
> > 
> > I really don't see the difference to the other dma-buf ops: in
> > dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> > the address space of the importer. Why would cache maintenance be any
> > different?
> 
> The issue here is the explicit ownership transfer.
> 
> We intentionally decided against that because it breaks tons of use 
> cases and is at least by me and a couple of others seen as generally 
> design failure of the Linux DMA-API.
> 
> DMA-Buf let's the exporter setup the DMA addresses the importer uses to 
> be able to directly decided where a certain operation should go. E.g. we 
> have cases where for example a P2P write doesn't even go to memory, but 
> rather a doorbell BAR to trigger another operation. Throwing in CPU 
> round trips for explicit ownership transfer completely breaks that concept.

I'm not sure where you see the CPU in that scenario. A explicit
ownership transfer could also trigger things like a cache invalidate on
the receiving end of a P2P transfer. A CPU cache invalidate/flush is
just one possible use-case of the ownership transfer.

And sure, some uses will just not work when a explicit ownership
transfer is required, as the usage model and/or API doesn't allow for
it. But lots of uses will work with explicit ownership transfer,
actually there is no other way to properly deal with systems where the
sharing masters may be non-coherent, which is norm rather than the
exception in lots of embedded uses, regardless of the inconvenience
this might cause for some other areas.

> 
> Additional to that a very basic concept of DMA-buf is that the exporter 
> provides the buffer as it is and just double checks if the importer can 
> access it. For example we have XGMI links which makes memory accessible 
> to other devices on the same bus, but not to PCIe device and not even to 
> the CPU. Otherwise you wouldn't be able to implement things like secure 
> decoding where the data isn't even accessible outside the device to 
> device link.
> 
> So if a device driver uses cached system memory on an architecture which 
> devices which can't access it the right approach is clearly to reject 
> the access.
> 
> What we can do is to reverse the role of the exporter and importer and 
> let the device which needs uncached memory take control. This way this 
> device can insert operations as needed, e.g. flush read caches or 
> invalidate write caches.
> 
And that will only solve one aspect of the allocation restriction maze,
namely the cacheable attribute, but it might as well break things like
contiguous requirements.

> This is what we have already done in DMA-buf and what already works 
> perfectly fine with use cases which are even more complicated than a 
> simple write cache invalidation.
> 
> > > > > This is just a software solution which works because of coincident and
> > > > > not because of engineering.
> > > > By mandating a software fallback for the cases where you would need
> > > > bracketed access to the dma-buf, you simply shift the problem into
> > > > userspace. Userspace then creates the bracket by falling back to some
> > > > other import option that mostly do a copy and then the appropriate
> > > > cache maintenance.
> > > > 
> > > > While I understand your sentiment about the DMA-API design being
> > > > inconvenient when things are just coherent by system design, the DMA-
> > > > API design wasn't done this way due to bad engineering, but due to the
> > > > fact that performant DMA access on some systems just require this kind
> > > > of bracketing.
> > > Well, this is exactly what I'm criticizing on the DMA-API. Instead of
> > > giving you a proper error code when something won't work in a specific
> > > way it just tries to hide the requirements inside the DMA layer.
> > > 
> > > For example when your device can only access 32bits the DMA-API
> > > transparently insert bounce buffers instead of giving you a proper error
> > > code that the memory in question can't be accessed.
> > > 
> > > This just tries to hide the underlying problem instead of pushing it
> > > into the upper layer where it can be handled much more gracefully.
> > How would you expect the DMA API to behave on a system where the device
> > driver is operating on cacheable memory, but the device is non-
> > coherent? Telling the driver that this just doesn't work?
> 
> Yes, exactly that.
> 
> It's the job of the higher level to prepare the buffer a device work 
> with, not the one of the lower level.
> 
> In other words in a proper design the higher level would prepare the 
> memory in a way the device driver can work with it, not the other way 
> around.
> 
> When a device driver gets memory it can't work with the correct response 
> is to throw an error and bubble that up into a layer where it can be 
> handled gracefully.
> 
How would this situation be handled more gracefully than a cache
flush/invalidate. Certainly the solution wouldn't be to make the Linux
network stack operate on non-cacheable memory? To be clear: that's a
rhetorical question, as unaligned access, which is quite common in many
network uses, to non-cacheable memory is also not allowed on most
system designs, so this isn't even an option.

> For example instead of using bounce buffers under the hood the DMA layer 
> the MM should make sure that when you call read() with O_DIRECT that the 
> pages in question are accessible by the device.
> 
> > It's a use-case that is working fine today with many devices (e.g. network
> > adapters) in the ARM world, exactly because the architecture specific
> > implementation of the DMA API inserts the cache maintenance operations
> > on buffer ownership transfer.
> 
> Yeah, I'm perfectly aware of that. The problem is that exactly that 
> design totally breaks GPUs on Xen DOM0 for example.
> 
> And Xen is just one example, I can certainly say from experience that 
> this design was a really really bad idea because it favors just one use 
> case while making other use cases practically impossible if not really 
> hard to implement.

So maybe we should agree that we both have uses in mind which are
practically impossible to implement when restricting dma-buf to one
world view? You are thinking about uses that need full coherence
without any bracketing and I'm thinking about uses that won't ever work
with usable performance without bracketing.

So maybe we should circle back to daniels suggestion to have a flag to
keep both worlds apart? Reject the import as you do in this series when
the "must not use bracketing" flag is set and allow explicit ownership
transfers otherwise seems like a workable solution to me.

Regards,
Lucas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-02 19:48                               ` Lucas Stach
  0 siblings, 0 replies; 110+ messages in thread
From: Lucas Stach @ 2022-11-02 19:48 UTC (permalink / raw)
  To: Christian König, Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, sumit.semwal, linux-media

Am Mittwoch, dem 02.11.2022 um 20:13 +0100 schrieb Christian König:
> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> > Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> > [SNIP]
> > > It would just be doing this for the importer and exactly that
> > > would be bad design because we then have handling for the display driver
> > > outside of the driver.
> > > 
> > The driver would have to do those cache maintenance operations if it
> > directly worked with a non-coherent device. Doing it for the importer
> > is just doing it for another device, not the one directly managed by
> > the exporter.
> > 
> > I really don't see the difference to the other dma-buf ops: in
> > dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> > the address space of the importer. Why would cache maintenance be any
> > different?
> 
> The issue here is the explicit ownership transfer.
> 
> We intentionally decided against that because it breaks tons of use 
> cases and is at least by me and a couple of others seen as generally 
> design failure of the Linux DMA-API.
> 
> DMA-Buf let's the exporter setup the DMA addresses the importer uses to 
> be able to directly decided where a certain operation should go. E.g. we 
> have cases where for example a P2P write doesn't even go to memory, but 
> rather a doorbell BAR to trigger another operation. Throwing in CPU 
> round trips for explicit ownership transfer completely breaks that concept.

I'm not sure where you see the CPU in that scenario. A explicit
ownership transfer could also trigger things like a cache invalidate on
the receiving end of a P2P transfer. A CPU cache invalidate/flush is
just one possible use-case of the ownership transfer.

And sure, some uses will just not work when a explicit ownership
transfer is required, as the usage model and/or API doesn't allow for
it. But lots of uses will work with explicit ownership transfer,
actually there is no other way to properly deal with systems where the
sharing masters may be non-coherent, which is norm rather than the
exception in lots of embedded uses, regardless of the inconvenience
this might cause for some other areas.

> 
> Additional to that a very basic concept of DMA-buf is that the exporter 
> provides the buffer as it is and just double checks if the importer can 
> access it. For example we have XGMI links which makes memory accessible 
> to other devices on the same bus, but not to PCIe device and not even to 
> the CPU. Otherwise you wouldn't be able to implement things like secure 
> decoding where the data isn't even accessible outside the device to 
> device link.
> 
> So if a device driver uses cached system memory on an architecture which 
> devices which can't access it the right approach is clearly to reject 
> the access.
> 
> What we can do is to reverse the role of the exporter and importer and 
> let the device which needs uncached memory take control. This way this 
> device can insert operations as needed, e.g. flush read caches or 
> invalidate write caches.
> 
And that will only solve one aspect of the allocation restriction maze,
namely the cacheable attribute, but it might as well break things like
contiguous requirements.

> This is what we have already done in DMA-buf and what already works 
> perfectly fine with use cases which are even more complicated than a 
> simple write cache invalidation.
> 
> > > > > This is just a software solution which works because of coincident and
> > > > > not because of engineering.
> > > > By mandating a software fallback for the cases where you would need
> > > > bracketed access to the dma-buf, you simply shift the problem into
> > > > userspace. Userspace then creates the bracket by falling back to some
> > > > other import option that mostly do a copy and then the appropriate
> > > > cache maintenance.
> > > > 
> > > > While I understand your sentiment about the DMA-API design being
> > > > inconvenient when things are just coherent by system design, the DMA-
> > > > API design wasn't done this way due to bad engineering, but due to the
> > > > fact that performant DMA access on some systems just require this kind
> > > > of bracketing.
> > > Well, this is exactly what I'm criticizing on the DMA-API. Instead of
> > > giving you a proper error code when something won't work in a specific
> > > way it just tries to hide the requirements inside the DMA layer.
> > > 
> > > For example when your device can only access 32bits the DMA-API
> > > transparently insert bounce buffers instead of giving you a proper error
> > > code that the memory in question can't be accessed.
> > > 
> > > This just tries to hide the underlying problem instead of pushing it
> > > into the upper layer where it can be handled much more gracefully.
> > How would you expect the DMA API to behave on a system where the device
> > driver is operating on cacheable memory, but the device is non-
> > coherent? Telling the driver that this just doesn't work?
> 
> Yes, exactly that.
> 
> It's the job of the higher level to prepare the buffer a device work 
> with, not the one of the lower level.
> 
> In other words in a proper design the higher level would prepare the 
> memory in a way the device driver can work with it, not the other way 
> around.
> 
> When a device driver gets memory it can't work with the correct response 
> is to throw an error and bubble that up into a layer where it can be 
> handled gracefully.
> 
How would this situation be handled more gracefully than a cache
flush/invalidate. Certainly the solution wouldn't be to make the Linux
network stack operate on non-cacheable memory? To be clear: that's a
rhetorical question, as unaligned access, which is quite common in many
network uses, to non-cacheable memory is also not allowed on most
system designs, so this isn't even an option.

> For example instead of using bounce buffers under the hood the DMA layer 
> the MM should make sure that when you call read() with O_DIRECT that the 
> pages in question are accessible by the device.
> 
> > It's a use-case that is working fine today with many devices (e.g. network
> > adapters) in the ARM world, exactly because the architecture specific
> > implementation of the DMA API inserts the cache maintenance operations
> > on buffer ownership transfer.
> 
> Yeah, I'm perfectly aware of that. The problem is that exactly that 
> design totally breaks GPUs on Xen DOM0 for example.
> 
> And Xen is just one example, I can certainly say from experience that 
> this design was a really really bad idea because it favors just one use 
> case while making other use cases practically impossible if not really 
> hard to implement.

So maybe we should agree that we both have uses in mind which are
practically impossible to implement when restricting dma-buf to one
world view? You are thinking about uses that need full coherence
without any bracketing and I'm thinking about uses that won't ever work
with usable performance without bracketing.

So maybe we should circle back to daniels suggestion to have a flag to
keep both worlds apart? Reject the import as you do in this series when
the "must not use bracketing" flag is set and allow explicit ownership
transfers otherwise seems like a workable solution to me.

Regards,
Lucas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 11:18                     ` Christian König
@ 2022-11-03 22:16                       ` Nicolas Dufresne
  -1 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-03 22:16 UTC (permalink / raw)
  To: Christian König, Daniel Stone
  Cc: Lucas Stach, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Le mercredi 02 novembre 2022 à 12:18 +0100, Christian König a écrit :
> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]
> > > > But the client is just a video player. It doesn't understand how to
> > > > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > > > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > > > response to the client.
> > > Well exactly that's the point I'm raising: The client *must* understand
> > > that!
> > > 
> > > See we need to be able to handle all restrictions here, coherency of the
> > > data is just one of them.
> > > 
> > > For example the much more important question is the location of the data
> > > and for this allocating from the V4L2 device is in most cases just not
> > > going to fly.
> > It feels like this is a generic statement and there is no reason it could not be
> > the other way around.
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.

That paragraph is a bit sloppy. By "you guys" you mean what exactly ? Normal
users will let V4L2 device allocate and write into their own memory (the device
fill it, not "you guys"). This is done like this simply because this is
guarantied to work with the V4L2 device. Most V4L2 device produces known by
userpsace pixel formats and layout, for which userspace know for sure it can
implement a GPU shader or software fallback for. I'm still to see one of these
format that cannot be efficiently imported into a modern GPU and converted using
shaders. I'm not entirely sure what/which GPU a dGPU is compared to a GPU btw.

In many cases, camera kind of V4L2 devices will have 1 producer for many
consumers. Consider your photo application, the streams will likely be capture
and displayed while being encoded by one of more CODEC, while being streamed to
a Machine Learning model for analyses. The software complexity to communicate
back the list of receiver devices and implementing all their non-standard way to
allocate memory, so all the combination of trial and error is just ridiculously
high. Remember that each GPU have their own allocation methods and corner cases,
this is simply not manageable by "you guys", which I pretty much assume is
everyone writing software for Generic Linux these days (non-Android/ChromeOS).

> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.
> 
> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.

This is a bit inaccurate. The compositor I've worked with (Gnome and Weston)
will only memcpy SHM. For DMABuf, they will fail importation if its not usable
either by the display or the GPU. Specially on the GPU side though (which is the
ultimate compositor fallback), there exists efficient HW copy mechanism that may
be used, and this is fine, since unlike your scannout example, it won't be
uploading over and over, but will do later re-display from a remote copy (or
transformed copy). Or if you prefer, its cached at the cost of higher memory
usage.

I think it would be preferable to speak about device to device sharing, since
V4L2 vs GPU is not really representative of the program. I think V4L2 vs GPU and
"you guys" simply contribute to the never ending, and needless friction around
that difficulty that exists with current support for memory sharing in Linux.

> 
> >   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> > X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
> > to use it. Of course, in current state of DMABuf, you have to be an exporter to
> > do anything fancy, but it did not have to be like this, its a design choice. I'm
> > not sure in the end what was the final method used, the driver isn't yet
> > upstream, so maybe that is not even final. What I know is that there is various
> > condition you may use the CODEC for which the optimal location will vary. As an
> > example, using the post processor or not, see my next comment for more details.
> 
> Yeah, and stuff like this was already discussed multiple times. Local 
> memory of devices can only be made available by the exporter, not the 
> importer.
> 
> So in the case of separated camera and encoder you run into exactly the 
> same limitation that some device needs the allocation to happen on the 
> camera while others need it on the encoder.
> 
> > > The more common case is that you need to allocate from the GPU and then
> > > import that into the V4L2 device. The background is that all dGPUs I
> > > know of need the data inside local memory (VRAM) to be able to scan out
> > > from it.
> > The reality is that what is common to you, might not be to others. In my work,
> > most ARM SoC have display that just handle direct scannout from cameras and
> > codecs.
> 
> > The only case the commonly fails is whenever we try to display UVC
> > created dmabuf,
> 
> Well, exactly that's not correct! The whole x86 use cases of direct 
> display for dGPUs are broken because media players think they can do the 
> simple thing and offload all the problematic cases to the display server.
> 
> This is absolutely *not* the common use case you describe here, but 
> rather something completely special to ARM.

sigh .. The UVC failures was first discovered on my Intel PC, and later
reproduced on ARM. Userspace expected driver(s) (V4L2 exports, DRM imports)
should have rejected the DMABuf import (I kind of know, I wrote this part). From
a userspace point of you, unlike what you stipulate here, there was no fault.
You said already that importer / exporter role is to be tried, the order you try
should not matter. So yes, today's userspace may lack the ability to flip the
roles, but at least it tries, and if the driver does not fail, you can't blame
userspace for at least trying to achieve decent performance.

I'd like to remind that this is clearly all kernel bugs, and we cannot state
that kernel drivers "are broken because media player". Just the fact that this
thread starts from a kernel changes kind of prove it. Would be nice also for you
to understand that I'm not against the method used in this patchset, but I'm not
against a bracketing mechanism either, as I think the former can improve, where
the first one only give more "correct" results.

> 
> >   which have dirty CPU write cache and this is the type of thing
> > we'd like to see solved. I think this series was addressing it in principle, but
> > failing the import and the raised point is that this wasn't the optimal way.
> > 
> > There is a community project called LibreELEC, if you aren't aware, they run
> > Khodi with direct scanout of video stream on a wide variety of SoC and they use
> > the CODEC as exporter all the time. They simply don't have cases were the
> > opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
> > does not really offer you any API to reverse the allocation.
> 
> Ok, let me try to explain it once more. It sounds like I wasn't able to 
> get my point through.
> 
> That we haven't heard anybody screaming that x86 doesn't work is just 
> because we handle the case that a buffer isn't directly displayable in 
> X/Wayland anyway, but this is absolutely not the optimal solution.

Basically, you are complaining that compositor will use GPU shaders to adapt the
buffers for the display. Most display don't do or have limited YUV support,
flipping the roles or bracketing won't help that. Using a GPU shader to adapt
it, like compositor and userspace do seems all right. And yes, sometimes the
memory will get imported into the GPU very efficiently, something in the mid-
range, and other times some GPU stack (which is userspace) will memcpy. But
remember that the GPU stack is programmed to work with a specific GPU, not the
higher level userland.

> 
> The argument that you want to keep the allocation on the codec side is 
> completely false as far as I can see.

I haven't made this argument, and don't intend to do so. There is nothing in
this thread that should be interpreted as I want, or not want. I want the same
thing as everyone on this list, which is both performance and correct results.

> 
> We already had numerous projects where we reported this practice as bugs 
> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

Links ? Remember that I do read every single bugs and emails around GStreamer
project. I do maintain older and newer V4L2 support in there. I also did
contribute a lot to the mechanism GStreamer have in-place to reverse the
allocation. In fact, its implemented, the problem being that on generic Linux,
the receiver element, like the GL element and the display sink don't have any
API they can rely on to allocate memory. Thus, they don't implement what we call
the allocation offer in GStreamer term. Very often though, on other modern OS,
or APIs like VA, the memory offer is replaced by a context. So the allocation is
done from a "context" which is neither an importer or an exporter. This is
mostly found on MacOS and Windows.

Was there APIs suggested to actually make it manageable by userland to allocate
from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
ready, no.

Can we at least implement some DRM memory allocation, yes, but remember that,
until very recently, the DRM drivers used by the display path was not exposed
through Wayland. This issue has only been resolved recently, it will take some
time before this propagate through compositors code. And you need compositor
implementation to do GL and multimedia stack implementation. Please, keep in
mind before raising bad practice by GStreamer and FFMPEG developers that getting
all the bit and pieces in place require back and forth, there has been huge gaps
that these devs were not able to overcome yet. Also, remember that these stack
don't have any contract to support Linux. They support it to the best of their
knowledge and capabilities, along with Windows, MacOS, IOS, Android and more.
And to my experience, memory sharing have different challenges in all of these
OS.

> 
> This is just a software solution which works because of coincident and 
> not because of engineering.

Another argument I can't really agree with, there is a lot of effort put into
fallback (mostly GPU fallback) in various software stack. These fallback are
engineered to guaranty you can display your frames. That case I've raised should
have ended well with a GPU/CPU fallback, but a kernel bug broke the ability to
fallback. If the kernel had rejected the import (your series ?), or offered a
bracketing mechanism (for the UVC case, both method would have worked), the end
results would have just worked.

I would not disagree if someone states that DMAbuf in UVC driver is an abuse.
The driver simply memcpy chunk of variable size data streamed by the usb camera
into a normal memory buffer. So why is that exported as a dmabuf ? I don't have
an strong opinion on that, but if you think this is wrong, then it proves my
point that this is a kernel bug. The challenge here is to come up with how we
will fix this, and sharing a good understanding of what today's userspace do,
and why they do so is key to make proper designs. As I started, writing code for
DMABuf subsystem is out of reach for me, I can only share what existing software
do, and why it does it like this.

Nicolas

> 
> Regards,
> Christian.
> 
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-03 22:16                       ` Nicolas Dufresne
  0 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-03 22:16 UTC (permalink / raw)
  To: Christian König, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, linux-media, sumit.semwal

Le mercredi 02 novembre 2022 à 12:18 +0100, Christian König a écrit :
> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]
> > > > But the client is just a video player. It doesn't understand how to
> > > > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > > > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > > > response to the client.
> > > Well exactly that's the point I'm raising: The client *must* understand
> > > that!
> > > 
> > > See we need to be able to handle all restrictions here, coherency of the
> > > data is just one of them.
> > > 
> > > For example the much more important question is the location of the data
> > > and for this allocating from the V4L2 device is in most cases just not
> > > going to fly.
> > It feels like this is a generic statement and there is no reason it could not be
> > the other way around.
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.

That paragraph is a bit sloppy. By "you guys" you mean what exactly ? Normal
users will let V4L2 device allocate and write into their own memory (the device
fill it, not "you guys"). This is done like this simply because this is
guarantied to work with the V4L2 device. Most V4L2 device produces known by
userpsace pixel formats and layout, for which userspace know for sure it can
implement a GPU shader or software fallback for. I'm still to see one of these
format that cannot be efficiently imported into a modern GPU and converted using
shaders. I'm not entirely sure what/which GPU a dGPU is compared to a GPU btw.

In many cases, camera kind of V4L2 devices will have 1 producer for many
consumers. Consider your photo application, the streams will likely be capture
and displayed while being encoded by one of more CODEC, while being streamed to
a Machine Learning model for analyses. The software complexity to communicate
back the list of receiver devices and implementing all their non-standard way to
allocate memory, so all the combination of trial and error is just ridiculously
high. Remember that each GPU have their own allocation methods and corner cases,
this is simply not manageable by "you guys", which I pretty much assume is
everyone writing software for Generic Linux these days (non-Android/ChromeOS).

> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.
> 
> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.

This is a bit inaccurate. The compositor I've worked with (Gnome and Weston)
will only memcpy SHM. For DMABuf, they will fail importation if its not usable
either by the display or the GPU. Specially on the GPU side though (which is the
ultimate compositor fallback), there exists efficient HW copy mechanism that may
be used, and this is fine, since unlike your scannout example, it won't be
uploading over and over, but will do later re-display from a remote copy (or
transformed copy). Or if you prefer, its cached at the cost of higher memory
usage.

I think it would be preferable to speak about device to device sharing, since
V4L2 vs GPU is not really representative of the program. I think V4L2 vs GPU and
"you guys" simply contribute to the never ending, and needless friction around
that difficulty that exists with current support for memory sharing in Linux.

> 
> >   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> > X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
> > to use it. Of course, in current state of DMABuf, you have to be an exporter to
> > do anything fancy, but it did not have to be like this, its a design choice. I'm
> > not sure in the end what was the final method used, the driver isn't yet
> > upstream, so maybe that is not even final. What I know is that there is various
> > condition you may use the CODEC for which the optimal location will vary. As an
> > example, using the post processor or not, see my next comment for more details.
> 
> Yeah, and stuff like this was already discussed multiple times. Local 
> memory of devices can only be made available by the exporter, not the 
> importer.
> 
> So in the case of separated camera and encoder you run into exactly the 
> same limitation that some device needs the allocation to happen on the 
> camera while others need it on the encoder.
> 
> > > The more common case is that you need to allocate from the GPU and then
> > > import that into the V4L2 device. The background is that all dGPUs I
> > > know of need the data inside local memory (VRAM) to be able to scan out
> > > from it.
> > The reality is that what is common to you, might not be to others. In my work,
> > most ARM SoC have display that just handle direct scannout from cameras and
> > codecs.
> 
> > The only case the commonly fails is whenever we try to display UVC
> > created dmabuf,
> 
> Well, exactly that's not correct! The whole x86 use cases of direct 
> display for dGPUs are broken because media players think they can do the 
> simple thing and offload all the problematic cases to the display server.
> 
> This is absolutely *not* the common use case you describe here, but 
> rather something completely special to ARM.

sigh .. The UVC failures was first discovered on my Intel PC, and later
reproduced on ARM. Userspace expected driver(s) (V4L2 exports, DRM imports)
should have rejected the DMABuf import (I kind of know, I wrote this part). From
a userspace point of you, unlike what you stipulate here, there was no fault.
You said already that importer / exporter role is to be tried, the order you try
should not matter. So yes, today's userspace may lack the ability to flip the
roles, but at least it tries, and if the driver does not fail, you can't blame
userspace for at least trying to achieve decent performance.

I'd like to remind that this is clearly all kernel bugs, and we cannot state
that kernel drivers "are broken because media player". Just the fact that this
thread starts from a kernel changes kind of prove it. Would be nice also for you
to understand that I'm not against the method used in this patchset, but I'm not
against a bracketing mechanism either, as I think the former can improve, where
the first one only give more "correct" results.

> 
> >   which have dirty CPU write cache and this is the type of thing
> > we'd like to see solved. I think this series was addressing it in principle, but
> > failing the import and the raised point is that this wasn't the optimal way.
> > 
> > There is a community project called LibreELEC, if you aren't aware, they run
> > Khodi with direct scanout of video stream on a wide variety of SoC and they use
> > the CODEC as exporter all the time. They simply don't have cases were the
> > opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
> > does not really offer you any API to reverse the allocation.
> 
> Ok, let me try to explain it once more. It sounds like I wasn't able to 
> get my point through.
> 
> That we haven't heard anybody screaming that x86 doesn't work is just 
> because we handle the case that a buffer isn't directly displayable in 
> X/Wayland anyway, but this is absolutely not the optimal solution.

Basically, you are complaining that compositor will use GPU shaders to adapt the
buffers for the display. Most display don't do or have limited YUV support,
flipping the roles or bracketing won't help that. Using a GPU shader to adapt
it, like compositor and userspace do seems all right. And yes, sometimes the
memory will get imported into the GPU very efficiently, something in the mid-
range, and other times some GPU stack (which is userspace) will memcpy. But
remember that the GPU stack is programmed to work with a specific GPU, not the
higher level userland.

> 
> The argument that you want to keep the allocation on the codec side is 
> completely false as far as I can see.

I haven't made this argument, and don't intend to do so. There is nothing in
this thread that should be interpreted as I want, or not want. I want the same
thing as everyone on this list, which is both performance and correct results.

> 
> We already had numerous projects where we reported this practice as bugs 
> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

Links ? Remember that I do read every single bugs and emails around GStreamer
project. I do maintain older and newer V4L2 support in there. I also did
contribute a lot to the mechanism GStreamer have in-place to reverse the
allocation. In fact, its implemented, the problem being that on generic Linux,
the receiver element, like the GL element and the display sink don't have any
API they can rely on to allocate memory. Thus, they don't implement what we call
the allocation offer in GStreamer term. Very often though, on other modern OS,
or APIs like VA, the memory offer is replaced by a context. So the allocation is
done from a "context" which is neither an importer or an exporter. This is
mostly found on MacOS and Windows.

Was there APIs suggested to actually make it manageable by userland to allocate
from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
ready, no.

Can we at least implement some DRM memory allocation, yes, but remember that,
until very recently, the DRM drivers used by the display path was not exposed
through Wayland. This issue has only been resolved recently, it will take some
time before this propagate through compositors code. And you need compositor
implementation to do GL and multimedia stack implementation. Please, keep in
mind before raising bad practice by GStreamer and FFMPEG developers that getting
all the bit and pieces in place require back and forth, there has been huge gaps
that these devs were not able to overcome yet. Also, remember that these stack
don't have any contract to support Linux. They support it to the best of their
knowledge and capabilities, along with Windows, MacOS, IOS, Android and more.
And to my experience, memory sharing have different challenges in all of these
OS.

> 
> This is just a software solution which works because of coincident and 
> not because of engineering.

Another argument I can't really agree with, there is a lot of effort put into
fallback (mostly GPU fallback) in various software stack. These fallback are
engineered to guaranty you can display your frames. That case I've raised should
have ended well with a GPU/CPU fallback, but a kernel bug broke the ability to
fallback. If the kernel had rejected the import (your series ?), or offered a
bracketing mechanism (for the UVC case, both method would have worked), the end
results would have just worked.

I would not disagree if someone states that DMAbuf in UVC driver is an abuse.
The driver simply memcpy chunk of variable size data streamed by the usb camera
into a normal memory buffer. So why is that exported as a dmabuf ? I don't have
an strong opinion on that, but if you think this is wrong, then it proves my
point that this is a kernel bug. The challenge here is to come up with how we
will fix this, and sharing a good understanding of what today's userspace do,
and why they do so is key to make proper designs. As I started, writing code for
DMABuf subsystem is out of reach for me, I can only share what existing software
do, and why it does it like this.

Nicolas

> 
> Regards,
> Christian.
> 
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-03 22:16                       ` Nicolas Dufresne
@ 2022-11-04  9:03                         ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-04  9:03 UTC (permalink / raw)
  To: Nicolas Dufresne, Daniel Stone
  Cc: Lucas Stach, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:
> [SNIP]
>> We already had numerous projects where we reported this practice as bugs
>> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
> Links ? Remember that I do read every single bugs and emails around GStreamer
> project. I do maintain older and newer V4L2 support in there. I also did
> contribute a lot to the mechanism GStreamer have in-place to reverse the
> allocation. In fact, its implemented, the problem being that on generic Linux,
> the receiver element, like the GL element and the display sink don't have any
> API they can rely on to allocate memory. Thus, they don't implement what we call
> the allocation offer in GStreamer term. Very often though, on other modern OS,
> or APIs like VA, the memory offer is replaced by a context. So the allocation is
> done from a "context" which is neither an importer or an exporter. This is
> mostly found on MacOS and Windows.
>
> Was there APIs suggested to actually make it manageable by userland to allocate
> from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
> ready, no.

Well, that stuff is absolutely ready: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175 
What do you think I'm talking about all the time?

DMA-buf has a lengthy section about CPU access to buffers and clearly 
documents how all of that is supposed to work: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160 
This includes braketing of CPU access with dma_buf_begin_cpu_access() 
and dma_buf_end_cpu_access(), as well as transaction management between 
devices and the CPU and even implicit synchronization.

This specification is then implemented by the different drivers 
including V4L2: 
https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473

As well as the different DRM drivers: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234

This design was then used by us with various media players on different 
customer projects, including QNAP https://www.qnap.com/en/product/ts-877 
as well as the newest Tesla 
https://www.amd.com/en/products/embedded-automotive-solutions

I won't go into the details here, but we are using exactly the approach 
I've outlined to let userspace control the DMA between the different 
device in question. I'm one of the main designers of that and our 
multimedia and mesa team has up-streamed quite a number of changes for 
this project.

I'm not that well into different ARM based solutions because we are just 
recently getting results that this starts to work with AMD GPUs, but I'm 
pretty sure that the design should be able to handle that as well.

So we have clearly prove that this design works, even with special 
requirements which are way more complex than what we are discussing 
here. We had cases where we used GStreamer to feed DMA-buf handles into 
multiple devices with different format requirements and that seems to 
work fine.

-----

But enough of this rant. As I wrote Lucas as well this doesn't help us 
any further in the technical discussion.

The only technical argument I have is that if some userspace 
applications fail to use the provided UAPI while others use it correctly 
then this is clearly not a good reason to change the UAPI, but rather an 
argument to change the applications.

If the application should be kept simple and device independent then 
allocating the buffer from the device independent DMA heaps would be 
enough as well. Cause that provider implements the necessary handling 
for dma_buf_begin_cpu_access() and dma_buf_end_cpu_access().

I'm a bit surprised that we are arguing about stuff like this because we 
spend a lot of effort trying to document this. Daniel gave me the job to 
fix  this documentation, but after reading through it multiple times now 
I can't seem to find where the design and the desired behavior is unclear.

What is clearly a bug in the kernel is that we don't reject things which 
won't work correctly and this is what this patch here addresses. What we 
could talk about is backward compatibility for this patch, cause it 
might look like it breaks things which previously used to work at least 
partially.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-04  9:03                         ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-04  9:03 UTC (permalink / raw)
  To: Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, linux-media, sumit.semwal

Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:
> [SNIP]
>> We already had numerous projects where we reported this practice as bugs
>> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
> Links ? Remember that I do read every single bugs and emails around GStreamer
> project. I do maintain older and newer V4L2 support in there. I also did
> contribute a lot to the mechanism GStreamer have in-place to reverse the
> allocation. In fact, its implemented, the problem being that on generic Linux,
> the receiver element, like the GL element and the display sink don't have any
> API they can rely on to allocate memory. Thus, they don't implement what we call
> the allocation offer in GStreamer term. Very often though, on other modern OS,
> or APIs like VA, the memory offer is replaced by a context. So the allocation is
> done from a "context" which is neither an importer or an exporter. This is
> mostly found on MacOS and Windows.
>
> Was there APIs suggested to actually make it manageable by userland to allocate
> from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
> ready, no.

Well, that stuff is absolutely ready: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175 
What do you think I'm talking about all the time?

DMA-buf has a lengthy section about CPU access to buffers and clearly 
documents how all of that is supposed to work: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160 
This includes braketing of CPU access with dma_buf_begin_cpu_access() 
and dma_buf_end_cpu_access(), as well as transaction management between 
devices and the CPU and even implicit synchronization.

This specification is then implemented by the different drivers 
including V4L2: 
https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473

As well as the different DRM drivers: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234

This design was then used by us with various media players on different 
customer projects, including QNAP https://www.qnap.com/en/product/ts-877 
as well as the newest Tesla 
https://www.amd.com/en/products/embedded-automotive-solutions

I won't go into the details here, but we are using exactly the approach 
I've outlined to let userspace control the DMA between the different 
device in question. I'm one of the main designers of that and our 
multimedia and mesa team has up-streamed quite a number of changes for 
this project.

I'm not that well into different ARM based solutions because we are just 
recently getting results that this starts to work with AMD GPUs, but I'm 
pretty sure that the design should be able to handle that as well.

So we have clearly prove that this design works, even with special 
requirements which are way more complex than what we are discussing 
here. We had cases where we used GStreamer to feed DMA-buf handles into 
multiple devices with different format requirements and that seems to 
work fine.

-----

But enough of this rant. As I wrote Lucas as well this doesn't help us 
any further in the technical discussion.

The only technical argument I have is that if some userspace 
applications fail to use the provided UAPI while others use it correctly 
then this is clearly not a good reason to change the UAPI, but rather an 
argument to change the applications.

If the application should be kept simple and device independent then 
allocating the buffer from the device independent DMA heaps would be 
enough as well. Cause that provider implements the necessary handling 
for dma_buf_begin_cpu_access() and dma_buf_end_cpu_access().

I'm a bit surprised that we are arguing about stuff like this because we 
spend a lot of effort trying to document this. Daniel gave me the job to 
fix  this documentation, but after reading through it multiple times now 
I can't seem to find where the design and the desired behavior is unclear.

What is clearly a bug in the kernel is that we don't reject things which 
won't work correctly and this is what this patch here addresses. What we 
could talk about is backward compatibility for this patch, cause it 
might look like it breaks things which previously used to work at least 
partially.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 3/3] media: videobuf2: set dma_coherent flag for DMA-buf
  2022-10-20 12:13 ` [PATCH 3/3] media: videobuf2: set dma_coherent flag for DMA-buf Christian König
@ 2022-11-04 10:42   ` Hans Verkuil
  0 siblings, 0 replies; 110+ messages in thread
From: Hans Verkuil @ 2022-11-04 10:42 UTC (permalink / raw)
  To: Christian König, l.stach, nicolas, ppaalanen, sumit.semwal,
	daniel, robdclark, dri-devel, linaro-mm-sig, linux-media
  Cc: Christian König

Hi Christian,

On 20/10/2022 14:13, Christian König wrote:
> When a device is snooping the CPU cache we assume that all importers
> must snoop the CPU cache as well.
> 
> Execpt for vmalloc allocations since those implement mmap() imports must
> always snoop the cache or we will run into coherency problems.

Looks reasonable.

For this series:

Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>

Regards,

	Hans

> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/media/common/videobuf2/videobuf2-dma-contig.c | 2 ++
>  drivers/media/common/videobuf2/videobuf2-dma-sg.c     | 2 ++
>  drivers/media/common/videobuf2/videobuf2-vmalloc.c    | 1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c
> index 555bd40fa472..57433310c55c 100644
> --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c
> +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c
> @@ -11,6 +11,7 @@
>   */
>  
>  #include <linux/dma-buf.h>
> +#include <linux/dma-map-ops.h>
>  #include <linux/module.h>
>  #include <linux/refcount.h>
>  #include <linux/scatterlist.h>
> @@ -507,6 +508,7 @@ static struct dma_buf *vb2_dc_get_dmabuf(struct vb2_buffer *vb,
>  	exp_info.size = buf->size;
>  	exp_info.flags = flags;
>  	exp_info.priv = buf;
> +	exp_info.coherent = dev_is_dma_coherent(vb->vb2_queue->dev);
>  
>  	if (!buf->sgt_base)
>  		buf->sgt_base = vb2_dc_get_base_sgt(buf);
> diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> index 36981a5b5c53..dbdd753e4a39 100644
> --- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> +++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c
> @@ -10,6 +10,7 @@
>   * the Free Software Foundation.
>   */
>  
> +#include <linux/dma-map-ops.h>
>  #include <linux/module.h>
>  #include <linux/mm.h>
>  #include <linux/refcount.h>
> @@ -522,6 +523,7 @@ static struct dma_buf *vb2_dma_sg_get_dmabuf(struct vb2_buffer *vb,
>  	exp_info.size = buf->size;
>  	exp_info.flags = flags;
>  	exp_info.priv = buf;
> +	exp_info.coherent = dev_is_dma_coherent(vb->vb2_queue->dev);
>  
>  	if (WARN_ON(!buf->dma_sgt))
>  		return NULL;
> diff --git a/drivers/media/common/videobuf2/videobuf2-vmalloc.c b/drivers/media/common/videobuf2/videobuf2-vmalloc.c
> index 41db707e43a4..0b6874733e86 100644
> --- a/drivers/media/common/videobuf2/videobuf2-vmalloc.c
> +++ b/drivers/media/common/videobuf2/videobuf2-vmalloc.c
> @@ -341,6 +341,7 @@ static struct dma_buf *vb2_vmalloc_get_dmabuf(struct vb2_buffer *vb,
>  	exp_info.size = buf->size;
>  	exp_info.flags = flags;
>  	exp_info.priv = buf;
> +	exp_info.coherent = true;
>  
>  	if (WARN_ON(!buf->vaddr))
>  		return NULL;


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-04  9:03                         ` Christian König
@ 2022-11-04 13:38                           ` Nicolas Dufresne
  -1 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-04 13:38 UTC (permalink / raw)
  To: Christian König, Daniel Stone
  Cc: Lucas Stach, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Le vendredi 04 novembre 2022 à 10:03 +0100, Christian König a écrit :
> Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:
> > [SNIP]
> > 
> > Was there APIs suggested to actually make it manageable by userland to allocate
> > from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
> > ready, no.
> 
> Well, that stuff is absolutely ready: 
> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175 
> What do you think I'm talking about all the time?

I'm aware of DMA Heap, still have few gaps, but this unrelated to coherency (we
can discuss offline, with Daniel S.). For DMABuf Heap, its used in many forks by
vendors in production. There is an upstream proposal for GStreamer, but review
comments were never addressed, in short, its stalled, and it waiting for a
volunteer. It might also be based on very old implementation of DMABuf Heap,
needs to be verified in depth for sure as the time have passed.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1391

> 
> DMA-buf has a lengthy section about CPU access to buffers and clearly 
> documents how all of that is supposed to work: 
> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160 
> This includes braketing of CPU access with dma_buf_begin_cpu_access() 
> and dma_buf_end_cpu_access(), as well as transaction management between 
> devices and the CPU and even implicit synchronization.
> 
> This specification is then implemented by the different drivers 
> including V4L2: 
> https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473
> 
> As well as the different DRM drivers: 
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117 
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234

I know, I've implement the userspace bracketing for this in GStreamer [1] before
DMAbuf Heap was merged and was one of the reporter for the missing bracketing in
VB2. Was tested against i915 driver. Note, this is just a fallback, the
performance is terrible, memory exported by (at least my old i915 HW) is not
cacheable on CPU. Though, between corrupted image and bad performance or just
bad performance, we decided that it was better to have the second. When the
DMABuf is backed by CPU cacheable memory, peformance is great and CPU fallback
works. Work is in progress to better handle these two cases generically. For
now, sometimes the app need to get involved, but this is only happens on
embedded/controlled kind of use cases. What matters is that application that
needs this can do it.

[1] https://gitlab.freedesktop.org/gstreamer/gstreamer/-/blob/main/subprojects/gst-plugins-base/gst-libs/gst/allocators/gstdmabuf.c

> 
> This design was then used by us with various media players on different 
> customer projects, including QNAP https://www.qnap.com/en/product/ts-877 
> as well as the newest Tesla 
> https://www.amd.com/en/products/embedded-automotive-solutions
> 
> I won't go into the details here, but we are using exactly the approach 
> I've outlined to let userspace control the DMA between the different 
> device in question. I'm one of the main designers of that and our 
> multimedia and mesa team has up-streamed quite a number of changes for 
> this project.
> 
> I'm not that well into different ARM based solutions because we are just 
> recently getting results that this starts to work with AMD GPUs, but I'm 
> pretty sure that the design should be able to handle that as well.
> 
> So we have clearly prove that this design works, even with special 
> requirements which are way more complex than what we are discussing 
> here. We had cases where we used GStreamer to feed DMA-buf handles into 
> multiple devices with different format requirements and that seems to 
> work fine.

Sounds like you have a love/hate relationship with GStreamer. Glad the framework
is working for you too. The framework have had bidirectional memory allocation
for over a decade, it also has context sharing for stacks like
D3D11,12/GL/Vulkan/CUDA etc. I strictly didn't understand what you were
complaining about. As a vendor, you can solve all this in your BSP. Though,
translating BSP patches into a generic upstream-able features is not as simple.
The solution that works for vendor is usually the most cost effective one. I'm
sure, Tesla or AMD Automotive are no exceptions.

> 
> -----
> What is clearly a bug in the kernel is that we don't reject things which 
> won't work correctly and this is what this patch here addresses. What we 
> could talk about is backward compatibility for this patch, cause it 
> might look like it breaks things which previously used to work at least 
> partially.

I did read your code review (I don't class this discussion as a code review).
You where asked to address several issues, clearly a v2 is needed.

1. Rob Clark stated the coherency is not homogeneous in many device drivers. So
your patch will yield many false positives. He asked if you could think of a
"per attachement" solution, as splitting drivers didn't seem like the best
approach (and it would have to happen at the same time anyway). 

2. Lucas raised a concern, unfortunately no proof yet, that this may cause
regressions in existing userspace application. You stated that the application
must be wrong, yet this would break Linus rule #1. I'm not participating in that
discussion, I tried to reproduced, but appart from writing an app that will
break, I haven't found yet. But it feels like the way forward is to ensure each
exporter driver can override this in case it has a good reason to do so.

As a third option, a safer approach would be to avoid reusing a flawed mechanism
to detect device coherency (or rephrased, accept that device coherency isn't
homogeneous). Communicate this back another way. this means patching all the
drivers, but at least each driver maintainer will be able to judge with their HW
knowledge if this is going to regress or not. When I first looked at it, I
wanted to experiment with enabling cache in vb2 contiguous allocator. I was
thinking that perhaps I could have bracketing, and state changes triggered by
the attach call, all based on the device coherency, but now that I saw Rob Clark
comment, I feel like this is flawed.

happy v2,
Nicolas

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-04 13:38                           ` Nicolas Dufresne
  0 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-04 13:38 UTC (permalink / raw)
  To: Christian König, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, linux-media, sumit.semwal

Le vendredi 04 novembre 2022 à 10:03 +0100, Christian König a écrit :
> Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:
> > [SNIP]
> > 
> > Was there APIs suggested to actually make it manageable by userland to allocate
> > from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
> > ready, no.
> 
> Well, that stuff is absolutely ready: 
> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175 
> What do you think I'm talking about all the time?

I'm aware of DMA Heap, still have few gaps, but this unrelated to coherency (we
can discuss offline, with Daniel S.). For DMABuf Heap, its used in many forks by
vendors in production. There is an upstream proposal for GStreamer, but review
comments were never addressed, in short, its stalled, and it waiting for a
volunteer. It might also be based on very old implementation of DMABuf Heap,
needs to be verified in depth for sure as the time have passed.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1391

> 
> DMA-buf has a lengthy section about CPU access to buffers and clearly 
> documents how all of that is supposed to work: 
> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160 
> This includes braketing of CPU access with dma_buf_begin_cpu_access() 
> and dma_buf_end_cpu_access(), as well as transaction management between 
> devices and the CPU and even implicit synchronization.
> 
> This specification is then implemented by the different drivers 
> including V4L2: 
> https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473
> 
> As well as the different DRM drivers: 
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117 
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234

I know, I've implement the userspace bracketing for this in GStreamer [1] before
DMAbuf Heap was merged and was one of the reporter for the missing bracketing in
VB2. Was tested against i915 driver. Note, this is just a fallback, the
performance is terrible, memory exported by (at least my old i915 HW) is not
cacheable on CPU. Though, between corrupted image and bad performance or just
bad performance, we decided that it was better to have the second. When the
DMABuf is backed by CPU cacheable memory, peformance is great and CPU fallback
works. Work is in progress to better handle these two cases generically. For
now, sometimes the app need to get involved, but this is only happens on
embedded/controlled kind of use cases. What matters is that application that
needs this can do it.

[1] https://gitlab.freedesktop.org/gstreamer/gstreamer/-/blob/main/subprojects/gst-plugins-base/gst-libs/gst/allocators/gstdmabuf.c

> 
> This design was then used by us with various media players on different 
> customer projects, including QNAP https://www.qnap.com/en/product/ts-877 
> as well as the newest Tesla 
> https://www.amd.com/en/products/embedded-automotive-solutions
> 
> I won't go into the details here, but we are using exactly the approach 
> I've outlined to let userspace control the DMA between the different 
> device in question. I'm one of the main designers of that and our 
> multimedia and mesa team has up-streamed quite a number of changes for 
> this project.
> 
> I'm not that well into different ARM based solutions because we are just 
> recently getting results that this starts to work with AMD GPUs, but I'm 
> pretty sure that the design should be able to handle that as well.
> 
> So we have clearly prove that this design works, even with special 
> requirements which are way more complex than what we are discussing 
> here. We had cases where we used GStreamer to feed DMA-buf handles into 
> multiple devices with different format requirements and that seems to 
> work fine.

Sounds like you have a love/hate relationship with GStreamer. Glad the framework
is working for you too. The framework have had bidirectional memory allocation
for over a decade, it also has context sharing for stacks like
D3D11,12/GL/Vulkan/CUDA etc. I strictly didn't understand what you were
complaining about. As a vendor, you can solve all this in your BSP. Though,
translating BSP patches into a generic upstream-able features is not as simple.
The solution that works for vendor is usually the most cost effective one. I'm
sure, Tesla or AMD Automotive are no exceptions.

> 
> -----
> What is clearly a bug in the kernel is that we don't reject things which 
> won't work correctly and this is what this patch here addresses. What we 
> could talk about is backward compatibility for this patch, cause it 
> might look like it breaks things which previously used to work at least 
> partially.

I did read your code review (I don't class this discussion as a code review).
You where asked to address several issues, clearly a v2 is needed.

1. Rob Clark stated the coherency is not homogeneous in many device drivers. So
your patch will yield many false positives. He asked if you could think of a
"per attachement" solution, as splitting drivers didn't seem like the best
approach (and it would have to happen at the same time anyway). 

2. Lucas raised a concern, unfortunately no proof yet, that this may cause
regressions in existing userspace application. You stated that the application
must be wrong, yet this would break Linus rule #1. I'm not participating in that
discussion, I tried to reproduced, but appart from writing an app that will
break, I haven't found yet. But it feels like the way forward is to ensure each
exporter driver can override this in case it has a good reason to do so.

As a third option, a safer approach would be to avoid reusing a flawed mechanism
to detect device coherency (or rephrased, accept that device coherency isn't
homogeneous). Communicate this back another way. this means patching all the
drivers, but at least each driver maintainer will be able to judge with their HW
knowledge if this is going to regress or not. When I first looked at it, I
wanted to experiment with enabling cache in vb2 contiguous allocator. I was
thinking that perhaps I could have bracketing, and state changes triggered by
the attach call, all based on the device coherency, but now that I saw Rob Clark
comment, I feel like this is flawed.

happy v2,
Nicolas

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-04 13:38                           ` Nicolas Dufresne
@ 2022-11-04 14:51                             ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-04 14:51 UTC (permalink / raw)
  To: Nicolas Dufresne, Daniel Stone
  Cc: Lucas Stach, ppaalanen, sumit.semwal, daniel, robdclark,
	dri-devel, linaro-mm-sig, linux-media

Am 04.11.22 um 14:38 schrieb Nicolas Dufresne:
> Le vendredi 04 novembre 2022 à 10:03 +0100, Christian König a écrit :
>> Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:
>>> [SNIP]
>>>
>>> Was there APIs suggested to actually make it manageable by userland to allocate
>>> from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
>>> ready, no.
>> Well, that stuff is absolutely ready:
>> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175
>> What do you think I'm talking about all the time?
> I'm aware of DMA Heap, still have few gaps, but this unrelated to coherency (we
> can discuss offline, with Daniel S.). For DMABuf Heap, its used in many forks by
> vendors in production. There is an upstream proposal for GStreamer, but review
> comments were never addressed, in short, its stalled, and it waiting for a
> volunteer. It might also be based on very old implementation of DMABuf Heap,
> needs to be verified in depth for sure as the time have passed.
>
> https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1391

Well this is actually quite related to coherency.

As general purpose framework it's mandatory for DMA Heaps to support the 
coherency callbacks. And as far as I can see they are indeed correctly 
implemented.

So if you allocate from a DMA Heap it should be guaranteed that sharing 
with all devices work correctly, no matter if they write with the CPU or 
don't snoop the CPU cache. That's essentially what those heaps are good for.

I just don't know how you should select between the contiguous and the 
system heap since I'm not deep enough in the userspace side of this.

We should have come up with heaps earlier and don't implement so many 
different system memory allocators in the drivers in the first place.

>> DMA-buf has a lengthy section about CPU access to buffers and clearly
>> documents how all of that is supposed to work:
>> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160
>> This includes braketing of CPU access with dma_buf_begin_cpu_access()
>> and dma_buf_end_cpu_access(), as well as transaction management between
>> devices and the CPU and even implicit synchronization.
>>
>> This specification is then implemented by the different drivers
>> including V4L2:
>> https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473
>>
>> As well as the different DRM drivers:
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234
> I know, I've implement the userspace bracketing for this in GStreamer [1] before
> DMAbuf Heap was merged and was one of the reporter for the missing bracketing in
> VB2. Was tested against i915 driver. Note, this is just a fallback, the
> performance is terrible, memory exported by (at least my old i915 HW) is not
> cacheable on CPU. Though, between corrupted image and bad performance or just
> bad performance, we decided that it was better to have the second. When the
> DMABuf is backed by CPU cacheable memory, peformance is great and CPU fallback
> works. Work is in progress to better handle these two cases generically. For
> now, sometimes the app need to get involved, but this is only happens on
> embedded/controlled kind of use cases. What matters is that application that
> needs this can do it.
>
> [1] https://gitlab.freedesktop.org/gstreamer/gstreamer/-/blob/main/subprojects/gst-plugins-base/gst-libs/gst/allocators/gstdmabuf.c

Yeah, I've heard complains about that before with people asking if we 
couldn't map buffers exporter by the GPU as cacheable and then flush it 
on demand.

For i915 that's more or less possible because it just uses stolen system 
memory, but pretty much any other dGPU uses local memory and mapping a 
PCIe BAR cacheable is really a no-go.

>> This design was then used by us with various media players on different
>> customer projects, including QNAP https://www.qnap.com/en/product/ts-877
>> as well as the newest Tesla
>> https://www.amd.com/en/products/embedded-automotive-solutions
>>
>> I won't go into the details here, but we are using exactly the approach
>> I've outlined to let userspace control the DMA between the different
>> device in question. I'm one of the main designers of that and our
>> multimedia and mesa team has up-streamed quite a number of changes for
>> this project.
>>
>> I'm not that well into different ARM based solutions because we are just
>> recently getting results that this starts to work with AMD GPUs, but I'm
>> pretty sure that the design should be able to handle that as well.
>>
>> So we have clearly prove that this design works, even with special
>> requirements which are way more complex than what we are discussing
>> here. We had cases where we used GStreamer to feed DMA-buf handles into
>> multiple devices with different format requirements and that seems to
>> work fine.
> Sounds like you have a love/hate relationship with GStreamer. Glad the framework
> is working for you too. The framework have had bidirectional memory allocation
> for over a decade, it also has context sharing for stacks like
> D3D11,12/GL/Vulkan/CUDA etc. I strictly didn't understand what you were
> complaining about. As a vendor, you can solve all this in your BSP. Though,
> translating BSP patches into a generic upstream-able features is not as simple.
> The solution that works for vendor is usually the most cost effective one. I'm
> sure, Tesla or AMD Automotive are no exceptions.

The framework is indeed great. It's just that customers often come up 
with their own sources/sinks which then obviously don't go upstream.

The LGPL was sometimes helpful to get people involved, but most often 
just leaves to much opportunity to develop a throw away solution.

>> -----
>> What is clearly a bug in the kernel is that we don't reject things which
>> won't work correctly and this is what this patch here addresses. What we
>> could talk about is backward compatibility for this patch, cause it
>> might look like it breaks things which previously used to work at least
>> partially.
> I did read your code review (I don't class this discussion as a code review).
> You where asked to address several issues, clearly a v2 is needed.
>
> 1. Rob Clark stated the coherency is not homogeneous in many device drivers. So
> your patch will yield many false positives. He asked if you could think of a
> "per attachement" solution, as splitting drivers didn't seem like the best
> approach (and it would have to happen at the same time anyway).

That should already be addressed. The flag is intentionally per buffer 
and only gets check as pre condition the framework itself.

So for example importing a buffer which requires CPU cache coherency is 
still possible, but then creating the framebuffer fails. This is pretty 
much the handling we already have, just not driver specific any more.

> 2. Lucas raised a concern, unfortunately no proof yet, that this may cause
> regressions in existing userspace application. You stated that the application
> must be wrong, yet this would break Linus rule #1. I'm not participating in that
> discussion, I tried to reproduced, but appart from writing an app that will
> break, I haven't found yet. But it feels like the way forward is to ensure each
> exporter driver can override this in case it has a good reason to do so.

Yes, that's what I meant that we should discuss this. Essentially we 
have the option between an application using corrupted data or breaking 
in the first place. I tend to say that the second is less harmful, but 
that's not a hard opinion.

> As a third option, a safer approach would be to avoid reusing a flawed mechanism
> to detect device coherency (or rephrased, accept that device coherency isn't
> homogeneous). Communicate this back another way. this means patching all the
> drivers, but at least each driver maintainer will be able to judge with their HW
> knowledge if this is going to regress or not. When I first looked at it, I
> wanted to experiment with enabling cache in vb2 contiguous allocator. I was
> thinking that perhaps I could have bracketing, and state changes triggered by
> the attach call, all based on the device coherency, but now that I saw Rob Clark
> comment, I feel like this is flawed.

Yeah, that won't work. Daniel V also noted multiple times that we could 
do this in the attach callback, but attaching should really be something 
done only once.

If you want device which write to the buffers with the CPU and devices 
which don't snoop to work together on system memory then DMA Heaps are 
really the way to go.

Just going to extend the patch set to the cases we have in amdgpu and 
i915 as well.

Thanks,
Christian.

>
> happy v2,
> Nicolas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-04 14:51                             ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-04 14:51 UTC (permalink / raw)
  To: Nicolas Dufresne, Daniel Stone
  Cc: ppaalanen, dri-devel, linaro-mm-sig, linux-media, sumit.semwal

Am 04.11.22 um 14:38 schrieb Nicolas Dufresne:
> Le vendredi 04 novembre 2022 à 10:03 +0100, Christian König a écrit :
>> Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:
>>> [SNIP]
>>>
>>> Was there APIs suggested to actually make it manageable by userland to allocate
>>> from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
>>> ready, no.
>> Well, that stuff is absolutely ready:
>> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175
>> What do you think I'm talking about all the time?
> I'm aware of DMA Heap, still have few gaps, but this unrelated to coherency (we
> can discuss offline, with Daniel S.). For DMABuf Heap, its used in many forks by
> vendors in production. There is an upstream proposal for GStreamer, but review
> comments were never addressed, in short, its stalled, and it waiting for a
> volunteer. It might also be based on very old implementation of DMABuf Heap,
> needs to be verified in depth for sure as the time have passed.
>
> https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1391

Well this is actually quite related to coherency.

As general purpose framework it's mandatory for DMA Heaps to support the 
coherency callbacks. And as far as I can see they are indeed correctly 
implemented.

So if you allocate from a DMA Heap it should be guaranteed that sharing 
with all devices work correctly, no matter if they write with the CPU or 
don't snoop the CPU cache. That's essentially what those heaps are good for.

I just don't know how you should select between the contiguous and the 
system heap since I'm not deep enough in the userspace side of this.

We should have come up with heaps earlier and don't implement so many 
different system memory allocators in the drivers in the first place.

>> DMA-buf has a lengthy section about CPU access to buffers and clearly
>> documents how all of that is supposed to work:
>> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160
>> This includes braketing of CPU access with dma_buf_begin_cpu_access()
>> and dma_buf_end_cpu_access(), as well as transaction management between
>> devices and the CPU and even implicit synchronization.
>>
>> This specification is then implemented by the different drivers
>> including V4L2:
>> https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473
>>
>> As well as the different DRM drivers:
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234
> I know, I've implement the userspace bracketing for this in GStreamer [1] before
> DMAbuf Heap was merged and was one of the reporter for the missing bracketing in
> VB2. Was tested against i915 driver. Note, this is just a fallback, the
> performance is terrible, memory exported by (at least my old i915 HW) is not
> cacheable on CPU. Though, between corrupted image and bad performance or just
> bad performance, we decided that it was better to have the second. When the
> DMABuf is backed by CPU cacheable memory, peformance is great and CPU fallback
> works. Work is in progress to better handle these two cases generically. For
> now, sometimes the app need to get involved, but this is only happens on
> embedded/controlled kind of use cases. What matters is that application that
> needs this can do it.
>
> [1] https://gitlab.freedesktop.org/gstreamer/gstreamer/-/blob/main/subprojects/gst-plugins-base/gst-libs/gst/allocators/gstdmabuf.c

Yeah, I've heard complains about that before with people asking if we 
couldn't map buffers exporter by the GPU as cacheable and then flush it 
on demand.

For i915 that's more or less possible because it just uses stolen system 
memory, but pretty much any other dGPU uses local memory and mapping a 
PCIe BAR cacheable is really a no-go.

>> This design was then used by us with various media players on different
>> customer projects, including QNAP https://www.qnap.com/en/product/ts-877
>> as well as the newest Tesla
>> https://www.amd.com/en/products/embedded-automotive-solutions
>>
>> I won't go into the details here, but we are using exactly the approach
>> I've outlined to let userspace control the DMA between the different
>> device in question. I'm one of the main designers of that and our
>> multimedia and mesa team has up-streamed quite a number of changes for
>> this project.
>>
>> I'm not that well into different ARM based solutions because we are just
>> recently getting results that this starts to work with AMD GPUs, but I'm
>> pretty sure that the design should be able to handle that as well.
>>
>> So we have clearly prove that this design works, even with special
>> requirements which are way more complex than what we are discussing
>> here. We had cases where we used GStreamer to feed DMA-buf handles into
>> multiple devices with different format requirements and that seems to
>> work fine.
> Sounds like you have a love/hate relationship with GStreamer. Glad the framework
> is working for you too. The framework have had bidirectional memory allocation
> for over a decade, it also has context sharing for stacks like
> D3D11,12/GL/Vulkan/CUDA etc. I strictly didn't understand what you were
> complaining about. As a vendor, you can solve all this in your BSP. Though,
> translating BSP patches into a generic upstream-able features is not as simple.
> The solution that works for vendor is usually the most cost effective one. I'm
> sure, Tesla or AMD Automotive are no exceptions.

The framework is indeed great. It's just that customers often come up 
with their own sources/sinks which then obviously don't go upstream.

The LGPL was sometimes helpful to get people involved, but most often 
just leaves to much opportunity to develop a throw away solution.

>> -----
>> What is clearly a bug in the kernel is that we don't reject things which
>> won't work correctly and this is what this patch here addresses. What we
>> could talk about is backward compatibility for this patch, cause it
>> might look like it breaks things which previously used to work at least
>> partially.
> I did read your code review (I don't class this discussion as a code review).
> You where asked to address several issues, clearly a v2 is needed.
>
> 1. Rob Clark stated the coherency is not homogeneous in many device drivers. So
> your patch will yield many false positives. He asked if you could think of a
> "per attachement" solution, as splitting drivers didn't seem like the best
> approach (and it would have to happen at the same time anyway).

That should already be addressed. The flag is intentionally per buffer 
and only gets check as pre condition the framework itself.

So for example importing a buffer which requires CPU cache coherency is 
still possible, but then creating the framebuffer fails. This is pretty 
much the handling we already have, just not driver specific any more.

> 2. Lucas raised a concern, unfortunately no proof yet, that this may cause
> regressions in existing userspace application. You stated that the application
> must be wrong, yet this would break Linus rule #1. I'm not participating in that
> discussion, I tried to reproduced, but appart from writing an app that will
> break, I haven't found yet. But it feels like the way forward is to ensure each
> exporter driver can override this in case it has a good reason to do so.

Yes, that's what I meant that we should discuss this. Essentially we 
have the option between an application using corrupted data or breaking 
in the first place. I tend to say that the second is less harmful, but 
that's not a hard opinion.

> As a third option, a safer approach would be to avoid reusing a flawed mechanism
> to detect device coherency (or rephrased, accept that device coherency isn't
> homogeneous). Communicate this back another way. this means patching all the
> drivers, but at least each driver maintainer will be able to judge with their HW
> knowledge if this is going to regress or not. When I first looked at it, I
> wanted to experiment with enabling cache in vb2 contiguous allocator. I was
> thinking that perhaps I could have bracketing, and state changes triggered by
> the attach call, all based on the device coherency, but now that I saw Rob Clark
> comment, I feel like this is flawed.

Yeah, that won't work. Daniel V also noted multiple times that we could 
do this in the attach callback, but attaching should really be something 
done only once.

If you want device which write to the buffers with the CPU and devices 
which don't snoop to work together on system memory then DMA Heaps are 
really the way to go.

Just going to extend the patch set to the cases we have in amdgpu and 
i915 as well.

Thanks,
Christian.

>
> happy v2,
> Nicolas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 12:21                         ` Christian König
@ 2022-11-04 14:58                           ` Rob Clark
  -1 siblings, 0 replies; 110+ messages in thread
From: Rob Clark @ 2022-11-04 14:58 UTC (permalink / raw)
  To: Christian König
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, dri-devel, linaro-mm-sig, linux-media

On Wed, Nov 2, 2022 at 5:21 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Lucas,
>
> Am 02.11.22 um 12:39 schrieb Lucas Stach:
> > Hi Christian,
> >
> > going to reply in more detail when I have some more time, so just some
> > quick thoughts for now.
> >
> > Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> >> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> >>> [SNIP]
> >> As far as I can see it you guys just allocate a buffer from a V4L2
> >> device, fill it with data and send it to Wayland for displaying.
> >>
> >> To be honest I'm really surprised that the Wayland guys hasn't pushed
> >> back on this practice already.
> >>
> >> This only works because the Wayland as well as X display pipeline is
> >> smart enough to insert an extra copy when it find that an imported
> >> buffer can't be used as a framebuffer directly.
> >>
> > With bracketed access you could even make this case work, as the dGPU
> > would be able to slurp a copy of the dma-buf into LMEM for scanout.
>
> Well, this copy is what we are trying to avoid here. The codec should
> pump the data into LMEM in the first place.
>

For the dGPU VRAM case, shouldn't this be amdgpu re-importing it's own
dmabuf, which more or less bypasses dma-buf to get back the backing
GEM obj?

BR,
-R

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-04 14:58                           ` Rob Clark
  0 siblings, 0 replies; 110+ messages in thread
From: Rob Clark @ 2022-11-04 14:58 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

On Wed, Nov 2, 2022 at 5:21 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Lucas,
>
> Am 02.11.22 um 12:39 schrieb Lucas Stach:
> > Hi Christian,
> >
> > going to reply in more detail when I have some more time, so just some
> > quick thoughts for now.
> >
> > Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> >> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> >>> [SNIP]
> >> As far as I can see it you guys just allocate a buffer from a V4L2
> >> device, fill it with data and send it to Wayland for displaying.
> >>
> >> To be honest I'm really surprised that the Wayland guys hasn't pushed
> >> back on this practice already.
> >>
> >> This only works because the Wayland as well as X display pipeline is
> >> smart enough to insert an extra copy when it find that an imported
> >> buffer can't be used as a framebuffer directly.
> >>
> > With bracketed access you could even make this case work, as the dGPU
> > would be able to slurp a copy of the dma-buf into LMEM for scanout.
>
> Well, this copy is what we are trying to avoid here. The codec should
> pump the data into LMEM in the first place.
>

For the dGPU VRAM case, shouldn't this be amdgpu re-importing it's own
dmabuf, which more or less bypasses dma-buf to get back the backing
GEM obj?

BR,
-R

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-04 14:58                           ` Rob Clark
@ 2022-11-04 15:00                             ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-04 15:00 UTC (permalink / raw)
  To: Rob Clark
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

Am 04.11.22 um 15:58 schrieb Rob Clark:
> On Wed, Nov 2, 2022 at 5:21 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Hi Lucas,
>>
>> Am 02.11.22 um 12:39 schrieb Lucas Stach:
>>> Hi Christian,
>>>
>>> going to reply in more detail when I have some more time, so just some
>>> quick thoughts for now.
>>>
>>> Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
>>>> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
>>>>> [SNIP]
>>>> As far as I can see it you guys just allocate a buffer from a V4L2
>>>> device, fill it with data and send it to Wayland for displaying.
>>>>
>>>> To be honest I'm really surprised that the Wayland guys hasn't pushed
>>>> back on this practice already.
>>>>
>>>> This only works because the Wayland as well as X display pipeline is
>>>> smart enough to insert an extra copy when it find that an imported
>>>> buffer can't be used as a framebuffer directly.
>>>>
>>> With bracketed access you could even make this case work, as the dGPU
>>> would be able to slurp a copy of the dma-buf into LMEM for scanout.
>> Well, this copy is what we are trying to avoid here. The codec should
>> pump the data into LMEM in the first place.
>>
> For the dGPU VRAM case, shouldn't this be amdgpu re-importing it's own
> dmabuf, which more or less bypasses dma-buf to get back the backing
> GEM obj?

When the codec and scanout is on the same device, then yes.

But we already had cases where the codec was on the dGPU and scanout 
happened on the integrated engine.

Regards,
Christian.

>
> BR,
> -R


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-04 15:00                             ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-04 15:00 UTC (permalink / raw)
  To: Rob Clark
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, dri-devel, linaro-mm-sig, linux-media

Am 04.11.22 um 15:58 schrieb Rob Clark:
> On Wed, Nov 2, 2022 at 5:21 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Hi Lucas,
>>
>> Am 02.11.22 um 12:39 schrieb Lucas Stach:
>>> Hi Christian,
>>>
>>> going to reply in more detail when I have some more time, so just some
>>> quick thoughts for now.
>>>
>>> Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
>>>> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
>>>>> [SNIP]
>>>> As far as I can see it you guys just allocate a buffer from a V4L2
>>>> device, fill it with data and send it to Wayland for displaying.
>>>>
>>>> To be honest I'm really surprised that the Wayland guys hasn't pushed
>>>> back on this practice already.
>>>>
>>>> This only works because the Wayland as well as X display pipeline is
>>>> smart enough to insert an extra copy when it find that an imported
>>>> buffer can't be used as a framebuffer directly.
>>>>
>>> With bracketed access you could even make this case work, as the dGPU
>>> would be able to slurp a copy of the dma-buf into LMEM for scanout.
>> Well, this copy is what we are trying to avoid here. The codec should
>> pump the data into LMEM in the first place.
>>
> For the dGPU VRAM case, shouldn't this be amdgpu re-importing it's own
> dmabuf, which more or less bypasses dma-buf to get back the backing
> GEM obj?

When the codec and scanout is on the same device, then yes.

But we already had cases where the codec was on the dGPU and scanout 
happened on the integrated engine.

Regards,
Christian.

>
> BR,
> -R


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-02 19:13                             ` Christian König
@ 2022-11-17  9:35                               ` Tomasz Figa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-11-17  9:35 UTC (permalink / raw)
  To: Christian König
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, robdclark, dri-devel, linaro-mm-sig,
	linux-media

Hi Christian and everyone,

On Thu, Nov 3, 2022 at 4:14 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> > Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> > [SNIP]
> >> It would just be doing this for the importer and exactly that
> >> would be bad design because we then have handling for the display driver
> >> outside of the driver.
> >>
> > The driver would have to do those cache maintenance operations if it
> > directly worked with a non-coherent device. Doing it for the importer
> > is just doing it for another device, not the one directly managed by
> > the exporter.
> >
> > I really don't see the difference to the other dma-buf ops: in
> > dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> > the address space of the importer. Why would cache maintenance be any
> > different?
>
> The issue here is the explicit ownership transfer.
>
> We intentionally decided against that because it breaks tons of use
> cases and is at least by me and a couple of others seen as generally
> design failure of the Linux DMA-API.

First of all, thanks for starting the discussion and sorry for being
late to the party. May I ask you to keep me on CC for any changes that
touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
being copied on the entire series, no need to pick the specific
patches. Thanks in advance.

I agree that we have some design issues in the current DMA-buf
framework, but I'd try to approach it a bit differently. Instead of
focusing on the issues in the current design, could we write down our
requirements and try to come up with how a correct design would look
like? (A lot of that has been already mentioned in this thread, but I
find it quite difficult to follow and it might not be a complete view
either.)

That said, let me address a few aspects already mentioned, to make
sure that everyone is on the same page.

>
> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> be able to directly decided where a certain operation should go. E.g. we
> have cases where for example a P2P write doesn't even go to memory, but
> rather a doorbell BAR to trigger another operation. Throwing in CPU
> round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which
accepts two (or an array?) of devices and tells the caller whether the
devices need explicit ownership transfer. Based on that, your drivers
would install the DMA completion (presumably IRQ) handlers or not.
It's necessary since it's not uncommon that devices A and B could be
in the same coherency domain, while C could be in a different one, but
you may still want them to exchange data through DMA-bufs. Even if it
means the need for some extra round trips it would likely be more
efficient than a full memory copy (might not be true 100% of the
time).

>
> Additional to that a very basic concept of DMA-buf is that the exporter
> provides the buffer as it is and just double checks if the importer can
> access it. For example we have XGMI links which makes memory accessible
> to other devices on the same bus, but not to PCIe device and not even to
> the CPU. Otherwise you wouldn't be able to implement things like secure
> decoding where the data isn't even accessible outside the device to
> device link.

Fully agreed.

>
> So if a device driver uses cached system memory on an architecture which
> devices which can't access it the right approach is clearly to reject
> the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

>
> What we can do is to reverse the role of the exporter and importer and
> let the device which needs uncached memory take control. This way this
> device can insert operations as needed, e.g. flush read caches or
> invalidate write caches.
>

(Putting aside the cases when the access is really impossible at all.)
Correct me if I'm wrong, but isn't that because we don't have a proper
hook for the importer to tell the DMA-buf framework to prepare the
buffer for its access?

> This is what we have already done in DMA-buf and what already works
> perfectly fine with use cases which are even more complicated than a
> simple write cache invalidation.
>
> >>>> This is just a software solution which works because of coincident and
> >>>> not because of engineering.
> >>> By mandating a software fallback for the cases where you would need
> >>> bracketed access to the dma-buf, you simply shift the problem into
> >>> userspace. Userspace then creates the bracket by falling back to some
> >>> other import option that mostly do a copy and then the appropriate
> >>> cache maintenance.
> >>>
> >>> While I understand your sentiment about the DMA-API design being
> >>> inconvenient when things are just coherent by system design, the DMA-
> >>> API design wasn't done this way due to bad engineering, but due to the
> >>> fact that performant DMA access on some systems just require this kind
> >>> of bracketing.
> >> Well, this is exactly what I'm criticizing on the DMA-API. Instead of
> >> giving you a proper error code when something won't work in a specific
> >> way it just tries to hide the requirements inside the DMA layer.
> >>
> >> For example when your device can only access 32bits the DMA-API
> >> transparently insert bounce buffers instead of giving you a proper error
> >> code that the memory in question can't be accessed.
> >>
> >> This just tries to hide the underlying problem instead of pushing it
> >> into the upper layer where it can be handled much more gracefully.
> > How would you expect the DMA API to behave on a system where the device
> > driver is operating on cacheable memory, but the device is non-
> > coherent? Telling the driver that this just doesn't work?
>
> Yes, exactly that.
>
> It's the job of the higher level to prepare the buffer a device work
> with, not the one of the lower level.

What are higher and lower levels here?

As per the existing design of the DMA mapping framework, the framework
handles the system DMA architecture details and DMA master drivers
take care of invoking the right DMA mapping operations around the DMA
accesses. This makes sense to me, as DMA master drivers have no idea
about the specific SoCs or buses they're plugged into, while the DMA
mapping framework has no idea when the DMA accesses are taking place.

>
> In other words in a proper design the higher level would prepare the
> memory in a way the device driver can work with it, not the other way
> around.
>
> When a device driver gets memory it can't work with the correct response
> is to throw an error and bubble that up into a layer where it can be
> handled gracefully.
>
> For example instead of using bounce buffers under the hood the DMA layer
> the MM should make sure that when you call read() with O_DIRECT that the
> pages in question are accessible by the device.
>

I tend to agree with you if it's about a costly software "emulation"
like bounce buffers, but cache maintenance is a hardware feature
existing there by default and it's often much cheaper to operate on
cached memory and synchronize the caches rather than have everything
in uncached (or write-combined) memory.

> > It's a use-case that is working fine today with many devices (e.g. network
> > adapters) in the ARM world, exactly because the architecture specific
> > implementation of the DMA API inserts the cache maintenance operations
> > on buffer ownership transfer.
>
> Yeah, I'm perfectly aware of that. The problem is that exactly that
> design totally breaks GPUs on Xen DOM0 for example.
>
> And Xen is just one example, I can certainly say from experience that
> this design was a really really bad idea because it favors just one use
> case while making other use cases practically impossible if not really
> hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the
problem that this introduces for it?

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-17  9:35                               ` Tomasz Figa
  0 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-11-17  9:35 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

Hi Christian and everyone,

On Thu, Nov 3, 2022 at 4:14 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> > Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> > [SNIP]
> >> It would just be doing this for the importer and exactly that
> >> would be bad design because we then have handling for the display driver
> >> outside of the driver.
> >>
> > The driver would have to do those cache maintenance operations if it
> > directly worked with a non-coherent device. Doing it for the importer
> > is just doing it for another device, not the one directly managed by
> > the exporter.
> >
> > I really don't see the difference to the other dma-buf ops: in
> > dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> > the address space of the importer. Why would cache maintenance be any
> > different?
>
> The issue here is the explicit ownership transfer.
>
> We intentionally decided against that because it breaks tons of use
> cases and is at least by me and a couple of others seen as generally
> design failure of the Linux DMA-API.

First of all, thanks for starting the discussion and sorry for being
late to the party. May I ask you to keep me on CC for any changes that
touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
being copied on the entire series, no need to pick the specific
patches. Thanks in advance.

I agree that we have some design issues in the current DMA-buf
framework, but I'd try to approach it a bit differently. Instead of
focusing on the issues in the current design, could we write down our
requirements and try to come up with how a correct design would look
like? (A lot of that has been already mentioned in this thread, but I
find it quite difficult to follow and it might not be a complete view
either.)

That said, let me address a few aspects already mentioned, to make
sure that everyone is on the same page.

>
> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> be able to directly decided where a certain operation should go. E.g. we
> have cases where for example a P2P write doesn't even go to memory, but
> rather a doorbell BAR to trigger another operation. Throwing in CPU
> round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which
accepts two (or an array?) of devices and tells the caller whether the
devices need explicit ownership transfer. Based on that, your drivers
would install the DMA completion (presumably IRQ) handlers or not.
It's necessary since it's not uncommon that devices A and B could be
in the same coherency domain, while C could be in a different one, but
you may still want them to exchange data through DMA-bufs. Even if it
means the need for some extra round trips it would likely be more
efficient than a full memory copy (might not be true 100% of the
time).

>
> Additional to that a very basic concept of DMA-buf is that the exporter
> provides the buffer as it is and just double checks if the importer can
> access it. For example we have XGMI links which makes memory accessible
> to other devices on the same bus, but not to PCIe device and not even to
> the CPU. Otherwise you wouldn't be able to implement things like secure
> decoding where the data isn't even accessible outside the device to
> device link.

Fully agreed.

>
> So if a device driver uses cached system memory on an architecture which
> devices which can't access it the right approach is clearly to reject
> the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

>
> What we can do is to reverse the role of the exporter and importer and
> let the device which needs uncached memory take control. This way this
> device can insert operations as needed, e.g. flush read caches or
> invalidate write caches.
>

(Putting aside the cases when the access is really impossible at all.)
Correct me if I'm wrong, but isn't that because we don't have a proper
hook for the importer to tell the DMA-buf framework to prepare the
buffer for its access?

> This is what we have already done in DMA-buf and what already works
> perfectly fine with use cases which are even more complicated than a
> simple write cache invalidation.
>
> >>>> This is just a software solution which works because of coincident and
> >>>> not because of engineering.
> >>> By mandating a software fallback for the cases where you would need
> >>> bracketed access to the dma-buf, you simply shift the problem into
> >>> userspace. Userspace then creates the bracket by falling back to some
> >>> other import option that mostly do a copy and then the appropriate
> >>> cache maintenance.
> >>>
> >>> While I understand your sentiment about the DMA-API design being
> >>> inconvenient when things are just coherent by system design, the DMA-
> >>> API design wasn't done this way due to bad engineering, but due to the
> >>> fact that performant DMA access on some systems just require this kind
> >>> of bracketing.
> >> Well, this is exactly what I'm criticizing on the DMA-API. Instead of
> >> giving you a proper error code when something won't work in a specific
> >> way it just tries to hide the requirements inside the DMA layer.
> >>
> >> For example when your device can only access 32bits the DMA-API
> >> transparently insert bounce buffers instead of giving you a proper error
> >> code that the memory in question can't be accessed.
> >>
> >> This just tries to hide the underlying problem instead of pushing it
> >> into the upper layer where it can be handled much more gracefully.
> > How would you expect the DMA API to behave on a system where the device
> > driver is operating on cacheable memory, but the device is non-
> > coherent? Telling the driver that this just doesn't work?
>
> Yes, exactly that.
>
> It's the job of the higher level to prepare the buffer a device work
> with, not the one of the lower level.

What are higher and lower levels here?

As per the existing design of the DMA mapping framework, the framework
handles the system DMA architecture details and DMA master drivers
take care of invoking the right DMA mapping operations around the DMA
accesses. This makes sense to me, as DMA master drivers have no idea
about the specific SoCs or buses they're plugged into, while the DMA
mapping framework has no idea when the DMA accesses are taking place.

>
> In other words in a proper design the higher level would prepare the
> memory in a way the device driver can work with it, not the other way
> around.
>
> When a device driver gets memory it can't work with the correct response
> is to throw an error and bubble that up into a layer where it can be
> handled gracefully.
>
> For example instead of using bounce buffers under the hood the DMA layer
> the MM should make sure that when you call read() with O_DIRECT that the
> pages in question are accessible by the device.
>

I tend to agree with you if it's about a costly software "emulation"
like bounce buffers, but cache maintenance is a hardware feature
existing there by default and it's often much cheaper to operate on
cached memory and synchronize the caches rather than have everything
in uncached (or write-combined) memory.

> > It's a use-case that is working fine today with many devices (e.g. network
> > adapters) in the ARM world, exactly because the architecture specific
> > implementation of the DMA API inserts the cache maintenance operations
> > on buffer ownership transfer.
>
> Yeah, I'm perfectly aware of that. The problem is that exactly that
> design totally breaks GPUs on Xen DOM0 for example.
>
> And Xen is just one example, I can certainly say from experience that
> this design was a really really bad idea because it favors just one use
> case while making other use cases practically impossible if not really
> hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the
problem that this introduces for it?

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-17  9:35                               ` Tomasz Figa
@ 2022-11-17 12:10                                 ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-17 12:10 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, robdclark, dri-devel, linaro-mm-sig,
	linux-media

Hi Tomasz,

Am 17.11.22 um 10:35 schrieb Tomasz Figa:
> Hi Christian and everyone,
>
> On Thu, Nov 3, 2022 at 4:14 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 02.11.22 um 18:10 schrieb Lucas Stach:
>>> Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
>>> [SNIP]
>>>> It would just be doing this for the importer and exactly that
>>>> would be bad design because we then have handling for the display driver
>>>> outside of the driver.
>>>>
>>> The driver would have to do those cache maintenance operations if it
>>> directly worked with a non-coherent device. Doing it for the importer
>>> is just doing it for another device, not the one directly managed by
>>> the exporter.
>>>
>>> I really don't see the difference to the other dma-buf ops: in
>>> dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
>>> the address space of the importer. Why would cache maintenance be any
>>> different?
>> The issue here is the explicit ownership transfer.
>>
>> We intentionally decided against that because it breaks tons of use
>> cases and is at least by me and a couple of others seen as generally
>> design failure of the Linux DMA-API.
> First of all, thanks for starting the discussion and sorry for being
> late to the party. May I ask you to keep me on CC for any changes that
> touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
> being copied on the entire series, no need to pick the specific
> patches. Thanks in advance.

Sorry for that, I've only added people involved in the previous 
discussion. Going to keep you in the loop.

> I agree that we have some design issues in the current DMA-buf
> framework, but I'd try to approach it a bit differently. Instead of
> focusing on the issues in the current design, could we write down our
> requirements and try to come up with how a correct design would look
> like? (A lot of that has been already mentioned in this thread, but I
> find it quite difficult to follow and it might not be a complete view
> either.)

Well, exactly that's what we disagree on.

As far as I can see the current design of DMA-buf is perfectly capable 
of handling all the requirements.

A brief summary of the requirements with some implementation notes:

1. Device drivers export their memory as it is. E.g. no special handling 
for importers on the exporter side.
     If an importer can't deal with a specific format, layout, caching 
etc... of the data the correct approach is to reject the attachment.
     Those parameters are controlled by userspace and negotiating them 
is explicitly not part of the framework.

2. Accesses of the CPU to a buffer are bracket int begin_cpu_access() 
and end_cpu_access() calls.
     Here we can insert the CPU cache invalidation/flushes as necessary.

3. Accesses of the device HW to a buffer are represented with dma_fence 
objects.
     It's explicitly allowed to have multiple devices access the buffer 
at the same time.

4. Access to the DMA-buf by the HW of an importer is setup by the exporter.
     In other words the exporter provides a bunch of DMA addresses the 
importer should access.
     The importer should not try to come up with those on its own.

> That said, let me address a few aspects already mentioned, to make
> sure that everyone is on the same page.
>
>> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
>> be able to directly decided where a certain operation should go. E.g. we
>> have cases where for example a P2P write doesn't even go to memory, but
>> rather a doorbell BAR to trigger another operation. Throwing in CPU
>> round trips for explicit ownership transfer completely breaks that concept.
> It sounds like we should have a dma_dev_is_coherent_with_dev() which
> accepts two (or an array?) of devices and tells the caller whether the
> devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would 
want as requirement in the framework, cause otherwise we break tons of 
use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly 
possible to implement this using the dma_fence objects mentioned above. 
E.g. drivers can already look at who is accessing a buffer currently and 
can even grab explicit ownership of it by adding their own dma_fence 
objects.

The only exception is CPU based access, e.g. when something is written 
with the CPU a cache flush might be necessary and when something is read 
with the CPU a cache invalidation might be necessary.

>> So if a device driver uses cached system memory on an architecture which
>> devices which can't access it the right approach is clearly to reject
>> the access.
> I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it 
was allocated.

If that means the the exporter provides a piece of memory which requires 
CPU cache snooping to access correctly then the best thing we can do is 
to prevent an importer which can't do this from attaching.

We do have the system and CMA dma-buf heap for cases where a device 
driver which wants to access the buffer with caches enabled. So this is 
not a limitation in functionality, it's just a matter of correctly using it.

>> What we can do is to reverse the role of the exporter and importer and
>> let the device which needs uncached memory take control. This way this
>> device can insert operations as needed, e.g. flush read caches or
>> invalidate write caches.
>>
> (Putting aside the cases when the access is really impossible at all.)
> Correct me if I'm wrong, but isn't that because we don't have a proper
> hook for the importer to tell the DMA-buf framework to prepare the
> buffer for its access?

No, we do have the explicit begin_cpu_access() and end_cpu_access() 
calls which are even exporter to userspace through IOCTLs for this.

The problem is that in this particular case the exporter provides the 
buffer as is, e.g. with dirty CPU caches. And the importer can't deal 
with that.

Either reverting the roles of the importer or exporter or switching over 
to the common DMA-heaps implementation for the buffer would solve that.

>> It's the job of the higher level to prepare the buffer a device work
>> with, not the one of the lower level.
> What are higher and lower levels here?

The MM as higher level and the DMA mapping framework as lower level.

>
> As per the existing design of the DMA mapping framework, the framework
> handles the system DMA architecture details and DMA master drivers
> take care of invoking the right DMA mapping operations around the DMA
> accesses. This makes sense to me, as DMA master drivers have no idea
> about the specific SoCs or buses they're plugged into, while the DMA
> mapping framework has no idea when the DMA accesses are taking place.

Yeah and exactly that's a bit problematic. In an ideal world device 
drivers wouldn't see memory they can't access in the first place.

For example what we currently do is the following:
1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
2. DMA mapping framework is called by the driver to create an sg-table.
3. The DMA mapping framework sees that this won't work and inserts 
bounce buffers.
4. The driver does the DMA to the bounce buffers instead.
5. The DMA mapping framework copies the data to the real location.

This is highly problematic since it removes the control of what happens 
here. E.g. drivers can't prevent using bounce buffers when they need 
coherent access for DMA-buf for example.

What we should have instead is that bounce buffers are applied at a 
higher level, for example by get_user_pages() or more general in the MM.

>> In other words in a proper design the higher level would prepare the
>> memory in a way the device driver can work with it, not the other way
>> around.
>>
>> When a device driver gets memory it can't work with the correct response
>> is to throw an error and bubble that up into a layer where it can be
>> handled gracefully.
>>
>> For example instead of using bounce buffers under the hood the DMA layer
>> the MM should make sure that when you call read() with O_DIRECT that the
>> pages in question are accessible by the device.
>>
> I tend to agree with you if it's about a costly software "emulation"
> like bounce buffers, but cache maintenance is a hardware feature
> existing there by default and it's often much cheaper to operate on
> cached memory and synchronize the caches rather than have everything
> in uncached (or write-combined) memory.

Well that we should have proper cache maintenance is really not the 
question. The question is where to do that?

E.g. in the DMA-mapping framework which as far as I can see should only 
take care of translating addresses.

Or the higher level (get_user_pages() is just one example of that) which 
says ok device X wants to access memory Y I need to make sure that 
caches are flushed/invalidated.

>>> It's a use-case that is working fine today with many devices (e.g. network
>>> adapters) in the ARM world, exactly because the architecture specific
>>> implementation of the DMA API inserts the cache maintenance operations
>>> on buffer ownership transfer.
>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>> design totally breaks GPUs on Xen DOM0 for example.
>>
>> And Xen is just one example, I can certainly say from experience that
>> this design was a really really bad idea because it favors just one use
>> case while making other use cases practically impossible if not really
>> hard to implement.
> Sorry, I haven't worked with Xen. Could you elaborate what's the
> problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this 
as far as I know. I can ping internally how far they got with sending 
the patches out to avoid this problem.

Thanks for the input,
Christian.

>
> Best regards,
> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-17 12:10                                 ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-17 12:10 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

Hi Tomasz,

Am 17.11.22 um 10:35 schrieb Tomasz Figa:
> Hi Christian and everyone,
>
> On Thu, Nov 3, 2022 at 4:14 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 02.11.22 um 18:10 schrieb Lucas Stach:
>>> Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
>>> [SNIP]
>>>> It would just be doing this for the importer and exactly that
>>>> would be bad design because we then have handling for the display driver
>>>> outside of the driver.
>>>>
>>> The driver would have to do those cache maintenance operations if it
>>> directly worked with a non-coherent device. Doing it for the importer
>>> is just doing it for another device, not the one directly managed by
>>> the exporter.
>>>
>>> I really don't see the difference to the other dma-buf ops: in
>>> dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
>>> the address space of the importer. Why would cache maintenance be any
>>> different?
>> The issue here is the explicit ownership transfer.
>>
>> We intentionally decided against that because it breaks tons of use
>> cases and is at least by me and a couple of others seen as generally
>> design failure of the Linux DMA-API.
> First of all, thanks for starting the discussion and sorry for being
> late to the party. May I ask you to keep me on CC for any changes that
> touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
> being copied on the entire series, no need to pick the specific
> patches. Thanks in advance.

Sorry for that, I've only added people involved in the previous 
discussion. Going to keep you in the loop.

> I agree that we have some design issues in the current DMA-buf
> framework, but I'd try to approach it a bit differently. Instead of
> focusing on the issues in the current design, could we write down our
> requirements and try to come up with how a correct design would look
> like? (A lot of that has been already mentioned in this thread, but I
> find it quite difficult to follow and it might not be a complete view
> either.)

Well, exactly that's what we disagree on.

As far as I can see the current design of DMA-buf is perfectly capable 
of handling all the requirements.

A brief summary of the requirements with some implementation notes:

1. Device drivers export their memory as it is. E.g. no special handling 
for importers on the exporter side.
     If an importer can't deal with a specific format, layout, caching 
etc... of the data the correct approach is to reject the attachment.
     Those parameters are controlled by userspace and negotiating them 
is explicitly not part of the framework.

2. Accesses of the CPU to a buffer are bracket int begin_cpu_access() 
and end_cpu_access() calls.
     Here we can insert the CPU cache invalidation/flushes as necessary.

3. Accesses of the device HW to a buffer are represented with dma_fence 
objects.
     It's explicitly allowed to have multiple devices access the buffer 
at the same time.

4. Access to the DMA-buf by the HW of an importer is setup by the exporter.
     In other words the exporter provides a bunch of DMA addresses the 
importer should access.
     The importer should not try to come up with those on its own.

> That said, let me address a few aspects already mentioned, to make
> sure that everyone is on the same page.
>
>> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
>> be able to directly decided where a certain operation should go. E.g. we
>> have cases where for example a P2P write doesn't even go to memory, but
>> rather a doorbell BAR to trigger another operation. Throwing in CPU
>> round trips for explicit ownership transfer completely breaks that concept.
> It sounds like we should have a dma_dev_is_coherent_with_dev() which
> accepts two (or an array?) of devices and tells the caller whether the
> devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would 
want as requirement in the framework, cause otherwise we break tons of 
use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly 
possible to implement this using the dma_fence objects mentioned above. 
E.g. drivers can already look at who is accessing a buffer currently and 
can even grab explicit ownership of it by adding their own dma_fence 
objects.

The only exception is CPU based access, e.g. when something is written 
with the CPU a cache flush might be necessary and when something is read 
with the CPU a cache invalidation might be necessary.

>> So if a device driver uses cached system memory on an architecture which
>> devices which can't access it the right approach is clearly to reject
>> the access.
> I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it 
was allocated.

If that means the the exporter provides a piece of memory which requires 
CPU cache snooping to access correctly then the best thing we can do is 
to prevent an importer which can't do this from attaching.

We do have the system and CMA dma-buf heap for cases where a device 
driver which wants to access the buffer with caches enabled. So this is 
not a limitation in functionality, it's just a matter of correctly using it.

>> What we can do is to reverse the role of the exporter and importer and
>> let the device which needs uncached memory take control. This way this
>> device can insert operations as needed, e.g. flush read caches or
>> invalidate write caches.
>>
> (Putting aside the cases when the access is really impossible at all.)
> Correct me if I'm wrong, but isn't that because we don't have a proper
> hook for the importer to tell the DMA-buf framework to prepare the
> buffer for its access?

No, we do have the explicit begin_cpu_access() and end_cpu_access() 
calls which are even exporter to userspace through IOCTLs for this.

The problem is that in this particular case the exporter provides the 
buffer as is, e.g. with dirty CPU caches. And the importer can't deal 
with that.

Either reverting the roles of the importer or exporter or switching over 
to the common DMA-heaps implementation for the buffer would solve that.

>> It's the job of the higher level to prepare the buffer a device work
>> with, not the one of the lower level.
> What are higher and lower levels here?

The MM as higher level and the DMA mapping framework as lower level.

>
> As per the existing design of the DMA mapping framework, the framework
> handles the system DMA architecture details and DMA master drivers
> take care of invoking the right DMA mapping operations around the DMA
> accesses. This makes sense to me, as DMA master drivers have no idea
> about the specific SoCs or buses they're plugged into, while the DMA
> mapping framework has no idea when the DMA accesses are taking place.

Yeah and exactly that's a bit problematic. In an ideal world device 
drivers wouldn't see memory they can't access in the first place.

For example what we currently do is the following:
1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
2. DMA mapping framework is called by the driver to create an sg-table.
3. The DMA mapping framework sees that this won't work and inserts 
bounce buffers.
4. The driver does the DMA to the bounce buffers instead.
5. The DMA mapping framework copies the data to the real location.

This is highly problematic since it removes the control of what happens 
here. E.g. drivers can't prevent using bounce buffers when they need 
coherent access for DMA-buf for example.

What we should have instead is that bounce buffers are applied at a 
higher level, for example by get_user_pages() or more general in the MM.

>> In other words in a proper design the higher level would prepare the
>> memory in a way the device driver can work with it, not the other way
>> around.
>>
>> When a device driver gets memory it can't work with the correct response
>> is to throw an error and bubble that up into a layer where it can be
>> handled gracefully.
>>
>> For example instead of using bounce buffers under the hood the DMA layer
>> the MM should make sure that when you call read() with O_DIRECT that the
>> pages in question are accessible by the device.
>>
> I tend to agree with you if it's about a costly software "emulation"
> like bounce buffers, but cache maintenance is a hardware feature
> existing there by default and it's often much cheaper to operate on
> cached memory and synchronize the caches rather than have everything
> in uncached (or write-combined) memory.

Well that we should have proper cache maintenance is really not the 
question. The question is where to do that?

E.g. in the DMA-mapping framework which as far as I can see should only 
take care of translating addresses.

Or the higher level (get_user_pages() is just one example of that) which 
says ok device X wants to access memory Y I need to make sure that 
caches are flushed/invalidated.

>>> It's a use-case that is working fine today with many devices (e.g. network
>>> adapters) in the ARM world, exactly because the architecture specific
>>> implementation of the DMA API inserts the cache maintenance operations
>>> on buffer ownership transfer.
>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>> design totally breaks GPUs on Xen DOM0 for example.
>>
>> And Xen is just one example, I can certainly say from experience that
>> this design was a really really bad idea because it favors just one use
>> case while making other use cases practically impossible if not really
>> hard to implement.
> Sorry, I haven't worked with Xen. Could you elaborate what's the
> problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this 
as far as I know. I can ping internally how far they got with sending 
the patches out to avoid this problem.

Thanks for the input,
Christian.

>
> Best regards,
> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-17 12:10                                 ` Christian König
@ 2022-11-17 15:38                                   ` Nicolas Dufresne
  -1 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-17 15:38 UTC (permalink / raw)
  To: Christian König, Tomasz Figa
  Cc: Lucas Stach, Daniel Stone, ppaalanen, sumit.semwal, daniel,
	robdclark, dri-devel, linaro-mm-sig, linux-media

Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > be able to directly decided where a certain operation should go. E.g. we
> > > have cases where for example a P2P write doesn't even go to memory, but
> > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > round trips for explicit ownership transfer completely breaks that
> > > concept.
> > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > accepts two (or an array?) of devices and tells the caller whether the
> > devices need explicit ownership transfer.
> 
> No, exactly that's the concept I'm pushing back on very hard here.
> 
> In other words explicit ownership transfer is not something we would 
> want as requirement in the framework, cause otherwise we break tons of 
> use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here.
I have quite some experience with ownership transfer mechanism, as this is how
GStreamer framework works since 2000. Concurrent access is a really common use
cases and it is quite well defined in that context. The bracketing system (in
this case called map() unmap(), with flag stating the usage intention like reads
and write) is combined the the refcount. The basic rules are simple:

- An object with a refcount higher then 2 is shared, hence read-only
- An object with refcount of one, mapped for writes becomes exclusive
- Non exclusive writes can be done, but that has to be explicit (intentional),
we didn't go as far as Rust in that domain
- Wrappers around these object can use mechanism like "copy-on-write" and can
also maintain the state of shadow buffers (e.g. GL upload slow cases) even with
concurrent access.

Just hope it clarify, Rust language works, yet its all based on explicit
ownership transfers. Its not limiting, but it requires a different way of
thinking how data is to be accessed.

Nicolas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-17 15:38                                   ` Nicolas Dufresne
  0 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-17 15:38 UTC (permalink / raw)
  To: Christian König, Tomasz Figa
  Cc: linaro-mm-sig, ppaalanen, dri-devel, linux-media, sumit.semwal

Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > be able to directly decided where a certain operation should go. E.g. we
> > > have cases where for example a P2P write doesn't even go to memory, but
> > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > round trips for explicit ownership transfer completely breaks that
> > > concept.
> > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > accepts two (or an array?) of devices and tells the caller whether the
> > devices need explicit ownership transfer.
> 
> No, exactly that's the concept I'm pushing back on very hard here.
> 
> In other words explicit ownership transfer is not something we would 
> want as requirement in the framework, cause otherwise we break tons of 
> use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here.
I have quite some experience with ownership transfer mechanism, as this is how
GStreamer framework works since 2000. Concurrent access is a really common use
cases and it is quite well defined in that context. The bracketing system (in
this case called map() unmap(), with flag stating the usage intention like reads
and write) is combined the the refcount. The basic rules are simple:

- An object with a refcount higher then 2 is shared, hence read-only
- An object with refcount of one, mapped for writes becomes exclusive
- Non exclusive writes can be done, but that has to be explicit (intentional),
we didn't go as far as Rust in that domain
- Wrappers around these object can use mechanism like "copy-on-write" and can
also maintain the state of shadow buffers (e.g. GL upload slow cases) even with
concurrent access.

Just hope it clarify, Rust language works, yet its all based on explicit
ownership transfers. Its not limiting, but it requires a different way of
thinking how data is to be accessed.

Nicolas


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-17 15:38                                   ` Nicolas Dufresne
@ 2022-11-18 19:32                                     ` Rob Clark
  -1 siblings, 0 replies; 110+ messages in thread
From: Rob Clark @ 2022-11-18 19:32 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Christian König, Tomasz Figa, Lucas Stach, Daniel Stone,
	ppaalanen, sumit.semwal, daniel, dri-devel, linaro-mm-sig,
	linux-media

On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > > be able to directly decided where a certain operation should go. E.g. we
> > > > have cases where for example a P2P write doesn't even go to memory, but
> > > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > > round trips for explicit ownership transfer completely breaks that
> > > > concept.
> > > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > > accepts two (or an array?) of devices and tells the caller whether the
> > > devices need explicit ownership transfer.
> >
> > No, exactly that's the concept I'm pushing back on very hard here.
> >
> > In other words explicit ownership transfer is not something we would
> > want as requirement in the framework, cause otherwise we break tons of
> > use cases which require concurrent access to the underlying buffer.
>
> I'm not pushing for this solution, but really felt the need to correct you here.
> I have quite some experience with ownership transfer mechanism, as this is how
> GStreamer framework works since 2000. Concurrent access is a really common use
> cases and it is quite well defined in that context. The bracketing system (in
> this case called map() unmap(), with flag stating the usage intention like reads
> and write) is combined the the refcount. The basic rules are simple:

This is all CPU oriented, I think Christian is talking about the case
where ownership transfer happens without CPU involvement, such as via
GPU waiting on a fence

BR,
-R

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-18 19:32                                     ` Rob Clark
  0 siblings, 0 replies; 110+ messages in thread
From: Rob Clark @ 2022-11-18 19:32 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: linux-media, Christian König, linaro-mm-sig, ppaalanen,
	dri-devel, Tomasz Figa, sumit.semwal

On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > > be able to directly decided where a certain operation should go. E.g. we
> > > > have cases where for example a P2P write doesn't even go to memory, but
> > > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > > round trips for explicit ownership transfer completely breaks that
> > > > concept.
> > > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > > accepts two (or an array?) of devices and tells the caller whether the
> > > devices need explicit ownership transfer.
> >
> > No, exactly that's the concept I'm pushing back on very hard here.
> >
> > In other words explicit ownership transfer is not something we would
> > want as requirement in the framework, cause otherwise we break tons of
> > use cases which require concurrent access to the underlying buffer.
>
> I'm not pushing for this solution, but really felt the need to correct you here.
> I have quite some experience with ownership transfer mechanism, as this is how
> GStreamer framework works since 2000. Concurrent access is a really common use
> cases and it is quite well defined in that context. The bracketing system (in
> this case called map() unmap(), with flag stating the usage intention like reads
> and write) is combined the the refcount. The basic rules are simple:

This is all CPU oriented, I think Christian is talking about the case
where ownership transfer happens without CPU involvement, such as via
GPU waiting on a fence

BR,
-R

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-18 19:32                                     ` Rob Clark
@ 2022-11-19 20:35                                       ` Nicolas Dufresne
  -1 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-19 20:35 UTC (permalink / raw)
  To: Rob Clark
  Cc: Christian König, Tomasz Figa, Lucas Stach, Daniel Stone,
	ppaalanen, sumit.semwal, daniel, dri-devel, linaro-mm-sig,
	linux-media

Le vendredi 18 novembre 2022 à 11:32 -0800, Rob Clark a écrit :
> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > 
> > Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > > > be able to directly decided where a certain operation should go. E.g. we
> > > > > have cases where for example a P2P write doesn't even go to memory, but
> > > > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > > > round trips for explicit ownership transfer completely breaks that
> > > > > concept.
> > > > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > > > accepts two (or an array?) of devices and tells the caller whether the
> > > > devices need explicit ownership transfer.
> > > 
> > > No, exactly that's the concept I'm pushing back on very hard here.
> > > 
> > > In other words explicit ownership transfer is not something we would
> > > want as requirement in the framework, cause otherwise we break tons of
> > > use cases which require concurrent access to the underlying buffer.
> > 
> > I'm not pushing for this solution, but really felt the need to correct you here.
> > I have quite some experience with ownership transfer mechanism, as this is how
> > GStreamer framework works since 2000. Concurrent access is a really common use
> > cases and it is quite well defined in that context. The bracketing system (in
> > this case called map() unmap(), with flag stating the usage intention like reads
> > and write) is combined the the refcount. The basic rules are simple:
> 
> This is all CPU oriented, I think Christian is talking about the case
> where ownership transfer happens without CPU involvement, such as via
> GPU waiting on a fence

HW fences and proper ownership isn't incompatible at all. Even if you have no
software involved during the usage, software still need to share the dmabuf (at
least once), and sharing modify the ownership, and can be made explicit.

p.s. I will agree if someone raises that this is totally off topic

Nicolas
> BR,
> -R


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-19 20:35                                       ` Nicolas Dufresne
  0 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-19 20:35 UTC (permalink / raw)
  To: Rob Clark
  Cc: linux-media, Christian König, linaro-mm-sig, ppaalanen,
	dri-devel, Tomasz Figa, sumit.semwal

Le vendredi 18 novembre 2022 à 11:32 -0800, Rob Clark a écrit :
> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > 
> > Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > > > be able to directly decided where a certain operation should go. E.g. we
> > > > > have cases where for example a P2P write doesn't even go to memory, but
> > > > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > > > round trips for explicit ownership transfer completely breaks that
> > > > > concept.
> > > > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > > > accepts two (or an array?) of devices and tells the caller whether the
> > > > devices need explicit ownership transfer.
> > > 
> > > No, exactly that's the concept I'm pushing back on very hard here.
> > > 
> > > In other words explicit ownership transfer is not something we would
> > > want as requirement in the framework, cause otherwise we break tons of
> > > use cases which require concurrent access to the underlying buffer.
> > 
> > I'm not pushing for this solution, but really felt the need to correct you here.
> > I have quite some experience with ownership transfer mechanism, as this is how
> > GStreamer framework works since 2000. Concurrent access is a really common use
> > cases and it is quite well defined in that context. The bracketing system (in
> > this case called map() unmap(), with flag stating the usage intention like reads
> > and write) is combined the the refcount. The basic rules are simple:
> 
> This is all CPU oriented, I think Christian is talking about the case
> where ownership transfer happens without CPU involvement, such as via
> GPU waiting on a fence

HW fences and proper ownership isn't incompatible at all. Even if you have no
software involved during the usage, software still need to share the dmabuf (at
least once), and sharing modify the ownership, and can be made explicit.

p.s. I will agree if someone raises that this is totally off topic

Nicolas
> BR,
> -R


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-18 19:32                                     ` Rob Clark
@ 2022-11-22 14:36                                       ` Daniel Vetter
  -1 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2022-11-22 14:36 UTC (permalink / raw)
  To: Rob Clark
  Cc: Nicolas Dufresne, Christian König, Tomasz Figa, Lucas Stach,
	Daniel Stone, ppaalanen, sumit.semwal, daniel, dri-devel,
	linaro-mm-sig, linux-media

On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> >
> > Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > > > be able to directly decided where a certain operation should go. E.g. we
> > > > > have cases where for example a P2P write doesn't even go to memory, but
> > > > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > > > round trips for explicit ownership transfer completely breaks that
> > > > > concept.
> > > > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > > > accepts two (or an array?) of devices and tells the caller whether the
> > > > devices need explicit ownership transfer.
> > >
> > > No, exactly that's the concept I'm pushing back on very hard here.
> > >
> > > In other words explicit ownership transfer is not something we would
> > > want as requirement in the framework, cause otherwise we break tons of
> > > use cases which require concurrent access to the underlying buffer.
> >
> > I'm not pushing for this solution, but really felt the need to correct you here.
> > I have quite some experience with ownership transfer mechanism, as this is how
> > GStreamer framework works since 2000. Concurrent access is a really common use
> > cases and it is quite well defined in that context. The bracketing system (in
> > this case called map() unmap(), with flag stating the usage intention like reads
> > and write) is combined the the refcount. The basic rules are simple:
> 
> This is all CPU oriented, I think Christian is talking about the case
> where ownership transfer happens without CPU involvement, such as via
> GPU waiting on a fence

Yeah for pure device2device handover the rule pretty much has to be that
any coherency management that needs to be done must be done from the
device side (flushing device side caches and stuff like that) only. But
under the assumption that _all_ cpu side management has been done already
before the first device access started.

And then the map/unmap respectively begin/end_cpu_access can be used what
it was meant for with cpu side invalidation/flushing and stuff like that,
while having pretty clear handover/ownership rules and hopefully not doing
no unecessary flushes. And all that while allowing device acces to be
pipelined. Worst case the exporter has to insert some pipelined cache
flushes as a dma_fence pipelined work of its own between the device access
when moving from one device to the other. That last part sucks a bit right
now, because we don't have any dma_buf_attachment method which does this
syncing without recreating the mapping, but in reality this is solved by
caching mappings in the exporter (well dma-buf layer) nowadays.

True concurrent access like vk/compute expects is still a model that
dma-buf needs to support on top, but that's a special case and pretty much
needs hw that supports such concurrent access without explicit handover
and fencing.

Aside from some historical accidents and still a few warts, I do think
dma-buf does support both of these models. Of course in the case of
gpu/drm drivers, userspace must know whats possible and act accordingly,
otherwise you just get to keep all the pieces.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-22 14:36                                       ` Daniel Vetter
  0 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2022-11-22 14:36 UTC (permalink / raw)
  To: Rob Clark
  Cc: linux-media, Tomasz Figa, linaro-mm-sig, ppaalanen, dri-devel,
	Nicolas Dufresne, sumit.semwal, Christian König

On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> >
> > Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > > > be able to directly decided where a certain operation should go. E.g. we
> > > > > have cases where for example a P2P write doesn't even go to memory, but
> > > > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > > > round trips for explicit ownership transfer completely breaks that
> > > > > concept.
> > > > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > > > accepts two (or an array?) of devices and tells the caller whether the
> > > > devices need explicit ownership transfer.
> > >
> > > No, exactly that's the concept I'm pushing back on very hard here.
> > >
> > > In other words explicit ownership transfer is not something we would
> > > want as requirement in the framework, cause otherwise we break tons of
> > > use cases which require concurrent access to the underlying buffer.
> >
> > I'm not pushing for this solution, but really felt the need to correct you here.
> > I have quite some experience with ownership transfer mechanism, as this is how
> > GStreamer framework works since 2000. Concurrent access is a really common use
> > cases and it is quite well defined in that context. The bracketing system (in
> > this case called map() unmap(), with flag stating the usage intention like reads
> > and write) is combined the the refcount. The basic rules are simple:
> 
> This is all CPU oriented, I think Christian is talking about the case
> where ownership transfer happens without CPU involvement, such as via
> GPU waiting on a fence

Yeah for pure device2device handover the rule pretty much has to be that
any coherency management that needs to be done must be done from the
device side (flushing device side caches and stuff like that) only. But
under the assumption that _all_ cpu side management has been done already
before the first device access started.

And then the map/unmap respectively begin/end_cpu_access can be used what
it was meant for with cpu side invalidation/flushing and stuff like that,
while having pretty clear handover/ownership rules and hopefully not doing
no unecessary flushes. And all that while allowing device acces to be
pipelined. Worst case the exporter has to insert some pipelined cache
flushes as a dma_fence pipelined work of its own between the device access
when moving from one device to the other. That last part sucks a bit right
now, because we don't have any dma_buf_attachment method which does this
syncing without recreating the mapping, but in reality this is solved by
caching mappings in the exporter (well dma-buf layer) nowadays.

True concurrent access like vk/compute expects is still a model that
dma-buf needs to support on top, but that's a special case and pretty much
needs hw that supports such concurrent access without explicit handover
and fencing.

Aside from some historical accidents and still a few warts, I do think
dma-buf does support both of these models. Of course in the case of
gpu/drm drivers, userspace must know whats possible and act accordingly,
otherwise you just get to keep all the pieces.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-22 14:36                                       ` Daniel Vetter
@ 2022-11-22 17:33                                         ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-22 17:33 UTC (permalink / raw)
  To: Daniel Vetter, Rob Clark
  Cc: Nicolas Dufresne, Tomasz Figa, Lucas Stach, Daniel Stone,
	ppaalanen, sumit.semwal, dri-devel, linaro-mm-sig, linux-media

Am 22.11.22 um 15:36 schrieb Daniel Vetter:
> On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
>> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>>> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
>>>>>> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
>>>>>> be able to directly decided where a certain operation should go. E.g. we
>>>>>> have cases where for example a P2P write doesn't even go to memory, but
>>>>>> rather a doorbell BAR to trigger another operation. Throwing in CPU
>>>>>> round trips for explicit ownership transfer completely breaks that
>>>>>> concept.
>>>>> It sounds like we should have a dma_dev_is_coherent_with_dev() which
>>>>> accepts two (or an array?) of devices and tells the caller whether the
>>>>> devices need explicit ownership transfer.
>>>> No, exactly that's the concept I'm pushing back on very hard here.
>>>>
>>>> In other words explicit ownership transfer is not something we would
>>>> want as requirement in the framework, cause otherwise we break tons of
>>>> use cases which require concurrent access to the underlying buffer.
>>> I'm not pushing for this solution, but really felt the need to correct you here.
>>> I have quite some experience with ownership transfer mechanism, as this is how
>>> GStreamer framework works since 2000. Concurrent access is a really common use
>>> cases and it is quite well defined in that context. The bracketing system (in
>>> this case called map() unmap(), with flag stating the usage intention like reads
>>> and write) is combined the the refcount. The basic rules are simple:
>> This is all CPU oriented, I think Christian is talking about the case
>> where ownership transfer happens without CPU involvement, such as via
>> GPU waiting on a fence
> Yeah for pure device2device handover the rule pretty much has to be that
> any coherency management that needs to be done must be done from the
> device side (flushing device side caches and stuff like that) only. But
> under the assumption that _all_ cpu side management has been done already
> before the first device access started.
>
> And then the map/unmap respectively begin/end_cpu_access can be used what
> it was meant for with cpu side invalidation/flushing and stuff like that,
> while having pretty clear handover/ownership rules and hopefully not doing
> no unecessary flushes. And all that while allowing device acces to be
> pipelined. Worst case the exporter has to insert some pipelined cache
> flushes as a dma_fence pipelined work of its own between the device access
> when moving from one device to the other. That last part sucks a bit right
> now, because we don't have any dma_buf_attachment method which does this
> syncing without recreating the mapping, but in reality this is solved by
> caching mappings in the exporter (well dma-buf layer) nowadays.
>
> True concurrent access like vk/compute expects is still a model that
> dma-buf needs to support on top, but that's a special case and pretty much
> needs hw that supports such concurrent access without explicit handover
> and fencing.
>
> Aside from some historical accidents and still a few warts, I do think
> dma-buf does support both of these models.

We should have come up with dma-heaps earlier and make it clear that 
exporting a DMA-buf from a device gives you something device specific 
which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. 
Question left is what documentation is missing to make it clear how 
things are supposed to work?

Regards,
Christian.

>   Of course in the case of
> gpu/drm drivers, userspace must know whats possible and act accordingly,
> otherwise you just get to keep all the pieces.
> -Daniel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-22 17:33                                         ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-11-22 17:33 UTC (permalink / raw)
  To: Daniel Vetter, Rob Clark
  Cc: linux-media, Tomasz Figa, linaro-mm-sig, ppaalanen, dri-devel,
	Nicolas Dufresne, sumit.semwal

Am 22.11.22 um 15:36 schrieb Daniel Vetter:
> On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
>> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>>> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
>>>>>> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
>>>>>> be able to directly decided where a certain operation should go. E.g. we
>>>>>> have cases where for example a P2P write doesn't even go to memory, but
>>>>>> rather a doorbell BAR to trigger another operation. Throwing in CPU
>>>>>> round trips for explicit ownership transfer completely breaks that
>>>>>> concept.
>>>>> It sounds like we should have a dma_dev_is_coherent_with_dev() which
>>>>> accepts two (or an array?) of devices and tells the caller whether the
>>>>> devices need explicit ownership transfer.
>>>> No, exactly that's the concept I'm pushing back on very hard here.
>>>>
>>>> In other words explicit ownership transfer is not something we would
>>>> want as requirement in the framework, cause otherwise we break tons of
>>>> use cases which require concurrent access to the underlying buffer.
>>> I'm not pushing for this solution, but really felt the need to correct you here.
>>> I have quite some experience with ownership transfer mechanism, as this is how
>>> GStreamer framework works since 2000. Concurrent access is a really common use
>>> cases and it is quite well defined in that context. The bracketing system (in
>>> this case called map() unmap(), with flag stating the usage intention like reads
>>> and write) is combined the the refcount. The basic rules are simple:
>> This is all CPU oriented, I think Christian is talking about the case
>> where ownership transfer happens without CPU involvement, such as via
>> GPU waiting on a fence
> Yeah for pure device2device handover the rule pretty much has to be that
> any coherency management that needs to be done must be done from the
> device side (flushing device side caches and stuff like that) only. But
> under the assumption that _all_ cpu side management has been done already
> before the first device access started.
>
> And then the map/unmap respectively begin/end_cpu_access can be used what
> it was meant for with cpu side invalidation/flushing and stuff like that,
> while having pretty clear handover/ownership rules and hopefully not doing
> no unecessary flushes. And all that while allowing device acces to be
> pipelined. Worst case the exporter has to insert some pipelined cache
> flushes as a dma_fence pipelined work of its own between the device access
> when moving from one device to the other. That last part sucks a bit right
> now, because we don't have any dma_buf_attachment method which does this
> syncing without recreating the mapping, but in reality this is solved by
> caching mappings in the exporter (well dma-buf layer) nowadays.
>
> True concurrent access like vk/compute expects is still a model that
> dma-buf needs to support on top, but that's a special case and pretty much
> needs hw that supports such concurrent access without explicit handover
> and fencing.
>
> Aside from some historical accidents and still a few warts, I do think
> dma-buf does support both of these models.

We should have come up with dma-heaps earlier and make it clear that 
exporting a DMA-buf from a device gives you something device specific 
which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. 
Question left is what documentation is missing to make it clear how 
things are supposed to work?

Regards,
Christian.

>   Of course in the case of
> gpu/drm drivers, userspace must know whats possible and act accordingly,
> otherwise you just get to keep all the pieces.
> -Daniel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-22 17:33                                         ` Christian König
@ 2022-11-22 18:26                                           ` Daniel Vetter
  -1 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2022-11-22 18:26 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Nicolas Dufresne, Tomasz Figa, Lucas Stach,
	Daniel Stone, ppaalanen, sumit.semwal, dri-devel, linaro-mm-sig,
	linux-media

On Tue, 22 Nov 2022 at 18:34, Christian König <christian.koenig@amd.com> wrote:
> Am 22.11.22 um 15:36 schrieb Daniel Vetter:
> > On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
> >> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> >>> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> >>>>>> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> >>>>>> be able to directly decided where a certain operation should go. E.g. we
> >>>>>> have cases where for example a P2P write doesn't even go to memory, but
> >>>>>> rather a doorbell BAR to trigger another operation. Throwing in CPU
> >>>>>> round trips for explicit ownership transfer completely breaks that
> >>>>>> concept.
> >>>>> It sounds like we should have a dma_dev_is_coherent_with_dev() which
> >>>>> accepts two (or an array?) of devices and tells the caller whether the
> >>>>> devices need explicit ownership transfer.
> >>>> No, exactly that's the concept I'm pushing back on very hard here.
> >>>>
> >>>> In other words explicit ownership transfer is not something we would
> >>>> want as requirement in the framework, cause otherwise we break tons of
> >>>> use cases which require concurrent access to the underlying buffer.
> >>> I'm not pushing for this solution, but really felt the need to correct you here.
> >>> I have quite some experience with ownership transfer mechanism, as this is how
> >>> GStreamer framework works since 2000. Concurrent access is a really common use
> >>> cases and it is quite well defined in that context. The bracketing system (in
> >>> this case called map() unmap(), with flag stating the usage intention like reads
> >>> and write) is combined the the refcount. The basic rules are simple:
> >> This is all CPU oriented, I think Christian is talking about the case
> >> where ownership transfer happens without CPU involvement, such as via
> >> GPU waiting on a fence
> > Yeah for pure device2device handover the rule pretty much has to be that
> > any coherency management that needs to be done must be done from the
> > device side (flushing device side caches and stuff like that) only. But
> > under the assumption that _all_ cpu side management has been done already
> > before the first device access started.
> >
> > And then the map/unmap respectively begin/end_cpu_access can be used what
> > it was meant for with cpu side invalidation/flushing and stuff like that,
> > while having pretty clear handover/ownership rules and hopefully not doing
> > no unecessary flushes. And all that while allowing device acces to be
> > pipelined. Worst case the exporter has to insert some pipelined cache
> > flushes as a dma_fence pipelined work of its own between the device access
> > when moving from one device to the other. That last part sucks a bit right
> > now, because we don't have any dma_buf_attachment method which does this
> > syncing without recreating the mapping, but in reality this is solved by
> > caching mappings in the exporter (well dma-buf layer) nowadays.
> >
> > True concurrent access like vk/compute expects is still a model that
> > dma-buf needs to support on top, but that's a special case and pretty much
> > needs hw that supports such concurrent access without explicit handover
> > and fencing.
> >
> > Aside from some historical accidents and still a few warts, I do think
> > dma-buf does support both of these models.
>
> We should have come up with dma-heaps earlier and make it clear that
> exporting a DMA-buf from a device gives you something device specific
> which might or might not work with others.

Yeah, but engineering practicalities were pretty clear that no one
would rewrite the entire Xorg stack and all the drivers just to make
that happen for prime.

> Apart from that I agree, DMA-buf should be capable of handling this.
> Question left is what documentation is missing to make it clear how
> things are supposed to work?

Given the historical baggage of existing use-case, I think the only
way out is that we look at concrete examples from real world users
that break, and figure out how to fix them. Without breaking any of
the existing mess.

One idea might be that we have a per-platform
dma_buf_legacy_coherency_mode(), which tells you what mode (cpu cache
snooping or uncached memory) you need to use to make sure that all
devices agree. On x86 the rule might be that it's cpu cache snooping
by default, but if you have an integrated gpu then everyone needs to
use uncached. That sucks, but at least we could keep the existing mess
going and clean it up. Everyone else would be uncached, except maybe
arm64 servers with pcie connectors. Essentially least common
denominator to make this work. Note that uncached actually means
device access doesn't snoop, the cpu side you can handle with either
uc/wc mappings or explicit flushing.

Then once we have that we could implement the coherency negotiation
protocol on top as an explicit opt-in, so that you can still use
coherent buffers across two pcie gpus even if you also have an
integrated gpu.

Doing only the new protocol without some means to keep the existing
pile of carefully hacked up assumptions would break things, and we
can't do that. Also I have no idea whether that global legacy device
coherency mode would work. Also we might more than just
snooped/unsnoop, since depending upon architecture you might want to
only snoop one transaction (reads vs writes) instead of both of them:
If writes snoop then cpu reads would never need to invalidate caches
beforehand, but writes would still need to flush (and would give you
faster reads on the device side since those can still bypass
snooping). Some igpu platforms work like that, but I'm not sure
whether there's any other device that would care enough about these
for this to matter. Yes it's a hw mis-design (well I don't like it
personally), they fixed it :-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-22 18:26                                           ` Daniel Vetter
  0 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2022-11-22 18:26 UTC (permalink / raw)
  To: Christian König
  Cc: linux-media, ppaalanen, Tomasz Figa, linaro-mm-sig, dri-devel,
	Nicolas Dufresne, sumit.semwal

On Tue, 22 Nov 2022 at 18:34, Christian König <christian.koenig@amd.com> wrote:
> Am 22.11.22 um 15:36 schrieb Daniel Vetter:
> > On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
> >> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> >>> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> >>>>>> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> >>>>>> be able to directly decided where a certain operation should go. E.g. we
> >>>>>> have cases where for example a P2P write doesn't even go to memory, but
> >>>>>> rather a doorbell BAR to trigger another operation. Throwing in CPU
> >>>>>> round trips for explicit ownership transfer completely breaks that
> >>>>>> concept.
> >>>>> It sounds like we should have a dma_dev_is_coherent_with_dev() which
> >>>>> accepts two (or an array?) of devices and tells the caller whether the
> >>>>> devices need explicit ownership transfer.
> >>>> No, exactly that's the concept I'm pushing back on very hard here.
> >>>>
> >>>> In other words explicit ownership transfer is not something we would
> >>>> want as requirement in the framework, cause otherwise we break tons of
> >>>> use cases which require concurrent access to the underlying buffer.
> >>> I'm not pushing for this solution, but really felt the need to correct you here.
> >>> I have quite some experience with ownership transfer mechanism, as this is how
> >>> GStreamer framework works since 2000. Concurrent access is a really common use
> >>> cases and it is quite well defined in that context. The bracketing system (in
> >>> this case called map() unmap(), with flag stating the usage intention like reads
> >>> and write) is combined the the refcount. The basic rules are simple:
> >> This is all CPU oriented, I think Christian is talking about the case
> >> where ownership transfer happens without CPU involvement, such as via
> >> GPU waiting on a fence
> > Yeah for pure device2device handover the rule pretty much has to be that
> > any coherency management that needs to be done must be done from the
> > device side (flushing device side caches and stuff like that) only. But
> > under the assumption that _all_ cpu side management has been done already
> > before the first device access started.
> >
> > And then the map/unmap respectively begin/end_cpu_access can be used what
> > it was meant for with cpu side invalidation/flushing and stuff like that,
> > while having pretty clear handover/ownership rules and hopefully not doing
> > no unecessary flushes. And all that while allowing device acces to be
> > pipelined. Worst case the exporter has to insert some pipelined cache
> > flushes as a dma_fence pipelined work of its own between the device access
> > when moving from one device to the other. That last part sucks a bit right
> > now, because we don't have any dma_buf_attachment method which does this
> > syncing without recreating the mapping, but in reality this is solved by
> > caching mappings in the exporter (well dma-buf layer) nowadays.
> >
> > True concurrent access like vk/compute expects is still a model that
> > dma-buf needs to support on top, but that's a special case and pretty much
> > needs hw that supports such concurrent access without explicit handover
> > and fencing.
> >
> > Aside from some historical accidents and still a few warts, I do think
> > dma-buf does support both of these models.
>
> We should have come up with dma-heaps earlier and make it clear that
> exporting a DMA-buf from a device gives you something device specific
> which might or might not work with others.

Yeah, but engineering practicalities were pretty clear that no one
would rewrite the entire Xorg stack and all the drivers just to make
that happen for prime.

> Apart from that I agree, DMA-buf should be capable of handling this.
> Question left is what documentation is missing to make it clear how
> things are supposed to work?

Given the historical baggage of existing use-case, I think the only
way out is that we look at concrete examples from real world users
that break, and figure out how to fix them. Without breaking any of
the existing mess.

One idea might be that we have a per-platform
dma_buf_legacy_coherency_mode(), which tells you what mode (cpu cache
snooping or uncached memory) you need to use to make sure that all
devices agree. On x86 the rule might be that it's cpu cache snooping
by default, but if you have an integrated gpu then everyone needs to
use uncached. That sucks, but at least we could keep the existing mess
going and clean it up. Everyone else would be uncached, except maybe
arm64 servers with pcie connectors. Essentially least common
denominator to make this work. Note that uncached actually means
device access doesn't snoop, the cpu side you can handle with either
uc/wc mappings or explicit flushing.

Then once we have that we could implement the coherency negotiation
protocol on top as an explicit opt-in, so that you can still use
coherent buffers across two pcie gpus even if you also have an
integrated gpu.

Doing only the new protocol without some means to keep the existing
pile of carefully hacked up assumptions would break things, and we
can't do that. Also I have no idea whether that global legacy device
coherency mode would work. Also we might more than just
snooped/unsnoop, since depending upon architecture you might want to
only snoop one transaction (reads vs writes) instead of both of them:
If writes snoop then cpu reads would never need to invalidate caches
beforehand, but writes would still need to flush (and would give you
faster reads on the device side since those can still bypass
snooping). Some igpu platforms work like that, but I'm not sure
whether there's any other device that would care enough about these
for this to matter. Yes it's a hw mis-design (well I don't like it
personally), they fixed it :-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-22 17:33                                         ` Christian König
@ 2022-11-23  8:33                                           ` Pekka Paalanen
  -1 siblings, 0 replies; 110+ messages in thread
From: Pekka Paalanen @ 2022-11-23  8:33 UTC (permalink / raw)
  To: Christian König
  Cc: linux-media, Tomasz Figa, linaro-mm-sig, dri-devel,
	Nicolas Dufresne, sumit.semwal

[-- Attachment #1: Type: text/plain, Size: 758 bytes --]

On Tue, 22 Nov 2022 18:33:59 +0100
Christian König <christian.koenig@amd.com> wrote:

> We should have come up with dma-heaps earlier and make it clear that 
> exporting a DMA-buf from a device gives you something device specific 
> which might or might not work with others.
> 
> Apart from that I agree, DMA-buf should be capable of handling this. 
> Question left is what documentation is missing to make it clear how 
> things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been
forgotten:
https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.com/

It aimed mostly at userspace, but sounds to me like the coherency stuff
could use a section of its own there?


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-23  8:33                                           ` Pekka Paalanen
  0 siblings, 0 replies; 110+ messages in thread
From: Pekka Paalanen @ 2022-11-23  8:33 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, Rob Clark, Nicolas Dufresne, Tomasz Figa,
	Lucas Stach, Daniel Stone, sumit.semwal, dri-devel,
	linaro-mm-sig, linux-media

[-- Attachment #1: Type: text/plain, Size: 758 bytes --]

On Tue, 22 Nov 2022 18:33:59 +0100
Christian König <christian.koenig@amd.com> wrote:

> We should have come up with dma-heaps earlier and make it clear that 
> exporting a DMA-buf from a device gives you something device specific 
> which might or might not work with others.
> 
> Apart from that I agree, DMA-buf should be capable of handling this. 
> Question left is what documentation is missing to make it clear how 
> things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been
forgotten:
https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.com/

It aimed mostly at userspace, but sounds to me like the coherency stuff
could use a section of its own there?


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-23  8:33                                           ` Pekka Paalanen
@ 2022-11-23 16:33                                             ` Daniel Vetter
  -1 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2022-11-23 16:33 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Christian König, Daniel Vetter, Rob Clark, Nicolas Dufresne,
	Tomasz Figa, Lucas Stach, Daniel Stone, sumit.semwal, dri-devel,
	linaro-mm-sig, linux-media

On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> On Tue, 22 Nov 2022 18:33:59 +0100
> Christian König <christian.koenig@amd.com> wrote:
> 
> > We should have come up with dma-heaps earlier and make it clear that 
> > exporting a DMA-buf from a device gives you something device specific 
> > which might or might not work with others.
> > 
> > Apart from that I agree, DMA-buf should be capable of handling this. 
> > Question left is what documentation is missing to make it clear how 
> > things are supposed to work?
> 
> Perhaps somewhat related from Daniel Stone that seems to have been
> forgotten:
> https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.com/
> 
> It aimed mostly at userspace, but sounds to me like the coherency stuff
> could use a section of its own there?

Hm yeah it would be great to land that and then eventually extend. Daniel?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-23 16:33                                             ` Daniel Vetter
  0 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2022-11-23 16:33 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: linux-media, Tomasz Figa, linaro-mm-sig, dri-devel,
	Nicolas Dufresne, sumit.semwal, Christian König

On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> On Tue, 22 Nov 2022 18:33:59 +0100
> Christian König <christian.koenig@amd.com> wrote:
> 
> > We should have come up with dma-heaps earlier and make it clear that 
> > exporting a DMA-buf from a device gives you something device specific 
> > which might or might not work with others.
> > 
> > Apart from that I agree, DMA-buf should be capable of handling this. 
> > Question left is what documentation is missing to make it clear how 
> > things are supposed to work?
> 
> Perhaps somewhat related from Daniel Stone that seems to have been
> forgotten:
> https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.com/
> 
> It aimed mostly at userspace, but sounds to me like the coherency stuff
> could use a section of its own there?

Hm yeah it would be great to land that and then eventually extend. Daniel?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-23 16:33                                             ` Daniel Vetter
@ 2022-11-25 16:40                                               ` Nicolas Dufresne
  -1 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-25 16:40 UTC (permalink / raw)
  To: Daniel Vetter, Pekka Paalanen
  Cc: Christian König, Rob Clark, Tomasz Figa, Lucas Stach,
	Daniel Stone, sumit.semwal, dri-devel, linaro-mm-sig,
	linux-media

Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
> On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> > On Tue, 22 Nov 2022 18:33:59 +0100
> > Christian König <christian.koenig@amd.com> wrote:
> > 
> > > We should have come up with dma-heaps earlier and make it clear that 
> > > exporting a DMA-buf from a device gives you something device specific 
> > > which might or might not work with others.
> > > 
> > > Apart from that I agree, DMA-buf should be capable of handling this. 
> > > Question left is what documentation is missing to make it clear how 
> > > things are supposed to work?
> > 
> > Perhaps somewhat related from Daniel Stone that seems to have been
> > forgotten:
> > https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.com/
> > 
> > It aimed mostly at userspace, but sounds to me like the coherency stuff
> > could use a section of its own there?
> 
> Hm yeah it would be great to land that and then eventually extend. Daniel?

There is a lot of things documented in this document that have been said to be
completely wrong user-space behaviour in this thread. But it seems to pre-date
the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
vs system memory issue. But it also underline a very important aspect, that
userland is not aware which one to use. What this document suggest though seems
more realist then what has been said here.

Its overall a great document, it unfortunate that it only makes it into the DRM
mailing list.

Nicolas

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-25 16:40                                               ` Nicolas Dufresne
  0 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-11-25 16:40 UTC (permalink / raw)
  To: Daniel Vetter, Pekka Paalanen
  Cc: linux-media, Christian König, linaro-mm-sig, dri-devel,
	Tomasz Figa, sumit.semwal

Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
> On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> > On Tue, 22 Nov 2022 18:33:59 +0100
> > Christian König <christian.koenig@amd.com> wrote:
> > 
> > > We should have come up with dma-heaps earlier and make it clear that 
> > > exporting a DMA-buf from a device gives you something device specific 
> > > which might or might not work with others.
> > > 
> > > Apart from that I agree, DMA-buf should be capable of handling this. 
> > > Question left is what documentation is missing to make it clear how 
> > > things are supposed to work?
> > 
> > Perhaps somewhat related from Daniel Stone that seems to have been
> > forgotten:
> > https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.com/
> > 
> > It aimed mostly at userspace, but sounds to me like the coherency stuff
> > could use a section of its own there?
> 
> Hm yeah it would be great to land that and then eventually extend. Daniel?

There is a lot of things documented in this document that have been said to be
completely wrong user-space behaviour in this thread. But it seems to pre-date
the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
vs system memory issue. But it also underline a very important aspect, that
userland is not aware which one to use. What this document suggest though seems
more realist then what has been said here.

Its overall a great document, it unfortunate that it only makes it into the DRM
mailing list.

Nicolas

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-25 16:40                                               ` Nicolas Dufresne
@ 2022-11-30 10:30                                                 ` Daniel Vetter
  -1 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2022-11-30 10:30 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Daniel Vetter, Pekka Paalanen, Christian König, Rob Clark,
	Tomasz Figa, Lucas Stach, Daniel Stone, sumit.semwal, dri-devel,
	linaro-mm-sig, linux-media

On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:
> Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
> > On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> > > On Tue, 22 Nov 2022 18:33:59 +0100
> > > Christian König <christian.koenig@amd.com> wrote:
> > > 
> > > > We should have come up with dma-heaps earlier and make it clear that 
> > > > exporting a DMA-buf from a device gives you something device specific 
> > > > which might or might not work with others.
> > > > 
> > > > Apart from that I agree, DMA-buf should be capable of handling this. 
> > > > Question left is what documentation is missing to make it clear how 
> > > > things are supposed to work?
> > > 
> > > Perhaps somewhat related from Daniel Stone that seems to have been
> > > forgotten:
> > > https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.com/
> > > 
> > > It aimed mostly at userspace, but sounds to me like the coherency stuff
> > > could use a section of its own there?
> > 
> > Hm yeah it would be great to land that and then eventually extend. Daniel?
> 
> There is a lot of things documented in this document that have been said to be
> completely wrong user-space behaviour in this thread. But it seems to pre-date
> the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
> vs system memory issue. But it also underline a very important aspect, that
> userland is not aware which one to use. What this document suggest though seems
> more realist then what has been said here.
> 
> Its overall a great document, it unfortunate that it only makes it into the DRM
> mailing list.

The doc is more about document the current status quo/best practices,
which is very much not using dma-heaps.

The issue there is that currently userspace has no idea which dma-heap to
use for shared buffers, plus not all allocators are exposed through heaps
to begin with. We had this noted as a todo item (add some device->heap
sysfs links was the idea), until that's done all you can do is hardcode
the right heaps for the right usage in userspace, which is what android
does. Plus android doesn't have dgpu, so doesn't need the missing ttm
heap.

But yeah the long-term aspiration also hasn't changed, because the
dma-heap todo list is also very, very old by now :-/
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-11-30 10:30                                                 ` Daniel Vetter
  0 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2022-11-30 10:30 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Pekka Paalanen, Tomasz Figa, linaro-mm-sig, dri-devel,
	linux-media, sumit.semwal, Christian König

On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:
> Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
> > On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> > > On Tue, 22 Nov 2022 18:33:59 +0100
> > > Christian König <christian.koenig@amd.com> wrote:
> > > 
> > > > We should have come up with dma-heaps earlier and make it clear that 
> > > > exporting a DMA-buf from a device gives you something device specific 
> > > > which might or might not work with others.
> > > > 
> > > > Apart from that I agree, DMA-buf should be capable of handling this. 
> > > > Question left is what documentation is missing to make it clear how 
> > > > things are supposed to work?
> > > 
> > > Perhaps somewhat related from Daniel Stone that seems to have been
> > > forgotten:
> > > https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.com/
> > > 
> > > It aimed mostly at userspace, but sounds to me like the coherency stuff
> > > could use a section of its own there?
> > 
> > Hm yeah it would be great to land that and then eventually extend. Daniel?
> 
> There is a lot of things documented in this document that have been said to be
> completely wrong user-space behaviour in this thread. But it seems to pre-date
> the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
> vs system memory issue. But it also underline a very important aspect, that
> userland is not aware which one to use. What this document suggest though seems
> more realist then what has been said here.
> 
> Its overall a great document, it unfortunate that it only makes it into the DRM
> mailing list.

The doc is more about document the current status quo/best practices,
which is very much not using dma-heaps.

The issue there is that currently userspace has no idea which dma-heap to
use for shared buffers, plus not all allocators are exposed through heaps
to begin with. We had this noted as a todo item (add some device->heap
sysfs links was the idea), until that's done all you can do is hardcode
the right heaps for the right usage in userspace, which is what android
does. Plus android doesn't have dgpu, so doesn't need the missing ttm
heap.

But yeah the long-term aspiration also hasn't changed, because the
dma-heap todo list is also very, very old by now :-/
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-30 10:30                                                 ` Daniel Vetter
@ 2022-12-02 15:27                                                   ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-02 15:27 UTC (permalink / raw)
  To: Daniel Vetter, Nicolas Dufresne
  Cc: Pekka Paalanen, Rob Clark, Tomasz Figa, Lucas Stach,
	Daniel Stone, sumit.semwal, dri-devel, linaro-mm-sig,
	linux-media

Am 30.11.22 um 11:30 schrieb Daniel Vetter:
> On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:
>> Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
>>> On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
>>>> On Tue, 22 Nov 2022 18:33:59 +0100
>>>> Christian König <christian.koenig@amd.com> wrote:
>>>>
>>>>> We should have come up with dma-heaps earlier and make it clear that
>>>>> exporting a DMA-buf from a device gives you something device specific
>>>>> which might or might not work with others.
>>>>>
>>>>> Apart from that I agree, DMA-buf should be capable of handling this.
>>>>> Question left is what documentation is missing to make it clear how
>>>>> things are supposed to work?
>>>> Perhaps somewhat related from Daniel Stone that seems to have been
>>>> forgotten:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210905122742.86029-1-daniels%40collabora.com%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C45786a08e6dc4af2816508dad2bdf957%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638054011293521624%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=GjsoJGNoPozTS2SWeeirURzQatI5vfl9%2B%2BfzoavgTbw%3D&amp;reserved=0
>>>>
>>>> It aimed mostly at userspace, but sounds to me like the coherency stuff
>>>> could use a section of its own there?
>>> Hm yeah it would be great to land that and then eventually extend. Daniel?
>> There is a lot of things documented in this document that have been said to be
>> completely wrong user-space behaviour in this thread. But it seems to pre-date
>> the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
>> vs system memory issue. But it also underline a very important aspect, that
>> userland is not aware which one to use. What this document suggest though seems
>> more realist then what has been said here.
>>
>> Its overall a great document, it unfortunate that it only makes it into the DRM
>> mailing list.
> The doc is more about document the current status quo/best practices,
> which is very much not using dma-heaps.
>
> The issue there is that currently userspace has no idea which dma-heap to
> use for shared buffers, plus not all allocators are exposed through heaps
> to begin with. We had this noted as a todo item (add some device->heap
> sysfs links was the idea), until that's done all you can do is hardcode
> the right heaps for the right usage in userspace, which is what android
> does. Plus android doesn't have dgpu, so doesn't need the missing ttm
> heap.

Hui? Why do you think we should have a TTM heap in the first place?

As far as I can see we need three different ways of allocation:
1. Normal system memory.
2. CMA based.
3. Device specific.

When any of the participating devices needs CMA then user space must use 
the CMA heap, when any of the participating devices needs device 
specific memory then user space must use device specific memory (from 
that device).

It still can be that two devices can't talk with each other because both 
needs the buffer to be allocated from device specific memory for a 
particular use case, but at least all the cases for both none CPU cache 
coherent ARM devices as well as device specific memory for scanout 
should be handled with this.

Regards,
Christian.

>
> But yeah the long-term aspiration also hasn't changed, because the
> dma-heap todo list is also very, very old by now :-/
> -Daniel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-02 15:27                                                   ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-02 15:27 UTC (permalink / raw)
  To: Daniel Vetter, Nicolas Dufresne
  Cc: Tomasz Figa, linaro-mm-sig, Pekka Paalanen, dri-devel,
	linux-media, sumit.semwal

Am 30.11.22 um 11:30 schrieb Daniel Vetter:
> On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:
>> Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
>>> On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
>>>> On Tue, 22 Nov 2022 18:33:59 +0100
>>>> Christian König <christian.koenig@amd.com> wrote:
>>>>
>>>>> We should have come up with dma-heaps earlier and make it clear that
>>>>> exporting a DMA-buf from a device gives you something device specific
>>>>> which might or might not work with others.
>>>>>
>>>>> Apart from that I agree, DMA-buf should be capable of handling this.
>>>>> Question left is what documentation is missing to make it clear how
>>>>> things are supposed to work?
>>>> Perhaps somewhat related from Daniel Stone that seems to have been
>>>> forgotten:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210905122742.86029-1-daniels%40collabora.com%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C45786a08e6dc4af2816508dad2bdf957%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638054011293521624%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=GjsoJGNoPozTS2SWeeirURzQatI5vfl9%2B%2BfzoavgTbw%3D&amp;reserved=0
>>>>
>>>> It aimed mostly at userspace, but sounds to me like the coherency stuff
>>>> could use a section of its own there?
>>> Hm yeah it would be great to land that and then eventually extend. Daniel?
>> There is a lot of things documented in this document that have been said to be
>> completely wrong user-space behaviour in this thread. But it seems to pre-date
>> the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
>> vs system memory issue. But it also underline a very important aspect, that
>> userland is not aware which one to use. What this document suggest though seems
>> more realist then what has been said here.
>>
>> Its overall a great document, it unfortunate that it only makes it into the DRM
>> mailing list.
> The doc is more about document the current status quo/best practices,
> which is very much not using dma-heaps.
>
> The issue there is that currently userspace has no idea which dma-heap to
> use for shared buffers, plus not all allocators are exposed through heaps
> to begin with. We had this noted as a todo item (add some device->heap
> sysfs links was the idea), until that's done all you can do is hardcode
> the right heaps for the right usage in userspace, which is what android
> does. Plus android doesn't have dgpu, so doesn't need the missing ttm
> heap.

Hui? Why do you think we should have a TTM heap in the first place?

As far as I can see we need three different ways of allocation:
1. Normal system memory.
2. CMA based.
3. Device specific.

When any of the participating devices needs CMA then user space must use 
the CMA heap, when any of the participating devices needs device 
specific memory then user space must use device specific memory (from 
that device).

It still can be that two devices can't talk with each other because both 
needs the buffer to be allocated from device specific memory for a 
particular use case, but at least all the cases for both none CPU cache 
coherent ARM devices as well as device specific memory for scanout 
should be handled with this.

Regards,
Christian.

>
> But yeah the long-term aspiration also hasn't changed, because the
> dma-heap todo list is also very, very old by now :-/
> -Daniel


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-11-17 12:10                                 ` Christian König
@ 2022-12-05  6:41                                   ` Tomasz Figa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-12-05  6:41 UTC (permalink / raw)
  To: Christian König
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, robdclark, dri-devel, linaro-mm-sig,
	linux-media

Hi Christian,

On Thu, Nov 17, 2022 at 9:11 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Hi Tomasz,
>
> Am 17.11.22 um 10:35 schrieb Tomasz Figa:
> > Hi Christian and everyone,
> >
> > On Thu, Nov 3, 2022 at 4:14 AM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> >>> Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> >>> [SNIP]
> >>>> It would just be doing this for the importer and exactly that
> >>>> would be bad design because we then have handling for the display driver
> >>>> outside of the driver.
> >>>>
> >>> The driver would have to do those cache maintenance operations if it
> >>> directly worked with a non-coherent device. Doing it for the importer
> >>> is just doing it for another device, not the one directly managed by
> >>> the exporter.
> >>>
> >>> I really don't see the difference to the other dma-buf ops: in
> >>> dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> >>> the address space of the importer. Why would cache maintenance be any
> >>> different?
> >> The issue here is the explicit ownership transfer.
> >>
> >> We intentionally decided against that because it breaks tons of use
> >> cases and is at least by me and a couple of others seen as generally
> >> design failure of the Linux DMA-API.
> > First of all, thanks for starting the discussion and sorry for being
> > late to the party. May I ask you to keep me on CC for any changes that
> > touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
> > being copied on the entire series, no need to pick the specific
> > patches. Thanks in advance.
>
> Sorry for that, I've only added people involved in the previous
> discussion. Going to keep you in the loop.
>

No worries. Thanks.

Sorry, for being late with the reply, had a bit of vacation and then
some catching up last week.

> > I agree that we have some design issues in the current DMA-buf
> > framework, but I'd try to approach it a bit differently. Instead of
> > focusing on the issues in the current design, could we write down our
> > requirements and try to come up with how a correct design would look
> > like? (A lot of that has been already mentioned in this thread, but I
> > find it quite difficult to follow and it might not be a complete view
> > either.)
>
> Well, exactly that's what we disagree on.
>
> As far as I can see the current design of DMA-buf is perfectly capable
> of handling all the requirements.
>
> A brief summary of the requirements with some implementation notes:
>
> 1. Device drivers export their memory as it is. E.g. no special handling
> for importers on the exporter side.
>      If an importer can't deal with a specific format, layout, caching
> etc... of the data the correct approach is to reject the attachment.
>      Those parameters are controlled by userspace and negotiating them
> is explicitly not part of the framework.

Ack. I believe it matches the current implementation of the DMA-buf
framework, although as you mentioned, the swiotlb feature of the DMA
mapping framework kind of violates this.

>
> 2. Accesses of the CPU to a buffer are bracket int begin_cpu_access()
> and end_cpu_access() calls.
>      Here we can insert the CPU cache invalidation/flushes as necessary.

Ack. I think a part of the problem today is that there exist userspace
and kernel code instances that don't insert them and assume that some
magic keeps the cache clean...

>
> 3. Accesses of the device HW to a buffer are represented with dma_fence
> objects.
>      It's explicitly allowed to have multiple devices access the buffer
> at the same time.
>

Ack. Again there exists kernel code that doesn't honor or provide DMA
fences (e.g. V4L2).

> 4. Access to the DMA-buf by the HW of an importer is setup by the exporter.
>      In other words the exporter provides a bunch of DMA addresses the
> importer should access.
>      The importer should not try to come up with those on its own.
>
> > That said, let me address a few aspects already mentioned, to make
> > sure that everyone is on the same page.
> >
> >> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> >> be able to directly decided where a certain operation should go. E.g. we
> >> have cases where for example a P2P write doesn't even go to memory, but
> >> rather a doorbell BAR to trigger another operation. Throwing in CPU
> >> round trips for explicit ownership transfer completely breaks that concept.
> > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > accepts two (or an array?) of devices and tells the caller whether the
> > devices need explicit ownership transfer.
>
> No, exactly that's the concept I'm pushing back on very hard here.
>
> In other words explicit ownership transfer is not something we would
> want as requirement in the framework, cause otherwise we break tons of
> use cases which require concurrent access to the underlying buffer.
>
> When a device driver needs explicit ownership transfer it's perfectly
> possible to implement this using the dma_fence objects mentioned above.
> E.g. drivers can already look at who is accessing a buffer currently and
> can even grab explicit ownership of it by adding their own dma_fence
> objects.
>
> The only exception is CPU based access, e.g. when something is written
> with the CPU a cache flush might be necessary and when something is read
> with the CPU a cache invalidation might be necessary.
>

Okay, that's much clearer now, thanks for clarifying this. So we
should be covered for the cache maintenance needs originating from CPU
accesses already, +/- the broken cases which don't call the begin/end
CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines,
we should be covered either by the userspace explicitly flushing the
hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing
begin/end CPU access calls.)

> >> So if a device driver uses cached system memory on an architecture which
> >> devices which can't access it the right approach is clearly to reject
> >> the access.
> > I'd like to accent the fact that "requires cache maintenance" != "can't access".
>
> Well that depends. As said above the exporter exports the buffer as it
> was allocated.
>
> If that means the the exporter provides a piece of memory which requires
> CPU cache snooping to access correctly then the best thing we can do is
> to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice?
Do I assume correctly that it's about sharing a buffer between one DMA
engine that is cache-coherent and another that is non-coherent, where
the first one ends up having its accesses always go through some kind
of a cache (CPU cache, L2/L3/... cache, etc.)?

>
> We do have the system and CMA dma-buf heap for cases where a device
> driver which wants to access the buffer with caches enabled. So this is
> not a limitation in functionality, it's just a matter of correctly using it.
>

V4L2 also has the ability to allocate buffers and map them with caches enabled.

> >> What we can do is to reverse the role of the exporter and importer and
> >> let the device which needs uncached memory take control. This way this
> >> device can insert operations as needed, e.g. flush read caches or
> >> invalidate write caches.
> >>
> > (Putting aside the cases when the access is really impossible at all.)
> > Correct me if I'm wrong, but isn't that because we don't have a proper
> > hook for the importer to tell the DMA-buf framework to prepare the
> > buffer for its access?
>
> No, we do have the explicit begin_cpu_access() and end_cpu_access()
> calls which are even exporter to userspace through IOCTLs for this.

Ack.

>
> The problem is that in this particular case the exporter provides the
> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> with that.

Why does the exporter leave the buffer with dirty CPU caches?

>
> Either reverting the roles of the importer or exporter or switching over
> to the common DMA-heaps implementation for the buffer would solve that.
>
> >> It's the job of the higher level to prepare the buffer a device work
> >> with, not the one of the lower level.
> > What are higher and lower levels here?
>
> The MM as higher level and the DMA mapping framework as lower level.
>

Hmm, I generally consider the DMA mapping framework a part of MM,
especially its allocation capabilities. It heavily relies on low level
MM primitives internally and exposes another set of low level
primitives that is more useful for device drivers. At least it should
be seen this way, but I agree that it currently includes things that
shouldn't necessarily be there, like the transparent bouncing.

> >
> > As per the existing design of the DMA mapping framework, the framework
> > handles the system DMA architecture details and DMA master drivers
> > take care of invoking the right DMA mapping operations around the DMA
> > accesses. This makes sense to me, as DMA master drivers have no idea
> > about the specific SoCs or buses they're plugged into, while the DMA
> > mapping framework has no idea when the DMA accesses are taking place.
>
> Yeah and exactly that's a bit problematic. In an ideal world device
> drivers wouldn't see memory they can't access in the first place.
>
> For example what we currently do is the following:
> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
> 2. DMA mapping framework is called by the driver to create an sg-table.
> 3. The DMA mapping framework sees that this won't work and inserts
> bounce buffers.
> 4. The driver does the DMA to the bounce buffers instead.
> 5. The DMA mapping framework copies the data to the real location.
>
> This is highly problematic since it removes the control of what happens
> here. E.g. drivers can't prevent using bounce buffers when they need
> coherent access for DMA-buf for example.
>
> What we should have instead is that bounce buffers are applied at a
> higher level, for example by get_user_pages() or more general in the MM.
>

I tend to agree with you on this, but I see a lot of challenges that
would need to be solved if we want to make it otherwise. The whole
reason for transparent bouncing came from the fact that many existing
subsystems didn't really care about the needs of the underlying
hardware, e.g. block or network subsystems would just pass random
pages to the drivers. I think the idea of DMA mapping handling this
internally was to avoid implementing the bouncing here and there in
each block or network driver that needs it. (Arguably, an optional
helper could be provided instead for use in those subsystems...)

> >> In other words in a proper design the higher level would prepare the
> >> memory in a way the device driver can work with it, not the other way
> >> around.
> >>
> >> When a device driver gets memory it can't work with the correct response
> >> is to throw an error and bubble that up into a layer where it can be
> >> handled gracefully.
> >>
> >> For example instead of using bounce buffers under the hood the DMA layer
> >> the MM should make sure that when you call read() with O_DIRECT that the
> >> pages in question are accessible by the device.
> >>
> > I tend to agree with you if it's about a costly software "emulation"
> > like bounce buffers, but cache maintenance is a hardware feature
> > existing there by default and it's often much cheaper to operate on
> > cached memory and synchronize the caches rather than have everything
> > in uncached (or write-combined) memory.
>
> Well that we should have proper cache maintenance is really not the
> question. The question is where to do that?
>
> E.g. in the DMA-mapping framework which as far as I can see should only
> take care of translating addresses.

The DMA mapping framework is actually a DMA allocation and mapping
framework. Generic memory allocation primitives (like alloc_pages(),
kmalloc()) shouldn't be used for buffers that are expected to be
passed to DMA engines - there exist DMA-aware replacements, like
dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
etc.

>
> Or the higher level (get_user_pages() is just one example of that) which
> says ok device X wants to access memory Y I need to make sure that
> caches are flushed/invalidated.

Okay, so here comes the userptr case, which I really feel like is just
doomed at the very start, because it does nothing to accommodate
hardware needs at allocation time and then just expects some magic to
happen to make it possible for the hardware to make use of the buffer.

That said, it should be still possible to handle that pretty well for
hardware that allows it, i.e. without much memory access constraints.
What would be still missing if we just use the existing gup() +
dma_map() + dma_sync() sequence?

>
> >>> It's a use-case that is working fine today with many devices (e.g. network
> >>> adapters) in the ARM world, exactly because the architecture specific
> >>> implementation of the DMA API inserts the cache maintenance operations
> >>> on buffer ownership transfer.
> >> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >> design totally breaks GPUs on Xen DOM0 for example.
> >>
> >> And Xen is just one example, I can certainly say from experience that
> >> this design was a really really bad idea because it favors just one use
> >> case while making other use cases practically impossible if not really
> >> hard to implement.
> > Sorry, I haven't worked with Xen. Could you elaborate what's the
> > problem that this introduces for it?
>
> That's a bit longer topic. The AMD XEN devs are already working on this
> as far as I know. I can ping internally how far they got with sending
> the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which
direction we should be going, but I guess we can wait until they send
some patches.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-05  6:41                                   ` Tomasz Figa
  0 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-12-05  6:41 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

Hi Christian,

On Thu, Nov 17, 2022 at 9:11 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Hi Tomasz,
>
> Am 17.11.22 um 10:35 schrieb Tomasz Figa:
> > Hi Christian and everyone,
> >
> > On Thu, Nov 3, 2022 at 4:14 AM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> >>> Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> >>> [SNIP]
> >>>> It would just be doing this for the importer and exactly that
> >>>> would be bad design because we then have handling for the display driver
> >>>> outside of the driver.
> >>>>
> >>> The driver would have to do those cache maintenance operations if it
> >>> directly worked with a non-coherent device. Doing it for the importer
> >>> is just doing it for another device, not the one directly managed by
> >>> the exporter.
> >>>
> >>> I really don't see the difference to the other dma-buf ops: in
> >>> dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> >>> the address space of the importer. Why would cache maintenance be any
> >>> different?
> >> The issue here is the explicit ownership transfer.
> >>
> >> We intentionally decided against that because it breaks tons of use
> >> cases and is at least by me and a couple of others seen as generally
> >> design failure of the Linux DMA-API.
> > First of all, thanks for starting the discussion and sorry for being
> > late to the party. May I ask you to keep me on CC for any changes that
> > touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
> > being copied on the entire series, no need to pick the specific
> > patches. Thanks in advance.
>
> Sorry for that, I've only added people involved in the previous
> discussion. Going to keep you in the loop.
>

No worries. Thanks.

Sorry, for being late with the reply, had a bit of vacation and then
some catching up last week.

> > I agree that we have some design issues in the current DMA-buf
> > framework, but I'd try to approach it a bit differently. Instead of
> > focusing on the issues in the current design, could we write down our
> > requirements and try to come up with how a correct design would look
> > like? (A lot of that has been already mentioned in this thread, but I
> > find it quite difficult to follow and it might not be a complete view
> > either.)
>
> Well, exactly that's what we disagree on.
>
> As far as I can see the current design of DMA-buf is perfectly capable
> of handling all the requirements.
>
> A brief summary of the requirements with some implementation notes:
>
> 1. Device drivers export their memory as it is. E.g. no special handling
> for importers on the exporter side.
>      If an importer can't deal with a specific format, layout, caching
> etc... of the data the correct approach is to reject the attachment.
>      Those parameters are controlled by userspace and negotiating them
> is explicitly not part of the framework.

Ack. I believe it matches the current implementation of the DMA-buf
framework, although as you mentioned, the swiotlb feature of the DMA
mapping framework kind of violates this.

>
> 2. Accesses of the CPU to a buffer are bracket int begin_cpu_access()
> and end_cpu_access() calls.
>      Here we can insert the CPU cache invalidation/flushes as necessary.

Ack. I think a part of the problem today is that there exist userspace
and kernel code instances that don't insert them and assume that some
magic keeps the cache clean...

>
> 3. Accesses of the device HW to a buffer are represented with dma_fence
> objects.
>      It's explicitly allowed to have multiple devices access the buffer
> at the same time.
>

Ack. Again there exists kernel code that doesn't honor or provide DMA
fences (e.g. V4L2).

> 4. Access to the DMA-buf by the HW of an importer is setup by the exporter.
>      In other words the exporter provides a bunch of DMA addresses the
> importer should access.
>      The importer should not try to come up with those on its own.
>
> > That said, let me address a few aspects already mentioned, to make
> > sure that everyone is on the same page.
> >
> >> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> >> be able to directly decided where a certain operation should go. E.g. we
> >> have cases where for example a P2P write doesn't even go to memory, but
> >> rather a doorbell BAR to trigger another operation. Throwing in CPU
> >> round trips for explicit ownership transfer completely breaks that concept.
> > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > accepts two (or an array?) of devices and tells the caller whether the
> > devices need explicit ownership transfer.
>
> No, exactly that's the concept I'm pushing back on very hard here.
>
> In other words explicit ownership transfer is not something we would
> want as requirement in the framework, cause otherwise we break tons of
> use cases which require concurrent access to the underlying buffer.
>
> When a device driver needs explicit ownership transfer it's perfectly
> possible to implement this using the dma_fence objects mentioned above.
> E.g. drivers can already look at who is accessing a buffer currently and
> can even grab explicit ownership of it by adding their own dma_fence
> objects.
>
> The only exception is CPU based access, e.g. when something is written
> with the CPU a cache flush might be necessary and when something is read
> with the CPU a cache invalidation might be necessary.
>

Okay, that's much clearer now, thanks for clarifying this. So we
should be covered for the cache maintenance needs originating from CPU
accesses already, +/- the broken cases which don't call the begin/end
CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines,
we should be covered either by the userspace explicitly flushing the
hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing
begin/end CPU access calls.)

> >> So if a device driver uses cached system memory on an architecture which
> >> devices which can't access it the right approach is clearly to reject
> >> the access.
> > I'd like to accent the fact that "requires cache maintenance" != "can't access".
>
> Well that depends. As said above the exporter exports the buffer as it
> was allocated.
>
> If that means the the exporter provides a piece of memory which requires
> CPU cache snooping to access correctly then the best thing we can do is
> to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice?
Do I assume correctly that it's about sharing a buffer between one DMA
engine that is cache-coherent and another that is non-coherent, where
the first one ends up having its accesses always go through some kind
of a cache (CPU cache, L2/L3/... cache, etc.)?

>
> We do have the system and CMA dma-buf heap for cases where a device
> driver which wants to access the buffer with caches enabled. So this is
> not a limitation in functionality, it's just a matter of correctly using it.
>

V4L2 also has the ability to allocate buffers and map them with caches enabled.

> >> What we can do is to reverse the role of the exporter and importer and
> >> let the device which needs uncached memory take control. This way this
> >> device can insert operations as needed, e.g. flush read caches or
> >> invalidate write caches.
> >>
> > (Putting aside the cases when the access is really impossible at all.)
> > Correct me if I'm wrong, but isn't that because we don't have a proper
> > hook for the importer to tell the DMA-buf framework to prepare the
> > buffer for its access?
>
> No, we do have the explicit begin_cpu_access() and end_cpu_access()
> calls which are even exporter to userspace through IOCTLs for this.

Ack.

>
> The problem is that in this particular case the exporter provides the
> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> with that.

Why does the exporter leave the buffer with dirty CPU caches?

>
> Either reverting the roles of the importer or exporter or switching over
> to the common DMA-heaps implementation for the buffer would solve that.
>
> >> It's the job of the higher level to prepare the buffer a device work
> >> with, not the one of the lower level.
> > What are higher and lower levels here?
>
> The MM as higher level and the DMA mapping framework as lower level.
>

Hmm, I generally consider the DMA mapping framework a part of MM,
especially its allocation capabilities. It heavily relies on low level
MM primitives internally and exposes another set of low level
primitives that is more useful for device drivers. At least it should
be seen this way, but I agree that it currently includes things that
shouldn't necessarily be there, like the transparent bouncing.

> >
> > As per the existing design of the DMA mapping framework, the framework
> > handles the system DMA architecture details and DMA master drivers
> > take care of invoking the right DMA mapping operations around the DMA
> > accesses. This makes sense to me, as DMA master drivers have no idea
> > about the specific SoCs or buses they're plugged into, while the DMA
> > mapping framework has no idea when the DMA accesses are taking place.
>
> Yeah and exactly that's a bit problematic. In an ideal world device
> drivers wouldn't see memory they can't access in the first place.
>
> For example what we currently do is the following:
> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
> 2. DMA mapping framework is called by the driver to create an sg-table.
> 3. The DMA mapping framework sees that this won't work and inserts
> bounce buffers.
> 4. The driver does the DMA to the bounce buffers instead.
> 5. The DMA mapping framework copies the data to the real location.
>
> This is highly problematic since it removes the control of what happens
> here. E.g. drivers can't prevent using bounce buffers when they need
> coherent access for DMA-buf for example.
>
> What we should have instead is that bounce buffers are applied at a
> higher level, for example by get_user_pages() or more general in the MM.
>

I tend to agree with you on this, but I see a lot of challenges that
would need to be solved if we want to make it otherwise. The whole
reason for transparent bouncing came from the fact that many existing
subsystems didn't really care about the needs of the underlying
hardware, e.g. block or network subsystems would just pass random
pages to the drivers. I think the idea of DMA mapping handling this
internally was to avoid implementing the bouncing here and there in
each block or network driver that needs it. (Arguably, an optional
helper could be provided instead for use in those subsystems...)

> >> In other words in a proper design the higher level would prepare the
> >> memory in a way the device driver can work with it, not the other way
> >> around.
> >>
> >> When a device driver gets memory it can't work with the correct response
> >> is to throw an error and bubble that up into a layer where it can be
> >> handled gracefully.
> >>
> >> For example instead of using bounce buffers under the hood the DMA layer
> >> the MM should make sure that when you call read() with O_DIRECT that the
> >> pages in question are accessible by the device.
> >>
> > I tend to agree with you if it's about a costly software "emulation"
> > like bounce buffers, but cache maintenance is a hardware feature
> > existing there by default and it's often much cheaper to operate on
> > cached memory and synchronize the caches rather than have everything
> > in uncached (or write-combined) memory.
>
> Well that we should have proper cache maintenance is really not the
> question. The question is where to do that?
>
> E.g. in the DMA-mapping framework which as far as I can see should only
> take care of translating addresses.

The DMA mapping framework is actually a DMA allocation and mapping
framework. Generic memory allocation primitives (like alloc_pages(),
kmalloc()) shouldn't be used for buffers that are expected to be
passed to DMA engines - there exist DMA-aware replacements, like
dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
etc.

>
> Or the higher level (get_user_pages() is just one example of that) which
> says ok device X wants to access memory Y I need to make sure that
> caches are flushed/invalidated.

Okay, so here comes the userptr case, which I really feel like is just
doomed at the very start, because it does nothing to accommodate
hardware needs at allocation time and then just expects some magic to
happen to make it possible for the hardware to make use of the buffer.

That said, it should be still possible to handle that pretty well for
hardware that allows it, i.e. without much memory access constraints.
What would be still missing if we just use the existing gup() +
dma_map() + dma_sync() sequence?

>
> >>> It's a use-case that is working fine today with many devices (e.g. network
> >>> adapters) in the ARM world, exactly because the architecture specific
> >>> implementation of the DMA API inserts the cache maintenance operations
> >>> on buffer ownership transfer.
> >> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >> design totally breaks GPUs on Xen DOM0 for example.
> >>
> >> And Xen is just one example, I can certainly say from experience that
> >> this design was a really really bad idea because it favors just one use
> >> case while making other use cases practically impossible if not really
> >> hard to implement.
> > Sorry, I haven't worked with Xen. Could you elaborate what's the
> > problem that this introduces for it?
>
> That's a bit longer topic. The AMD XEN devs are already working on this
> as far as I know. I can ping internally how far they got with sending
> the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which
direction we should be going, but I guess we can wait until they send
some patches.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-05  6:41                                   ` Tomasz Figa
@ 2022-12-05  8:28                                     ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-05  8:28 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, robdclark, dri-devel, linaro-mm-sig,
	linux-media

Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> [SNIP]
>> In other words explicit ownership transfer is not something we would
>> want as requirement in the framework, cause otherwise we break tons of
>> use cases which require concurrent access to the underlying buffer.
>>
>> When a device driver needs explicit ownership transfer it's perfectly
>> possible to implement this using the dma_fence objects mentioned above.
>> E.g. drivers can already look at who is accessing a buffer currently and
>> can even grab explicit ownership of it by adding their own dma_fence
>> objects.
>>
>> The only exception is CPU based access, e.g. when something is written
>> with the CPU a cache flush might be necessary and when something is read
>> with the CPU a cache invalidation might be necessary.
>>
> Okay, that's much clearer now, thanks for clarifying this. So we
> should be covered for the cache maintenance needs originating from CPU
> accesses already, +/- the broken cases which don't call the begin/end
> CPU access routines that I mentioned above.
>
> Similarly, for any ownership transfer between different DMA engines,
> we should be covered either by the userspace explicitly flushing the
> hardware pipeline or attaching a DMA-buf fence to the buffer.
>
> But then, what's left to be solved? :) (Besides the cases of missing
> begin/end CPU access calls.)

Well there are multiple problems here:

1. A lot of userspace applications/frameworks assume that it can 
allocate the buffer anywhere and it just works.

This isn't true at all, we have tons of cases where device can only 
access their special memory for certain use cases.
Just look at scanout for displaying on dGPU, neither AMD nor NVidia 
supports system memory here. Similar cases exists for audio/video codecs 
where intermediate memory is only accessible by certain devices because 
of content protection.

2. We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only 
guess if normal, CMA or even device specific memory is needed.

3. We seem to lack some essential parts of those restrictions in the 
documentation.

>>>> So if a device driver uses cached system memory on an architecture which
>>>> devices which can't access it the right approach is clearly to reject
>>>> the access.
>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
>> Well that depends. As said above the exporter exports the buffer as it
>> was allocated.
>>
>> If that means the the exporter provides a piece of memory which requires
>> CPU cache snooping to access correctly then the best thing we can do is
>> to prevent an importer which can't do this from attaching.
> Could you elaborate more about this case? Does it exist in practice?
> Do I assume correctly that it's about sharing a buffer between one DMA
> engine that is cache-coherent and another that is non-coherent, where
> the first one ends up having its accesses always go through some kind
> of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one 
device driver wrote to it's internal buffer with the CPU (so some cache 
lines where dirty) and then a device which couldn't deal with that tried 
to access it.

We could say that all device drivers must always look at the coherency 
of the devices which want to access their buffers. But that would 
horrible complicate things for maintaining the drivers because then 
drivers would need to take into account requirements from other drivers 
while allocating their internal buffers.

That's essentially the reason why we have DMA-buf heaps. Those heaps 
expose system memory buffers with certain properties (size, CMA, DMA bit 
restrictions etc...) and also implement the callback functions for CPU 
cache maintenance.

>> We do have the system and CMA dma-buf heap for cases where a device
>> driver which wants to access the buffer with caches enabled. So this is
>> not a limitation in functionality, it's just a matter of correctly using it.
>>
> V4L2 also has the ability to allocate buffers and map them with caches enabled.

Yeah, when that's a requirement for the V4L2 device it also makes 
perfect sense.

It's just that we shouldn't force any specific allocation behavior on a 
device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use 
videobuf2-dma-contig because some other device needs CMA. Instead we 
should use the common DMA-buf heaps which implement this as neutral 
supplier of system memory buffers.

>> The problem is that in this particular case the exporter provides the
>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
>> with that.
> Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in 
this case that means writing with the CPU to it.

>> Either reverting the roles of the importer or exporter or switching over
>> to the common DMA-heaps implementation for the buffer would solve that.
>>
>>>> It's the job of the higher level to prepare the buffer a device work
>>>> with, not the one of the lower level.
>>> What are higher and lower levels here?
>> The MM as higher level and the DMA mapping framework as lower level.
>>
> Hmm, I generally consider the DMA mapping framework a part of MM,
> especially its allocation capabilities. It heavily relies on low level
> MM primitives internally and exposes another set of low level
> primitives that is more useful for device drivers. At least it should
> be seen this way, but I agree that it currently includes things that
> shouldn't necessarily be there, like the transparent bouncing.

Exactly that, yes! Good to hear that more people think that way.

Christoph made some comments which sounded like he agreed to some of the 
points as well, but nobody ever said it so clearly.

>>> As per the existing design of the DMA mapping framework, the framework
>>> handles the system DMA architecture details and DMA master drivers
>>> take care of invoking the right DMA mapping operations around the DMA
>>> accesses. This makes sense to me, as DMA master drivers have no idea
>>> about the specific SoCs or buses they're plugged into, while the DMA
>>> mapping framework has no idea when the DMA accesses are taking place.
>> Yeah and exactly that's a bit problematic. In an ideal world device
>> drivers wouldn't see memory they can't access in the first place.
>>
>> For example what we currently do is the following:
>> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
>> 2. DMA mapping framework is called by the driver to create an sg-table.
>> 3. The DMA mapping framework sees that this won't work and inserts
>> bounce buffers.
>> 4. The driver does the DMA to the bounce buffers instead.
>> 5. The DMA mapping framework copies the data to the real location.
>>
>> This is highly problematic since it removes the control of what happens
>> here. E.g. drivers can't prevent using bounce buffers when they need
>> coherent access for DMA-buf for example.
>>
>> What we should have instead is that bounce buffers are applied at a
>> higher level, for example by get_user_pages() or more general in the MM.
>>
> I tend to agree with you on this, but I see a lot of challenges that
> would need to be solved if we want to make it otherwise. The whole
> reason for transparent bouncing came from the fact that many existing
> subsystems didn't really care about the needs of the underlying
> hardware, e.g. block or network subsystems would just pass random
> pages to the drivers. I think the idea of DMA mapping handling this
> internally was to avoid implementing the bouncing here and there in
> each block or network driver that needs it. (Arguably, an optional
> helper could be provided instead for use in those subsystems...)

Yes, totally agree. The problem is really that we moved bunch of MM and 
DMA functions in one API.

The bounce buffers are something we should really have in a separate 
component.

Then the functionality of allocating system memory for a specific device 
or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a 
device (IOMMU programming etc..) should then be part of an DMA API.

>>>> In other words in a proper design the higher level would prepare the
>>>> memory in a way the device driver can work with it, not the other way
>>>> around.
>>>>
>>>> When a device driver gets memory it can't work with the correct response
>>>> is to throw an error and bubble that up into a layer where it can be
>>>> handled gracefully.
>>>>
>>>> For example instead of using bounce buffers under the hood the DMA layer
>>>> the MM should make sure that when you call read() with O_DIRECT that the
>>>> pages in question are accessible by the device.
>>>>
>>> I tend to agree with you if it's about a costly software "emulation"
>>> like bounce buffers, but cache maintenance is a hardware feature
>>> existing there by default and it's often much cheaper to operate on
>>> cached memory and synchronize the caches rather than have everything
>>> in uncached (or write-combined) memory.
>> Well that we should have proper cache maintenance is really not the
>> question. The question is where to do that?
>>
>> E.g. in the DMA-mapping framework which as far as I can see should only
>> take care of translating addresses.
> The DMA mapping framework is actually a DMA allocation and mapping
> framework. Generic memory allocation primitives (like alloc_pages(),
> kmalloc()) shouldn't be used for buffers that are expected to be
> passed to DMA engines - there exist DMA-aware replacements, like
> dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
> etc.
>
>> Or the higher level (get_user_pages() is just one example of that) which
>> says ok device X wants to access memory Y I need to make sure that
>> caches are flushed/invalidated.
> Okay, so here comes the userptr case, which I really feel like is just
> doomed at the very start, because it does nothing to accommodate
> hardware needs at allocation time and then just expects some magic to
> happen to make it possible for the hardware to make use of the buffer.
>
> That said, it should be still possible to handle that pretty well for
> hardware that allows it, i.e. without much memory access constraints.
> What would be still missing if we just use the existing gup() +
> dma_map() + dma_sync() sequence?

Error or rather fallback handling and *not* transparently inserting 
bounce buffers.

The whole implicit bounce buffers concept falls apart as soon as you 
don't have a stream access any more.

>>>>> It's a use-case that is working fine today with many devices (e.g. network
>>>>> adapters) in the ARM world, exactly because the architecture specific
>>>>> implementation of the DMA API inserts the cache maintenance operations
>>>>> on buffer ownership transfer.
>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>>>> design totally breaks GPUs on Xen DOM0 for example.
>>>>
>>>> And Xen is just one example, I can certainly say from experience that
>>>> this design was a really really bad idea because it favors just one use
>>>> case while making other use cases practically impossible if not really
>>>> hard to implement.
>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
>>> problem that this introduces for it?
>> That's a bit longer topic. The AMD XEN devs are already working on this
>> as far as I know. I can ping internally how far they got with sending
>> the patches out to avoid this problem.
> Hmm, I see. It might be a good data point to understand in which
> direction we should be going, but I guess we can wait until they send
> some patches.

There was just recently a longer thread on the amd-gfx mailing list 
about that. I think looking in there as well might be beneficial.

Regards,
Christian.

>
> Best regards,
> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-05  8:28                                     ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-05  8:28 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> [SNIP]
>> In other words explicit ownership transfer is not something we would
>> want as requirement in the framework, cause otherwise we break tons of
>> use cases which require concurrent access to the underlying buffer.
>>
>> When a device driver needs explicit ownership transfer it's perfectly
>> possible to implement this using the dma_fence objects mentioned above.
>> E.g. drivers can already look at who is accessing a buffer currently and
>> can even grab explicit ownership of it by adding their own dma_fence
>> objects.
>>
>> The only exception is CPU based access, e.g. when something is written
>> with the CPU a cache flush might be necessary and when something is read
>> with the CPU a cache invalidation might be necessary.
>>
> Okay, that's much clearer now, thanks for clarifying this. So we
> should be covered for the cache maintenance needs originating from CPU
> accesses already, +/- the broken cases which don't call the begin/end
> CPU access routines that I mentioned above.
>
> Similarly, for any ownership transfer between different DMA engines,
> we should be covered either by the userspace explicitly flushing the
> hardware pipeline or attaching a DMA-buf fence to the buffer.
>
> But then, what's left to be solved? :) (Besides the cases of missing
> begin/end CPU access calls.)

Well there are multiple problems here:

1. A lot of userspace applications/frameworks assume that it can 
allocate the buffer anywhere and it just works.

This isn't true at all, we have tons of cases where device can only 
access their special memory for certain use cases.
Just look at scanout for displaying on dGPU, neither AMD nor NVidia 
supports system memory here. Similar cases exists for audio/video codecs 
where intermediate memory is only accessible by certain devices because 
of content protection.

2. We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only 
guess if normal, CMA or even device specific memory is needed.

3. We seem to lack some essential parts of those restrictions in the 
documentation.

>>>> So if a device driver uses cached system memory on an architecture which
>>>> devices which can't access it the right approach is clearly to reject
>>>> the access.
>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
>> Well that depends. As said above the exporter exports the buffer as it
>> was allocated.
>>
>> If that means the the exporter provides a piece of memory which requires
>> CPU cache snooping to access correctly then the best thing we can do is
>> to prevent an importer which can't do this from attaching.
> Could you elaborate more about this case? Does it exist in practice?
> Do I assume correctly that it's about sharing a buffer between one DMA
> engine that is cache-coherent and another that is non-coherent, where
> the first one ends up having its accesses always go through some kind
> of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one 
device driver wrote to it's internal buffer with the CPU (so some cache 
lines where dirty) and then a device which couldn't deal with that tried 
to access it.

We could say that all device drivers must always look at the coherency 
of the devices which want to access their buffers. But that would 
horrible complicate things for maintaining the drivers because then 
drivers would need to take into account requirements from other drivers 
while allocating their internal buffers.

That's essentially the reason why we have DMA-buf heaps. Those heaps 
expose system memory buffers with certain properties (size, CMA, DMA bit 
restrictions etc...) and also implement the callback functions for CPU 
cache maintenance.

>> We do have the system and CMA dma-buf heap for cases where a device
>> driver which wants to access the buffer with caches enabled. So this is
>> not a limitation in functionality, it's just a matter of correctly using it.
>>
> V4L2 also has the ability to allocate buffers and map them with caches enabled.

Yeah, when that's a requirement for the V4L2 device it also makes 
perfect sense.

It's just that we shouldn't force any specific allocation behavior on a 
device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use 
videobuf2-dma-contig because some other device needs CMA. Instead we 
should use the common DMA-buf heaps which implement this as neutral 
supplier of system memory buffers.

>> The problem is that in this particular case the exporter provides the
>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
>> with that.
> Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in 
this case that means writing with the CPU to it.

>> Either reverting the roles of the importer or exporter or switching over
>> to the common DMA-heaps implementation for the buffer would solve that.
>>
>>>> It's the job of the higher level to prepare the buffer a device work
>>>> with, not the one of the lower level.
>>> What are higher and lower levels here?
>> The MM as higher level and the DMA mapping framework as lower level.
>>
> Hmm, I generally consider the DMA mapping framework a part of MM,
> especially its allocation capabilities. It heavily relies on low level
> MM primitives internally and exposes another set of low level
> primitives that is more useful for device drivers. At least it should
> be seen this way, but I agree that it currently includes things that
> shouldn't necessarily be there, like the transparent bouncing.

Exactly that, yes! Good to hear that more people think that way.

Christoph made some comments which sounded like he agreed to some of the 
points as well, but nobody ever said it so clearly.

>>> As per the existing design of the DMA mapping framework, the framework
>>> handles the system DMA architecture details and DMA master drivers
>>> take care of invoking the right DMA mapping operations around the DMA
>>> accesses. This makes sense to me, as DMA master drivers have no idea
>>> about the specific SoCs or buses they're plugged into, while the DMA
>>> mapping framework has no idea when the DMA accesses are taking place.
>> Yeah and exactly that's a bit problematic. In an ideal world device
>> drivers wouldn't see memory they can't access in the first place.
>>
>> For example what we currently do is the following:
>> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
>> 2. DMA mapping framework is called by the driver to create an sg-table.
>> 3. The DMA mapping framework sees that this won't work and inserts
>> bounce buffers.
>> 4. The driver does the DMA to the bounce buffers instead.
>> 5. The DMA mapping framework copies the data to the real location.
>>
>> This is highly problematic since it removes the control of what happens
>> here. E.g. drivers can't prevent using bounce buffers when they need
>> coherent access for DMA-buf for example.
>>
>> What we should have instead is that bounce buffers are applied at a
>> higher level, for example by get_user_pages() or more general in the MM.
>>
> I tend to agree with you on this, but I see a lot of challenges that
> would need to be solved if we want to make it otherwise. The whole
> reason for transparent bouncing came from the fact that many existing
> subsystems didn't really care about the needs of the underlying
> hardware, e.g. block or network subsystems would just pass random
> pages to the drivers. I think the idea of DMA mapping handling this
> internally was to avoid implementing the bouncing here and there in
> each block or network driver that needs it. (Arguably, an optional
> helper could be provided instead for use in those subsystems...)

Yes, totally agree. The problem is really that we moved bunch of MM and 
DMA functions in one API.

The bounce buffers are something we should really have in a separate 
component.

Then the functionality of allocating system memory for a specific device 
or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a 
device (IOMMU programming etc..) should then be part of an DMA API.

>>>> In other words in a proper design the higher level would prepare the
>>>> memory in a way the device driver can work with it, not the other way
>>>> around.
>>>>
>>>> When a device driver gets memory it can't work with the correct response
>>>> is to throw an error and bubble that up into a layer where it can be
>>>> handled gracefully.
>>>>
>>>> For example instead of using bounce buffers under the hood the DMA layer
>>>> the MM should make sure that when you call read() with O_DIRECT that the
>>>> pages in question are accessible by the device.
>>>>
>>> I tend to agree with you if it's about a costly software "emulation"
>>> like bounce buffers, but cache maintenance is a hardware feature
>>> existing there by default and it's often much cheaper to operate on
>>> cached memory and synchronize the caches rather than have everything
>>> in uncached (or write-combined) memory.
>> Well that we should have proper cache maintenance is really not the
>> question. The question is where to do that?
>>
>> E.g. in the DMA-mapping framework which as far as I can see should only
>> take care of translating addresses.
> The DMA mapping framework is actually a DMA allocation and mapping
> framework. Generic memory allocation primitives (like alloc_pages(),
> kmalloc()) shouldn't be used for buffers that are expected to be
> passed to DMA engines - there exist DMA-aware replacements, like
> dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
> etc.
>
>> Or the higher level (get_user_pages() is just one example of that) which
>> says ok device X wants to access memory Y I need to make sure that
>> caches are flushed/invalidated.
> Okay, so here comes the userptr case, which I really feel like is just
> doomed at the very start, because it does nothing to accommodate
> hardware needs at allocation time and then just expects some magic to
> happen to make it possible for the hardware to make use of the buffer.
>
> That said, it should be still possible to handle that pretty well for
> hardware that allows it, i.e. without much memory access constraints.
> What would be still missing if we just use the existing gup() +
> dma_map() + dma_sync() sequence?

Error or rather fallback handling and *not* transparently inserting 
bounce buffers.

The whole implicit bounce buffers concept falls apart as soon as you 
don't have a stream access any more.

>>>>> It's a use-case that is working fine today with many devices (e.g. network
>>>>> adapters) in the ARM world, exactly because the architecture specific
>>>>> implementation of the DMA API inserts the cache maintenance operations
>>>>> on buffer ownership transfer.
>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>>>> design totally breaks GPUs on Xen DOM0 for example.
>>>>
>>>> And Xen is just one example, I can certainly say from experience that
>>>> this design was a really really bad idea because it favors just one use
>>>> case while making other use cases practically impossible if not really
>>>> hard to implement.
>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
>>> problem that this introduces for it?
>> That's a bit longer topic. The AMD XEN devs are already working on this
>> as far as I know. I can ping internally how far they got with sending
>> the patches out to avoid this problem.
> Hmm, I see. It might be a good data point to understand in which
> direction we should be going, but I guess we can wait until they send
> some patches.

There was just recently a longer thread on the amd-gfx mailing list 
about that. I think looking in there as well might be beneficial.

Regards,
Christian.

>
> Best regards,
> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-05  8:28                                     ` Christian König
@ 2022-12-06 18:26                                       ` Nicolas Dufresne
  -1 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-12-06 18:26 UTC (permalink / raw)
  To: Christian König, Tomasz Figa
  Cc: linaro-mm-sig, ppaalanen, dri-devel, linux-media, sumit.semwal

Le lundi 05 décembre 2022 à 09:28 +0100, Christian König a écrit :
> Hi Tomasz,
> 
> Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > [SNIP]
> > > In other words explicit ownership transfer is not something we would
> > > want as requirement in the framework, cause otherwise we break tons of
> > > use cases which require concurrent access to the underlying buffer.
> > > 
> > > When a device driver needs explicit ownership transfer it's perfectly
> > > possible to implement this using the dma_fence objects mentioned above.
> > > E.g. drivers can already look at who is accessing a buffer currently and
> > > can even grab explicit ownership of it by adding their own dma_fence
> > > objects.
> > > 
> > > The only exception is CPU based access, e.g. when something is written
> > > with the CPU a cache flush might be necessary and when something is read
> > > with the CPU a cache invalidation might be necessary.
> > > 
> > Okay, that's much clearer now, thanks for clarifying this. So we
> > should be covered for the cache maintenance needs originating from CPU
> > accesses already, +/- the broken cases which don't call the begin/end
> > CPU access routines that I mentioned above.
> > 
> > Similarly, for any ownership transfer between different DMA engines,
> > we should be covered either by the userspace explicitly flushing the
> > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > 
> > But then, what's left to be solved? :) (Besides the cases of missing
> > begin/end CPU access calls.)
> 
> Well there are multiple problems here:
> 
> 1. A lot of userspace applications/frameworks assume that it can 
> allocate the buffer anywhere and it just works.

I know you have said that about 10 times, perhaps I'm about to believe it, but
why do you think userspace assumes this ? Did you actually read code that does
this (that isn't meant to run on controlled environment). And can you provide
some example of broken generic userspace ? The DMABuf flow is meant to be trial
and error. At least in GStreamer, yes, mostly only device allocation (when
genericly usable) is implemented, but the code that has been contribute will try
and fallback back like documented. Still fails sometimes, but that's exactly the
kind of kernel bugs your patchset is trying to address. I don't blame anyone
here, since why would folks on GStreamer/FFMPEG or any other "generic media
framework" spend so much time implement "per linux device code", when non-
embedded (constraint) linux is just handful of users (compare to Windows,
Android, iOS users).

To me, this shouldn't be #1 issue. Perhaps it should simply be replaced by
userspace not supporting DMABuf Heaps. Perhaps add that Linux distribution don't
always enable (or allow normal users to access) heaps (though you point 2. gets
in the way) ? Unlike virtual memory, I don't think there is very good accounting
and reclaiming mechanism for that memory, hence opening these means any
userspace could possibly impair the system functioning. If you can't e.g. limit
their usage within containers, this is pretty difficult for generic linux to
carry. This is a wider problem of course, which likely affect a lot of GPU usage
too, but perhaps it should be in the lower priority part of the todo.

> 
> This isn't true at all, we have tons of cases where device can only 
> access their special memory for certain use cases.
> Just look at scanout for displaying on dGPU, neither AMD nor NVidia 
> supports system memory here. Similar cases exists for audio/video codecs 
> where intermediate memory is only accessible by certain devices because 
> of content protection.

nit: content protection is not CODEC specific, its a platform feature, its also
not really a thing upstream yet from what I'm aware of. This needs unified
design and documentation imho, but also enough standardisation so that a generic
application can use it. Right now, content protection people have been
complaining that V4L2 (and most generic userspace) don't work with their design,
rather then trying to figure-out a design that works with existing API.

> 
> 2. We don't properly communicate allocation requirements to userspace.
> 
> E.g. even if you allocate from DMA-Heaps userspace can currently only 
> guess if normal, CMA or even device specific memory is needed.
> 
> 3. We seem to lack some essential parts of those restrictions in the 
> documentation.

Agreed (can't always disagree).

regards,
Nicolas

> 
> > > > > So if a device driver uses cached system memory on an architecture which
> > > > > devices which can't access it the right approach is clearly to reject
> > > > > the access.
> > > > I'd like to accent the fact that "requires cache maintenance" != "can't access".
> > > Well that depends. As said above the exporter exports the buffer as it
> > > was allocated.
> > > 
> > > If that means the the exporter provides a piece of memory which requires
> > > CPU cache snooping to access correctly then the best thing we can do is
> > > to prevent an importer which can't do this from attaching.
> > Could you elaborate more about this case? Does it exist in practice?
> > Do I assume correctly that it's about sharing a buffer between one DMA
> > engine that is cache-coherent and another that is non-coherent, where
> > the first one ends up having its accesses always go through some kind
> > of a cache (CPU cache, L2/L3/... cache, etc.)?
> 
> Yes, exactly that. What happens in this particular use case is that one 
> device driver wrote to it's internal buffer with the CPU (so some cache 
> lines where dirty) and then a device which couldn't deal with that tried 
> to access it.
> 
> We could say that all device drivers must always look at the coherency 
> of the devices which want to access their buffers. But that would 
> horrible complicate things for maintaining the drivers because then 
> drivers would need to take into account requirements from other drivers 
> while allocating their internal buffers.
> 
> That's essentially the reason why we have DMA-buf heaps. Those heaps 
> expose system memory buffers with certain properties (size, CMA, DMA bit 
> restrictions etc...) and also implement the callback functions for CPU 
> cache maintenance.
> 
> > > We do have the system and CMA dma-buf heap for cases where a device
> > > driver which wants to access the buffer with caches enabled. So this is
> > > not a limitation in functionality, it's just a matter of correctly using it.
> > > 
> > V4L2 also has the ability to allocate buffers and map them with caches enabled.
> 
> Yeah, when that's a requirement for the V4L2 device it also makes 
> perfect sense.
> 
> It's just that we shouldn't force any specific allocation behavior on a 
> device driver because of the requirements of a different device.
> 
> Giving an example a V4L2 device shouldn't be forced to use 
> videobuf2-dma-contig because some other device needs CMA. Instead we 
> should use the common DMA-buf heaps which implement this as neutral 
> supplier of system memory buffers.
> 
> > > The problem is that in this particular case the exporter provides the
> > > buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> > > with that.
> > Why does the exporter leave the buffer with dirty CPU caches?
> 
> Because exporters always export the buffers as they would use it. And in 
> this case that means writing with the CPU to it.
> 
> > > Either reverting the roles of the importer or exporter or switching over
> > > to the common DMA-heaps implementation for the buffer would solve that.
> > > 
> > > > > It's the job of the higher level to prepare the buffer a device work
> > > > > with, not the one of the lower level.
> > > > What are higher and lower levels here?
> > > The MM as higher level and the DMA mapping framework as lower level.
> > > 
> > Hmm, I generally consider the DMA mapping framework a part of MM,
> > especially its allocation capabilities. It heavily relies on low level
> > MM primitives internally and exposes another set of low level
> > primitives that is more useful for device drivers. At least it should
> > be seen this way, but I agree that it currently includes things that
> > shouldn't necessarily be there, like the transparent bouncing.
> 
> Exactly that, yes! Good to hear that more people think that way.
> 
> Christoph made some comments which sounded like he agreed to some of the 
> points as well, but nobody ever said it so clearly.
> 
> > > > As per the existing design of the DMA mapping framework, the framework
> > > > handles the system DMA architecture details and DMA master drivers
> > > > take care of invoking the right DMA mapping operations around the DMA
> > > > accesses. This makes sense to me, as DMA master drivers have no idea
> > > > about the specific SoCs or buses they're plugged into, while the DMA
> > > > mapping framework has no idea when the DMA accesses are taking place.
> > > Yeah and exactly that's a bit problematic. In an ideal world device
> > > drivers wouldn't see memory they can't access in the first place.
> > > 
> > > For example what we currently do is the following:
> > > 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
> > > 2. DMA mapping framework is called by the driver to create an sg-table.
> > > 3. The DMA mapping framework sees that this won't work and inserts
> > > bounce buffers.
> > > 4. The driver does the DMA to the bounce buffers instead.
> > > 5. The DMA mapping framework copies the data to the real location.
> > > 
> > > This is highly problematic since it removes the control of what happens
> > > here. E.g. drivers can't prevent using bounce buffers when they need
> > > coherent access for DMA-buf for example.
> > > 
> > > What we should have instead is that bounce buffers are applied at a
> > > higher level, for example by get_user_pages() or more general in the MM.
> > > 
> > I tend to agree with you on this, but I see a lot of challenges that
> > would need to be solved if we want to make it otherwise. The whole
> > reason for transparent bouncing came from the fact that many existing
> > subsystems didn't really care about the needs of the underlying
> > hardware, e.g. block or network subsystems would just pass random
> > pages to the drivers. I think the idea of DMA mapping handling this
> > internally was to avoid implementing the bouncing here and there in
> > each block or network driver that needs it. (Arguably, an optional
> > helper could be provided instead for use in those subsystems...)
> 
> Yes, totally agree. The problem is really that we moved bunch of MM and 
> DMA functions in one API.
> 
> The bounce buffers are something we should really have in a separate 
> component.
> 
> Then the functionality of allocating system memory for a specific device 
> or devices should be something provided by the MM.
> 
> And finally making this memory or any other CPU address accessible to a 
> device (IOMMU programming etc..) should then be part of an DMA API.
> 
> > > > > In other words in a proper design the higher level would prepare the
> > > > > memory in a way the device driver can work with it, not the other way
> > > > > around.
> > > > > 
> > > > > When a device driver gets memory it can't work with the correct response
> > > > > is to throw an error and bubble that up into a layer where it can be
> > > > > handled gracefully.
> > > > > 
> > > > > For example instead of using bounce buffers under the hood the DMA layer
> > > > > the MM should make sure that when you call read() with O_DIRECT that the
> > > > > pages in question are accessible by the device.
> > > > > 
> > > > I tend to agree with you if it's about a costly software "emulation"
> > > > like bounce buffers, but cache maintenance is a hardware feature
> > > > existing there by default and it's often much cheaper to operate on
> > > > cached memory and synchronize the caches rather than have everything
> > > > in uncached (or write-combined) memory.
> > > Well that we should have proper cache maintenance is really not the
> > > question. The question is where to do that?
> > > 
> > > E.g. in the DMA-mapping framework which as far as I can see should only
> > > take care of translating addresses.
> > The DMA mapping framework is actually a DMA allocation and mapping
> > framework. Generic memory allocation primitives (like alloc_pages(),
> > kmalloc()) shouldn't be used for buffers that are expected to be
> > passed to DMA engines - there exist DMA-aware replacements, like
> > dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
> > etc.
> > 
> > > Or the higher level (get_user_pages() is just one example of that) which
> > > says ok device X wants to access memory Y I need to make sure that
> > > caches are flushed/invalidated.
> > Okay, so here comes the userptr case, which I really feel like is just
> > doomed at the very start, because it does nothing to accommodate
> > hardware needs at allocation time and then just expects some magic to
> > happen to make it possible for the hardware to make use of the buffer.
> > 
> > That said, it should be still possible to handle that pretty well for
> > hardware that allows it, i.e. without much memory access constraints.
> > What would be still missing if we just use the existing gup() +
> > dma_map() + dma_sync() sequence?
> 
> Error or rather fallback handling and *not* transparently inserting 
> bounce buffers.
> 
> The whole implicit bounce buffers concept falls apart as soon as you 
> don't have a stream access any more.
> 
> > > > > > It's a use-case that is working fine today with many devices (e.g. network
> > > > > > adapters) in the ARM world, exactly because the architecture specific
> > > > > > implementation of the DMA API inserts the cache maintenance operations
> > > > > > on buffer ownership transfer.
> > > > > Yeah, I'm perfectly aware of that. The problem is that exactly that
> > > > > design totally breaks GPUs on Xen DOM0 for example.
> > > > > 
> > > > > And Xen is just one example, I can certainly say from experience that
> > > > > this design was a really really bad idea because it favors just one use
> > > > > case while making other use cases practically impossible if not really
> > > > > hard to implement.
> > > > Sorry, I haven't worked with Xen. Could you elaborate what's the
> > > > problem that this introduces for it?
> > > That's a bit longer topic. The AMD XEN devs are already working on this
> > > as far as I know. I can ping internally how far they got with sending
> > > the patches out to avoid this problem.
> > Hmm, I see. It might be a good data point to understand in which
> > direction we should be going, but I guess we can wait until they send
> > some patches.
> 
> There was just recently a longer thread on the amd-gfx mailing list 
> about that. I think looking in there as well might be beneficial.
> 
> Regards,
> Christian.
> 
> > 
> > Best regards,
> > Tomasz
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-06 18:26                                       ` Nicolas Dufresne
  0 siblings, 0 replies; 110+ messages in thread
From: Nicolas Dufresne @ 2022-12-06 18:26 UTC (permalink / raw)
  To: Christian König, Tomasz Figa
  Cc: Lucas Stach, Daniel Stone, ppaalanen, sumit.semwal, daniel,
	robdclark, dri-devel, linaro-mm-sig, linux-media

Le lundi 05 décembre 2022 à 09:28 +0100, Christian König a écrit :
> Hi Tomasz,
> 
> Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > [SNIP]
> > > In other words explicit ownership transfer is not something we would
> > > want as requirement in the framework, cause otherwise we break tons of
> > > use cases which require concurrent access to the underlying buffer.
> > > 
> > > When a device driver needs explicit ownership transfer it's perfectly
> > > possible to implement this using the dma_fence objects mentioned above.
> > > E.g. drivers can already look at who is accessing a buffer currently and
> > > can even grab explicit ownership of it by adding their own dma_fence
> > > objects.
> > > 
> > > The only exception is CPU based access, e.g. when something is written
> > > with the CPU a cache flush might be necessary and when something is read
> > > with the CPU a cache invalidation might be necessary.
> > > 
> > Okay, that's much clearer now, thanks for clarifying this. So we
> > should be covered for the cache maintenance needs originating from CPU
> > accesses already, +/- the broken cases which don't call the begin/end
> > CPU access routines that I mentioned above.
> > 
> > Similarly, for any ownership transfer between different DMA engines,
> > we should be covered either by the userspace explicitly flushing the
> > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > 
> > But then, what's left to be solved? :) (Besides the cases of missing
> > begin/end CPU access calls.)
> 
> Well there are multiple problems here:
> 
> 1. A lot of userspace applications/frameworks assume that it can 
> allocate the buffer anywhere and it just works.

I know you have said that about 10 times, perhaps I'm about to believe it, but
why do you think userspace assumes this ? Did you actually read code that does
this (that isn't meant to run on controlled environment). And can you provide
some example of broken generic userspace ? The DMABuf flow is meant to be trial
and error. At least in GStreamer, yes, mostly only device allocation (when
genericly usable) is implemented, but the code that has been contribute will try
and fallback back like documented. Still fails sometimes, but that's exactly the
kind of kernel bugs your patchset is trying to address. I don't blame anyone
here, since why would folks on GStreamer/FFMPEG or any other "generic media
framework" spend so much time implement "per linux device code", when non-
embedded (constraint) linux is just handful of users (compare to Windows,
Android, iOS users).

To me, this shouldn't be #1 issue. Perhaps it should simply be replaced by
userspace not supporting DMABuf Heaps. Perhaps add that Linux distribution don't
always enable (or allow normal users to access) heaps (though you point 2. gets
in the way) ? Unlike virtual memory, I don't think there is very good accounting
and reclaiming mechanism for that memory, hence opening these means any
userspace could possibly impair the system functioning. If you can't e.g. limit
their usage within containers, this is pretty difficult for generic linux to
carry. This is a wider problem of course, which likely affect a lot of GPU usage
too, but perhaps it should be in the lower priority part of the todo.

> 
> This isn't true at all, we have tons of cases where device can only 
> access their special memory for certain use cases.
> Just look at scanout for displaying on dGPU, neither AMD nor NVidia 
> supports system memory here. Similar cases exists for audio/video codecs 
> where intermediate memory is only accessible by certain devices because 
> of content protection.

nit: content protection is not CODEC specific, its a platform feature, its also
not really a thing upstream yet from what I'm aware of. This needs unified
design and documentation imho, but also enough standardisation so that a generic
application can use it. Right now, content protection people have been
complaining that V4L2 (and most generic userspace) don't work with their design,
rather then trying to figure-out a design that works with existing API.

> 
> 2. We don't properly communicate allocation requirements to userspace.
> 
> E.g. even if you allocate from DMA-Heaps userspace can currently only 
> guess if normal, CMA or even device specific memory is needed.
> 
> 3. We seem to lack some essential parts of those restrictions in the 
> documentation.

Agreed (can't always disagree).

regards,
Nicolas

> 
> > > > > So if a device driver uses cached system memory on an architecture which
> > > > > devices which can't access it the right approach is clearly to reject
> > > > > the access.
> > > > I'd like to accent the fact that "requires cache maintenance" != "can't access".
> > > Well that depends. As said above the exporter exports the buffer as it
> > > was allocated.
> > > 
> > > If that means the the exporter provides a piece of memory which requires
> > > CPU cache snooping to access correctly then the best thing we can do is
> > > to prevent an importer which can't do this from attaching.
> > Could you elaborate more about this case? Does it exist in practice?
> > Do I assume correctly that it's about sharing a buffer between one DMA
> > engine that is cache-coherent and another that is non-coherent, where
> > the first one ends up having its accesses always go through some kind
> > of a cache (CPU cache, L2/L3/... cache, etc.)?
> 
> Yes, exactly that. What happens in this particular use case is that one 
> device driver wrote to it's internal buffer with the CPU (so some cache 
> lines where dirty) and then a device which couldn't deal with that tried 
> to access it.
> 
> We could say that all device drivers must always look at the coherency 
> of the devices which want to access their buffers. But that would 
> horrible complicate things for maintaining the drivers because then 
> drivers would need to take into account requirements from other drivers 
> while allocating their internal buffers.
> 
> That's essentially the reason why we have DMA-buf heaps. Those heaps 
> expose system memory buffers with certain properties (size, CMA, DMA bit 
> restrictions etc...) and also implement the callback functions for CPU 
> cache maintenance.
> 
> > > We do have the system and CMA dma-buf heap for cases where a device
> > > driver which wants to access the buffer with caches enabled. So this is
> > > not a limitation in functionality, it's just a matter of correctly using it.
> > > 
> > V4L2 also has the ability to allocate buffers and map them with caches enabled.
> 
> Yeah, when that's a requirement for the V4L2 device it also makes 
> perfect sense.
> 
> It's just that we shouldn't force any specific allocation behavior on a 
> device driver because of the requirements of a different device.
> 
> Giving an example a V4L2 device shouldn't be forced to use 
> videobuf2-dma-contig because some other device needs CMA. Instead we 
> should use the common DMA-buf heaps which implement this as neutral 
> supplier of system memory buffers.
> 
> > > The problem is that in this particular case the exporter provides the
> > > buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> > > with that.
> > Why does the exporter leave the buffer with dirty CPU caches?
> 
> Because exporters always export the buffers as they would use it. And in 
> this case that means writing with the CPU to it.
> 
> > > Either reverting the roles of the importer or exporter or switching over
> > > to the common DMA-heaps implementation for the buffer would solve that.
> > > 
> > > > > It's the job of the higher level to prepare the buffer a device work
> > > > > with, not the one of the lower level.
> > > > What are higher and lower levels here?
> > > The MM as higher level and the DMA mapping framework as lower level.
> > > 
> > Hmm, I generally consider the DMA mapping framework a part of MM,
> > especially its allocation capabilities. It heavily relies on low level
> > MM primitives internally and exposes another set of low level
> > primitives that is more useful for device drivers. At least it should
> > be seen this way, but I agree that it currently includes things that
> > shouldn't necessarily be there, like the transparent bouncing.
> 
> Exactly that, yes! Good to hear that more people think that way.
> 
> Christoph made some comments which sounded like he agreed to some of the 
> points as well, but nobody ever said it so clearly.
> 
> > > > As per the existing design of the DMA mapping framework, the framework
> > > > handles the system DMA architecture details and DMA master drivers
> > > > take care of invoking the right DMA mapping operations around the DMA
> > > > accesses. This makes sense to me, as DMA master drivers have no idea
> > > > about the specific SoCs or buses they're plugged into, while the DMA
> > > > mapping framework has no idea when the DMA accesses are taking place.
> > > Yeah and exactly that's a bit problematic. In an ideal world device
> > > drivers wouldn't see memory they can't access in the first place.
> > > 
> > > For example what we currently do is the following:
> > > 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
> > > 2. DMA mapping framework is called by the driver to create an sg-table.
> > > 3. The DMA mapping framework sees that this won't work and inserts
> > > bounce buffers.
> > > 4. The driver does the DMA to the bounce buffers instead.
> > > 5. The DMA mapping framework copies the data to the real location.
> > > 
> > > This is highly problematic since it removes the control of what happens
> > > here. E.g. drivers can't prevent using bounce buffers when they need
> > > coherent access for DMA-buf for example.
> > > 
> > > What we should have instead is that bounce buffers are applied at a
> > > higher level, for example by get_user_pages() or more general in the MM.
> > > 
> > I tend to agree with you on this, but I see a lot of challenges that
> > would need to be solved if we want to make it otherwise. The whole
> > reason for transparent bouncing came from the fact that many existing
> > subsystems didn't really care about the needs of the underlying
> > hardware, e.g. block or network subsystems would just pass random
> > pages to the drivers. I think the idea of DMA mapping handling this
> > internally was to avoid implementing the bouncing here and there in
> > each block or network driver that needs it. (Arguably, an optional
> > helper could be provided instead for use in those subsystems...)
> 
> Yes, totally agree. The problem is really that we moved bunch of MM and 
> DMA functions in one API.
> 
> The bounce buffers are something we should really have in a separate 
> component.
> 
> Then the functionality of allocating system memory for a specific device 
> or devices should be something provided by the MM.
> 
> And finally making this memory or any other CPU address accessible to a 
> device (IOMMU programming etc..) should then be part of an DMA API.
> 
> > > > > In other words in a proper design the higher level would prepare the
> > > > > memory in a way the device driver can work with it, not the other way
> > > > > around.
> > > > > 
> > > > > When a device driver gets memory it can't work with the correct response
> > > > > is to throw an error and bubble that up into a layer where it can be
> > > > > handled gracefully.
> > > > > 
> > > > > For example instead of using bounce buffers under the hood the DMA layer
> > > > > the MM should make sure that when you call read() with O_DIRECT that the
> > > > > pages in question are accessible by the device.
> > > > > 
> > > > I tend to agree with you if it's about a costly software "emulation"
> > > > like bounce buffers, but cache maintenance is a hardware feature
> > > > existing there by default and it's often much cheaper to operate on
> > > > cached memory and synchronize the caches rather than have everything
> > > > in uncached (or write-combined) memory.
> > > Well that we should have proper cache maintenance is really not the
> > > question. The question is where to do that?
> > > 
> > > E.g. in the DMA-mapping framework which as far as I can see should only
> > > take care of translating addresses.
> > The DMA mapping framework is actually a DMA allocation and mapping
> > framework. Generic memory allocation primitives (like alloc_pages(),
> > kmalloc()) shouldn't be used for buffers that are expected to be
> > passed to DMA engines - there exist DMA-aware replacements, like
> > dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
> > etc.
> > 
> > > Or the higher level (get_user_pages() is just one example of that) which
> > > says ok device X wants to access memory Y I need to make sure that
> > > caches are flushed/invalidated.
> > Okay, so here comes the userptr case, which I really feel like is just
> > doomed at the very start, because it does nothing to accommodate
> > hardware needs at allocation time and then just expects some magic to
> > happen to make it possible for the hardware to make use of the buffer.
> > 
> > That said, it should be still possible to handle that pretty well for
> > hardware that allows it, i.e. without much memory access constraints.
> > What would be still missing if we just use the existing gup() +
> > dma_map() + dma_sync() sequence?
> 
> Error or rather fallback handling and *not* transparently inserting 
> bounce buffers.
> 
> The whole implicit bounce buffers concept falls apart as soon as you 
> don't have a stream access any more.
> 
> > > > > > It's a use-case that is working fine today with many devices (e.g. network
> > > > > > adapters) in the ARM world, exactly because the architecture specific
> > > > > > implementation of the DMA API inserts the cache maintenance operations
> > > > > > on buffer ownership transfer.
> > > > > Yeah, I'm perfectly aware of that. The problem is that exactly that
> > > > > design totally breaks GPUs on Xen DOM0 for example.
> > > > > 
> > > > > And Xen is just one example, I can certainly say from experience that
> > > > > this design was a really really bad idea because it favors just one use
> > > > > case while making other use cases practically impossible if not really
> > > > > hard to implement.
> > > > Sorry, I haven't worked with Xen. Could you elaborate what's the
> > > > problem that this introduces for it?
> > > That's a bit longer topic. The AMD XEN devs are already working on this
> > > as far as I know. I can ping internally how far they got with sending
> > > the patches out to avoid this problem.
> > Hmm, I see. It might be a good data point to understand in which
> > direction we should be going, but I guess we can wait until they send
> > some patches.
> 
> There was just recently a longer thread on the amd-gfx mailing list 
> about that. I think looking in there as well might be beneficial.
> 
> Regards,
> Christian.
> 
> > 
> > Best regards,
> > Tomasz
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-06 18:26                                       ` Nicolas Dufresne
@ 2022-12-06 18:37                                         ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-06 18:37 UTC (permalink / raw)
  To: Nicolas Dufresne, Tomasz Figa
  Cc: Lucas Stach, Daniel Stone, ppaalanen, sumit.semwal, daniel,
	robdclark, dri-devel, linaro-mm-sig, linux-media

Am 06.12.22 um 19:26 schrieb Nicolas Dufresne:
> Le lundi 05 décembre 2022 à 09:28 +0100, Christian König a écrit :
>> Hi Tomasz,
>>
>> Am 05.12.22 um 07:41 schrieb Tomasz Figa:
>>> [SNIP]
>>>> In other words explicit ownership transfer is not something we would
>>>> want as requirement in the framework, cause otherwise we break tons of
>>>> use cases which require concurrent access to the underlying buffer.
>>>>
>>>> When a device driver needs explicit ownership transfer it's perfectly
>>>> possible to implement this using the dma_fence objects mentioned above.
>>>> E.g. drivers can already look at who is accessing a buffer currently and
>>>> can even grab explicit ownership of it by adding their own dma_fence
>>>> objects.
>>>>
>>>> The only exception is CPU based access, e.g. when something is written
>>>> with the CPU a cache flush might be necessary and when something is read
>>>> with the CPU a cache invalidation might be necessary.
>>>>
>>> Okay, that's much clearer now, thanks for clarifying this. So we
>>> should be covered for the cache maintenance needs originating from CPU
>>> accesses already, +/- the broken cases which don't call the begin/end
>>> CPU access routines that I mentioned above.
>>>
>>> Similarly, for any ownership transfer between different DMA engines,
>>> we should be covered either by the userspace explicitly flushing the
>>> hardware pipeline or attaching a DMA-buf fence to the buffer.
>>>
>>> But then, what's left to be solved? :) (Besides the cases of missing
>>> begin/end CPU access calls.)
>> Well there are multiple problems here:
>>
>> 1. A lot of userspace applications/frameworks assume that it can
>> allocate the buffer anywhere and it just works.
> I know you have said that about 10 times, perhaps I'm about to believe it, but
> why do you think userspace assumes this ? Did you actually read code that does
> this (that isn't meant to run on controlled environment).

Yes, absolutely.

Kodi for example used to do this and it was actually me who pointed out 
that this whole approach is flawed in the first place.

And we had tons of customer projects which ran into trouble with that. 
It became better in the last few years, but only after pushing back on 
this many many times.

Regards,
Christian.

>   And can you provide
> some example of broken generic userspace ? The DMABuf flow is meant to be trial
> and error. At least in GStreamer, yes, mostly only device allocation (when
> genericly usable) is implemented, but the code that has been contribute will try
> and fallback back like documented. Still fails sometimes, but that's exactly the
> kind of kernel bugs your patchset is trying to address. I don't blame anyone
> here, since why would folks on GStreamer/FFMPEG or any other "generic media
> framework" spend so much time implement "per linux device code", when non-
> embedded (constraint) linux is just handful of users (compare to Windows,
> Android, iOS users).
>
> To me, this shouldn't be #1 issue. Perhaps it should simply be replaced by
> userspace not supporting DMABuf Heaps. Perhaps add that Linux distribution don't
> always enable (or allow normal users to access) heaps (though you point 2. gets
> in the way) ? Unlike virtual memory, I don't think there is very good accounting
> and reclaiming mechanism for that memory, hence opening these means any
> userspace could possibly impair the system functioning. If you can't e.g. limit
> their usage within containers, this is pretty difficult for generic linux to
> carry. This is a wider problem of course, which likely affect a lot of GPU usage
> too, but perhaps it should be in the lower priority part of the todo.
>
>> This isn't true at all, we have tons of cases where device can only
>> access their special memory for certain use cases.
>> Just look at scanout for displaying on dGPU, neither AMD nor NVidia
>> supports system memory here. Similar cases exists for audio/video codecs
>> where intermediate memory is only accessible by certain devices because
>> of content protection.
> nit: content protection is not CODEC specific, its a platform feature, its also
> not really a thing upstream yet from what I'm aware of. This needs unified
> design and documentation imho, but also enough standardisation so that a generic
> application can use it. Right now, content protection people have been
> complaining that V4L2 (and most generic userspace) don't work with their design,
> rather then trying to figure-out a design that works with existing API.
>
>> 2. We don't properly communicate allocation requirements to userspace.
>>
>> E.g. even if you allocate from DMA-Heaps userspace can currently only
>> guess if normal, CMA or even device specific memory is needed.
>>
>> 3. We seem to lack some essential parts of those restrictions in the
>> documentation.
> Agreed (can't always disagree).
>
> regards,
> Nicolas
>
>>>>>> So if a device driver uses cached system memory on an architecture which
>>>>>> devices which can't access it the right approach is clearly to reject
>>>>>> the access.
>>>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
>>>> Well that depends. As said above the exporter exports the buffer as it
>>>> was allocated.
>>>>
>>>> If that means the the exporter provides a piece of memory which requires
>>>> CPU cache snooping to access correctly then the best thing we can do is
>>>> to prevent an importer which can't do this from attaching.
>>> Could you elaborate more about this case? Does it exist in practice?
>>> Do I assume correctly that it's about sharing a buffer between one DMA
>>> engine that is cache-coherent and another that is non-coherent, where
>>> the first one ends up having its accesses always go through some kind
>>> of a cache (CPU cache, L2/L3/... cache, etc.)?
>> Yes, exactly that. What happens in this particular use case is that one
>> device driver wrote to it's internal buffer with the CPU (so some cache
>> lines where dirty) and then a device which couldn't deal with that tried
>> to access it.
>>
>> We could say that all device drivers must always look at the coherency
>> of the devices which want to access their buffers. But that would
>> horrible complicate things for maintaining the drivers because then
>> drivers would need to take into account requirements from other drivers
>> while allocating their internal buffers.
>>
>> That's essentially the reason why we have DMA-buf heaps. Those heaps
>> expose system memory buffers with certain properties (size, CMA, DMA bit
>> restrictions etc...) and also implement the callback functions for CPU
>> cache maintenance.
>>
>>>> We do have the system and CMA dma-buf heap for cases where a device
>>>> driver which wants to access the buffer with caches enabled. So this is
>>>> not a limitation in functionality, it's just a matter of correctly using it.
>>>>
>>> V4L2 also has the ability to allocate buffers and map them with caches enabled.
>> Yeah, when that's a requirement for the V4L2 device it also makes
>> perfect sense.
>>
>> It's just that we shouldn't force any specific allocation behavior on a
>> device driver because of the requirements of a different device.
>>
>> Giving an example a V4L2 device shouldn't be forced to use
>> videobuf2-dma-contig because some other device needs CMA. Instead we
>> should use the common DMA-buf heaps which implement this as neutral
>> supplier of system memory buffers.
>>
>>>> The problem is that in this particular case the exporter provides the
>>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
>>>> with that.
>>> Why does the exporter leave the buffer with dirty CPU caches?
>> Because exporters always export the buffers as they would use it. And in
>> this case that means writing with the CPU to it.
>>
>>>> Either reverting the roles of the importer or exporter or switching over
>>>> to the common DMA-heaps implementation for the buffer would solve that.
>>>>
>>>>>> It's the job of the higher level to prepare the buffer a device work
>>>>>> with, not the one of the lower level.
>>>>> What are higher and lower levels here?
>>>> The MM as higher level and the DMA mapping framework as lower level.
>>>>
>>> Hmm, I generally consider the DMA mapping framework a part of MM,
>>> especially its allocation capabilities. It heavily relies on low level
>>> MM primitives internally and exposes another set of low level
>>> primitives that is more useful for device drivers. At least it should
>>> be seen this way, but I agree that it currently includes things that
>>> shouldn't necessarily be there, like the transparent bouncing.
>> Exactly that, yes! Good to hear that more people think that way.
>>
>> Christoph made some comments which sounded like he agreed to some of the
>> points as well, but nobody ever said it so clearly.
>>
>>>>> As per the existing design of the DMA mapping framework, the framework
>>>>> handles the system DMA architecture details and DMA master drivers
>>>>> take care of invoking the right DMA mapping operations around the DMA
>>>>> accesses. This makes sense to me, as DMA master drivers have no idea
>>>>> about the specific SoCs or buses they're plugged into, while the DMA
>>>>> mapping framework has no idea when the DMA accesses are taking place.
>>>> Yeah and exactly that's a bit problematic. In an ideal world device
>>>> drivers wouldn't see memory they can't access in the first place.
>>>>
>>>> For example what we currently do is the following:
>>>> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
>>>> 2. DMA mapping framework is called by the driver to create an sg-table.
>>>> 3. The DMA mapping framework sees that this won't work and inserts
>>>> bounce buffers.
>>>> 4. The driver does the DMA to the bounce buffers instead.
>>>> 5. The DMA mapping framework copies the data to the real location.
>>>>
>>>> This is highly problematic since it removes the control of what happens
>>>> here. E.g. drivers can't prevent using bounce buffers when they need
>>>> coherent access for DMA-buf for example.
>>>>
>>>> What we should have instead is that bounce buffers are applied at a
>>>> higher level, for example by get_user_pages() or more general in the MM.
>>>>
>>> I tend to agree with you on this, but I see a lot of challenges that
>>> would need to be solved if we want to make it otherwise. The whole
>>> reason for transparent bouncing came from the fact that many existing
>>> subsystems didn't really care about the needs of the underlying
>>> hardware, e.g. block or network subsystems would just pass random
>>> pages to the drivers. I think the idea of DMA mapping handling this
>>> internally was to avoid implementing the bouncing here and there in
>>> each block or network driver that needs it. (Arguably, an optional
>>> helper could be provided instead for use in those subsystems...)
>> Yes, totally agree. The problem is really that we moved bunch of MM and
>> DMA functions in one API.
>>
>> The bounce buffers are something we should really have in a separate
>> component.
>>
>> Then the functionality of allocating system memory for a specific device
>> or devices should be something provided by the MM.
>>
>> And finally making this memory or any other CPU address accessible to a
>> device (IOMMU programming etc..) should then be part of an DMA API.
>>
>>>>>> In other words in a proper design the higher level would prepare the
>>>>>> memory in a way the device driver can work with it, not the other way
>>>>>> around.
>>>>>>
>>>>>> When a device driver gets memory it can't work with the correct response
>>>>>> is to throw an error and bubble that up into a layer where it can be
>>>>>> handled gracefully.
>>>>>>
>>>>>> For example instead of using bounce buffers under the hood the DMA layer
>>>>>> the MM should make sure that when you call read() with O_DIRECT that the
>>>>>> pages in question are accessible by the device.
>>>>>>
>>>>> I tend to agree with you if it's about a costly software "emulation"
>>>>> like bounce buffers, but cache maintenance is a hardware feature
>>>>> existing there by default and it's often much cheaper to operate on
>>>>> cached memory and synchronize the caches rather than have everything
>>>>> in uncached (or write-combined) memory.
>>>> Well that we should have proper cache maintenance is really not the
>>>> question. The question is where to do that?
>>>>
>>>> E.g. in the DMA-mapping framework which as far as I can see should only
>>>> take care of translating addresses.
>>> The DMA mapping framework is actually a DMA allocation and mapping
>>> framework. Generic memory allocation primitives (like alloc_pages(),
>>> kmalloc()) shouldn't be used for buffers that are expected to be
>>> passed to DMA engines - there exist DMA-aware replacements, like
>>> dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
>>> etc.
>>>
>>>> Or the higher level (get_user_pages() is just one example of that) which
>>>> says ok device X wants to access memory Y I need to make sure that
>>>> caches are flushed/invalidated.
>>> Okay, so here comes the userptr case, which I really feel like is just
>>> doomed at the very start, because it does nothing to accommodate
>>> hardware needs at allocation time and then just expects some magic to
>>> happen to make it possible for the hardware to make use of the buffer.
>>>
>>> That said, it should be still possible to handle that pretty well for
>>> hardware that allows it, i.e. without much memory access constraints.
>>> What would be still missing if we just use the existing gup() +
>>> dma_map() + dma_sync() sequence?
>> Error or rather fallback handling and *not* transparently inserting
>> bounce buffers.
>>
>> The whole implicit bounce buffers concept falls apart as soon as you
>> don't have a stream access any more.
>>
>>>>>>> It's a use-case that is working fine today with many devices (e.g. network
>>>>>>> adapters) in the ARM world, exactly because the architecture specific
>>>>>>> implementation of the DMA API inserts the cache maintenance operations
>>>>>>> on buffer ownership transfer.
>>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>>>>>> design totally breaks GPUs on Xen DOM0 for example.
>>>>>>
>>>>>> And Xen is just one example, I can certainly say from experience that
>>>>>> this design was a really really bad idea because it favors just one use
>>>>>> case while making other use cases practically impossible if not really
>>>>>> hard to implement.
>>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
>>>>> problem that this introduces for it?
>>>> That's a bit longer topic. The AMD XEN devs are already working on this
>>>> as far as I know. I can ping internally how far they got with sending
>>>> the patches out to avoid this problem.
>>> Hmm, I see. It might be a good data point to understand in which
>>> direction we should be going, but I guess we can wait until they send
>>> some patches.
>> There was just recently a longer thread on the amd-gfx mailing list
>> about that. I think looking in there as well might be beneficial.
>>
>> Regards,
>> Christian.
>>
>>> Best regards,
>>> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-06 18:37                                         ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-06 18:37 UTC (permalink / raw)
  To: Nicolas Dufresne, Tomasz Figa
  Cc: linaro-mm-sig, ppaalanen, dri-devel, linux-media, sumit.semwal

Am 06.12.22 um 19:26 schrieb Nicolas Dufresne:
> Le lundi 05 décembre 2022 à 09:28 +0100, Christian König a écrit :
>> Hi Tomasz,
>>
>> Am 05.12.22 um 07:41 schrieb Tomasz Figa:
>>> [SNIP]
>>>> In other words explicit ownership transfer is not something we would
>>>> want as requirement in the framework, cause otherwise we break tons of
>>>> use cases which require concurrent access to the underlying buffer.
>>>>
>>>> When a device driver needs explicit ownership transfer it's perfectly
>>>> possible to implement this using the dma_fence objects mentioned above.
>>>> E.g. drivers can already look at who is accessing a buffer currently and
>>>> can even grab explicit ownership of it by adding their own dma_fence
>>>> objects.
>>>>
>>>> The only exception is CPU based access, e.g. when something is written
>>>> with the CPU a cache flush might be necessary and when something is read
>>>> with the CPU a cache invalidation might be necessary.
>>>>
>>> Okay, that's much clearer now, thanks for clarifying this. So we
>>> should be covered for the cache maintenance needs originating from CPU
>>> accesses already, +/- the broken cases which don't call the begin/end
>>> CPU access routines that I mentioned above.
>>>
>>> Similarly, for any ownership transfer between different DMA engines,
>>> we should be covered either by the userspace explicitly flushing the
>>> hardware pipeline or attaching a DMA-buf fence to the buffer.
>>>
>>> But then, what's left to be solved? :) (Besides the cases of missing
>>> begin/end CPU access calls.)
>> Well there are multiple problems here:
>>
>> 1. A lot of userspace applications/frameworks assume that it can
>> allocate the buffer anywhere and it just works.
> I know you have said that about 10 times, perhaps I'm about to believe it, but
> why do you think userspace assumes this ? Did you actually read code that does
> this (that isn't meant to run on controlled environment).

Yes, absolutely.

Kodi for example used to do this and it was actually me who pointed out 
that this whole approach is flawed in the first place.

And we had tons of customer projects which ran into trouble with that. 
It became better in the last few years, but only after pushing back on 
this many many times.

Regards,
Christian.

>   And can you provide
> some example of broken generic userspace ? The DMABuf flow is meant to be trial
> and error. At least in GStreamer, yes, mostly only device allocation (when
> genericly usable) is implemented, but the code that has been contribute will try
> and fallback back like documented. Still fails sometimes, but that's exactly the
> kind of kernel bugs your patchset is trying to address. I don't blame anyone
> here, since why would folks on GStreamer/FFMPEG or any other "generic media
> framework" spend so much time implement "per linux device code", when non-
> embedded (constraint) linux is just handful of users (compare to Windows,
> Android, iOS users).
>
> To me, this shouldn't be #1 issue. Perhaps it should simply be replaced by
> userspace not supporting DMABuf Heaps. Perhaps add that Linux distribution don't
> always enable (or allow normal users to access) heaps (though you point 2. gets
> in the way) ? Unlike virtual memory, I don't think there is very good accounting
> and reclaiming mechanism for that memory, hence opening these means any
> userspace could possibly impair the system functioning. If you can't e.g. limit
> their usage within containers, this is pretty difficult for generic linux to
> carry. This is a wider problem of course, which likely affect a lot of GPU usage
> too, but perhaps it should be in the lower priority part of the todo.
>
>> This isn't true at all, we have tons of cases where device can only
>> access their special memory for certain use cases.
>> Just look at scanout for displaying on dGPU, neither AMD nor NVidia
>> supports system memory here. Similar cases exists for audio/video codecs
>> where intermediate memory is only accessible by certain devices because
>> of content protection.
> nit: content protection is not CODEC specific, its a platform feature, its also
> not really a thing upstream yet from what I'm aware of. This needs unified
> design and documentation imho, but also enough standardisation so that a generic
> application can use it. Right now, content protection people have been
> complaining that V4L2 (and most generic userspace) don't work with their design,
> rather then trying to figure-out a design that works with existing API.
>
>> 2. We don't properly communicate allocation requirements to userspace.
>>
>> E.g. even if you allocate from DMA-Heaps userspace can currently only
>> guess if normal, CMA or even device specific memory is needed.
>>
>> 3. We seem to lack some essential parts of those restrictions in the
>> documentation.
> Agreed (can't always disagree).
>
> regards,
> Nicolas
>
>>>>>> So if a device driver uses cached system memory on an architecture which
>>>>>> devices which can't access it the right approach is clearly to reject
>>>>>> the access.
>>>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
>>>> Well that depends. As said above the exporter exports the buffer as it
>>>> was allocated.
>>>>
>>>> If that means the the exporter provides a piece of memory which requires
>>>> CPU cache snooping to access correctly then the best thing we can do is
>>>> to prevent an importer which can't do this from attaching.
>>> Could you elaborate more about this case? Does it exist in practice?
>>> Do I assume correctly that it's about sharing a buffer between one DMA
>>> engine that is cache-coherent and another that is non-coherent, where
>>> the first one ends up having its accesses always go through some kind
>>> of a cache (CPU cache, L2/L3/... cache, etc.)?
>> Yes, exactly that. What happens in this particular use case is that one
>> device driver wrote to it's internal buffer with the CPU (so some cache
>> lines where dirty) and then a device which couldn't deal with that tried
>> to access it.
>>
>> We could say that all device drivers must always look at the coherency
>> of the devices which want to access their buffers. But that would
>> horrible complicate things for maintaining the drivers because then
>> drivers would need to take into account requirements from other drivers
>> while allocating their internal buffers.
>>
>> That's essentially the reason why we have DMA-buf heaps. Those heaps
>> expose system memory buffers with certain properties (size, CMA, DMA bit
>> restrictions etc...) and also implement the callback functions for CPU
>> cache maintenance.
>>
>>>> We do have the system and CMA dma-buf heap for cases where a device
>>>> driver which wants to access the buffer with caches enabled. So this is
>>>> not a limitation in functionality, it's just a matter of correctly using it.
>>>>
>>> V4L2 also has the ability to allocate buffers and map them with caches enabled.
>> Yeah, when that's a requirement for the V4L2 device it also makes
>> perfect sense.
>>
>> It's just that we shouldn't force any specific allocation behavior on a
>> device driver because of the requirements of a different device.
>>
>> Giving an example a V4L2 device shouldn't be forced to use
>> videobuf2-dma-contig because some other device needs CMA. Instead we
>> should use the common DMA-buf heaps which implement this as neutral
>> supplier of system memory buffers.
>>
>>>> The problem is that in this particular case the exporter provides the
>>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
>>>> with that.
>>> Why does the exporter leave the buffer with dirty CPU caches?
>> Because exporters always export the buffers as they would use it. And in
>> this case that means writing with the CPU to it.
>>
>>>> Either reverting the roles of the importer or exporter or switching over
>>>> to the common DMA-heaps implementation for the buffer would solve that.
>>>>
>>>>>> It's the job of the higher level to prepare the buffer a device work
>>>>>> with, not the one of the lower level.
>>>>> What are higher and lower levels here?
>>>> The MM as higher level and the DMA mapping framework as lower level.
>>>>
>>> Hmm, I generally consider the DMA mapping framework a part of MM,
>>> especially its allocation capabilities. It heavily relies on low level
>>> MM primitives internally and exposes another set of low level
>>> primitives that is more useful for device drivers. At least it should
>>> be seen this way, but I agree that it currently includes things that
>>> shouldn't necessarily be there, like the transparent bouncing.
>> Exactly that, yes! Good to hear that more people think that way.
>>
>> Christoph made some comments which sounded like he agreed to some of the
>> points as well, but nobody ever said it so clearly.
>>
>>>>> As per the existing design of the DMA mapping framework, the framework
>>>>> handles the system DMA architecture details and DMA master drivers
>>>>> take care of invoking the right DMA mapping operations around the DMA
>>>>> accesses. This makes sense to me, as DMA master drivers have no idea
>>>>> about the specific SoCs or buses they're plugged into, while the DMA
>>>>> mapping framework has no idea when the DMA accesses are taking place.
>>>> Yeah and exactly that's a bit problematic. In an ideal world device
>>>> drivers wouldn't see memory they can't access in the first place.
>>>>
>>>> For example what we currently do is the following:
>>>> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
>>>> 2. DMA mapping framework is called by the driver to create an sg-table.
>>>> 3. The DMA mapping framework sees that this won't work and inserts
>>>> bounce buffers.
>>>> 4. The driver does the DMA to the bounce buffers instead.
>>>> 5. The DMA mapping framework copies the data to the real location.
>>>>
>>>> This is highly problematic since it removes the control of what happens
>>>> here. E.g. drivers can't prevent using bounce buffers when they need
>>>> coherent access for DMA-buf for example.
>>>>
>>>> What we should have instead is that bounce buffers are applied at a
>>>> higher level, for example by get_user_pages() or more general in the MM.
>>>>
>>> I tend to agree with you on this, but I see a lot of challenges that
>>> would need to be solved if we want to make it otherwise. The whole
>>> reason for transparent bouncing came from the fact that many existing
>>> subsystems didn't really care about the needs of the underlying
>>> hardware, e.g. block or network subsystems would just pass random
>>> pages to the drivers. I think the idea of DMA mapping handling this
>>> internally was to avoid implementing the bouncing here and there in
>>> each block or network driver that needs it. (Arguably, an optional
>>> helper could be provided instead for use in those subsystems...)
>> Yes, totally agree. The problem is really that we moved bunch of MM and
>> DMA functions in one API.
>>
>> The bounce buffers are something we should really have in a separate
>> component.
>>
>> Then the functionality of allocating system memory for a specific device
>> or devices should be something provided by the MM.
>>
>> And finally making this memory or any other CPU address accessible to a
>> device (IOMMU programming etc..) should then be part of an DMA API.
>>
>>>>>> In other words in a proper design the higher level would prepare the
>>>>>> memory in a way the device driver can work with it, not the other way
>>>>>> around.
>>>>>>
>>>>>> When a device driver gets memory it can't work with the correct response
>>>>>> is to throw an error and bubble that up into a layer where it can be
>>>>>> handled gracefully.
>>>>>>
>>>>>> For example instead of using bounce buffers under the hood the DMA layer
>>>>>> the MM should make sure that when you call read() with O_DIRECT that the
>>>>>> pages in question are accessible by the device.
>>>>>>
>>>>> I tend to agree with you if it's about a costly software "emulation"
>>>>> like bounce buffers, but cache maintenance is a hardware feature
>>>>> existing there by default and it's often much cheaper to operate on
>>>>> cached memory and synchronize the caches rather than have everything
>>>>> in uncached (or write-combined) memory.
>>>> Well that we should have proper cache maintenance is really not the
>>>> question. The question is where to do that?
>>>>
>>>> E.g. in the DMA-mapping framework which as far as I can see should only
>>>> take care of translating addresses.
>>> The DMA mapping framework is actually a DMA allocation and mapping
>>> framework. Generic memory allocation primitives (like alloc_pages(),
>>> kmalloc()) shouldn't be used for buffers that are expected to be
>>> passed to DMA engines - there exist DMA-aware replacements, like
>>> dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
>>> etc.
>>>
>>>> Or the higher level (get_user_pages() is just one example of that) which
>>>> says ok device X wants to access memory Y I need to make sure that
>>>> caches are flushed/invalidated.
>>> Okay, so here comes the userptr case, which I really feel like is just
>>> doomed at the very start, because it does nothing to accommodate
>>> hardware needs at allocation time and then just expects some magic to
>>> happen to make it possible for the hardware to make use of the buffer.
>>>
>>> That said, it should be still possible to handle that pretty well for
>>> hardware that allows it, i.e. without much memory access constraints.
>>> What would be still missing if we just use the existing gup() +
>>> dma_map() + dma_sync() sequence?
>> Error or rather fallback handling and *not* transparently inserting
>> bounce buffers.
>>
>> The whole implicit bounce buffers concept falls apart as soon as you
>> don't have a stream access any more.
>>
>>>>>>> It's a use-case that is working fine today with many devices (e.g. network
>>>>>>> adapters) in the ARM world, exactly because the architecture specific
>>>>>>> implementation of the DMA API inserts the cache maintenance operations
>>>>>>> on buffer ownership transfer.
>>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>>>>>> design totally breaks GPUs on Xen DOM0 for example.
>>>>>>
>>>>>> And Xen is just one example, I can certainly say from experience that
>>>>>> this design was a really really bad idea because it favors just one use
>>>>>> case while making other use cases practically impossible if not really
>>>>>> hard to implement.
>>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
>>>>> problem that this introduces for it?
>>>> That's a bit longer topic. The AMD XEN devs are already working on this
>>>> as far as I know. I can ping internally how far they got with sending
>>>> the patches out to avoid this problem.
>>> Hmm, I see. It might be a good data point to understand in which
>>> direction we should be going, but I guess we can wait until they send
>>> some patches.
>> There was just recently a longer thread on the amd-gfx mailing list
>> about that. I think looking in there as well might be beneficial.
>>
>> Regards,
>> Christian.
>>
>>> Best regards,
>>> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-05  8:28                                     ` Christian König
@ 2022-12-09  8:26                                       ` Tomasz Figa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-12-09  8:26 UTC (permalink / raw)
  To: Christian König
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, robdclark, dri-devel, linaro-mm-sig,
	linux-media

On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
>
> Hi Tomasz,
>
> Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > [SNIP]
> >> In other words explicit ownership transfer is not something we would
> >> want as requirement in the framework, cause otherwise we break tons of
> >> use cases which require concurrent access to the underlying buffer.
> >>
> >> When a device driver needs explicit ownership transfer it's perfectly
> >> possible to implement this using the dma_fence objects mentioned above.
> >> E.g. drivers can already look at who is accessing a buffer currently and
> >> can even grab explicit ownership of it by adding their own dma_fence
> >> objects.
> >>
> >> The only exception is CPU based access, e.g. when something is written
> >> with the CPU a cache flush might be necessary and when something is read
> >> with the CPU a cache invalidation might be necessary.
> >>
> > Okay, that's much clearer now, thanks for clarifying this. So we
> > should be covered for the cache maintenance needs originating from CPU
> > accesses already, +/- the broken cases which don't call the begin/end
> > CPU access routines that I mentioned above.
> >
> > Similarly, for any ownership transfer between different DMA engines,
> > we should be covered either by the userspace explicitly flushing the
> > hardware pipeline or attaching a DMA-buf fence to the buffer.
> >
> > But then, what's left to be solved? :) (Besides the cases of missing
> > begin/end CPU access calls.)
>
> Well there are multiple problems here:
>
> 1. A lot of userspace applications/frameworks assume that it can
> allocate the buffer anywhere and it just works.
>
> This isn't true at all, we have tons of cases where device can only
> access their special memory for certain use cases.
> Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> supports system memory here. Similar cases exists for audio/video codecs
> where intermediate memory is only accessible by certain devices because
> of content protection.

Ack.

Although I think the most common case on mainstream Linux today is
properly allocating for device X (e.g. V4L2 video decoder or DRM-based
GPU) and hoping that other devices would accept the buffers just fine,
which isn't a given on most platforms (although often it's just about
pixel format, width/height/stride alignment, tiling, etc. rather than
the memory itself). That's why ChromiumOS has minigbm and Android has
gralloc that act as the central point of knowledge on buffer
allocation.

>
> 2. We don't properly communicate allocation requirements to userspace.
>
> E.g. even if you allocate from DMA-Heaps userspace can currently only
> guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace,
because now it needs to pick the right heap. With allocation built
into the specific UAPI (like V4L2), it's at least possible to allocate
for one specific device without having any knowledge about allocation
constraints in the userspace.

>
> 3. We seem to lack some essential parts of those restrictions in the
> documentation.
>

Ack.

> >>>> So if a device driver uses cached system memory on an architecture which
> >>>> devices which can't access it the right approach is clearly to reject
> >>>> the access.
> >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> >> Well that depends. As said above the exporter exports the buffer as it
> >> was allocated.
> >>
> >> If that means the the exporter provides a piece of memory which requires
> >> CPU cache snooping to access correctly then the best thing we can do is
> >> to prevent an importer which can't do this from attaching.
> > Could you elaborate more about this case? Does it exist in practice?
> > Do I assume correctly that it's about sharing a buffer between one DMA
> > engine that is cache-coherent and another that is non-coherent, where
> > the first one ends up having its accesses always go through some kind
> > of a cache (CPU cache, L2/L3/... cache, etc.)?
>
> Yes, exactly that. What happens in this particular use case is that one
> device driver wrote to it's internal buffer with the CPU (so some cache
> lines where dirty) and then a device which couldn't deal with that tried
> to access it.

If so, shouldn't that driver surround its CPU accesses with
begin/end_cpu_access() in the first place?

The case that I was suggesting was of a hardware block that actually
sits behind the CPU cache and thus dirties it on writes, not the
driver doing that. (I haven't personally encountered such a system,
though.)

>
> We could say that all device drivers must always look at the coherency
> of the devices which want to access their buffers. But that would
> horrible complicate things for maintaining the drivers because then
> drivers would need to take into account requirements from other drivers
> while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA
mapping API, but currently it's only handling requirements of one
device. And we don't have any information from the userspace what
other devices the buffer would be used with...

Actually, do we even have such information in the userspace today?
Let's say I do a video call in a web browser on a typical Linux
system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
V4L2 camera fills in buffers with video frames and both encoder and
display consume them. Do we have a central place which would know that
a buffer needs to be allocated that works with the producer and all
consumers?

>
> That's essentially the reason why we have DMA-buf heaps. Those heaps
> expose system memory buffers with certain properties (size, CMA, DMA bit
> restrictions etc...) and also implement the callback functions for CPU
> cache maintenance.
>

How do DMA-buf heaps change anything here? We already have CPU cache
maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
(which arguably shouldn't actually do CPU cache maintenance, unless
that's implied by how either of the involved DMA engines work).


> >> We do have the system and CMA dma-buf heap for cases where a device
> >> driver which wants to access the buffer with caches enabled. So this is
> >> not a limitation in functionality, it's just a matter of correctly using it.
> >>
> > V4L2 also has the ability to allocate buffers and map them with caches enabled.
>
> Yeah, when that's a requirement for the V4L2 device it also makes
> perfect sense.
>
> It's just that we shouldn't force any specific allocation behavior on a
> device driver because of the requirements of a different device.
>
> Giving an example a V4L2 device shouldn't be forced to use
> videobuf2-dma-contig because some other device needs CMA. Instead we
> should use the common DMA-buf heaps which implement this as neutral
> supplier of system memory buffers.

Agreed, but that's only possible if we have a central entity that
understands what devices the requested buffer would be used with. My
understanding is that we're nowhere close to that with mainstream
Linux today.

// As a side note, videobuf2-dma-contig can actually allocate
discontiguous memory, if the device is behind an IOMMU. Actually with
the recent DMA API improvements, we could probably coalesce
vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

>
> >> The problem is that in this particular case the exporter provides the
> >> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> >> with that.
> > Why does the exporter leave the buffer with dirty CPU caches?
>
> Because exporters always export the buffers as they would use it. And in
> this case that means writing with the CPU to it.
>

Sorry for the question not being very clear. I meant: How do the CPU
caches get dirty in that case?

> >> Either reverting the roles of the importer or exporter or switching over
> >> to the common DMA-heaps implementation for the buffer would solve that.
> >>
> >>>> It's the job of the higher level to prepare the buffer a device work
> >>>> with, not the one of the lower level.
> >>> What are higher and lower levels here?
> >> The MM as higher level and the DMA mapping framework as lower level.
> >>
> > Hmm, I generally consider the DMA mapping framework a part of MM,
> > especially its allocation capabilities. It heavily relies on low level
> > MM primitives internally and exposes another set of low level
> > primitives that is more useful for device drivers. At least it should
> > be seen this way, but I agree that it currently includes things that
> > shouldn't necessarily be there, like the transparent bouncing.
>
> Exactly that, yes! Good to hear that more people think that way.
>
> Christoph made some comments which sounded like he agreed to some of the
> points as well, but nobody ever said it so clearly.
>

An interesting anecdote would be a case where we had the dma_mask set
incorrectly for one of the devices, which ended up triggering the
implicit bouncing and hurting the performance a lot, despite all of
the functional testing passing just fine...

> >>> As per the existing design of the DMA mapping framework, the framework
> >>> handles the system DMA architecture details and DMA master drivers
> >>> take care of invoking the right DMA mapping operations around the DMA
> >>> accesses. This makes sense to me, as DMA master drivers have no idea
> >>> about the specific SoCs or buses they're plugged into, while the DMA
> >>> mapping framework has no idea when the DMA accesses are taking place.
> >> Yeah and exactly that's a bit problematic. In an ideal world device
> >> drivers wouldn't see memory they can't access in the first place.
> >>
> >> For example what we currently do is the following:
> >> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
> >> 2. DMA mapping framework is called by the driver to create an sg-table.
> >> 3. The DMA mapping framework sees that this won't work and inserts
> >> bounce buffers.
> >> 4. The driver does the DMA to the bounce buffers instead.
> >> 5. The DMA mapping framework copies the data to the real location.
> >>
> >> This is highly problematic since it removes the control of what happens
> >> here. E.g. drivers can't prevent using bounce buffers when they need
> >> coherent access for DMA-buf for example.
> >>
> >> What we should have instead is that bounce buffers are applied at a
> >> higher level, for example by get_user_pages() or more general in the MM.
> >>
> > I tend to agree with you on this, but I see a lot of challenges that
> > would need to be solved if we want to make it otherwise. The whole
> > reason for transparent bouncing came from the fact that many existing
> > subsystems didn't really care about the needs of the underlying
> > hardware, e.g. block or network subsystems would just pass random
> > pages to the drivers. I think the idea of DMA mapping handling this
> > internally was to avoid implementing the bouncing here and there in
> > each block or network driver that needs it. (Arguably, an optional
> > helper could be provided instead for use in those subsystems...)
>
> Yes, totally agree. The problem is really that we moved bunch of MM and
> DMA functions in one API.
>
> The bounce buffers are something we should really have in a separate
> component.
>
> Then the functionality of allocating system memory for a specific device
> or devices should be something provided by the MM.
>
> And finally making this memory or any other CPU address accessible to a
> device (IOMMU programming etc..) should then be part of an DMA API.
>

Remember that actually making the memory accessible to a device often
needs to be handled already as a part of the allocation (e.g. dma_mask
in the non-IOMMU case). So I tend to think that the current division
of responsibilities is mostly fine - the dma_alloc family should be
seen as a part of MM already, especially with all the recent
improvements from Christoph, like dma_alloc_pages().

That said, it may indeed make sense to move things like IOMMU mapping
management out of the dma_alloc() and just reduce those functions to
simply returning a set of pages that satisfy the allocation
constraints. One would need to call dma_map() after the allocation,
but that's actually a fair expectation. Sometimes it might be
preferred to pre-allocate the memory, but only map it into the device
address space when it's really necessary.

> >>>> In other words in a proper design the higher level would prepare the
> >>>> memory in a way the device driver can work with it, not the other way
> >>>> around.
> >>>>
> >>>> When a device driver gets memory it can't work with the correct response
> >>>> is to throw an error and bubble that up into a layer where it can be
> >>>> handled gracefully.
> >>>>
> >>>> For example instead of using bounce buffers under the hood the DMA layer
> >>>> the MM should make sure that when you call read() with O_DIRECT that the
> >>>> pages in question are accessible by the device.
> >>>>
> >>> I tend to agree with you if it's about a costly software "emulation"
> >>> like bounce buffers, but cache maintenance is a hardware feature
> >>> existing there by default and it's often much cheaper to operate on
> >>> cached memory and synchronize the caches rather than have everything
> >>> in uncached (or write-combined) memory.
> >> Well that we should have proper cache maintenance is really not the
> >> question. The question is where to do that?
> >>
> >> E.g. in the DMA-mapping framework which as far as I can see should only
> >> take care of translating addresses.
> > The DMA mapping framework is actually a DMA allocation and mapping
> > framework. Generic memory allocation primitives (like alloc_pages(),
> > kmalloc()) shouldn't be used for buffers that are expected to be
> > passed to DMA engines - there exist DMA-aware replacements, like
> > dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
> > etc.
> >
> >> Or the higher level (get_user_pages() is just one example of that) which
> >> says ok device X wants to access memory Y I need to make sure that
> >> caches are flushed/invalidated.
> > Okay, so here comes the userptr case, which I really feel like is just
> > doomed at the very start, because it does nothing to accommodate
> > hardware needs at allocation time and then just expects some magic to
> > happen to make it possible for the hardware to make use of the buffer.
> >
> > That said, it should be still possible to handle that pretty well for
> > hardware that allows it, i.e. without much memory access constraints.
> > What would be still missing if we just use the existing gup() +
> > dma_map() + dma_sync() sequence?
>
> Error or rather fallback handling and *not* transparently inserting
> bounce buffers.
>
> The whole implicit bounce buffers concept falls apart as soon as you
> don't have a stream access any more.
>

I see. So I think we agree on that. :)

> >>>>> It's a use-case that is working fine today with many devices (e.g. network
> >>>>> adapters) in the ARM world, exactly because the architecture specific
> >>>>> implementation of the DMA API inserts the cache maintenance operations
> >>>>> on buffer ownership transfer.
> >>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >>>> design totally breaks GPUs on Xen DOM0 for example.
> >>>>
> >>>> And Xen is just one example, I can certainly say from experience that
> >>>> this design was a really really bad idea because it favors just one use
> >>>> case while making other use cases practically impossible if not really
> >>>> hard to implement.
> >>> Sorry, I haven't worked with Xen. Could you elaborate what's the
> >>> problem that this introduces for it?
> >> That's a bit longer topic. The AMD XEN devs are already working on this
> >> as far as I know. I can ping internally how far they got with sending
> >> the patches out to avoid this problem.
> > Hmm, I see. It might be a good data point to understand in which
> > direction we should be going, but I guess we can wait until they send
> > some patches.
>
> There was just recently a longer thread on the amd-gfx mailing list
> about that. I think looking in there as well might be beneficial.

Okay, let me check. Thanks,

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-09  8:26                                       ` Tomasz Figa
  0 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-12-09  8:26 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
>
> Hi Tomasz,
>
> Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > [SNIP]
> >> In other words explicit ownership transfer is not something we would
> >> want as requirement in the framework, cause otherwise we break tons of
> >> use cases which require concurrent access to the underlying buffer.
> >>
> >> When a device driver needs explicit ownership transfer it's perfectly
> >> possible to implement this using the dma_fence objects mentioned above.
> >> E.g. drivers can already look at who is accessing a buffer currently and
> >> can even grab explicit ownership of it by adding their own dma_fence
> >> objects.
> >>
> >> The only exception is CPU based access, e.g. when something is written
> >> with the CPU a cache flush might be necessary and when something is read
> >> with the CPU a cache invalidation might be necessary.
> >>
> > Okay, that's much clearer now, thanks for clarifying this. So we
> > should be covered for the cache maintenance needs originating from CPU
> > accesses already, +/- the broken cases which don't call the begin/end
> > CPU access routines that I mentioned above.
> >
> > Similarly, for any ownership transfer between different DMA engines,
> > we should be covered either by the userspace explicitly flushing the
> > hardware pipeline or attaching a DMA-buf fence to the buffer.
> >
> > But then, what's left to be solved? :) (Besides the cases of missing
> > begin/end CPU access calls.)
>
> Well there are multiple problems here:
>
> 1. A lot of userspace applications/frameworks assume that it can
> allocate the buffer anywhere and it just works.
>
> This isn't true at all, we have tons of cases where device can only
> access their special memory for certain use cases.
> Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> supports system memory here. Similar cases exists for audio/video codecs
> where intermediate memory is only accessible by certain devices because
> of content protection.

Ack.

Although I think the most common case on mainstream Linux today is
properly allocating for device X (e.g. V4L2 video decoder or DRM-based
GPU) and hoping that other devices would accept the buffers just fine,
which isn't a given on most platforms (although often it's just about
pixel format, width/height/stride alignment, tiling, etc. rather than
the memory itself). That's why ChromiumOS has minigbm and Android has
gralloc that act as the central point of knowledge on buffer
allocation.

>
> 2. We don't properly communicate allocation requirements to userspace.
>
> E.g. even if you allocate from DMA-Heaps userspace can currently only
> guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace,
because now it needs to pick the right heap. With allocation built
into the specific UAPI (like V4L2), it's at least possible to allocate
for one specific device without having any knowledge about allocation
constraints in the userspace.

>
> 3. We seem to lack some essential parts of those restrictions in the
> documentation.
>

Ack.

> >>>> So if a device driver uses cached system memory on an architecture which
> >>>> devices which can't access it the right approach is clearly to reject
> >>>> the access.
> >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> >> Well that depends. As said above the exporter exports the buffer as it
> >> was allocated.
> >>
> >> If that means the the exporter provides a piece of memory which requires
> >> CPU cache snooping to access correctly then the best thing we can do is
> >> to prevent an importer which can't do this from attaching.
> > Could you elaborate more about this case? Does it exist in practice?
> > Do I assume correctly that it's about sharing a buffer between one DMA
> > engine that is cache-coherent and another that is non-coherent, where
> > the first one ends up having its accesses always go through some kind
> > of a cache (CPU cache, L2/L3/... cache, etc.)?
>
> Yes, exactly that. What happens in this particular use case is that one
> device driver wrote to it's internal buffer with the CPU (so some cache
> lines where dirty) and then a device which couldn't deal with that tried
> to access it.

If so, shouldn't that driver surround its CPU accesses with
begin/end_cpu_access() in the first place?

The case that I was suggesting was of a hardware block that actually
sits behind the CPU cache and thus dirties it on writes, not the
driver doing that. (I haven't personally encountered such a system,
though.)

>
> We could say that all device drivers must always look at the coherency
> of the devices which want to access their buffers. But that would
> horrible complicate things for maintaining the drivers because then
> drivers would need to take into account requirements from other drivers
> while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA
mapping API, but currently it's only handling requirements of one
device. And we don't have any information from the userspace what
other devices the buffer would be used with...

Actually, do we even have such information in the userspace today?
Let's say I do a video call in a web browser on a typical Linux
system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
V4L2 camera fills in buffers with video frames and both encoder and
display consume them. Do we have a central place which would know that
a buffer needs to be allocated that works with the producer and all
consumers?

>
> That's essentially the reason why we have DMA-buf heaps. Those heaps
> expose system memory buffers with certain properties (size, CMA, DMA bit
> restrictions etc...) and also implement the callback functions for CPU
> cache maintenance.
>

How do DMA-buf heaps change anything here? We already have CPU cache
maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
(which arguably shouldn't actually do CPU cache maintenance, unless
that's implied by how either of the involved DMA engines work).


> >> We do have the system and CMA dma-buf heap for cases where a device
> >> driver which wants to access the buffer with caches enabled. So this is
> >> not a limitation in functionality, it's just a matter of correctly using it.
> >>
> > V4L2 also has the ability to allocate buffers and map them with caches enabled.
>
> Yeah, when that's a requirement for the V4L2 device it also makes
> perfect sense.
>
> It's just that we shouldn't force any specific allocation behavior on a
> device driver because of the requirements of a different device.
>
> Giving an example a V4L2 device shouldn't be forced to use
> videobuf2-dma-contig because some other device needs CMA. Instead we
> should use the common DMA-buf heaps which implement this as neutral
> supplier of system memory buffers.

Agreed, but that's only possible if we have a central entity that
understands what devices the requested buffer would be used with. My
understanding is that we're nowhere close to that with mainstream
Linux today.

// As a side note, videobuf2-dma-contig can actually allocate
discontiguous memory, if the device is behind an IOMMU. Actually with
the recent DMA API improvements, we could probably coalesce
vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

>
> >> The problem is that in this particular case the exporter provides the
> >> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> >> with that.
> > Why does the exporter leave the buffer with dirty CPU caches?
>
> Because exporters always export the buffers as they would use it. And in
> this case that means writing with the CPU to it.
>

Sorry for the question not being very clear. I meant: How do the CPU
caches get dirty in that case?

> >> Either reverting the roles of the importer or exporter or switching over
> >> to the common DMA-heaps implementation for the buffer would solve that.
> >>
> >>>> It's the job of the higher level to prepare the buffer a device work
> >>>> with, not the one of the lower level.
> >>> What are higher and lower levels here?
> >> The MM as higher level and the DMA mapping framework as lower level.
> >>
> > Hmm, I generally consider the DMA mapping framework a part of MM,
> > especially its allocation capabilities. It heavily relies on low level
> > MM primitives internally and exposes another set of low level
> > primitives that is more useful for device drivers. At least it should
> > be seen this way, but I agree that it currently includes things that
> > shouldn't necessarily be there, like the transparent bouncing.
>
> Exactly that, yes! Good to hear that more people think that way.
>
> Christoph made some comments which sounded like he agreed to some of the
> points as well, but nobody ever said it so clearly.
>

An interesting anecdote would be a case where we had the dma_mask set
incorrectly for one of the devices, which ended up triggering the
implicit bouncing and hurting the performance a lot, despite all of
the functional testing passing just fine...

> >>> As per the existing design of the DMA mapping framework, the framework
> >>> handles the system DMA architecture details and DMA master drivers
> >>> take care of invoking the right DMA mapping operations around the DMA
> >>> accesses. This makes sense to me, as DMA master drivers have no idea
> >>> about the specific SoCs or buses they're plugged into, while the DMA
> >>> mapping framework has no idea when the DMA accesses are taking place.
> >> Yeah and exactly that's a bit problematic. In an ideal world device
> >> drivers wouldn't see memory they can't access in the first place.
> >>
> >> For example what we currently do is the following:
> >> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
> >> 2. DMA mapping framework is called by the driver to create an sg-table.
> >> 3. The DMA mapping framework sees that this won't work and inserts
> >> bounce buffers.
> >> 4. The driver does the DMA to the bounce buffers instead.
> >> 5. The DMA mapping framework copies the data to the real location.
> >>
> >> This is highly problematic since it removes the control of what happens
> >> here. E.g. drivers can't prevent using bounce buffers when they need
> >> coherent access for DMA-buf for example.
> >>
> >> What we should have instead is that bounce buffers are applied at a
> >> higher level, for example by get_user_pages() or more general in the MM.
> >>
> > I tend to agree with you on this, but I see a lot of challenges that
> > would need to be solved if we want to make it otherwise. The whole
> > reason for transparent bouncing came from the fact that many existing
> > subsystems didn't really care about the needs of the underlying
> > hardware, e.g. block or network subsystems would just pass random
> > pages to the drivers. I think the idea of DMA mapping handling this
> > internally was to avoid implementing the bouncing here and there in
> > each block or network driver that needs it. (Arguably, an optional
> > helper could be provided instead for use in those subsystems...)
>
> Yes, totally agree. The problem is really that we moved bunch of MM and
> DMA functions in one API.
>
> The bounce buffers are something we should really have in a separate
> component.
>
> Then the functionality of allocating system memory for a specific device
> or devices should be something provided by the MM.
>
> And finally making this memory or any other CPU address accessible to a
> device (IOMMU programming etc..) should then be part of an DMA API.
>

Remember that actually making the memory accessible to a device often
needs to be handled already as a part of the allocation (e.g. dma_mask
in the non-IOMMU case). So I tend to think that the current division
of responsibilities is mostly fine - the dma_alloc family should be
seen as a part of MM already, especially with all the recent
improvements from Christoph, like dma_alloc_pages().

That said, it may indeed make sense to move things like IOMMU mapping
management out of the dma_alloc() and just reduce those functions to
simply returning a set of pages that satisfy the allocation
constraints. One would need to call dma_map() after the allocation,
but that's actually a fair expectation. Sometimes it might be
preferred to pre-allocate the memory, but only map it into the device
address space when it's really necessary.

> >>>> In other words in a proper design the higher level would prepare the
> >>>> memory in a way the device driver can work with it, not the other way
> >>>> around.
> >>>>
> >>>> When a device driver gets memory it can't work with the correct response
> >>>> is to throw an error and bubble that up into a layer where it can be
> >>>> handled gracefully.
> >>>>
> >>>> For example instead of using bounce buffers under the hood the DMA layer
> >>>> the MM should make sure that when you call read() with O_DIRECT that the
> >>>> pages in question are accessible by the device.
> >>>>
> >>> I tend to agree with you if it's about a costly software "emulation"
> >>> like bounce buffers, but cache maintenance is a hardware feature
> >>> existing there by default and it's often much cheaper to operate on
> >>> cached memory and synchronize the caches rather than have everything
> >>> in uncached (or write-combined) memory.
> >> Well that we should have proper cache maintenance is really not the
> >> question. The question is where to do that?
> >>
> >> E.g. in the DMA-mapping framework which as far as I can see should only
> >> take care of translating addresses.
> > The DMA mapping framework is actually a DMA allocation and mapping
> > framework. Generic memory allocation primitives (like alloc_pages(),
> > kmalloc()) shouldn't be used for buffers that are expected to be
> > passed to DMA engines - there exist DMA-aware replacements, like
> > dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
> > etc.
> >
> >> Or the higher level (get_user_pages() is just one example of that) which
> >> says ok device X wants to access memory Y I need to make sure that
> >> caches are flushed/invalidated.
> > Okay, so here comes the userptr case, which I really feel like is just
> > doomed at the very start, because it does nothing to accommodate
> > hardware needs at allocation time and then just expects some magic to
> > happen to make it possible for the hardware to make use of the buffer.
> >
> > That said, it should be still possible to handle that pretty well for
> > hardware that allows it, i.e. without much memory access constraints.
> > What would be still missing if we just use the existing gup() +
> > dma_map() + dma_sync() sequence?
>
> Error or rather fallback handling and *not* transparently inserting
> bounce buffers.
>
> The whole implicit bounce buffers concept falls apart as soon as you
> don't have a stream access any more.
>

I see. So I think we agree on that. :)

> >>>>> It's a use-case that is working fine today with many devices (e.g. network
> >>>>> adapters) in the ARM world, exactly because the architecture specific
> >>>>> implementation of the DMA API inserts the cache maintenance operations
> >>>>> on buffer ownership transfer.
> >>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >>>> design totally breaks GPUs on Xen DOM0 for example.
> >>>>
> >>>> And Xen is just one example, I can certainly say from experience that
> >>>> this design was a really really bad idea because it favors just one use
> >>>> case while making other use cases practically impossible if not really
> >>>> hard to implement.
> >>> Sorry, I haven't worked with Xen. Could you elaborate what's the
> >>> problem that this introduces for it?
> >> That's a bit longer topic. The AMD XEN devs are already working on this
> >> as far as I know. I can ping internally how far they got with sending
> >> the patches out to avoid this problem.
> > Hmm, I see. It might be a good data point to understand in which
> > direction we should be going, but I guess we can wait until they send
> > some patches.
>
> There was just recently a longer thread on the amd-gfx mailing list
> about that. I think looking in there as well might be beneficial.

Okay, let me check. Thanks,

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-09  8:26                                       ` Tomasz Figa
@ 2022-12-09  9:32                                         ` Pekka Paalanen
  -1 siblings, 0 replies; 110+ messages in thread
From: Pekka Paalanen @ 2022-12-09  9:32 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Christian König, Lucas Stach, Nicolas Dufresne,
	Daniel Stone, sumit.semwal, daniel, robdclark, dri-devel,
	linaro-mm-sig, linux-media

[-- Attachment #1: Type: text/plain, Size: 7983 bytes --]

On Fri, 9 Dec 2022 17:26:06 +0900
Tomasz Figa <tfiga@chromium.org> wrote:

> On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
> >
> > Hi Tomasz,
> >
> > Am 05.12.22 um 07:41 schrieb Tomasz Figa:  
> > > [SNIP]  
> > >> In other words explicit ownership transfer is not something we would
> > >> want as requirement in the framework, cause otherwise we break tons of
> > >> use cases which require concurrent access to the underlying buffer.
> > >>
> > >> When a device driver needs explicit ownership transfer it's perfectly
> > >> possible to implement this using the dma_fence objects mentioned above.
> > >> E.g. drivers can already look at who is accessing a buffer currently and
> > >> can even grab explicit ownership of it by adding their own dma_fence
> > >> objects.
> > >>
> > >> The only exception is CPU based access, e.g. when something is written
> > >> with the CPU a cache flush might be necessary and when something is read
> > >> with the CPU a cache invalidation might be necessary.
> > >>  
> > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > should be covered for the cache maintenance needs originating from CPU
> > > accesses already, +/- the broken cases which don't call the begin/end
> > > CPU access routines that I mentioned above.
> > >
> > > Similarly, for any ownership transfer between different DMA engines,
> > > we should be covered either by the userspace explicitly flushing the
> > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > >
> > > But then, what's left to be solved? :) (Besides the cases of missing
> > > begin/end CPU access calls.)  
> >
> > Well there are multiple problems here:
> >
> > 1. A lot of userspace applications/frameworks assume that it can
> > allocate the buffer anywhere and it just works.
> >
> > This isn't true at all, we have tons of cases where device can only
> > access their special memory for certain use cases.
> > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > supports system memory here. Similar cases exists for audio/video codecs
> > where intermediate memory is only accessible by certain devices because
> > of content protection.  
> 
> Ack.
> 
> Although I think the most common case on mainstream Linux today is
> properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> GPU) and hoping that other devices would accept the buffers just fine,
> which isn't a given on most platforms (although often it's just about
> pixel format, width/height/stride alignment, tiling, etc. rather than
> the memory itself). That's why ChromiumOS has minigbm and Android has
> gralloc that act as the central point of knowledge on buffer
> allocation.

Hi,

as an anecdote, when I was improving Mutter's cross-DRM-device handling
(for DisplayLink uses) a few years ago, I implemented several different
approaches of where to allocate, to try until going for the slowest but
guaranteed to work case of copying every update into and out of sysram.

It seems there are two different approaches in general for allocation
and sharing:

1. Try different things until it works or you run out of options

pro:
- no need for a single software component to know everything about
  every device in the system

con:
- who bothers with fallbacks, if the first try works on my system for
  my use case I test with? I.e. cost of code in users.
- trial-and-error can be very laborious (allocate, share with all
  devices, populate, test)
- the search space might be huge


2. Have a central component that knows what to do

pro:
- It might work on the first attempt, so no fallbacks in users.
- It might be optimal.

con:
- You need a software component that knows everything about every
  single combination of hardware in existence, multiplied by use cases.


Neither seems good, which brings us back to https://github.com/cubanismo/allocator .


> > 2. We don't properly communicate allocation requirements to userspace.
> >
> > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > guess if normal, CMA or even device specific memory is needed.  
> 
> DMA-buf heaps actually make it even more difficult for the userspace,
> because now it needs to pick the right heap. With allocation built
> into the specific UAPI (like V4L2), it's at least possible to allocate
> for one specific device without having any knowledge about allocation
> constraints in the userspace.
> 
> >
> > 3. We seem to lack some essential parts of those restrictions in the
> > documentation.
> >  
> 
> Ack.
> 
> > >>>> So if a device driver uses cached system memory on an architecture which
> > >>>> devices which can't access it the right approach is clearly to reject
> > >>>> the access.  
> > >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".  
> > >> Well that depends. As said above the exporter exports the buffer as it
> > >> was allocated.
> > >>
> > >> If that means the the exporter provides a piece of memory which requires
> > >> CPU cache snooping to access correctly then the best thing we can do is
> > >> to prevent an importer which can't do this from attaching.  
> > > Could you elaborate more about this case? Does it exist in practice?
> > > Do I assume correctly that it's about sharing a buffer between one DMA
> > > engine that is cache-coherent and another that is non-coherent, where
> > > the first one ends up having its accesses always go through some kind
> > > of a cache (CPU cache, L2/L3/... cache, etc.)?  
> >
> > Yes, exactly that. What happens in this particular use case is that one
> > device driver wrote to it's internal buffer with the CPU (so some cache
> > lines where dirty) and then a device which couldn't deal with that tried
> > to access it.  
> 
> If so, shouldn't that driver surround its CPU accesses with
> begin/end_cpu_access() in the first place?
> 
> The case that I was suggesting was of a hardware block that actually
> sits behind the CPU cache and thus dirties it on writes, not the
> driver doing that. (I haven't personally encountered such a system,
> though.)
> 
> >
> > We could say that all device drivers must always look at the coherency
> > of the devices which want to access their buffers. But that would
> > horrible complicate things for maintaining the drivers because then
> > drivers would need to take into account requirements from other drivers
> > while allocating their internal buffers.  
> 
> I think it's partially why we have the allocation part of the DMA
> mapping API, but currently it's only handling requirements of one
> device. And we don't have any information from the userspace what
> other devices the buffer would be used with...
> 
> Actually, do we even have such information in the userspace today?
> Let's say I do a video call in a web browser on a typical Linux
> system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> V4L2 camera fills in buffers with video frames and both encoder and
> display consume them. Do we have a central place which would know that
> a buffer needs to be allocated that works with the producer and all
> consumers?

I have a vague belief that many, many years ago, in the early days of
dmabuf development, there was the idea of the sequence:
- create a dmabuf handle
- share the handle with all devices that would need access
- *then* do the allocation with kernel-internal negotiation to fill all
  devices' needs, if at all possible

Obviously that didn't happen. I think today's dmabuf Wayland protocol
would support this though.

Anyway, Wayland can tell the app which DRM devices a buffer
needs to work with as a GPU texture and potentially on same/another
DRM device as a KMS framebuffer, so theoretically the app could know.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-09  9:32                                         ` Pekka Paalanen
  0 siblings, 0 replies; 110+ messages in thread
From: Pekka Paalanen @ 2022-12-09  9:32 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Christian König, Nicolas Dufresne, linaro-mm-sig, dri-devel,
	linux-media, sumit.semwal

[-- Attachment #1: Type: text/plain, Size: 7983 bytes --]

On Fri, 9 Dec 2022 17:26:06 +0900
Tomasz Figa <tfiga@chromium.org> wrote:

> On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
> >
> > Hi Tomasz,
> >
> > Am 05.12.22 um 07:41 schrieb Tomasz Figa:  
> > > [SNIP]  
> > >> In other words explicit ownership transfer is not something we would
> > >> want as requirement in the framework, cause otherwise we break tons of
> > >> use cases which require concurrent access to the underlying buffer.
> > >>
> > >> When a device driver needs explicit ownership transfer it's perfectly
> > >> possible to implement this using the dma_fence objects mentioned above.
> > >> E.g. drivers can already look at who is accessing a buffer currently and
> > >> can even grab explicit ownership of it by adding their own dma_fence
> > >> objects.
> > >>
> > >> The only exception is CPU based access, e.g. when something is written
> > >> with the CPU a cache flush might be necessary and when something is read
> > >> with the CPU a cache invalidation might be necessary.
> > >>  
> > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > should be covered for the cache maintenance needs originating from CPU
> > > accesses already, +/- the broken cases which don't call the begin/end
> > > CPU access routines that I mentioned above.
> > >
> > > Similarly, for any ownership transfer between different DMA engines,
> > > we should be covered either by the userspace explicitly flushing the
> > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > >
> > > But then, what's left to be solved? :) (Besides the cases of missing
> > > begin/end CPU access calls.)  
> >
> > Well there are multiple problems here:
> >
> > 1. A lot of userspace applications/frameworks assume that it can
> > allocate the buffer anywhere and it just works.
> >
> > This isn't true at all, we have tons of cases where device can only
> > access their special memory for certain use cases.
> > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > supports system memory here. Similar cases exists for audio/video codecs
> > where intermediate memory is only accessible by certain devices because
> > of content protection.  
> 
> Ack.
> 
> Although I think the most common case on mainstream Linux today is
> properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> GPU) and hoping that other devices would accept the buffers just fine,
> which isn't a given on most platforms (although often it's just about
> pixel format, width/height/stride alignment, tiling, etc. rather than
> the memory itself). That's why ChromiumOS has minigbm and Android has
> gralloc that act as the central point of knowledge on buffer
> allocation.

Hi,

as an anecdote, when I was improving Mutter's cross-DRM-device handling
(for DisplayLink uses) a few years ago, I implemented several different
approaches of where to allocate, to try until going for the slowest but
guaranteed to work case of copying every update into and out of sysram.

It seems there are two different approaches in general for allocation
and sharing:

1. Try different things until it works or you run out of options

pro:
- no need for a single software component to know everything about
  every device in the system

con:
- who bothers with fallbacks, if the first try works on my system for
  my use case I test with? I.e. cost of code in users.
- trial-and-error can be very laborious (allocate, share with all
  devices, populate, test)
- the search space might be huge


2. Have a central component that knows what to do

pro:
- It might work on the first attempt, so no fallbacks in users.
- It might be optimal.

con:
- You need a software component that knows everything about every
  single combination of hardware in existence, multiplied by use cases.


Neither seems good, which brings us back to https://github.com/cubanismo/allocator .


> > 2. We don't properly communicate allocation requirements to userspace.
> >
> > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > guess if normal, CMA or even device specific memory is needed.  
> 
> DMA-buf heaps actually make it even more difficult for the userspace,
> because now it needs to pick the right heap. With allocation built
> into the specific UAPI (like V4L2), it's at least possible to allocate
> for one specific device without having any knowledge about allocation
> constraints in the userspace.
> 
> >
> > 3. We seem to lack some essential parts of those restrictions in the
> > documentation.
> >  
> 
> Ack.
> 
> > >>>> So if a device driver uses cached system memory on an architecture which
> > >>>> devices which can't access it the right approach is clearly to reject
> > >>>> the access.  
> > >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".  
> > >> Well that depends. As said above the exporter exports the buffer as it
> > >> was allocated.
> > >>
> > >> If that means the the exporter provides a piece of memory which requires
> > >> CPU cache snooping to access correctly then the best thing we can do is
> > >> to prevent an importer which can't do this from attaching.  
> > > Could you elaborate more about this case? Does it exist in practice?
> > > Do I assume correctly that it's about sharing a buffer between one DMA
> > > engine that is cache-coherent and another that is non-coherent, where
> > > the first one ends up having its accesses always go through some kind
> > > of a cache (CPU cache, L2/L3/... cache, etc.)?  
> >
> > Yes, exactly that. What happens in this particular use case is that one
> > device driver wrote to it's internal buffer with the CPU (so some cache
> > lines where dirty) and then a device which couldn't deal with that tried
> > to access it.  
> 
> If so, shouldn't that driver surround its CPU accesses with
> begin/end_cpu_access() in the first place?
> 
> The case that I was suggesting was of a hardware block that actually
> sits behind the CPU cache and thus dirties it on writes, not the
> driver doing that. (I haven't personally encountered such a system,
> though.)
> 
> >
> > We could say that all device drivers must always look at the coherency
> > of the devices which want to access their buffers. But that would
> > horrible complicate things for maintaining the drivers because then
> > drivers would need to take into account requirements from other drivers
> > while allocating their internal buffers.  
> 
> I think it's partially why we have the allocation part of the DMA
> mapping API, but currently it's only handling requirements of one
> device. And we don't have any information from the userspace what
> other devices the buffer would be used with...
> 
> Actually, do we even have such information in the userspace today?
> Let's say I do a video call in a web browser on a typical Linux
> system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> V4L2 camera fills in buffers with video frames and both encoder and
> display consume them. Do we have a central place which would know that
> a buffer needs to be allocated that works with the producer and all
> consumers?

I have a vague belief that many, many years ago, in the early days of
dmabuf development, there was the idea of the sequence:
- create a dmabuf handle
- share the handle with all devices that would need access
- *then* do the allocation with kernel-internal negotiation to fill all
  devices' needs, if at all possible

Obviously that didn't happen. I think today's dmabuf Wayland protocol
would support this though.

Anyway, Wayland can tell the app which DRM devices a buffer
needs to work with as a GPU texture and potentially on same/another
DRM device as a KMS framebuffer, so theoretically the app could know.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-09  8:26                                       ` Tomasz Figa
@ 2022-12-09 10:27                                         ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-09 10:27 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, robdclark, dri-devel, linaro-mm-sig,
	linux-media

Am 09.12.22 um 09:26 schrieb Tomasz Figa:
> [SNIP]
> Although I think the most common case on mainstream Linux today is
> properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> GPU) and hoping that other devices would accept the buffers just fine,
> which isn't a given on most platforms (although often it's just about
> pixel format, width/height/stride alignment, tiling, etc. rather than
> the memory itself). That's why ChromiumOS has minigbm and Android has
> gralloc that act as the central point of knowledge on buffer
> allocation.

Yeah, completely agree. The crux is really that we need to improve those 
user space allocators by providing more information from the kernel.

>> 2. We don't properly communicate allocation requirements to userspace.
>>
>> E.g. even if you allocate from DMA-Heaps userspace can currently only
>> guess if normal, CMA or even device specific memory is needed.
> DMA-buf heaps actually make it even more difficult for the userspace,
> because now it needs to pick the right heap. With allocation built
> into the specific UAPI (like V4L2), it's at least possible to allocate
> for one specific device without having any knowledge about allocation
> constraints in the userspace.

Yes, same what Daniel said as well. We need to provide some more hints 
which allocator to use from the kernel.

>>>>>> So if a device driver uses cached system memory on an architecture which
>>>>>> devices which can't access it the right approach is clearly to reject
>>>>>> the access.
>>>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
>>>> Well that depends. As said above the exporter exports the buffer as it
>>>> was allocated.
>>>>
>>>> If that means the the exporter provides a piece of memory which requires
>>>> CPU cache snooping to access correctly then the best thing we can do is
>>>> to prevent an importer which can't do this from attaching.
>>> Could you elaborate more about this case? Does it exist in practice?
>>> Do I assume correctly that it's about sharing a buffer between one DMA
>>> engine that is cache-coherent and another that is non-coherent, where
>>> the first one ends up having its accesses always go through some kind
>>> of a cache (CPU cache, L2/L3/... cache, etc.)?
>> Yes, exactly that. What happens in this particular use case is that one
>> device driver wrote to it's internal buffer with the CPU (so some cache
>> lines where dirty) and then a device which couldn't deal with that tried
>> to access it.
> If so, shouldn't that driver surround its CPU accesses with
> begin/end_cpu_access() in the first place?

The problem is that the roles are reversed. The callbacks let the 
exporter knows that it needs to flush the caches when the importer is 
done accessing the buffer with the CPU.

But here the exporter is the one accessing the buffer with the CPU and 
the importer then accesses stale data because it doesn't snoop the caches.

What we could do is to force all exporters to use begin/end_cpu_access() 
even on it's own buffers and look at all the importers when the access 
is completed. But that would increase the complexity of the handling in 
the exporter.

In other words we would then have code in the exporters which is only 
written for handling the constrains of the importers. This has a wide 
variety of consequences, especially that this functionality of the 
exporter can't be tested without a proper importer.

I was also thinking about reversing the role of exporter and importer in 
the kernel, but came to the conclusion that doing this under the hood 
without userspace knowing about it is probably not going to work either.

> The case that I was suggesting was of a hardware block that actually
> sits behind the CPU cache and thus dirties it on writes, not the
> driver doing that. (I haven't personally encountered such a system,
> though.)

Never heard of anything like that either, but who knows.

>> We could say that all device drivers must always look at the coherency
>> of the devices which want to access their buffers. But that would
>> horrible complicate things for maintaining the drivers because then
>> drivers would need to take into account requirements from other drivers
>> while allocating their internal buffers.
> I think it's partially why we have the allocation part of the DMA
> mapping API, but currently it's only handling requirements of one
> device. And we don't have any information from the userspace what
> other devices the buffer would be used with...

Exactly that, yes.

> Actually, do we even have such information in the userspace today?
> Let's say I do a video call in a web browser on a typical Linux
> system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> V4L2 camera fills in buffers with video frames and both encoder and
> display consume them. Do we have a central place which would know that
> a buffer needs to be allocated that works with the producer and all
> consumers?

Both X11 and Wayland have protocols which can be used to display a 
certain DMA-buf handle, their feedback packages contain information how 
ideal your configuration is, e.g. if the DMA-buf handle could be used or 
if an extra copy was needed etc...

Similar exists between VAAPI and V4L2 as far as I know, but as you noted 
as well here it's indeed more about negotiating pixel format, stride, 
padding, alignment etc...

The best we can do is to reject combinations which won't work in the 
kernel and then userspace could react accordingly.

>> That's essentially the reason why we have DMA-buf heaps. Those heaps
>> expose system memory buffers with certain properties (size, CMA, DMA bit
>> restrictions etc...) and also implement the callback functions for CPU
>> cache maintenance.
>>
> How do DMA-buf heaps change anything here? We already have CPU cache
> maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
> for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
> (which arguably shouldn't actually do CPU cache maintenance, unless
> that's implied by how either of the involved DMA engines work).

DMA-buf heaps are the neutral man in the middle.

The implementation keeps track of all the attached importers and should 
make sure that the allocated memory fits the need of everyone. 
Additional to that calls to the cache DMA-api cache management functions 
are inserted whenever CPU access begins/ends.

That's the best we can do for system memory sharing, only device 
specific memory can't be allocated like this.

>>>> We do have the system and CMA dma-buf heap for cases where a device
>>>> driver which wants to access the buffer with caches enabled. So this is
>>>> not a limitation in functionality, it's just a matter of correctly using it.
>>>>
>>> V4L2 also has the ability to allocate buffers and map them with caches enabled.
>> Yeah, when that's a requirement for the V4L2 device it also makes
>> perfect sense.
>>
>> It's just that we shouldn't force any specific allocation behavior on a
>> device driver because of the requirements of a different device.
>>
>> Giving an example a V4L2 device shouldn't be forced to use
>> videobuf2-dma-contig because some other device needs CMA. Instead we
>> should use the common DMA-buf heaps which implement this as neutral
>> supplier of system memory buffers.
> Agreed, but that's only possible if we have a central entity that
> understands what devices the requested buffer would be used with. My
> understanding is that we're nowhere close to that with mainstream
> Linux today.
>
> // As a side note, videobuf2-dma-contig can actually allocate
> discontiguous memory, if the device is behind an IOMMU. Actually with
> the recent DMA API improvements, we could probably coalesce
> vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

That would probably make live a little bit simpler, yes.

>>>> The problem is that in this particular case the exporter provides the
>>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
>>>> with that.
>>> Why does the exporter leave the buffer with dirty CPU caches?
>> Because exporters always export the buffers as they would use it. And in
>> this case that means writing with the CPU to it.
>>
> Sorry for the question not being very clear. I meant: How do the CPU
> caches get dirty in that case?

The exporter wrote to it. As far as I understand the exporter just 
copies things from A to B with memcpy to construct the buffer content.

> [SNIP]
>> Yes, totally agree. The problem is really that we moved bunch of MM and
>> DMA functions in one API.
>>
>> The bounce buffers are something we should really have in a separate
>> component.
>>
>> Then the functionality of allocating system memory for a specific device
>> or devices should be something provided by the MM.
>>
>> And finally making this memory or any other CPU address accessible to a
>> device (IOMMU programming etc..) should then be part of an DMA API.
>>
> Remember that actually making the memory accessible to a device often
> needs to be handled already as a part of the allocation (e.g. dma_mask
> in the non-IOMMU case). So I tend to think that the current division
> of responsibilities is mostly fine - the dma_alloc family should be
> seen as a part of MM already, especially with all the recent
> improvements from Christoph, like dma_alloc_pages().

Yes, that's indeed a very interesting development which as far as I can 
see goes into the right direction.

> That said, it may indeed make sense to move things like IOMMU mapping
> management out of the dma_alloc() and just reduce those functions to
> simply returning a set of pages that satisfy the allocation
> constraints. One would need to call dma_map() after the allocation,
> but that's actually a fair expectation. Sometimes it might be
> preferred to pre-allocate the memory, but only map it into the device
> address space when it's really necessary.

What I'm still missing is the functionality to allocate pages for 
multiple devices and proper error codes when dma_map() can't make the 
page accessible to a device.

Regards,
Christian.

>>>>>>> It's a use-case that is working fine today with many devices (e.g. network
>>>>>>> adapters) in the ARM world, exactly because the architecture specific
>>>>>>> implementation of the DMA API inserts the cache maintenance operations
>>>>>>> on buffer ownership transfer.
>>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>>>>>> design totally breaks GPUs on Xen DOM0 for example.
>>>>>>
>>>>>> And Xen is just one example, I can certainly say from experience that
>>>>>> this design was a really really bad idea because it favors just one use
>>>>>> case while making other use cases practically impossible if not really
>>>>>> hard to implement.
>>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
>>>>> problem that this introduces for it?
>>>> That's a bit longer topic. The AMD XEN devs are already working on this
>>>> as far as I know. I can ping internally how far they got with sending
>>>> the patches out to avoid this problem.
>>> Hmm, I see. It might be a good data point to understand in which
>>> direction we should be going, but I guess we can wait until they send
>>> some patches.
>> There was just recently a longer thread on the amd-gfx mailing list
>> about that. I think looking in there as well might be beneficial.
> Okay, let me check. Thanks,
>
> Best regards,
> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-09 10:27                                         ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-09 10:27 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

Am 09.12.22 um 09:26 schrieb Tomasz Figa:
> [SNIP]
> Although I think the most common case on mainstream Linux today is
> properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> GPU) and hoping that other devices would accept the buffers just fine,
> which isn't a given on most platforms (although often it's just about
> pixel format, width/height/stride alignment, tiling, etc. rather than
> the memory itself). That's why ChromiumOS has minigbm and Android has
> gralloc that act as the central point of knowledge on buffer
> allocation.

Yeah, completely agree. The crux is really that we need to improve those 
user space allocators by providing more information from the kernel.

>> 2. We don't properly communicate allocation requirements to userspace.
>>
>> E.g. even if you allocate from DMA-Heaps userspace can currently only
>> guess if normal, CMA or even device specific memory is needed.
> DMA-buf heaps actually make it even more difficult for the userspace,
> because now it needs to pick the right heap. With allocation built
> into the specific UAPI (like V4L2), it's at least possible to allocate
> for one specific device without having any knowledge about allocation
> constraints in the userspace.

Yes, same what Daniel said as well. We need to provide some more hints 
which allocator to use from the kernel.

>>>>>> So if a device driver uses cached system memory on an architecture which
>>>>>> devices which can't access it the right approach is clearly to reject
>>>>>> the access.
>>>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
>>>> Well that depends. As said above the exporter exports the buffer as it
>>>> was allocated.
>>>>
>>>> If that means the the exporter provides a piece of memory which requires
>>>> CPU cache snooping to access correctly then the best thing we can do is
>>>> to prevent an importer which can't do this from attaching.
>>> Could you elaborate more about this case? Does it exist in practice?
>>> Do I assume correctly that it's about sharing a buffer between one DMA
>>> engine that is cache-coherent and another that is non-coherent, where
>>> the first one ends up having its accesses always go through some kind
>>> of a cache (CPU cache, L2/L3/... cache, etc.)?
>> Yes, exactly that. What happens in this particular use case is that one
>> device driver wrote to it's internal buffer with the CPU (so some cache
>> lines where dirty) and then a device which couldn't deal with that tried
>> to access it.
> If so, shouldn't that driver surround its CPU accesses with
> begin/end_cpu_access() in the first place?

The problem is that the roles are reversed. The callbacks let the 
exporter knows that it needs to flush the caches when the importer is 
done accessing the buffer with the CPU.

But here the exporter is the one accessing the buffer with the CPU and 
the importer then accesses stale data because it doesn't snoop the caches.

What we could do is to force all exporters to use begin/end_cpu_access() 
even on it's own buffers and look at all the importers when the access 
is completed. But that would increase the complexity of the handling in 
the exporter.

In other words we would then have code in the exporters which is only 
written for handling the constrains of the importers. This has a wide 
variety of consequences, especially that this functionality of the 
exporter can't be tested without a proper importer.

I was also thinking about reversing the role of exporter and importer in 
the kernel, but came to the conclusion that doing this under the hood 
without userspace knowing about it is probably not going to work either.

> The case that I was suggesting was of a hardware block that actually
> sits behind the CPU cache and thus dirties it on writes, not the
> driver doing that. (I haven't personally encountered such a system,
> though.)

Never heard of anything like that either, but who knows.

>> We could say that all device drivers must always look at the coherency
>> of the devices which want to access their buffers. But that would
>> horrible complicate things for maintaining the drivers because then
>> drivers would need to take into account requirements from other drivers
>> while allocating their internal buffers.
> I think it's partially why we have the allocation part of the DMA
> mapping API, but currently it's only handling requirements of one
> device. And we don't have any information from the userspace what
> other devices the buffer would be used with...

Exactly that, yes.

> Actually, do we even have such information in the userspace today?
> Let's say I do a video call in a web browser on a typical Linux
> system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> V4L2 camera fills in buffers with video frames and both encoder and
> display consume them. Do we have a central place which would know that
> a buffer needs to be allocated that works with the producer and all
> consumers?

Both X11 and Wayland have protocols which can be used to display a 
certain DMA-buf handle, their feedback packages contain information how 
ideal your configuration is, e.g. if the DMA-buf handle could be used or 
if an extra copy was needed etc...

Similar exists between VAAPI and V4L2 as far as I know, but as you noted 
as well here it's indeed more about negotiating pixel format, stride, 
padding, alignment etc...

The best we can do is to reject combinations which won't work in the 
kernel and then userspace could react accordingly.

>> That's essentially the reason why we have DMA-buf heaps. Those heaps
>> expose system memory buffers with certain properties (size, CMA, DMA bit
>> restrictions etc...) and also implement the callback functions for CPU
>> cache maintenance.
>>
> How do DMA-buf heaps change anything here? We already have CPU cache
> maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
> for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
> (which arguably shouldn't actually do CPU cache maintenance, unless
> that's implied by how either of the involved DMA engines work).

DMA-buf heaps are the neutral man in the middle.

The implementation keeps track of all the attached importers and should 
make sure that the allocated memory fits the need of everyone. 
Additional to that calls to the cache DMA-api cache management functions 
are inserted whenever CPU access begins/ends.

That's the best we can do for system memory sharing, only device 
specific memory can't be allocated like this.

>>>> We do have the system and CMA dma-buf heap for cases where a device
>>>> driver which wants to access the buffer with caches enabled. So this is
>>>> not a limitation in functionality, it's just a matter of correctly using it.
>>>>
>>> V4L2 also has the ability to allocate buffers and map them with caches enabled.
>> Yeah, when that's a requirement for the V4L2 device it also makes
>> perfect sense.
>>
>> It's just that we shouldn't force any specific allocation behavior on a
>> device driver because of the requirements of a different device.
>>
>> Giving an example a V4L2 device shouldn't be forced to use
>> videobuf2-dma-contig because some other device needs CMA. Instead we
>> should use the common DMA-buf heaps which implement this as neutral
>> supplier of system memory buffers.
> Agreed, but that's only possible if we have a central entity that
> understands what devices the requested buffer would be used with. My
> understanding is that we're nowhere close to that with mainstream
> Linux today.
>
> // As a side note, videobuf2-dma-contig can actually allocate
> discontiguous memory, if the device is behind an IOMMU. Actually with
> the recent DMA API improvements, we could probably coalesce
> vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

That would probably make live a little bit simpler, yes.

>>>> The problem is that in this particular case the exporter provides the
>>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
>>>> with that.
>>> Why does the exporter leave the buffer with dirty CPU caches?
>> Because exporters always export the buffers as they would use it. And in
>> this case that means writing with the CPU to it.
>>
> Sorry for the question not being very clear. I meant: How do the CPU
> caches get dirty in that case?

The exporter wrote to it. As far as I understand the exporter just 
copies things from A to B with memcpy to construct the buffer content.

> [SNIP]
>> Yes, totally agree. The problem is really that we moved bunch of MM and
>> DMA functions in one API.
>>
>> The bounce buffers are something we should really have in a separate
>> component.
>>
>> Then the functionality of allocating system memory for a specific device
>> or devices should be something provided by the MM.
>>
>> And finally making this memory or any other CPU address accessible to a
>> device (IOMMU programming etc..) should then be part of an DMA API.
>>
> Remember that actually making the memory accessible to a device often
> needs to be handled already as a part of the allocation (e.g. dma_mask
> in the non-IOMMU case). So I tend to think that the current division
> of responsibilities is mostly fine - the dma_alloc family should be
> seen as a part of MM already, especially with all the recent
> improvements from Christoph, like dma_alloc_pages().

Yes, that's indeed a very interesting development which as far as I can 
see goes into the right direction.

> That said, it may indeed make sense to move things like IOMMU mapping
> management out of the dma_alloc() and just reduce those functions to
> simply returning a set of pages that satisfy the allocation
> constraints. One would need to call dma_map() after the allocation,
> but that's actually a fair expectation. Sometimes it might be
> preferred to pre-allocate the memory, but only map it into the device
> address space when it's really necessary.

What I'm still missing is the functionality to allocate pages for 
multiple devices and proper error codes when dma_map() can't make the 
page accessible to a device.

Regards,
Christian.

>>>>>>> It's a use-case that is working fine today with many devices (e.g. network
>>>>>>> adapters) in the ARM world, exactly because the architecture specific
>>>>>>> implementation of the DMA API inserts the cache maintenance operations
>>>>>>> on buffer ownership transfer.
>>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>>>>>> design totally breaks GPUs on Xen DOM0 for example.
>>>>>>
>>>>>> And Xen is just one example, I can certainly say from experience that
>>>>>> this design was a really really bad idea because it favors just one use
>>>>>> case while making other use cases practically impossible if not really
>>>>>> hard to implement.
>>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
>>>>> problem that this introduces for it?
>>>> That's a bit longer topic. The AMD XEN devs are already working on this
>>>> as far as I know. I can ping internally how far they got with sending
>>>> the patches out to avoid this problem.
>>> Hmm, I see. It might be a good data point to understand in which
>>> direction we should be going, but I guess we can wait until they send
>>> some patches.
>> There was just recently a longer thread on the amd-gfx mailing list
>> about that. I think looking in there as well might be beneficial.
> Okay, let me check. Thanks,
>
> Best regards,
> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-09  9:32                                         ` Pekka Paalanen
@ 2022-12-09 17:07                                           ` Alex Deucher
  -1 siblings, 0 replies; 110+ messages in thread
From: Alex Deucher @ 2022-12-09 17:07 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: dri-devel, Nicolas Dufresne, linaro-mm-sig, Tomasz Figa,
	sumit.semwal, Christian König, linux-media

On Fri, Dec 9, 2022 at 4:32 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Fri, 9 Dec 2022 17:26:06 +0900
> Tomasz Figa <tfiga@chromium.org> wrote:
>
> > On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Hi Tomasz,
> > >
> > > Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > > > [SNIP]
> > > >> In other words explicit ownership transfer is not something we would
> > > >> want as requirement in the framework, cause otherwise we break tons of
> > > >> use cases which require concurrent access to the underlying buffer.
> > > >>
> > > >> When a device driver needs explicit ownership transfer it's perfectly
> > > >> possible to implement this using the dma_fence objects mentioned above.
> > > >> E.g. drivers can already look at who is accessing a buffer currently and
> > > >> can even grab explicit ownership of it by adding their own dma_fence
> > > >> objects.
> > > >>
> > > >> The only exception is CPU based access, e.g. when something is written
> > > >> with the CPU a cache flush might be necessary and when something is read
> > > >> with the CPU a cache invalidation might be necessary.
> > > >>
> > > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > > should be covered for the cache maintenance needs originating from CPU
> > > > accesses already, +/- the broken cases which don't call the begin/end
> > > > CPU access routines that I mentioned above.
> > > >
> > > > Similarly, for any ownership transfer between different DMA engines,
> > > > we should be covered either by the userspace explicitly flushing the
> > > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > > >
> > > > But then, what's left to be solved? :) (Besides the cases of missing
> > > > begin/end CPU access calls.)
> > >
> > > Well there are multiple problems here:
> > >
> > > 1. A lot of userspace applications/frameworks assume that it can
> > > allocate the buffer anywhere and it just works.
> > >
> > > This isn't true at all, we have tons of cases where device can only
> > > access their special memory for certain use cases.
> > > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > > supports system memory here. Similar cases exists for audio/video codecs
> > > where intermediate memory is only accessible by certain devices because
> > > of content protection.
> >
> > Ack.
> >
> > Although I think the most common case on mainstream Linux today is
> > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > GPU) and hoping that other devices would accept the buffers just fine,
> > which isn't a given on most platforms (although often it's just about
> > pixel format, width/height/stride alignment, tiling, etc. rather than
> > the memory itself). That's why ChromiumOS has minigbm and Android has
> > gralloc that act as the central point of knowledge on buffer
> > allocation.
>
> Hi,
>
> as an anecdote, when I was improving Mutter's cross-DRM-device handling
> (for DisplayLink uses) a few years ago, I implemented several different
> approaches of where to allocate, to try until going for the slowest but
> guaranteed to work case of copying every update into and out of sysram.
>
> It seems there are two different approaches in general for allocation
> and sharing:
>
> 1. Try different things until it works or you run out of options
>
> pro:
> - no need for a single software component to know everything about
>   every device in the system
>
> con:
> - who bothers with fallbacks, if the first try works on my system for
>   my use case I test with? I.e. cost of code in users.
> - trial-and-error can be very laborious (allocate, share with all
>   devices, populate, test)
> - the search space might be huge
>

Even that is fraught with difficulty.  We had a ton of bug reports
over the years claiming amdgpu was broken when users tried to use
displaylink devices in combination with AMD dGPUs because the
performance was so slow.  The problem was that rather than using
dma-buf, the compositor was just mmaping the the dGPU BAR,  which
happens to be uncached write combined and across the PCI bus, and
copying the data to the displaylink device.  Read access to a PCI BAR
with the CPU is on the order of 10s of MB per second.

Alex

>
> 2. Have a central component that knows what to do
>
> pro:
> - It might work on the first attempt, so no fallbacks in users.
> - It might be optimal.
>
> con:
> - You need a software component that knows everything about every
>   single combination of hardware in existence, multiplied by use cases.
>
>
> Neither seems good, which brings us back to https://github.com/cubanismo/allocator .
>
>
> > > 2. We don't properly communicate allocation requirements to userspace.
> > >
> > > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > > guess if normal, CMA or even device specific memory is needed.
> >
> > DMA-buf heaps actually make it even more difficult for the userspace,
> > because now it needs to pick the right heap. With allocation built
> > into the specific UAPI (like V4L2), it's at least possible to allocate
> > for one specific device without having any knowledge about allocation
> > constraints in the userspace.
> >
> > >
> > > 3. We seem to lack some essential parts of those restrictions in the
> > > documentation.
> > >
> >
> > Ack.
> >
> > > >>>> So if a device driver uses cached system memory on an architecture which
> > > >>>> devices which can't access it the right approach is clearly to reject
> > > >>>> the access.
> > > >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> > > >> Well that depends. As said above the exporter exports the buffer as it
> > > >> was allocated.
> > > >>
> > > >> If that means the the exporter provides a piece of memory which requires
> > > >> CPU cache snooping to access correctly then the best thing we can do is
> > > >> to prevent an importer which can't do this from attaching.
> > > > Could you elaborate more about this case? Does it exist in practice?
> > > > Do I assume correctly that it's about sharing a buffer between one DMA
> > > > engine that is cache-coherent and another that is non-coherent, where
> > > > the first one ends up having its accesses always go through some kind
> > > > of a cache (CPU cache, L2/L3/... cache, etc.)?
> > >
> > > Yes, exactly that. What happens in this particular use case is that one
> > > device driver wrote to it's internal buffer with the CPU (so some cache
> > > lines where dirty) and then a device which couldn't deal with that tried
> > > to access it.
> >
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
> >
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
> >
> > >
> > > We could say that all device drivers must always look at the coherency
> > > of the devices which want to access their buffers. But that would
> > > horrible complicate things for maintaining the drivers because then
> > > drivers would need to take into account requirements from other drivers
> > > while allocating their internal buffers.
> >
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only handling requirements of one
> > device. And we don't have any information from the userspace what
> > other devices the buffer would be used with...
> >
> > Actually, do we even have such information in the userspace today?
> > Let's say I do a video call in a web browser on a typical Linux
> > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > V4L2 camera fills in buffers with video frames and both encoder and
> > display consume them. Do we have a central place which would know that
> > a buffer needs to be allocated that works with the producer and all
> > consumers?
>
> I have a vague belief that many, many years ago, in the early days of
> dmabuf development, there was the idea of the sequence:
> - create a dmabuf handle
> - share the handle with all devices that would need access
> - *then* do the allocation with kernel-internal negotiation to fill all
>   devices' needs, if at all possible
>
> Obviously that didn't happen. I think today's dmabuf Wayland protocol
> would support this though.
>
> Anyway, Wayland can tell the app which DRM devices a buffer
> needs to work with as a GPU texture and potentially on same/another
> DRM device as a KMS framebuffer, so theoretically the app could know.
>
>
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-09 17:07                                           ` Alex Deucher
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Deucher @ 2022-12-09 17:07 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Tomasz Figa, Christian König, Nicolas Dufresne,
	linaro-mm-sig, dri-devel, linux-media, sumit.semwal

On Fri, Dec 9, 2022 at 4:32 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Fri, 9 Dec 2022 17:26:06 +0900
> Tomasz Figa <tfiga@chromium.org> wrote:
>
> > On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Hi Tomasz,
> > >
> > > Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > > > [SNIP]
> > > >> In other words explicit ownership transfer is not something we would
> > > >> want as requirement in the framework, cause otherwise we break tons of
> > > >> use cases which require concurrent access to the underlying buffer.
> > > >>
> > > >> When a device driver needs explicit ownership transfer it's perfectly
> > > >> possible to implement this using the dma_fence objects mentioned above.
> > > >> E.g. drivers can already look at who is accessing a buffer currently and
> > > >> can even grab explicit ownership of it by adding their own dma_fence
> > > >> objects.
> > > >>
> > > >> The only exception is CPU based access, e.g. when something is written
> > > >> with the CPU a cache flush might be necessary and when something is read
> > > >> with the CPU a cache invalidation might be necessary.
> > > >>
> > > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > > should be covered for the cache maintenance needs originating from CPU
> > > > accesses already, +/- the broken cases which don't call the begin/end
> > > > CPU access routines that I mentioned above.
> > > >
> > > > Similarly, for any ownership transfer between different DMA engines,
> > > > we should be covered either by the userspace explicitly flushing the
> > > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > > >
> > > > But then, what's left to be solved? :) (Besides the cases of missing
> > > > begin/end CPU access calls.)
> > >
> > > Well there are multiple problems here:
> > >
> > > 1. A lot of userspace applications/frameworks assume that it can
> > > allocate the buffer anywhere and it just works.
> > >
> > > This isn't true at all, we have tons of cases where device can only
> > > access their special memory for certain use cases.
> > > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > > supports system memory here. Similar cases exists for audio/video codecs
> > > where intermediate memory is only accessible by certain devices because
> > > of content protection.
> >
> > Ack.
> >
> > Although I think the most common case on mainstream Linux today is
> > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > GPU) and hoping that other devices would accept the buffers just fine,
> > which isn't a given on most platforms (although often it's just about
> > pixel format, width/height/stride alignment, tiling, etc. rather than
> > the memory itself). That's why ChromiumOS has minigbm and Android has
> > gralloc that act as the central point of knowledge on buffer
> > allocation.
>
> Hi,
>
> as an anecdote, when I was improving Mutter's cross-DRM-device handling
> (for DisplayLink uses) a few years ago, I implemented several different
> approaches of where to allocate, to try until going for the slowest but
> guaranteed to work case of copying every update into and out of sysram.
>
> It seems there are two different approaches in general for allocation
> and sharing:
>
> 1. Try different things until it works or you run out of options
>
> pro:
> - no need for a single software component to know everything about
>   every device in the system
>
> con:
> - who bothers with fallbacks, if the first try works on my system for
>   my use case I test with? I.e. cost of code in users.
> - trial-and-error can be very laborious (allocate, share with all
>   devices, populate, test)
> - the search space might be huge
>

Even that is fraught with difficulty.  We had a ton of bug reports
over the years claiming amdgpu was broken when users tried to use
displaylink devices in combination with AMD dGPUs because the
performance was so slow.  The problem was that rather than using
dma-buf, the compositor was just mmaping the the dGPU BAR,  which
happens to be uncached write combined and across the PCI bus, and
copying the data to the displaylink device.  Read access to a PCI BAR
with the CPU is on the order of 10s of MB per second.

Alex

>
> 2. Have a central component that knows what to do
>
> pro:
> - It might work on the first attempt, so no fallbacks in users.
> - It might be optimal.
>
> con:
> - You need a software component that knows everything about every
>   single combination of hardware in existence, multiplied by use cases.
>
>
> Neither seems good, which brings us back to https://github.com/cubanismo/allocator .
>
>
> > > 2. We don't properly communicate allocation requirements to userspace.
> > >
> > > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > > guess if normal, CMA or even device specific memory is needed.
> >
> > DMA-buf heaps actually make it even more difficult for the userspace,
> > because now it needs to pick the right heap. With allocation built
> > into the specific UAPI (like V4L2), it's at least possible to allocate
> > for one specific device without having any knowledge about allocation
> > constraints in the userspace.
> >
> > >
> > > 3. We seem to lack some essential parts of those restrictions in the
> > > documentation.
> > >
> >
> > Ack.
> >
> > > >>>> So if a device driver uses cached system memory on an architecture which
> > > >>>> devices which can't access it the right approach is clearly to reject
> > > >>>> the access.
> > > >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> > > >> Well that depends. As said above the exporter exports the buffer as it
> > > >> was allocated.
> > > >>
> > > >> If that means the the exporter provides a piece of memory which requires
> > > >> CPU cache snooping to access correctly then the best thing we can do is
> > > >> to prevent an importer which can't do this from attaching.
> > > > Could you elaborate more about this case? Does it exist in practice?
> > > > Do I assume correctly that it's about sharing a buffer between one DMA
> > > > engine that is cache-coherent and another that is non-coherent, where
> > > > the first one ends up having its accesses always go through some kind
> > > > of a cache (CPU cache, L2/L3/... cache, etc.)?
> > >
> > > Yes, exactly that. What happens in this particular use case is that one
> > > device driver wrote to it's internal buffer with the CPU (so some cache
> > > lines where dirty) and then a device which couldn't deal with that tried
> > > to access it.
> >
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
> >
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
> >
> > >
> > > We could say that all device drivers must always look at the coherency
> > > of the devices which want to access their buffers. But that would
> > > horrible complicate things for maintaining the drivers because then
> > > drivers would need to take into account requirements from other drivers
> > > while allocating their internal buffers.
> >
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only handling requirements of one
> > device. And we don't have any information from the userspace what
> > other devices the buffer would be used with...
> >
> > Actually, do we even have such information in the userspace today?
> > Let's say I do a video call in a web browser on a typical Linux
> > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > V4L2 camera fills in buffers with video frames and both encoder and
> > display consume them. Do we have a central place which would know that
> > a buffer needs to be allocated that works with the producer and all
> > consumers?
>
> I have a vague belief that many, many years ago, in the early days of
> dmabuf development, there was the idea of the sequence:
> - create a dmabuf handle
> - share the handle with all devices that would need access
> - *then* do the allocation with kernel-internal negotiation to fill all
>   devices' needs, if at all possible
>
> Obviously that didn't happen. I think today's dmabuf Wayland protocol
> would support this though.
>
> Anyway, Wayland can tell the app which DRM devices a buffer
> needs to work with as a GPU texture and potentially on same/another
> DRM device as a KMS framebuffer, so theoretically the app could know.
>
>
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-09 10:27                                         ` Christian König
@ 2022-12-12  3:00                                           ` Tomasz Figa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-12-12  3:00 UTC (permalink / raw)
  To: Christian König
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, robdclark, dri-devel, linaro-mm-sig,
	linux-media

On Fri, Dec 9, 2022 at 7:27 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 09.12.22 um 09:26 schrieb Tomasz Figa:
[snip]
> Yes, same what Daniel said as well. We need to provide some more hints
> which allocator to use from the kernel.
>
> >>>>>> So if a device driver uses cached system memory on an architecture which
> >>>>>> devices which can't access it the right approach is clearly to reject
> >>>>>> the access.
> >>>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> >>>> Well that depends. As said above the exporter exports the buffer as it
> >>>> was allocated.
> >>>>
> >>>> If that means the the exporter provides a piece of memory which requires
> >>>> CPU cache snooping to access correctly then the best thing we can do is
> >>>> to prevent an importer which can't do this from attaching.
> >>> Could you elaborate more about this case? Does it exist in practice?
> >>> Do I assume correctly that it's about sharing a buffer between one DMA
> >>> engine that is cache-coherent and another that is non-coherent, where
> >>> the first one ends up having its accesses always go through some kind
> >>> of a cache (CPU cache, L2/L3/... cache, etc.)?
> >> Yes, exactly that. What happens in this particular use case is that one
> >> device driver wrote to it's internal buffer with the CPU (so some cache
> >> lines where dirty) and then a device which couldn't deal with that tried
> >> to access it.
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
>
> The problem is that the roles are reversed. The callbacks let the
> exporter knows that it needs to flush the caches when the importer is
> done accessing the buffer with the CPU.
>
> But here the exporter is the one accessing the buffer with the CPU and
> the importer then accesses stale data because it doesn't snoop the caches.
>
> What we could do is to force all exporters to use begin/end_cpu_access()
> even on it's own buffers and look at all the importers when the access
> is completed. But that would increase the complexity of the handling in
> the exporter.

I feel like they should be doing so anyway, because it often depends
on the SoC integration whether the DMA can do cache snooping or not.

Although arguably, there is a corner case today where if one uses
dma_alloc_coherent() to get a buffer with a coherent CPU mapping for
device X that is declared as cache-coherent, one also expects not to
need to call begin/end_cpu_access(), but those would be needed if the
buffer was to be imported by device Y that is not cache-coherent...

Sounds like after all it's a mess. I guess your patches make it one
step closer to something sensible, import would fail in such cases.
Although arguably we should be able to still export from driver Y and
import to driver X just fine if Y allocated the buffer as coherent -
otherwise we would break existing users for whom things worked fine.

>
> In other words we would then have code in the exporters which is only
> written for handling the constrains of the importers. This has a wide
> variety of consequences, especially that this functionality of the
> exporter can't be tested without a proper importer.
>
> I was also thinking about reversing the role of exporter and importer in
> the kernel, but came to the conclusion that doing this under the hood
> without userspace knowing about it is probably not going to work either.
>
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
>
> Never heard of anything like that either, but who knows.
>
> >> We could say that all device drivers must always look at the coherency
> >> of the devices which want to access their buffers. But that would
> >> horrible complicate things for maintaining the drivers because then
> >> drivers would need to take into account requirements from other drivers
> >> while allocating their internal buffers.
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only handling requirements of one
> > device. And we don't have any information from the userspace what
> > other devices the buffer would be used with...
>
> Exactly that, yes.
>
> > Actually, do we even have such information in the userspace today?
> > Let's say I do a video call in a web browser on a typical Linux
> > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > V4L2 camera fills in buffers with video frames and both encoder and
> > display consume them. Do we have a central place which would know that
> > a buffer needs to be allocated that works with the producer and all
> > consumers?
>
> Both X11 and Wayland have protocols which can be used to display a
> certain DMA-buf handle, their feedback packages contain information how
> ideal your configuration is, e.g. if the DMA-buf handle could be used or
> if an extra copy was needed etc...
>
> Similar exists between VAAPI and V4L2 as far as I know, but as you noted
> as well here it's indeed more about negotiating pixel format, stride,
> padding, alignment etc...
>
> The best we can do is to reject combinations which won't work in the
> kernel and then userspace could react accordingly.
>

The question is whether userspace is able to deal with it, given the
limited amount of information it gets from the kernel. Sure, there is
always the ultimate fallback of memcpy(), but in many cases that would
be totally unusable due to severe performance impact. If we were to
remove the existing extent of implicit handling from the kernel, I
think we need to first give the userspace the information necessary to
explicitly handle the fallback to the same extent.

We also need to think about backwards compatibility. Simply removing
the implicit fallback cases would probably break a lot of userspace,
so an opt-in behavior is likely needed initially...

> >> That's essentially the reason why we have DMA-buf heaps. Those heaps
> >> expose system memory buffers with certain properties (size, CMA, DMA bit
> >> restrictions etc...) and also implement the callback functions for CPU
> >> cache maintenance.
> >>
> > How do DMA-buf heaps change anything here? We already have CPU cache
> > maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
> > for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
> > (which arguably shouldn't actually do CPU cache maintenance, unless
> > that's implied by how either of the involved DMA engines work).
>
> DMA-buf heaps are the neutral man in the middle.
>
> The implementation keeps track of all the attached importers and should
> make sure that the allocated memory fits the need of everyone.
> Additional to that calls to the cache DMA-api cache management functions
> are inserted whenever CPU access begins/ends.
>

I think in current design, it only knows all the importers after the
buffer is already allocated, so it doesn't necessarily have a way to
handle the allocation constraints. Something would have to be done to
get all the importers attached before the allocation actually takes
place.

> That's the best we can do for system memory sharing, only device
> specific memory can't be allocated like this.
>
> >>>> We do have the system and CMA dma-buf heap for cases where a device
> >>>> driver which wants to access the buffer with caches enabled. So this is
> >>>> not a limitation in functionality, it's just a matter of correctly using it.
> >>>>
> >>> V4L2 also has the ability to allocate buffers and map them with caches enabled.
> >> Yeah, when that's a requirement for the V4L2 device it also makes
> >> perfect sense.
> >>
> >> It's just that we shouldn't force any specific allocation behavior on a
> >> device driver because of the requirements of a different device.
> >>
> >> Giving an example a V4L2 device shouldn't be forced to use
> >> videobuf2-dma-contig because some other device needs CMA. Instead we
> >> should use the common DMA-buf heaps which implement this as neutral
> >> supplier of system memory buffers.
> > Agreed, but that's only possible if we have a central entity that
> > understands what devices the requested buffer would be used with. My
> > understanding is that we're nowhere close to that with mainstream
> > Linux today.
> >
> > // As a side note, videobuf2-dma-contig can actually allocate
> > discontiguous memory, if the device is behind an IOMMU. Actually with
> > the recent DMA API improvements, we could probably coalesce
> > vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.
>
> That would probably make live a little bit simpler, yes.
>
> >>>> The problem is that in this particular case the exporter provides the
> >>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> >>>> with that.
> >>> Why does the exporter leave the buffer with dirty CPU caches?
> >> Because exporters always export the buffers as they would use it. And in
> >> this case that means writing with the CPU to it.
> >>
> > Sorry for the question not being very clear. I meant: How do the CPU
> > caches get dirty in that case?
>
> The exporter wrote to it. As far as I understand the exporter just
> copies things from A to B with memcpy to construct the buffer content.
>

Okay, so it's just due to CPU access and basically what we touched a
few paragraphs above.

> > [SNIP]
> >> Yes, totally agree. The problem is really that we moved bunch of MM and
> >> DMA functions in one API.
> >>
> >> The bounce buffers are something we should really have in a separate
> >> component.
> >>
> >> Then the functionality of allocating system memory for a specific device
> >> or devices should be something provided by the MM.
> >>
> >> And finally making this memory or any other CPU address accessible to a
> >> device (IOMMU programming etc..) should then be part of an DMA API.
> >>
> > Remember that actually making the memory accessible to a device often
> > needs to be handled already as a part of the allocation (e.g. dma_mask
> > in the non-IOMMU case). So I tend to think that the current division
> > of responsibilities is mostly fine - the dma_alloc family should be
> > seen as a part of MM already, especially with all the recent
> > improvements from Christoph, like dma_alloc_pages().
>
> Yes, that's indeed a very interesting development which as far as I can
> see goes into the right direction.
>
> > That said, it may indeed make sense to move things like IOMMU mapping
> > management out of the dma_alloc() and just reduce those functions to
> > simply returning a set of pages that satisfy the allocation
> > constraints. One would need to call dma_map() after the allocation,
> > but that's actually a fair expectation. Sometimes it might be
> > preferred to pre-allocate the memory, but only map it into the device
> > address space when it's really necessary.
>
> What I'm still missing is the functionality to allocate pages for
> multiple devices and proper error codes when dma_map() can't make the
> page accessible to a device.

Agreed. Although again, I think the more challenging part would be to
get the complete list of devices involved before the allocation
happens.

Best regards,
Tomasz

>
> Regards,
> Christian.
>
> >>>>>>> It's a use-case that is working fine today with many devices (e.g. network
> >>>>>>> adapters) in the ARM world, exactly because the architecture specific
> >>>>>>> implementation of the DMA API inserts the cache maintenance operations
> >>>>>>> on buffer ownership transfer.
> >>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >>>>>> design totally breaks GPUs on Xen DOM0 for example.
> >>>>>>
> >>>>>> And Xen is just one example, I can certainly say from experience that
> >>>>>> this design was a really really bad idea because it favors just one use
> >>>>>> case while making other use cases practically impossible if not really
> >>>>>> hard to implement.
> >>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
> >>>>> problem that this introduces for it?
> >>>> That's a bit longer topic. The AMD XEN devs are already working on this
> >>>> as far as I know. I can ping internally how far they got with sending
> >>>> the patches out to avoid this problem.
> >>> Hmm, I see. It might be a good data point to understand in which
> >>> direction we should be going, but I guess we can wait until they send
> >>> some patches.
> >> There was just recently a longer thread on the amd-gfx mailing list
> >> about that. I think looking in there as well might be beneficial.
> > Okay, let me check. Thanks,
> >
> > Best regards,
> > Tomasz
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-12  3:00                                           ` Tomasz Figa
  0 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-12-12  3:00 UTC (permalink / raw)
  To: Christian König
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

On Fri, Dec 9, 2022 at 7:27 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 09.12.22 um 09:26 schrieb Tomasz Figa:
[snip]
> Yes, same what Daniel said as well. We need to provide some more hints
> which allocator to use from the kernel.
>
> >>>>>> So if a device driver uses cached system memory on an architecture which
> >>>>>> devices which can't access it the right approach is clearly to reject
> >>>>>> the access.
> >>>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> >>>> Well that depends. As said above the exporter exports the buffer as it
> >>>> was allocated.
> >>>>
> >>>> If that means the the exporter provides a piece of memory which requires
> >>>> CPU cache snooping to access correctly then the best thing we can do is
> >>>> to prevent an importer which can't do this from attaching.
> >>> Could you elaborate more about this case? Does it exist in practice?
> >>> Do I assume correctly that it's about sharing a buffer between one DMA
> >>> engine that is cache-coherent and another that is non-coherent, where
> >>> the first one ends up having its accesses always go through some kind
> >>> of a cache (CPU cache, L2/L3/... cache, etc.)?
> >> Yes, exactly that. What happens in this particular use case is that one
> >> device driver wrote to it's internal buffer with the CPU (so some cache
> >> lines where dirty) and then a device which couldn't deal with that tried
> >> to access it.
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
>
> The problem is that the roles are reversed. The callbacks let the
> exporter knows that it needs to flush the caches when the importer is
> done accessing the buffer with the CPU.
>
> But here the exporter is the one accessing the buffer with the CPU and
> the importer then accesses stale data because it doesn't snoop the caches.
>
> What we could do is to force all exporters to use begin/end_cpu_access()
> even on it's own buffers and look at all the importers when the access
> is completed. But that would increase the complexity of the handling in
> the exporter.

I feel like they should be doing so anyway, because it often depends
on the SoC integration whether the DMA can do cache snooping or not.

Although arguably, there is a corner case today where if one uses
dma_alloc_coherent() to get a buffer with a coherent CPU mapping for
device X that is declared as cache-coherent, one also expects not to
need to call begin/end_cpu_access(), but those would be needed if the
buffer was to be imported by device Y that is not cache-coherent...

Sounds like after all it's a mess. I guess your patches make it one
step closer to something sensible, import would fail in such cases.
Although arguably we should be able to still export from driver Y and
import to driver X just fine if Y allocated the buffer as coherent -
otherwise we would break existing users for whom things worked fine.

>
> In other words we would then have code in the exporters which is only
> written for handling the constrains of the importers. This has a wide
> variety of consequences, especially that this functionality of the
> exporter can't be tested without a proper importer.
>
> I was also thinking about reversing the role of exporter and importer in
> the kernel, but came to the conclusion that doing this under the hood
> without userspace knowing about it is probably not going to work either.
>
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
>
> Never heard of anything like that either, but who knows.
>
> >> We could say that all device drivers must always look at the coherency
> >> of the devices which want to access their buffers. But that would
> >> horrible complicate things for maintaining the drivers because then
> >> drivers would need to take into account requirements from other drivers
> >> while allocating their internal buffers.
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only handling requirements of one
> > device. And we don't have any information from the userspace what
> > other devices the buffer would be used with...
>
> Exactly that, yes.
>
> > Actually, do we even have such information in the userspace today?
> > Let's say I do a video call in a web browser on a typical Linux
> > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > V4L2 camera fills in buffers with video frames and both encoder and
> > display consume them. Do we have a central place which would know that
> > a buffer needs to be allocated that works with the producer and all
> > consumers?
>
> Both X11 and Wayland have protocols which can be used to display a
> certain DMA-buf handle, their feedback packages contain information how
> ideal your configuration is, e.g. if the DMA-buf handle could be used or
> if an extra copy was needed etc...
>
> Similar exists between VAAPI and V4L2 as far as I know, but as you noted
> as well here it's indeed more about negotiating pixel format, stride,
> padding, alignment etc...
>
> The best we can do is to reject combinations which won't work in the
> kernel and then userspace could react accordingly.
>

The question is whether userspace is able to deal with it, given the
limited amount of information it gets from the kernel. Sure, there is
always the ultimate fallback of memcpy(), but in many cases that would
be totally unusable due to severe performance impact. If we were to
remove the existing extent of implicit handling from the kernel, I
think we need to first give the userspace the information necessary to
explicitly handle the fallback to the same extent.

We also need to think about backwards compatibility. Simply removing
the implicit fallback cases would probably break a lot of userspace,
so an opt-in behavior is likely needed initially...

> >> That's essentially the reason why we have DMA-buf heaps. Those heaps
> >> expose system memory buffers with certain properties (size, CMA, DMA bit
> >> restrictions etc...) and also implement the callback functions for CPU
> >> cache maintenance.
> >>
> > How do DMA-buf heaps change anything here? We already have CPU cache
> > maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
> > for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
> > (which arguably shouldn't actually do CPU cache maintenance, unless
> > that's implied by how either of the involved DMA engines work).
>
> DMA-buf heaps are the neutral man in the middle.
>
> The implementation keeps track of all the attached importers and should
> make sure that the allocated memory fits the need of everyone.
> Additional to that calls to the cache DMA-api cache management functions
> are inserted whenever CPU access begins/ends.
>

I think in current design, it only knows all the importers after the
buffer is already allocated, so it doesn't necessarily have a way to
handle the allocation constraints. Something would have to be done to
get all the importers attached before the allocation actually takes
place.

> That's the best we can do for system memory sharing, only device
> specific memory can't be allocated like this.
>
> >>>> We do have the system and CMA dma-buf heap for cases where a device
> >>>> driver which wants to access the buffer with caches enabled. So this is
> >>>> not a limitation in functionality, it's just a matter of correctly using it.
> >>>>
> >>> V4L2 also has the ability to allocate buffers and map them with caches enabled.
> >> Yeah, when that's a requirement for the V4L2 device it also makes
> >> perfect sense.
> >>
> >> It's just that we shouldn't force any specific allocation behavior on a
> >> device driver because of the requirements of a different device.
> >>
> >> Giving an example a V4L2 device shouldn't be forced to use
> >> videobuf2-dma-contig because some other device needs CMA. Instead we
> >> should use the common DMA-buf heaps which implement this as neutral
> >> supplier of system memory buffers.
> > Agreed, but that's only possible if we have a central entity that
> > understands what devices the requested buffer would be used with. My
> > understanding is that we're nowhere close to that with mainstream
> > Linux today.
> >
> > // As a side note, videobuf2-dma-contig can actually allocate
> > discontiguous memory, if the device is behind an IOMMU. Actually with
> > the recent DMA API improvements, we could probably coalesce
> > vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.
>
> That would probably make live a little bit simpler, yes.
>
> >>>> The problem is that in this particular case the exporter provides the
> >>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> >>>> with that.
> >>> Why does the exporter leave the buffer with dirty CPU caches?
> >> Because exporters always export the buffers as they would use it. And in
> >> this case that means writing with the CPU to it.
> >>
> > Sorry for the question not being very clear. I meant: How do the CPU
> > caches get dirty in that case?
>
> The exporter wrote to it. As far as I understand the exporter just
> copies things from A to B with memcpy to construct the buffer content.
>

Okay, so it's just due to CPU access and basically what we touched a
few paragraphs above.

> > [SNIP]
> >> Yes, totally agree. The problem is really that we moved bunch of MM and
> >> DMA functions in one API.
> >>
> >> The bounce buffers are something we should really have in a separate
> >> component.
> >>
> >> Then the functionality of allocating system memory for a specific device
> >> or devices should be something provided by the MM.
> >>
> >> And finally making this memory or any other CPU address accessible to a
> >> device (IOMMU programming etc..) should then be part of an DMA API.
> >>
> > Remember that actually making the memory accessible to a device often
> > needs to be handled already as a part of the allocation (e.g. dma_mask
> > in the non-IOMMU case). So I tend to think that the current division
> > of responsibilities is mostly fine - the dma_alloc family should be
> > seen as a part of MM already, especially with all the recent
> > improvements from Christoph, like dma_alloc_pages().
>
> Yes, that's indeed a very interesting development which as far as I can
> see goes into the right direction.
>
> > That said, it may indeed make sense to move things like IOMMU mapping
> > management out of the dma_alloc() and just reduce those functions to
> > simply returning a set of pages that satisfy the allocation
> > constraints. One would need to call dma_map() after the allocation,
> > but that's actually a fair expectation. Sometimes it might be
> > preferred to pre-allocate the memory, but only map it into the device
> > address space when it's really necessary.
>
> What I'm still missing is the functionality to allocate pages for
> multiple devices and proper error codes when dma_map() can't make the
> page accessible to a device.

Agreed. Although again, I think the more challenging part would be to
get the complete list of devices involved before the allocation
happens.

Best regards,
Tomasz

>
> Regards,
> Christian.
>
> >>>>>>> It's a use-case that is working fine today with many devices (e.g. network
> >>>>>>> adapters) in the ARM world, exactly because the architecture specific
> >>>>>>> implementation of the DMA API inserts the cache maintenance operations
> >>>>>>> on buffer ownership transfer.
> >>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >>>>>> design totally breaks GPUs on Xen DOM0 for example.
> >>>>>>
> >>>>>> And Xen is just one example, I can certainly say from experience that
> >>>>>> this design was a really really bad idea because it favors just one use
> >>>>>> case while making other use cases practically impossible if not really
> >>>>>> hard to implement.
> >>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
> >>>>> problem that this introduces for it?
> >>>> That's a bit longer topic. The AMD XEN devs are already working on this
> >>>> as far as I know. I can ping internally how far they got with sending
> >>>> the patches out to avoid this problem.
> >>> Hmm, I see. It might be a good data point to understand in which
> >>> direction we should be going, but I guess we can wait until they send
> >>> some patches.
> >> There was just recently a longer thread on the amd-gfx mailing list
> >> about that. I think looking in there as well might be beneficial.
> > Okay, let me check. Thanks,
> >
> > Best regards,
> > Tomasz
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-09  9:32                                         ` Pekka Paalanen
@ 2022-12-12  3:13                                           ` Tomasz Figa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-12-12  3:13 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Christian König, Lucas Stach, Nicolas Dufresne,
	Daniel Stone, sumit.semwal, daniel, robdclark, dri-devel,
	linaro-mm-sig, linux-media

On Fri, Dec 9, 2022 at 6:32 PM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Fri, 9 Dec 2022 17:26:06 +0900
> Tomasz Figa <tfiga@chromium.org> wrote:
>
> > On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Hi Tomasz,
> > >
> > > Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > > > [SNIP]
> > > >> In other words explicit ownership transfer is not something we would
> > > >> want as requirement in the framework, cause otherwise we break tons of
> > > >> use cases which require concurrent access to the underlying buffer.
> > > >>
> > > >> When a device driver needs explicit ownership transfer it's perfectly
> > > >> possible to implement this using the dma_fence objects mentioned above.
> > > >> E.g. drivers can already look at who is accessing a buffer currently and
> > > >> can even grab explicit ownership of it by adding their own dma_fence
> > > >> objects.
> > > >>
> > > >> The only exception is CPU based access, e.g. when something is written
> > > >> with the CPU a cache flush might be necessary and when something is read
> > > >> with the CPU a cache invalidation might be necessary.
> > > >>
> > > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > > should be covered for the cache maintenance needs originating from CPU
> > > > accesses already, +/- the broken cases which don't call the begin/end
> > > > CPU access routines that I mentioned above.
> > > >
> > > > Similarly, for any ownership transfer between different DMA engines,
> > > > we should be covered either by the userspace explicitly flushing the
> > > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > > >
> > > > But then, what's left to be solved? :) (Besides the cases of missing
> > > > begin/end CPU access calls.)
> > >
> > > Well there are multiple problems here:
> > >
> > > 1. A lot of userspace applications/frameworks assume that it can
> > > allocate the buffer anywhere and it just works.
> > >
> > > This isn't true at all, we have tons of cases where device can only
> > > access their special memory for certain use cases.
> > > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > > supports system memory here. Similar cases exists for audio/video codecs
> > > where intermediate memory is only accessible by certain devices because
> > > of content protection.
> >
> > Ack.
> >
> > Although I think the most common case on mainstream Linux today is
> > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > GPU) and hoping that other devices would accept the buffers just fine,
> > which isn't a given on most platforms (although often it's just about
> > pixel format, width/height/stride alignment, tiling, etc. rather than
> > the memory itself). That's why ChromiumOS has minigbm and Android has
> > gralloc that act as the central point of knowledge on buffer
> > allocation.
>
> Hi,
>
> as an anecdote, when I was improving Mutter's cross-DRM-device handling
> (for DisplayLink uses) a few years ago, I implemented several different
> approaches of where to allocate, to try until going for the slowest but
> guaranteed to work case of copying every update into and out of sysram.
>
> It seems there are two different approaches in general for allocation
> and sharing:
>
> 1. Try different things until it works or you run out of options
>
> pro:
> - no need for a single software component to know everything about
>   every device in the system
>
> con:
> - who bothers with fallbacks, if the first try works on my system for
>   my use case I test with? I.e. cost of code in users.
> - trial-and-error can be very laborious (allocate, share with all
>   devices, populate, test)
> - the search space might be huge
>
>
> 2. Have a central component that knows what to do
>
> pro:
> - It might work on the first attempt, so no fallbacks in users.
> - It might be optimal.
>
> con:
> - You need a software component that knows everything about every
>   single combination of hardware in existence, multiplied by use cases.
>
>
> Neither seems good, which brings us back to https://github.com/cubanismo/allocator .
>

I need to refresh my memory on how far we went with that and what the
stoppers were, but it really sounds like we need it to make things
really work on a mainstream Linux system.

When I was still participating in the discussions, I remember the idea
was to have an API exposed by various components, like EGL, Vulkan,
etc. to describe the constraints. Maybe to simplify the problem,
instead of having this complex negotiation between different APIs, we
could have a plugin-based library and plugins would be installed
together with the various API implementations (e.g. even proprietary
Nvidia drivers could provide an open source allocation plugin to tell
the central allocator library about its requirements)?

I also recall we were stopped by lack of a generic DMA-buf allocation
functionality exposed to the userspace, but that now exists as DMA-buf
heaps.

>
> > > 2. We don't properly communicate allocation requirements to userspace.
> > >
> > > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > > guess if normal, CMA or even device specific memory is needed.
> >
> > DMA-buf heaps actually make it even more difficult for the userspace,
> > because now it needs to pick the right heap. With allocation built
> > into the specific UAPI (like V4L2), it's at least possible to allocate
> > for one specific device without having any knowledge about allocation
> > constraints in the userspace.
> >
> > >
> > > 3. We seem to lack some essential parts of those restrictions in the
> > > documentation.
> > >
> >
> > Ack.
> >
> > > >>>> So if a device driver uses cached system memory on an architecture which
> > > >>>> devices which can't access it the right approach is clearly to reject
> > > >>>> the access.
> > > >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> > > >> Well that depends. As said above the exporter exports the buffer as it
> > > >> was allocated.
> > > >>
> > > >> If that means the the exporter provides a piece of memory which requires
> > > >> CPU cache snooping to access correctly then the best thing we can do is
> > > >> to prevent an importer which can't do this from attaching.
> > > > Could you elaborate more about this case? Does it exist in practice?
> > > > Do I assume correctly that it's about sharing a buffer between one DMA
> > > > engine that is cache-coherent and another that is non-coherent, where
> > > > the first one ends up having its accesses always go through some kind
> > > > of a cache (CPU cache, L2/L3/... cache, etc.)?
> > >
> > > Yes, exactly that. What happens in this particular use case is that one
> > > device driver wrote to it's internal buffer with the CPU (so some cache
> > > lines where dirty) and then a device which couldn't deal with that tried
> > > to access it.
> >
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
> >
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
> >
> > >
> > > We could say that all device drivers must always look at the coherency
> > > of the devices which want to access their buffers. But that would
> > > horrible complicate things for maintaining the drivers because then
> > > drivers would need to take into account requirements from other drivers
> > > while allocating their internal buffers.
> >
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only handling requirements of one
> > device. And we don't have any information from the userspace what
> > other devices the buffer would be used with...
> >
> > Actually, do we even have such information in the userspace today?
> > Let's say I do a video call in a web browser on a typical Linux
> > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > V4L2 camera fills in buffers with video frames and both encoder and
> > display consume them. Do we have a central place which would know that
> > a buffer needs to be allocated that works with the producer and all
> > consumers?
>
> I have a vague belief that many, many years ago, in the early days of
> dmabuf development, there was the idea of the sequence:
> - create a dmabuf handle
> - share the handle with all devices that would need access
> - *then* do the allocation with kernel-internal negotiation to fill all
>   devices' needs, if at all possible

Yeah, I had such a recollection as well. I think the difficulty is to
get the userspace to tell the kernel about all the devices before the
buffer needs to be allocated.

>
> Obviously that didn't happen. I think today's dmabuf Wayland protocol
> would support this though.
>
> Anyway, Wayland can tell the app which DRM devices a buffer
> needs to work with as a GPU texture and potentially on same/another
> DRM device as a KMS framebuffer, so theoretically the app could know.
>

Good to know, thanks.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-12  3:13                                           ` Tomasz Figa
  0 siblings, 0 replies; 110+ messages in thread
From: Tomasz Figa @ 2022-12-12  3:13 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Christian König, Nicolas Dufresne, linaro-mm-sig, dri-devel,
	linux-media, sumit.semwal

On Fri, Dec 9, 2022 at 6:32 PM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Fri, 9 Dec 2022 17:26:06 +0900
> Tomasz Figa <tfiga@chromium.org> wrote:
>
> > On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Hi Tomasz,
> > >
> > > Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > > > [SNIP]
> > > >> In other words explicit ownership transfer is not something we would
> > > >> want as requirement in the framework, cause otherwise we break tons of
> > > >> use cases which require concurrent access to the underlying buffer.
> > > >>
> > > >> When a device driver needs explicit ownership transfer it's perfectly
> > > >> possible to implement this using the dma_fence objects mentioned above.
> > > >> E.g. drivers can already look at who is accessing a buffer currently and
> > > >> can even grab explicit ownership of it by adding their own dma_fence
> > > >> objects.
> > > >>
> > > >> The only exception is CPU based access, e.g. when something is written
> > > >> with the CPU a cache flush might be necessary and when something is read
> > > >> with the CPU a cache invalidation might be necessary.
> > > >>
> > > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > > should be covered for the cache maintenance needs originating from CPU
> > > > accesses already, +/- the broken cases which don't call the begin/end
> > > > CPU access routines that I mentioned above.
> > > >
> > > > Similarly, for any ownership transfer between different DMA engines,
> > > > we should be covered either by the userspace explicitly flushing the
> > > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > > >
> > > > But then, what's left to be solved? :) (Besides the cases of missing
> > > > begin/end CPU access calls.)
> > >
> > > Well there are multiple problems here:
> > >
> > > 1. A lot of userspace applications/frameworks assume that it can
> > > allocate the buffer anywhere and it just works.
> > >
> > > This isn't true at all, we have tons of cases where device can only
> > > access their special memory for certain use cases.
> > > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > > supports system memory here. Similar cases exists for audio/video codecs
> > > where intermediate memory is only accessible by certain devices because
> > > of content protection.
> >
> > Ack.
> >
> > Although I think the most common case on mainstream Linux today is
> > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > GPU) and hoping that other devices would accept the buffers just fine,
> > which isn't a given on most platforms (although often it's just about
> > pixel format, width/height/stride alignment, tiling, etc. rather than
> > the memory itself). That's why ChromiumOS has minigbm and Android has
> > gralloc that act as the central point of knowledge on buffer
> > allocation.
>
> Hi,
>
> as an anecdote, when I was improving Mutter's cross-DRM-device handling
> (for DisplayLink uses) a few years ago, I implemented several different
> approaches of where to allocate, to try until going for the slowest but
> guaranteed to work case of copying every update into and out of sysram.
>
> It seems there are two different approaches in general for allocation
> and sharing:
>
> 1. Try different things until it works or you run out of options
>
> pro:
> - no need for a single software component to know everything about
>   every device in the system
>
> con:
> - who bothers with fallbacks, if the first try works on my system for
>   my use case I test with? I.e. cost of code in users.
> - trial-and-error can be very laborious (allocate, share with all
>   devices, populate, test)
> - the search space might be huge
>
>
> 2. Have a central component that knows what to do
>
> pro:
> - It might work on the first attempt, so no fallbacks in users.
> - It might be optimal.
>
> con:
> - You need a software component that knows everything about every
>   single combination of hardware in existence, multiplied by use cases.
>
>
> Neither seems good, which brings us back to https://github.com/cubanismo/allocator .
>

I need to refresh my memory on how far we went with that and what the
stoppers were, but it really sounds like we need it to make things
really work on a mainstream Linux system.

When I was still participating in the discussions, I remember the idea
was to have an API exposed by various components, like EGL, Vulkan,
etc. to describe the constraints. Maybe to simplify the problem,
instead of having this complex negotiation between different APIs, we
could have a plugin-based library and plugins would be installed
together with the various API implementations (e.g. even proprietary
Nvidia drivers could provide an open source allocation plugin to tell
the central allocator library about its requirements)?

I also recall we were stopped by lack of a generic DMA-buf allocation
functionality exposed to the userspace, but that now exists as DMA-buf
heaps.

>
> > > 2. We don't properly communicate allocation requirements to userspace.
> > >
> > > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > > guess if normal, CMA or even device specific memory is needed.
> >
> > DMA-buf heaps actually make it even more difficult for the userspace,
> > because now it needs to pick the right heap. With allocation built
> > into the specific UAPI (like V4L2), it's at least possible to allocate
> > for one specific device without having any knowledge about allocation
> > constraints in the userspace.
> >
> > >
> > > 3. We seem to lack some essential parts of those restrictions in the
> > > documentation.
> > >
> >
> > Ack.
> >
> > > >>>> So if a device driver uses cached system memory on an architecture which
> > > >>>> devices which can't access it the right approach is clearly to reject
> > > >>>> the access.
> > > >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> > > >> Well that depends. As said above the exporter exports the buffer as it
> > > >> was allocated.
> > > >>
> > > >> If that means the the exporter provides a piece of memory which requires
> > > >> CPU cache snooping to access correctly then the best thing we can do is
> > > >> to prevent an importer which can't do this from attaching.
> > > > Could you elaborate more about this case? Does it exist in practice?
> > > > Do I assume correctly that it's about sharing a buffer between one DMA
> > > > engine that is cache-coherent and another that is non-coherent, where
> > > > the first one ends up having its accesses always go through some kind
> > > > of a cache (CPU cache, L2/L3/... cache, etc.)?
> > >
> > > Yes, exactly that. What happens in this particular use case is that one
> > > device driver wrote to it's internal buffer with the CPU (so some cache
> > > lines where dirty) and then a device which couldn't deal with that tried
> > > to access it.
> >
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
> >
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
> >
> > >
> > > We could say that all device drivers must always look at the coherency
> > > of the devices which want to access their buffers. But that would
> > > horrible complicate things for maintaining the drivers because then
> > > drivers would need to take into account requirements from other drivers
> > > while allocating their internal buffers.
> >
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only handling requirements of one
> > device. And we don't have any information from the userspace what
> > other devices the buffer would be used with...
> >
> > Actually, do we even have such information in the userspace today?
> > Let's say I do a video call in a web browser on a typical Linux
> > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > V4L2 camera fills in buffers with video frames and both encoder and
> > display consume them. Do we have a central place which would know that
> > a buffer needs to be allocated that works with the producer and all
> > consumers?
>
> I have a vague belief that many, many years ago, in the early days of
> dmabuf development, there was the idea of the sequence:
> - create a dmabuf handle
> - share the handle with all devices that would need access
> - *then* do the allocation with kernel-internal negotiation to fill all
>   devices' needs, if at all possible

Yeah, I had such a recollection as well. I think the difficulty is to
get the userspace to tell the kernel about all the devices before the
buffer needs to be allocated.

>
> Obviously that didn't happen. I think today's dmabuf Wayland protocol
> would support this though.
>
> Anyway, Wayland can tell the app which DRM devices a buffer
> needs to work with as a GPU texture and potentially on same/another
> DRM device as a KMS framebuffer, so theoretically the app could know.
>

Good to know, thanks.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-12  3:00                                           ` Tomasz Figa
@ 2022-12-12 11:15                                             ` Christian König
  -1 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-12 11:15 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Nicolas Dufresne, linaro-mm-sig, ppaalanen, dri-devel,
	linux-media, sumit.semwal

Am 12.12.22 um 04:00 schrieb Tomasz Figa:
> [SNIP]
>> What we could do is to force all exporters to use begin/end_cpu_access()
>> even on it's own buffers and look at all the importers when the access
>> is completed. But that would increase the complexity of the handling in
>> the exporter.
> I feel like they should be doing so anyway, because it often depends
> on the SoC integration whether the DMA can do cache snooping or not.

Yeah, but wouldn't help without the exporter taking care of the 
importers needs of cache flushing. And that's exactly what we try hard 
to avoid because it creates code in every exporter which is never tested 
except for the case that you combine this exporter with an non coherent 
importer.

That's a testing nightmare because then you everywhere has code which 
only in very few combinations of exporter and importer is actually used.

> Although arguably, there is a corner case today where if one uses
> dma_alloc_coherent() to get a buffer with a coherent CPU mapping for
> device X that is declared as cache-coherent, one also expects not to
> need to call begin/end_cpu_access(), but those would be needed if the
> buffer was to be imported by device Y that is not cache-coherent...
>
> Sounds like after all it's a mess. I guess your patches make it one
> step closer to something sensible, import would fail in such cases.
> Although arguably we should be able to still export from driver Y and
> import to driver X just fine if Y allocated the buffer as coherent -
> otherwise we would break existing users for whom things worked fine.

Allocating the buffer as coherent won't help in this case because we 
actually do want CPU caching for performant access. It's just that some 
device needs a cache flush before it sees all the changes.

As far as I can see without adding additional complexity to the exporter 
this can only be archived in two ways:

1. Switch the role of the exporter and importer. This way the device 
with the need for the cache flush is the exporter and in control of the 
operations on its buffers.

2. We use DMA-buf as neutral mediator. Since DMA-buf keeps track of who 
has mapped the buffers it inserts the appropriate 
dma_sync_*_for_device() calls.

>> [SNIP]
>> The best we can do is to reject combinations which won't work in the
>> kernel and then userspace could react accordingly.
>>
> The question is whether userspace is able to deal with it, given the
> limited amount of information it gets from the kernel. Sure, there is
> always the ultimate fallback of memcpy(), but in many cases that would
> be totally unusable due to severe performance impact. If we were to
> remove the existing extent of implicit handling from the kernel, I
> think we need to first give the userspace the information necessary to
> explicitly handle the fallback to the same extent.

Good point.

> We also need to think about backwards compatibility. Simply removing
> the implicit fallback cases would probably break a lot of userspace,
> so an opt-in behavior is likely needed initially...

Yes, I'm completely aware of that as well.

We can't hard break userspace even if the previous approach didn't 
worked 100% correctly.

>>>> That's essentially the reason why we have DMA-buf heaps. Those heaps
>>>> expose system memory buffers with certain properties (size, CMA, DMA bit
>>>> restrictions etc...) and also implement the callback functions for CPU
>>>> cache maintenance.
>>>>
>>> How do DMA-buf heaps change anything here? We already have CPU cache
>>> maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
>>> for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
>>> (which arguably shouldn't actually do CPU cache maintenance, unless
>>> that's implied by how either of the involved DMA engines work).
>> DMA-buf heaps are the neutral man in the middle.
>>
>> The implementation keeps track of all the attached importers and should
>> make sure that the allocated memory fits the need of everyone.
>> Additional to that calls to the cache DMA-api cache management functions
>> are inserted whenever CPU access begins/ends.
>>
> I think in current design, it only knows all the importers after the
> buffer is already allocated, so it doesn't necessarily have a way to
> handle the allocation constraints. Something would have to be done to
> get all the importers attached before the allocation actually takes
> place.

That's already in place. See the attach and map callbacks.

I'm just not sure if heaps fully implements it like this.

>>>>>> The problem is that in this particular case the exporter provides the
>>>>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
>>>>>> with that.
>>>>> Why does the exporter leave the buffer with dirty CPU caches?
>>>> Because exporters always export the buffers as they would use it. And in
>>>> this case that means writing with the CPU to it.
>>>>
>>> Sorry for the question not being very clear. I meant: How do the CPU
>>> caches get dirty in that case?
>> The exporter wrote to it. As far as I understand the exporter just
>> copies things from A to B with memcpy to construct the buffer content.
>>
> Okay, so it's just due to CPU access and basically what we touched a
> few paragraphs above.

Yes, I've never seen a device which actually dirties the CPU cache. But 
I would never rule out that such a device exists.

Regards,
Christian.

>>> [SNIP]
>>>> Yes, totally agree. The problem is really that we moved bunch of MM and
>>>> DMA functions in one API.
>>>>
>>>> The bounce buffers are something we should really have in a separate
>>>> component.
>>>>
>>>> Then the functionality of allocating system memory for a specific device
>>>> or devices should be something provided by the MM.
>>>>
>>>> And finally making this memory or any other CPU address accessible to a
>>>> device (IOMMU programming etc..) should then be part of an DMA API.
>>>>
>>> Remember that actually making the memory accessible to a device often
>>> needs to be handled already as a part of the allocation (e.g. dma_mask
>>> in the non-IOMMU case). So I tend to think that the current division
>>> of responsibilities is mostly fine - the dma_alloc family should be
>>> seen as a part of MM already, especially with all the recent
>>> improvements from Christoph, like dma_alloc_pages().
>> Yes, that's indeed a very interesting development which as far as I can
>> see goes into the right direction.
>>
>>> That said, it may indeed make sense to move things like IOMMU mapping
>>> management out of the dma_alloc() and just reduce those functions to
>>> simply returning a set of pages that satisfy the allocation
>>> constraints. One would need to call dma_map() after the allocation,
>>> but that's actually a fair expectation. Sometimes it might be
>>> preferred to pre-allocate the memory, but only map it into the device
>>> address space when it's really necessary.
>> What I'm still missing is the functionality to allocate pages for
>> multiple devices and proper error codes when dma_map() can't make the
>> page accessible to a device.
> Agreed. Although again, I think the more challenging part would be to
> get the complete list of devices involved before the allocation
> happens.
>
> Best regards,
> Tomasz
>
>> Regards,
>> Christian.
>>
>>>>>>>>> It's a use-case that is working fine today with many devices (e.g. network
>>>>>>>>> adapters) in the ARM world, exactly because the architecture specific
>>>>>>>>> implementation of the DMA API inserts the cache maintenance operations
>>>>>>>>> on buffer ownership transfer.
>>>>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>>>>>>>> design totally breaks GPUs on Xen DOM0 for example.
>>>>>>>>
>>>>>>>> And Xen is just one example, I can certainly say from experience that
>>>>>>>> this design was a really really bad idea because it favors just one use
>>>>>>>> case while making other use cases practically impossible if not really
>>>>>>>> hard to implement.
>>>>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
>>>>>>> problem that this introduces for it?
>>>>>> That's a bit longer topic. The AMD XEN devs are already working on this
>>>>>> as far as I know. I can ping internally how far they got with sending
>>>>>> the patches out to avoid this problem.
>>>>> Hmm, I see. It might be a good data point to understand in which
>>>>> direction we should be going, but I guess we can wait until they send
>>>>> some patches.
>>>> There was just recently a longer thread on the amd-gfx mailing list
>>>> about that. I think looking in there as well might be beneficial.
>>> Okay, let me check. Thanks,
>>>
>>> Best regards,
>>> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2022-12-12 11:15                                             ` Christian König
  0 siblings, 0 replies; 110+ messages in thread
From: Christian König @ 2022-12-12 11:15 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Lucas Stach, Nicolas Dufresne, Daniel Stone, ppaalanen,
	sumit.semwal, daniel, robdclark, dri-devel, linaro-mm-sig,
	linux-media

Am 12.12.22 um 04:00 schrieb Tomasz Figa:
> [SNIP]
>> What we could do is to force all exporters to use begin/end_cpu_access()
>> even on it's own buffers and look at all the importers when the access
>> is completed. But that would increase the complexity of the handling in
>> the exporter.
> I feel like they should be doing so anyway, because it often depends
> on the SoC integration whether the DMA can do cache snooping or not.

Yeah, but wouldn't help without the exporter taking care of the 
importers needs of cache flushing. And that's exactly what we try hard 
to avoid because it creates code in every exporter which is never tested 
except for the case that you combine this exporter with an non coherent 
importer.

That's a testing nightmare because then you everywhere has code which 
only in very few combinations of exporter and importer is actually used.

> Although arguably, there is a corner case today where if one uses
> dma_alloc_coherent() to get a buffer with a coherent CPU mapping for
> device X that is declared as cache-coherent, one also expects not to
> need to call begin/end_cpu_access(), but those would be needed if the
> buffer was to be imported by device Y that is not cache-coherent...
>
> Sounds like after all it's a mess. I guess your patches make it one
> step closer to something sensible, import would fail in such cases.
> Although arguably we should be able to still export from driver Y and
> import to driver X just fine if Y allocated the buffer as coherent -
> otherwise we would break existing users for whom things worked fine.

Allocating the buffer as coherent won't help in this case because we 
actually do want CPU caching for performant access. It's just that some 
device needs a cache flush before it sees all the changes.

As far as I can see without adding additional complexity to the exporter 
this can only be archived in two ways:

1. Switch the role of the exporter and importer. This way the device 
with the need for the cache flush is the exporter and in control of the 
operations on its buffers.

2. We use DMA-buf as neutral mediator. Since DMA-buf keeps track of who 
has mapped the buffers it inserts the appropriate 
dma_sync_*_for_device() calls.

>> [SNIP]
>> The best we can do is to reject combinations which won't work in the
>> kernel and then userspace could react accordingly.
>>
> The question is whether userspace is able to deal with it, given the
> limited amount of information it gets from the kernel. Sure, there is
> always the ultimate fallback of memcpy(), but in many cases that would
> be totally unusable due to severe performance impact. If we were to
> remove the existing extent of implicit handling from the kernel, I
> think we need to first give the userspace the information necessary to
> explicitly handle the fallback to the same extent.

Good point.

> We also need to think about backwards compatibility. Simply removing
> the implicit fallback cases would probably break a lot of userspace,
> so an opt-in behavior is likely needed initially...

Yes, I'm completely aware of that as well.

We can't hard break userspace even if the previous approach didn't 
worked 100% correctly.

>>>> That's essentially the reason why we have DMA-buf heaps. Those heaps
>>>> expose system memory buffers with certain properties (size, CMA, DMA bit
>>>> restrictions etc...) and also implement the callback functions for CPU
>>>> cache maintenance.
>>>>
>>> How do DMA-buf heaps change anything here? We already have CPU cache
>>> maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
>>> for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
>>> (which arguably shouldn't actually do CPU cache maintenance, unless
>>> that's implied by how either of the involved DMA engines work).
>> DMA-buf heaps are the neutral man in the middle.
>>
>> The implementation keeps track of all the attached importers and should
>> make sure that the allocated memory fits the need of everyone.
>> Additional to that calls to the cache DMA-api cache management functions
>> are inserted whenever CPU access begins/ends.
>>
> I think in current design, it only knows all the importers after the
> buffer is already allocated, so it doesn't necessarily have a way to
> handle the allocation constraints. Something would have to be done to
> get all the importers attached before the allocation actually takes
> place.

That's already in place. See the attach and map callbacks.

I'm just not sure if heaps fully implements it like this.

>>>>>> The problem is that in this particular case the exporter provides the
>>>>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
>>>>>> with that.
>>>>> Why does the exporter leave the buffer with dirty CPU caches?
>>>> Because exporters always export the buffers as they would use it. And in
>>>> this case that means writing with the CPU to it.
>>>>
>>> Sorry for the question not being very clear. I meant: How do the CPU
>>> caches get dirty in that case?
>> The exporter wrote to it. As far as I understand the exporter just
>> copies things from A to B with memcpy to construct the buffer content.
>>
> Okay, so it's just due to CPU access and basically what we touched a
> few paragraphs above.

Yes, I've never seen a device which actually dirties the CPU cache. But 
I would never rule out that such a device exists.

Regards,
Christian.

>>> [SNIP]
>>>> Yes, totally agree. The problem is really that we moved bunch of MM and
>>>> DMA functions in one API.
>>>>
>>>> The bounce buffers are something we should really have in a separate
>>>> component.
>>>>
>>>> Then the functionality of allocating system memory for a specific device
>>>> or devices should be something provided by the MM.
>>>>
>>>> And finally making this memory or any other CPU address accessible to a
>>>> device (IOMMU programming etc..) should then be part of an DMA API.
>>>>
>>> Remember that actually making the memory accessible to a device often
>>> needs to be handled already as a part of the allocation (e.g. dma_mask
>>> in the non-IOMMU case). So I tend to think that the current division
>>> of responsibilities is mostly fine - the dma_alloc family should be
>>> seen as a part of MM already, especially with all the recent
>>> improvements from Christoph, like dma_alloc_pages().
>> Yes, that's indeed a very interesting development which as far as I can
>> see goes into the right direction.
>>
>>> That said, it may indeed make sense to move things like IOMMU mapping
>>> management out of the dma_alloc() and just reduce those functions to
>>> simply returning a set of pages that satisfy the allocation
>>> constraints. One would need to call dma_map() after the allocation,
>>> but that's actually a fair expectation. Sometimes it might be
>>> preferred to pre-allocate the memory, but only map it into the device
>>> address space when it's really necessary.
>> What I'm still missing is the functionality to allocate pages for
>> multiple devices and proper error codes when dma_map() can't make the
>> page accessible to a device.
> Agreed. Although again, I think the more challenging part would be to
> get the complete list of devices involved before the allocation
> happens.
>
> Best regards,
> Tomasz
>
>> Regards,
>> Christian.
>>
>>>>>>>>> It's a use-case that is working fine today with many devices (e.g. network
>>>>>>>>> adapters) in the ARM world, exactly because the architecture specific
>>>>>>>>> implementation of the DMA API inserts the cache maintenance operations
>>>>>>>>> on buffer ownership transfer.
>>>>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
>>>>>>>> design totally breaks GPUs on Xen DOM0 for example.
>>>>>>>>
>>>>>>>> And Xen is just one example, I can certainly say from experience that
>>>>>>>> this design was a really really bad idea because it favors just one use
>>>>>>>> case while making other use cases practically impossible if not really
>>>>>>>> hard to implement.
>>>>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
>>>>>>> problem that this introduces for it?
>>>>>> That's a bit longer topic. The AMD XEN devs are already working on this
>>>>>> as far as I know. I can ping internally how far they got with sending
>>>>>> the patches out to avoid this problem.
>>>>> Hmm, I see. It might be a good data point to understand in which
>>>>> direction we should be going, but I guess we can wait until they send
>>>>> some patches.
>>>> There was just recently a longer thread on the amd-gfx mailing list
>>>> about that. I think looking in there as well might be beneficial.
>>> Okay, let me check. Thanks,
>>>
>>> Best regards,
>>> Tomasz


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-02 15:27                                                   ` Christian König
@ 2023-01-05 11:46                                                     ` Daniel Vetter
  -1 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2023-01-05 11:46 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, Nicolas Dufresne, Pekka Paalanen, Rob Clark,
	Tomasz Figa, Lucas Stach, Daniel Stone, sumit.semwal, dri-devel,
	linaro-mm-sig, linux-media

On Fri, Dec 02, 2022 at 04:27:08PM +0100, Christian König wrote:
> Am 30.11.22 um 11:30 schrieb Daniel Vetter:
> > On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:
> > > Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
> > > > On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> > > > > On Tue, 22 Nov 2022 18:33:59 +0100
> > > > > Christian König <christian.koenig@amd.com> wrote:
> > > > > 
> > > > > > We should have come up with dma-heaps earlier and make it clear that
> > > > > > exporting a DMA-buf from a device gives you something device specific
> > > > > > which might or might not work with others.
> > > > > > 
> > > > > > Apart from that I agree, DMA-buf should be capable of handling this.
> > > > > > Question left is what documentation is missing to make it clear how
> > > > > > things are supposed to work?
> > > > > Perhaps somewhat related from Daniel Stone that seems to have been
> > > > > forgotten:
> > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210905122742.86029-1-daniels%40collabora.com%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C45786a08e6dc4af2816508dad2bdf957%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638054011293521624%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=GjsoJGNoPozTS2SWeeirURzQatI5vfl9%2B%2BfzoavgTbw%3D&amp;reserved=0
> > > > > 
> > > > > It aimed mostly at userspace, but sounds to me like the coherency stuff
> > > > > could use a section of its own there?
> > > > Hm yeah it would be great to land that and then eventually extend. Daniel?
> > > There is a lot of things documented in this document that have been said to be
> > > completely wrong user-space behaviour in this thread. But it seems to pre-date
> > > the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
> > > vs system memory issue. But it also underline a very important aspect, that
> > > userland is not aware which one to use. What this document suggest though seems
> > > more realist then what has been said here.
> > > 
> > > Its overall a great document, it unfortunate that it only makes it into the DRM
> > > mailing list.
> > The doc is more about document the current status quo/best practices,
> > which is very much not using dma-heaps.
> > 
> > The issue there is that currently userspace has no idea which dma-heap to
> > use for shared buffers, plus not all allocators are exposed through heaps
> > to begin with. We had this noted as a todo item (add some device->heap
> > sysfs links was the idea), until that's done all you can do is hardcode
> > the right heaps for the right usage in userspace, which is what android
> > does. Plus android doesn't have dgpu, so doesn't need the missing ttm
> > heap.
> 
> Hui? Why do you think we should have a TTM heap in the first place?

[sorry for the really late reply]

Extremely late reply, but p2p buffer sharing when you want it in vram is
only something ttm can provide you. And if the goal is to have all buffer
sharing needs served by dma-heaps, then eventually that'll be needed too.

But probably totally fine to just not have it for another few years
(decades?).

> As far as I can see we need three different ways of allocation:
> 1. Normal system memory.
> 2. CMA based.
> 3. Device specific.
> 
> When any of the participating devices needs CMA then user space must use the
> CMA heap, when any of the participating devices needs device specific memory
> then user space must use device specific memory (from that device).
> 
> It still can be that two devices can't talk with each other because both
> needs the buffer to be allocated from device specific memory for a
> particular use case, but at least all the cases for both none CPU cache
> coherent ARM devices as well as device specific memory for scanout should be
> handled with this.

Yeah on ARM this should be enough.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > But yeah the long-term aspiration also hasn't changed, because the
> > dma-heap todo list is also very, very old by now :-/
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2023-01-05 11:46                                                     ` Daniel Vetter
  0 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2023-01-05 11:46 UTC (permalink / raw)
  To: Christian König
  Cc: linux-media, Tomasz Figa, linaro-mm-sig, Pekka Paalanen,
	dri-devel, Nicolas Dufresne, sumit.semwal

On Fri, Dec 02, 2022 at 04:27:08PM +0100, Christian König wrote:
> Am 30.11.22 um 11:30 schrieb Daniel Vetter:
> > On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:
> > > Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
> > > > On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> > > > > On Tue, 22 Nov 2022 18:33:59 +0100
> > > > > Christian König <christian.koenig@amd.com> wrote:
> > > > > 
> > > > > > We should have come up with dma-heaps earlier and make it clear that
> > > > > > exporting a DMA-buf from a device gives you something device specific
> > > > > > which might or might not work with others.
> > > > > > 
> > > > > > Apart from that I agree, DMA-buf should be capable of handling this.
> > > > > > Question left is what documentation is missing to make it clear how
> > > > > > things are supposed to work?
> > > > > Perhaps somewhat related from Daniel Stone that seems to have been
> > > > > forgotten:
> > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210905122742.86029-1-daniels%40collabora.com%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C45786a08e6dc4af2816508dad2bdf957%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638054011293521624%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=GjsoJGNoPozTS2SWeeirURzQatI5vfl9%2B%2BfzoavgTbw%3D&amp;reserved=0
> > > > > 
> > > > > It aimed mostly at userspace, but sounds to me like the coherency stuff
> > > > > could use a section of its own there?
> > > > Hm yeah it would be great to land that and then eventually extend. Daniel?
> > > There is a lot of things documented in this document that have been said to be
> > > completely wrong user-space behaviour in this thread. But it seems to pre-date
> > > the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
> > > vs system memory issue. But it also underline a very important aspect, that
> > > userland is not aware which one to use. What this document suggest though seems
> > > more realist then what has been said here.
> > > 
> > > Its overall a great document, it unfortunate that it only makes it into the DRM
> > > mailing list.
> > The doc is more about document the current status quo/best practices,
> > which is very much not using dma-heaps.
> > 
> > The issue there is that currently userspace has no idea which dma-heap to
> > use for shared buffers, plus not all allocators are exposed through heaps
> > to begin with. We had this noted as a todo item (add some device->heap
> > sysfs links was the idea), until that's done all you can do is hardcode
> > the right heaps for the right usage in userspace, which is what android
> > does. Plus android doesn't have dgpu, so doesn't need the missing ttm
> > heap.
> 
> Hui? Why do you think we should have a TTM heap in the first place?

[sorry for the really late reply]

Extremely late reply, but p2p buffer sharing when you want it in vram is
only something ttm can provide you. And if the goal is to have all buffer
sharing needs served by dma-heaps, then eventually that'll be needed too.

But probably totally fine to just not have it for another few years
(decades?).

> As far as I can see we need three different ways of allocation:
> 1. Normal system memory.
> 2. CMA based.
> 3. Device specific.
> 
> When any of the participating devices needs CMA then user space must use the
> CMA heap, when any of the participating devices needs device specific memory
> then user space must use device specific memory (from that device).
> 
> It still can be that two devices can't talk with each other because both
> needs the buffer to be allocated from device specific memory for a
> particular use case, but at least all the cases for both none CPU cache
> coherent ARM devices as well as device specific memory for scanout should be
> handled with this.

Yeah on ARM this should be enough.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > But yeah the long-term aspiration also hasn't changed, because the
> > dma-heap todo list is also very, very old by now :-/
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Linaro-mm-sig] Re: Try to address the DMA-buf coherency problem
  2022-12-09 17:07                                           ` Alex Deucher
@ 2023-01-05 11:50                                             ` Daniel Vetter
  -1 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2023-01-05 11:50 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Pekka Paalanen, Tomasz Figa, Christian König,
	Nicolas Dufresne, linaro-mm-sig, dri-devel, linux-media,
	sumit.semwal

On Fri, Dec 09, 2022 at 12:07:49PM -0500, Alex Deucher wrote:
> On Fri, Dec 9, 2022 at 4:32 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Fri, 9 Dec 2022 17:26:06 +0900
> > Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > > On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
> > > >
> > > > Hi Tomasz,
> > > >
> > > > Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > > > > [SNIP]
> > > > >> In other words explicit ownership transfer is not something we would
> > > > >> want as requirement in the framework, cause otherwise we break tons of
> > > > >> use cases which require concurrent access to the underlying buffer.
> > > > >>
> > > > >> When a device driver needs explicit ownership transfer it's perfectly
> > > > >> possible to implement this using the dma_fence objects mentioned above.
> > > > >> E.g. drivers can already look at who is accessing a buffer currently and
> > > > >> can even grab explicit ownership of it by adding their own dma_fence
> > > > >> objects.
> > > > >>
> > > > >> The only exception is CPU based access, e.g. when something is written
> > > > >> with the CPU a cache flush might be necessary and when something is read
> > > > >> with the CPU a cache invalidation might be necessary.
> > > > >>
> > > > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > > > should be covered for the cache maintenance needs originating from CPU
> > > > > accesses already, +/- the broken cases which don't call the begin/end
> > > > > CPU access routines that I mentioned above.
> > > > >
> > > > > Similarly, for any ownership transfer between different DMA engines,
> > > > > we should be covered either by the userspace explicitly flushing the
> > > > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > > > >
> > > > > But then, what's left to be solved? :) (Besides the cases of missing
> > > > > begin/end CPU access calls.)
> > > >
> > > > Well there are multiple problems here:
> > > >
> > > > 1. A lot of userspace applications/frameworks assume that it can
> > > > allocate the buffer anywhere and it just works.
> > > >
> > > > This isn't true at all, we have tons of cases where device can only
> > > > access their special memory for certain use cases.
> > > > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > > > supports system memory here. Similar cases exists for audio/video codecs
> > > > where intermediate memory is only accessible by certain devices because
> > > > of content protection.
> > >
> > > Ack.
> > >
> > > Although I think the most common case on mainstream Linux today is
> > > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > > GPU) and hoping that other devices would accept the buffers just fine,
> > > which isn't a given on most platforms (although often it's just about
> > > pixel format, width/height/stride alignment, tiling, etc. rather than
> > > the memory itself). That's why ChromiumOS has minigbm and Android has
> > > gralloc that act as the central point of knowledge on buffer
> > > allocation.
> >
> > Hi,
> >
> > as an anecdote, when I was improving Mutter's cross-DRM-device handling
> > (for DisplayLink uses) a few years ago, I implemented several different
> > approaches of where to allocate, to try until going for the slowest but
> > guaranteed to work case of copying every update into and out of sysram.
> >
> > It seems there are two different approaches in general for allocation
> > and sharing:
> >
> > 1. Try different things until it works or you run out of options
> >
> > pro:
> > - no need for a single software component to know everything about
> >   every device in the system
> >
> > con:
> > - who bothers with fallbacks, if the first try works on my system for
> >   my use case I test with? I.e. cost of code in users.
> > - trial-and-error can be very laborious (allocate, share with all
> >   devices, populate, test)
> > - the search space might be huge
> >
> 
> Even that is fraught with difficulty.  We had a ton of bug reports
> over the years claiming amdgpu was broken when users tried to use
> displaylink devices in combination with AMD dGPUs because the
> performance was so slow.  The problem was that rather than using
> dma-buf, the compositor was just mmaping the the dGPU BAR,  which
> happens to be uncached write combined and across the PCI bus, and
> copying the data to the displaylink device.  Read access to a PCI BAR
> with the CPU is on the order of 10s of MB per second.

Yeah you shouldn't readback dma-buf directly when you get them from a
gl/vk driver, but instead use the tools these apis provide for cpu access.

But that just puts us even more firmly into the "who even bothers"
territory because people just hack up something that happens to work on
their igpu (where at least on x86 you nowadays tend to get fully cpu
cached buffers).
-Daniel

> 
> Alex
> 
> >
> > 2. Have a central component that knows what to do
> >
> > pro:
> > - It might work on the first attempt, so no fallbacks in users.
> > - It might be optimal.
> >
> > con:
> > - You need a software component that knows everything about every
> >   single combination of hardware in existence, multiplied by use cases.
> >
> >
> > Neither seems good, which brings us back to https://github.com/cubanismo/allocator .
> >
> >
> > > > 2. We don't properly communicate allocation requirements to userspace.
> > > >
> > > > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > > > guess if normal, CMA or even device specific memory is needed.
> > >
> > > DMA-buf heaps actually make it even more difficult for the userspace,
> > > because now it needs to pick the right heap. With allocation built
> > > into the specific UAPI (like V4L2), it's at least possible to allocate
> > > for one specific device without having any knowledge about allocation
> > > constraints in the userspace.
> > >
> > > >
> > > > 3. We seem to lack some essential parts of those restrictions in the
> > > > documentation.
> > > >
> > >
> > > Ack.
> > >
> > > > >>>> So if a device driver uses cached system memory on an architecture which
> > > > >>>> devices which can't access it the right approach is clearly to reject
> > > > >>>> the access.
> > > > >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> > > > >> Well that depends. As said above the exporter exports the buffer as it
> > > > >> was allocated.
> > > > >>
> > > > >> If that means the the exporter provides a piece of memory which requires
> > > > >> CPU cache snooping to access correctly then the best thing we can do is
> > > > >> to prevent an importer which can't do this from attaching.
> > > > > Could you elaborate more about this case? Does it exist in practice?
> > > > > Do I assume correctly that it's about sharing a buffer between one DMA
> > > > > engine that is cache-coherent and another that is non-coherent, where
> > > > > the first one ends up having its accesses always go through some kind
> > > > > of a cache (CPU cache, L2/L3/... cache, etc.)?
> > > >
> > > > Yes, exactly that. What happens in this particular use case is that one
> > > > device driver wrote to it's internal buffer with the CPU (so some cache
> > > > lines where dirty) and then a device which couldn't deal with that tried
> > > > to access it.
> > >
> > > If so, shouldn't that driver surround its CPU accesses with
> > > begin/end_cpu_access() in the first place?
> > >
> > > The case that I was suggesting was of a hardware block that actually
> > > sits behind the CPU cache and thus dirties it on writes, not the
> > > driver doing that. (I haven't personally encountered such a system,
> > > though.)
> > >
> > > >
> > > > We could say that all device drivers must always look at the coherency
> > > > of the devices which want to access their buffers. But that would
> > > > horrible complicate things for maintaining the drivers because then
> > > > drivers would need to take into account requirements from other drivers
> > > > while allocating their internal buffers.
> > >
> > > I think it's partially why we have the allocation part of the DMA
> > > mapping API, but currently it's only handling requirements of one
> > > device. And we don't have any information from the userspace what
> > > other devices the buffer would be used with...
> > >
> > > Actually, do we even have such information in the userspace today?
> > > Let's say I do a video call in a web browser on a typical Linux
> > > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > > V4L2 camera fills in buffers with video frames and both encoder and
> > > display consume them. Do we have a central place which would know that
> > > a buffer needs to be allocated that works with the producer and all
> > > consumers?
> >
> > I have a vague belief that many, many years ago, in the early days of
> > dmabuf development, there was the idea of the sequence:
> > - create a dmabuf handle
> > - share the handle with all devices that would need access
> > - *then* do the allocation with kernel-internal negotiation to fill all
> >   devices' needs, if at all possible
> >
> > Obviously that didn't happen. I think today's dmabuf Wayland protocol
> > would support this though.
> >
> > Anyway, Wayland can tell the app which DRM devices a buffer
> > needs to work with as a GPU texture and potentially on same/another
> > DRM device as a KMS framebuffer, so theoretically the app could know.
> >
> >
> > Thanks,
> > pq
> _______________________________________________
> Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org
> To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Linaro-mm-sig] Re: Try to address the DMA-buf coherency problem
@ 2023-01-05 11:50                                             ` Daniel Vetter
  0 siblings, 0 replies; 110+ messages in thread
From: Daniel Vetter @ 2023-01-05 11:50 UTC (permalink / raw)
  To: Alex Deucher
  Cc: dri-devel, Nicolas Dufresne, linaro-mm-sig, Pekka Paalanen,
	Tomasz Figa, sumit.semwal, Christian König, linux-media

On Fri, Dec 09, 2022 at 12:07:49PM -0500, Alex Deucher wrote:
> On Fri, Dec 9, 2022 at 4:32 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Fri, 9 Dec 2022 17:26:06 +0900
> > Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > > On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig@amd.com> wrote:
> > > >
> > > > Hi Tomasz,
> > > >
> > > > Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > > > > [SNIP]
> > > > >> In other words explicit ownership transfer is not something we would
> > > > >> want as requirement in the framework, cause otherwise we break tons of
> > > > >> use cases which require concurrent access to the underlying buffer.
> > > > >>
> > > > >> When a device driver needs explicit ownership transfer it's perfectly
> > > > >> possible to implement this using the dma_fence objects mentioned above.
> > > > >> E.g. drivers can already look at who is accessing a buffer currently and
> > > > >> can even grab explicit ownership of it by adding their own dma_fence
> > > > >> objects.
> > > > >>
> > > > >> The only exception is CPU based access, e.g. when something is written
> > > > >> with the CPU a cache flush might be necessary and when something is read
> > > > >> with the CPU a cache invalidation might be necessary.
> > > > >>
> > > > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > > > should be covered for the cache maintenance needs originating from CPU
> > > > > accesses already, +/- the broken cases which don't call the begin/end
> > > > > CPU access routines that I mentioned above.
> > > > >
> > > > > Similarly, for any ownership transfer between different DMA engines,
> > > > > we should be covered either by the userspace explicitly flushing the
> > > > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > > > >
> > > > > But then, what's left to be solved? :) (Besides the cases of missing
> > > > > begin/end CPU access calls.)
> > > >
> > > > Well there are multiple problems here:
> > > >
> > > > 1. A lot of userspace applications/frameworks assume that it can
> > > > allocate the buffer anywhere and it just works.
> > > >
> > > > This isn't true at all, we have tons of cases where device can only
> > > > access their special memory for certain use cases.
> > > > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > > > supports system memory here. Similar cases exists for audio/video codecs
> > > > where intermediate memory is only accessible by certain devices because
> > > > of content protection.
> > >
> > > Ack.
> > >
> > > Although I think the most common case on mainstream Linux today is
> > > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > > GPU) and hoping that other devices would accept the buffers just fine,
> > > which isn't a given on most platforms (although often it's just about
> > > pixel format, width/height/stride alignment, tiling, etc. rather than
> > > the memory itself). That's why ChromiumOS has minigbm and Android has
> > > gralloc that act as the central point of knowledge on buffer
> > > allocation.
> >
> > Hi,
> >
> > as an anecdote, when I was improving Mutter's cross-DRM-device handling
> > (for DisplayLink uses) a few years ago, I implemented several different
> > approaches of where to allocate, to try until going for the slowest but
> > guaranteed to work case of copying every update into and out of sysram.
> >
> > It seems there are two different approaches in general for allocation
> > and sharing:
> >
> > 1. Try different things until it works or you run out of options
> >
> > pro:
> > - no need for a single software component to know everything about
> >   every device in the system
> >
> > con:
> > - who bothers with fallbacks, if the first try works on my system for
> >   my use case I test with? I.e. cost of code in users.
> > - trial-and-error can be very laborious (allocate, share with all
> >   devices, populate, test)
> > - the search space might be huge
> >
> 
> Even that is fraught with difficulty.  We had a ton of bug reports
> over the years claiming amdgpu was broken when users tried to use
> displaylink devices in combination with AMD dGPUs because the
> performance was so slow.  The problem was that rather than using
> dma-buf, the compositor was just mmaping the the dGPU BAR,  which
> happens to be uncached write combined and across the PCI bus, and
> copying the data to the displaylink device.  Read access to a PCI BAR
> with the CPU is on the order of 10s of MB per second.

Yeah you shouldn't readback dma-buf directly when you get them from a
gl/vk driver, but instead use the tools these apis provide for cpu access.

But that just puts us even more firmly into the "who even bothers"
territory because people just hack up something that happens to work on
their igpu (where at least on x86 you nowadays tend to get fully cpu
cached buffers).
-Daniel

> 
> Alex
> 
> >
> > 2. Have a central component that knows what to do
> >
> > pro:
> > - It might work on the first attempt, so no fallbacks in users.
> > - It might be optimal.
> >
> > con:
> > - You need a software component that knows everything about every
> >   single combination of hardware in existence, multiplied by use cases.
> >
> >
> > Neither seems good, which brings us back to https://github.com/cubanismo/allocator .
> >
> >
> > > > 2. We don't properly communicate allocation requirements to userspace.
> > > >
> > > > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > > > guess if normal, CMA or even device specific memory is needed.
> > >
> > > DMA-buf heaps actually make it even more difficult for the userspace,
> > > because now it needs to pick the right heap. With allocation built
> > > into the specific UAPI (like V4L2), it's at least possible to allocate
> > > for one specific device without having any knowledge about allocation
> > > constraints in the userspace.
> > >
> > > >
> > > > 3. We seem to lack some essential parts of those restrictions in the
> > > > documentation.
> > > >
> > >
> > > Ack.
> > >
> > > > >>>> So if a device driver uses cached system memory on an architecture which
> > > > >>>> devices which can't access it the right approach is clearly to reject
> > > > >>>> the access.
> > > > >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> > > > >> Well that depends. As said above the exporter exports the buffer as it
> > > > >> was allocated.
> > > > >>
> > > > >> If that means the the exporter provides a piece of memory which requires
> > > > >> CPU cache snooping to access correctly then the best thing we can do is
> > > > >> to prevent an importer which can't do this from attaching.
> > > > > Could you elaborate more about this case? Does it exist in practice?
> > > > > Do I assume correctly that it's about sharing a buffer between one DMA
> > > > > engine that is cache-coherent and another that is non-coherent, where
> > > > > the first one ends up having its accesses always go through some kind
> > > > > of a cache (CPU cache, L2/L3/... cache, etc.)?
> > > >
> > > > Yes, exactly that. What happens in this particular use case is that one
> > > > device driver wrote to it's internal buffer with the CPU (so some cache
> > > > lines where dirty) and then a device which couldn't deal with that tried
> > > > to access it.
> > >
> > > If so, shouldn't that driver surround its CPU accesses with
> > > begin/end_cpu_access() in the first place?
> > >
> > > The case that I was suggesting was of a hardware block that actually
> > > sits behind the CPU cache and thus dirties it on writes, not the
> > > driver doing that. (I haven't personally encountered such a system,
> > > though.)
> > >
> > > >
> > > > We could say that all device drivers must always look at the coherency
> > > > of the devices which want to access their buffers. But that would
> > > > horrible complicate things for maintaining the drivers because then
> > > > drivers would need to take into account requirements from other drivers
> > > > while allocating their internal buffers.
> > >
> > > I think it's partially why we have the allocation part of the DMA
> > > mapping API, but currently it's only handling requirements of one
> > > device. And we don't have any information from the userspace what
> > > other devices the buffer would be used with...
> > >
> > > Actually, do we even have such information in the userspace today?
> > > Let's say I do a video call in a web browser on a typical Linux
> > > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > > V4L2 camera fills in buffers with video frames and both encoder and
> > > display consume them. Do we have a central place which would know that
> > > a buffer needs to be allocated that works with the producer and all
> > > consumers?
> >
> > I have a vague belief that many, many years ago, in the early days of
> > dmabuf development, there was the idea of the sequence:
> > - create a dmabuf handle
> > - share the handle with all devices that would need access
> > - *then* do the allocation with kernel-internal negotiation to fill all
> >   devices' needs, if at all possible
> >
> > Obviously that didn't happen. I think today's dmabuf Wayland protocol
> > would support this though.
> >
> > Anyway, Wayland can tell the app which DRM devices a buffer
> > needs to work with as a GPU texture and potentially on same/another
> > DRM device as a KMS framebuffer, so theoretically the app could know.
> >
> >
> > Thanks,
> > pq
> _______________________________________________
> Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org
> To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
  2022-12-09 10:27                                         ` Christian König
@ 2023-01-05 20:39                                           ` Sebastian Wick
  -1 siblings, 0 replies; 110+ messages in thread
From: Sebastian Wick @ 2023-01-05 20:39 UTC (permalink / raw)
  To: Christian König
  Cc: Tomasz Figa, Nicolas Dufresne, linaro-mm-sig, ppaalanen,
	dri-devel, linux-media, sumit.semwal

On Fri, Dec 9, 2022 at 11:28 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 09.12.22 um 09:26 schrieb Tomasz Figa:
> > [SNIP]
> > Although I think the most common case on mainstream Linux today is
> > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > GPU) and hoping that other devices would accept the buffers just fine,
> > which isn't a given on most platforms (although often it's just about
> > pixel format, width/height/stride alignment, tiling, etc. rather than
> > the memory itself). That's why ChromiumOS has minigbm and Android has
> > gralloc that act as the central point of knowledge on buffer
> > allocation.
>
> Yeah, completely agree. The crux is really that we need to improve those
> user space allocators by providing more information from the kernel.
>
> >> 2. We don't properly communicate allocation requirements to userspace.
> >>
> >> E.g. even if you allocate from DMA-Heaps userspace can currently only
> >> guess if normal, CMA or even device specific memory is needed.
> > DMA-buf heaps actually make it even more difficult for the userspace,
> > because now it needs to pick the right heap. With allocation built
> > into the specific UAPI (like V4L2), it's at least possible to allocate
> > for one specific device without having any knowledge about allocation
> > constraints in the userspace.
>
> Yes, same what Daniel said as well. We need to provide some more hints
> which allocator to use from the kernel.

This information doesn't necessarily have to come from the kernel. For
KMS it makes sense for the kernel to provide it but Vulkan drivers for
example could have the information entirely in the UMD. The important
point is that user space can somehow query which heap can be used with
which constraints for a specific operation. At that point it's
entirely a user space problem.

>
> >>>>>> So if a device driver uses cached system memory on an architecture which
> >>>>>> devices which can't access it the right approach is clearly to reject
> >>>>>> the access.
> >>>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> >>>> Well that depends. As said above the exporter exports the buffer as it
> >>>> was allocated.
> >>>>
> >>>> If that means the the exporter provides a piece of memory which requires
> >>>> CPU cache snooping to access correctly then the best thing we can do is
> >>>> to prevent an importer which can't do this from attaching.
> >>> Could you elaborate more about this case? Does it exist in practice?
> >>> Do I assume correctly that it's about sharing a buffer between one DMA
> >>> engine that is cache-coherent and another that is non-coherent, where
> >>> the first one ends up having its accesses always go through some kind
> >>> of a cache (CPU cache, L2/L3/... cache, etc.)?
> >> Yes, exactly that. What happens in this particular use case is that one
> >> device driver wrote to it's internal buffer with the CPU (so some cache
> >> lines where dirty) and then a device which couldn't deal with that tried
> >> to access it.
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
>
> The problem is that the roles are reversed. The callbacks let the
> exporter knows that it needs to flush the caches when the importer is
> done accessing the buffer with the CPU.
>
> But here the exporter is the one accessing the buffer with the CPU and
> the importer then accesses stale data because it doesn't snoop the caches.
>
> What we could do is to force all exporters to use begin/end_cpu_access()
> even on it's own buffers and look at all the importers when the access
> is completed. But that would increase the complexity of the handling in
> the exporter.
>
> In other words we would then have code in the exporters which is only
> written for handling the constrains of the importers. This has a wide
> variety of consequences, especially that this functionality of the
> exporter can't be tested without a proper importer.
>
> I was also thinking about reversing the role of exporter and importer in
> the kernel, but came to the conclusion that doing this under the hood
> without userspace knowing about it is probably not going to work either.
>
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
>
> Never heard of anything like that either, but who knows.
>
> >> We could say that all device drivers must always look at the coherency
> >> of the devices which want to access their buffers. But that would
> >> horrible complicate things for maintaining the drivers because then
> >> drivers would need to take into account requirements from other drivers
> >> while allocating their internal buffers.
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only handling requirements of one
> > device. And we don't have any information from the userspace what
> > other devices the buffer would be used with...
>
> Exactly that, yes.
>
> > Actually, do we even have such information in the userspace today?
> > Let's say I do a video call in a web browser on a typical Linux
> > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > V4L2 camera fills in buffers with video frames and both encoder and
> > display consume them. Do we have a central place which would know that
> > a buffer needs to be allocated that works with the producer and all
> > consumers?
>
> Both X11 and Wayland have protocols which can be used to display a
> certain DMA-buf handle, their feedback packages contain information how
> ideal your configuration is, e.g. if the DMA-buf handle could be used or
> if an extra copy was needed etc...
>
> Similar exists between VAAPI and V4L2 as far as I know, but as you noted
> as well here it's indeed more about negotiating pixel format, stride,
> padding, alignment etc...
>
> The best we can do is to reject combinations which won't work in the
> kernel and then userspace could react accordingly.
>
> >> That's essentially the reason why we have DMA-buf heaps. Those heaps
> >> expose system memory buffers with certain properties (size, CMA, DMA bit
> >> restrictions etc...) and also implement the callback functions for CPU
> >> cache maintenance.
> >>
> > How do DMA-buf heaps change anything here? We already have CPU cache
> > maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
> > for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
> > (which arguably shouldn't actually do CPU cache maintenance, unless
> > that's implied by how either of the involved DMA engines work).
>
> DMA-buf heaps are the neutral man in the middle.
>
> The implementation keeps track of all the attached importers and should
> make sure that the allocated memory fits the need of everyone.
> Additional to that calls to the cache DMA-api cache management functions
> are inserted whenever CPU access begins/ends.
>
> That's the best we can do for system memory sharing, only device
> specific memory can't be allocated like this.
>
> >>>> We do have the system and CMA dma-buf heap for cases where a device
> >>>> driver which wants to access the buffer with caches enabled. So this is
> >>>> not a limitation in functionality, it's just a matter of correctly using it.
> >>>>
> >>> V4L2 also has the ability to allocate buffers and map them with caches enabled.
> >> Yeah, when that's a requirement for the V4L2 device it also makes
> >> perfect sense.
> >>
> >> It's just that we shouldn't force any specific allocation behavior on a
> >> device driver because of the requirements of a different device.
> >>
> >> Giving an example a V4L2 device shouldn't be forced to use
> >> videobuf2-dma-contig because some other device needs CMA. Instead we
> >> should use the common DMA-buf heaps which implement this as neutral
> >> supplier of system memory buffers.
> > Agreed, but that's only possible if we have a central entity that
> > understands what devices the requested buffer would be used with. My
> > understanding is that we're nowhere close to that with mainstream
> > Linux today.
> >
> > // As a side note, videobuf2-dma-contig can actually allocate
> > discontiguous memory, if the device is behind an IOMMU. Actually with
> > the recent DMA API improvements, we could probably coalesce
> > vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.
>
> That would probably make live a little bit simpler, yes.
>
> >>>> The problem is that in this particular case the exporter provides the
> >>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> >>>> with that.
> >>> Why does the exporter leave the buffer with dirty CPU caches?
> >> Because exporters always export the buffers as they would use it. And in
> >> this case that means writing with the CPU to it.
> >>
> > Sorry for the question not being very clear. I meant: How do the CPU
> > caches get dirty in that case?
>
> The exporter wrote to it. As far as I understand the exporter just
> copies things from A to B with memcpy to construct the buffer content.
>
> > [SNIP]
> >> Yes, totally agree. The problem is really that we moved bunch of MM and
> >> DMA functions in one API.
> >>
> >> The bounce buffers are something we should really have in a separate
> >> component.
> >>
> >> Then the functionality of allocating system memory for a specific device
> >> or devices should be something provided by the MM.
> >>
> >> And finally making this memory or any other CPU address accessible to a
> >> device (IOMMU programming etc..) should then be part of an DMA API.
> >>
> > Remember that actually making the memory accessible to a device often
> > needs to be handled already as a part of the allocation (e.g. dma_mask
> > in the non-IOMMU case). So I tend to think that the current division
> > of responsibilities is mostly fine - the dma_alloc family should be
> > seen as a part of MM already, especially with all the recent
> > improvements from Christoph, like dma_alloc_pages().
>
> Yes, that's indeed a very interesting development which as far as I can
> see goes into the right direction.
>
> > That said, it may indeed make sense to move things like IOMMU mapping
> > management out of the dma_alloc() and just reduce those functions to
> > simply returning a set of pages that satisfy the allocation
> > constraints. One would need to call dma_map() after the allocation,
> > but that's actually a fair expectation. Sometimes it might be
> > preferred to pre-allocate the memory, but only map it into the device
> > address space when it's really necessary.
>
> What I'm still missing is the functionality to allocate pages for
> multiple devices and proper error codes when dma_map() can't make the
> page accessible to a device.
>
> Regards,
> Christian.
>
> >>>>>>> It's a use-case that is working fine today with many devices (e.g. network
> >>>>>>> adapters) in the ARM world, exactly because the architecture specific
> >>>>>>> implementation of the DMA API inserts the cache maintenance operations
> >>>>>>> on buffer ownership transfer.
> >>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >>>>>> design totally breaks GPUs on Xen DOM0 for example.
> >>>>>>
> >>>>>> And Xen is just one example, I can certainly say from experience that
> >>>>>> this design was a really really bad idea because it favors just one use
> >>>>>> case while making other use cases practically impossible if not really
> >>>>>> hard to implement.
> >>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
> >>>>> problem that this introduces for it?
> >>>> That's a bit longer topic. The AMD XEN devs are already working on this
> >>>> as far as I know. I can ping internally how far they got with sending
> >>>> the patches out to avoid this problem.
> >>> Hmm, I see. It might be a good data point to understand in which
> >>> direction we should be going, but I guess we can wait until they send
> >>> some patches.
> >> There was just recently a longer thread on the amd-gfx mailing list
> >> about that. I think looking in there as well might be beneficial.
> > Okay, let me check. Thanks,
> >
> > Best regards,
> > Tomasz
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: Try to address the DMA-buf coherency problem
@ 2023-01-05 20:39                                           ` Sebastian Wick
  0 siblings, 0 replies; 110+ messages in thread
From: Sebastian Wick @ 2023-01-05 20:39 UTC (permalink / raw)
  To: Christian König
  Cc: dri-devel, Nicolas Dufresne, linaro-mm-sig, ppaalanen,
	Tomasz Figa, sumit.semwal, linux-media

On Fri, Dec 9, 2022 at 11:28 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 09.12.22 um 09:26 schrieb Tomasz Figa:
> > [SNIP]
> > Although I think the most common case on mainstream Linux today is
> > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > GPU) and hoping that other devices would accept the buffers just fine,
> > which isn't a given on most platforms (although often it's just about
> > pixel format, width/height/stride alignment, tiling, etc. rather than
> > the memory itself). That's why ChromiumOS has minigbm and Android has
> > gralloc that act as the central point of knowledge on buffer
> > allocation.
>
> Yeah, completely agree. The crux is really that we need to improve those
> user space allocators by providing more information from the kernel.
>
> >> 2. We don't properly communicate allocation requirements to userspace.
> >>
> >> E.g. even if you allocate from DMA-Heaps userspace can currently only
> >> guess if normal, CMA or even device specific memory is needed.
> > DMA-buf heaps actually make it even more difficult for the userspace,
> > because now it needs to pick the right heap. With allocation built
> > into the specific UAPI (like V4L2), it's at least possible to allocate
> > for one specific device without having any knowledge about allocation
> > constraints in the userspace.
>
> Yes, same what Daniel said as well. We need to provide some more hints
> which allocator to use from the kernel.

This information doesn't necessarily have to come from the kernel. For
KMS it makes sense for the kernel to provide it but Vulkan drivers for
example could have the information entirely in the UMD. The important
point is that user space can somehow query which heap can be used with
which constraints for a specific operation. At that point it's
entirely a user space problem.

>
> >>>>>> So if a device driver uses cached system memory on an architecture which
> >>>>>> devices which can't access it the right approach is clearly to reject
> >>>>>> the access.
> >>>>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> >>>> Well that depends. As said above the exporter exports the buffer as it
> >>>> was allocated.
> >>>>
> >>>> If that means the the exporter provides a piece of memory which requires
> >>>> CPU cache snooping to access correctly then the best thing we can do is
> >>>> to prevent an importer which can't do this from attaching.
> >>> Could you elaborate more about this case? Does it exist in practice?
> >>> Do I assume correctly that it's about sharing a buffer between one DMA
> >>> engine that is cache-coherent and another that is non-coherent, where
> >>> the first one ends up having its accesses always go through some kind
> >>> of a cache (CPU cache, L2/L3/... cache, etc.)?
> >> Yes, exactly that. What happens in this particular use case is that one
> >> device driver wrote to it's internal buffer with the CPU (so some cache
> >> lines where dirty) and then a device which couldn't deal with that tried
> >> to access it.
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
>
> The problem is that the roles are reversed. The callbacks let the
> exporter knows that it needs to flush the caches when the importer is
> done accessing the buffer with the CPU.
>
> But here the exporter is the one accessing the buffer with the CPU and
> the importer then accesses stale data because it doesn't snoop the caches.
>
> What we could do is to force all exporters to use begin/end_cpu_access()
> even on it's own buffers and look at all the importers when the access
> is completed. But that would increase the complexity of the handling in
> the exporter.
>
> In other words we would then have code in the exporters which is only
> written for handling the constrains of the importers. This has a wide
> variety of consequences, especially that this functionality of the
> exporter can't be tested without a proper importer.
>
> I was also thinking about reversing the role of exporter and importer in
> the kernel, but came to the conclusion that doing this under the hood
> without userspace knowing about it is probably not going to work either.
>
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
>
> Never heard of anything like that either, but who knows.
>
> >> We could say that all device drivers must always look at the coherency
> >> of the devices which want to access their buffers. But that would
> >> horrible complicate things for maintaining the drivers because then
> >> drivers would need to take into account requirements from other drivers
> >> while allocating their internal buffers.
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only handling requirements of one
> > device. And we don't have any information from the userspace what
> > other devices the buffer would be used with...
>
> Exactly that, yes.
>
> > Actually, do we even have such information in the userspace today?
> > Let's say I do a video call in a web browser on a typical Linux
> > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > V4L2 camera fills in buffers with video frames and both encoder and
> > display consume them. Do we have a central place which would know that
> > a buffer needs to be allocated that works with the producer and all
> > consumers?
>
> Both X11 and Wayland have protocols which can be used to display a
> certain DMA-buf handle, their feedback packages contain information how
> ideal your configuration is, e.g. if the DMA-buf handle could be used or
> if an extra copy was needed etc...
>
> Similar exists between VAAPI and V4L2 as far as I know, but as you noted
> as well here it's indeed more about negotiating pixel format, stride,
> padding, alignment etc...
>
> The best we can do is to reject combinations which won't work in the
> kernel and then userspace could react accordingly.
>
> >> That's essentially the reason why we have DMA-buf heaps. Those heaps
> >> expose system memory buffers with certain properties (size, CMA, DMA bit
> >> restrictions etc...) and also implement the callback functions for CPU
> >> cache maintenance.
> >>
> > How do DMA-buf heaps change anything here? We already have CPU cache
> > maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
> > for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
> > (which arguably shouldn't actually do CPU cache maintenance, unless
> > that's implied by how either of the involved DMA engines work).
>
> DMA-buf heaps are the neutral man in the middle.
>
> The implementation keeps track of all the attached importers and should
> make sure that the allocated memory fits the need of everyone.
> Additional to that calls to the cache DMA-api cache management functions
> are inserted whenever CPU access begins/ends.
>
> That's the best we can do for system memory sharing, only device
> specific memory can't be allocated like this.
>
> >>>> We do have the system and CMA dma-buf heap for cases where a device
> >>>> driver which wants to access the buffer with caches enabled. So this is
> >>>> not a limitation in functionality, it's just a matter of correctly using it.
> >>>>
> >>> V4L2 also has the ability to allocate buffers and map them with caches enabled.
> >> Yeah, when that's a requirement for the V4L2 device it also makes
> >> perfect sense.
> >>
> >> It's just that we shouldn't force any specific allocation behavior on a
> >> device driver because of the requirements of a different device.
> >>
> >> Giving an example a V4L2 device shouldn't be forced to use
> >> videobuf2-dma-contig because some other device needs CMA. Instead we
> >> should use the common DMA-buf heaps which implement this as neutral
> >> supplier of system memory buffers.
> > Agreed, but that's only possible if we have a central entity that
> > understands what devices the requested buffer would be used with. My
> > understanding is that we're nowhere close to that with mainstream
> > Linux today.
> >
> > // As a side note, videobuf2-dma-contig can actually allocate
> > discontiguous memory, if the device is behind an IOMMU. Actually with
> > the recent DMA API improvements, we could probably coalesce
> > vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.
>
> That would probably make live a little bit simpler, yes.
>
> >>>> The problem is that in this particular case the exporter provides the
> >>>> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> >>>> with that.
> >>> Why does the exporter leave the buffer with dirty CPU caches?
> >> Because exporters always export the buffers as they would use it. And in
> >> this case that means writing with the CPU to it.
> >>
> > Sorry for the question not being very clear. I meant: How do the CPU
> > caches get dirty in that case?
>
> The exporter wrote to it. As far as I understand the exporter just
> copies things from A to B with memcpy to construct the buffer content.
>
> > [SNIP]
> >> Yes, totally agree. The problem is really that we moved bunch of MM and
> >> DMA functions in one API.
> >>
> >> The bounce buffers are something we should really have in a separate
> >> component.
> >>
> >> Then the functionality of allocating system memory for a specific device
> >> or devices should be something provided by the MM.
> >>
> >> And finally making this memory or any other CPU address accessible to a
> >> device (IOMMU programming etc..) should then be part of an DMA API.
> >>
> > Remember that actually making the memory accessible to a device often
> > needs to be handled already as a part of the allocation (e.g. dma_mask
> > in the non-IOMMU case). So I tend to think that the current division
> > of responsibilities is mostly fine - the dma_alloc family should be
> > seen as a part of MM already, especially with all the recent
> > improvements from Christoph, like dma_alloc_pages().
>
> Yes, that's indeed a very interesting development which as far as I can
> see goes into the right direction.
>
> > That said, it may indeed make sense to move things like IOMMU mapping
> > management out of the dma_alloc() and just reduce those functions to
> > simply returning a set of pages that satisfy the allocation
> > constraints. One would need to call dma_map() after the allocation,
> > but that's actually a fair expectation. Sometimes it might be
> > preferred to pre-allocate the memory, but only map it into the device
> > address space when it's really necessary.
>
> What I'm still missing is the functionality to allocate pages for
> multiple devices and proper error codes when dma_map() can't make the
> page accessible to a device.
>
> Regards,
> Christian.
>
> >>>>>>> It's a use-case that is working fine today with many devices (e.g. network
> >>>>>>> adapters) in the ARM world, exactly because the architecture specific
> >>>>>>> implementation of the DMA API inserts the cache maintenance operations
> >>>>>>> on buffer ownership transfer.
> >>>>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >>>>>> design totally breaks GPUs on Xen DOM0 for example.
> >>>>>>
> >>>>>> And Xen is just one example, I can certainly say from experience that
> >>>>>> this design was a really really bad idea because it favors just one use
> >>>>>> case while making other use cases practically impossible if not really
> >>>>>> hard to implement.
> >>>>> Sorry, I haven't worked with Xen. Could you elaborate what's the
> >>>>> problem that this introduces for it?
> >>>> That's a bit longer topic. The AMD XEN devs are already working on this
> >>>> as far as I know. I can ping internally how far they got with sending
> >>>> the patches out to avoid this problem.
> >>> Hmm, I see. It might be a good data point to understand in which
> >>> direction we should be going, but I guess we can wait until they send
> >>> some patches.
> >> There was just recently a longer thread on the amd-gfx mailing list
> >> about that. I think looking in there as well might be beneficial.
> > Okay, let me check. Thanks,
> >
> > Best regards,
> > Tomasz
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2023-01-05 20:40 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-20 12:13 Try to address the DMA-buf coherency problem Christian König
2022-10-20 12:13 ` [PATCH 1/3] dma-buf: add dma_coherent flag Christian König
2022-10-20 12:13 ` [PATCH 2/3] drm/prime: set the dma_coherent flag for export Christian König
2022-10-20 14:43   ` Rob Clark
2022-10-20 14:43     ` Rob Clark
2022-10-20 14:56     ` Christian König
2022-10-20 14:56       ` Christian König
2022-10-28  1:11       ` Rob Clark
2022-10-28  1:11         ` Rob Clark
2022-10-20 12:13 ` [PATCH 3/3] media: videobuf2: set dma_coherent flag for DMA-buf Christian König
2022-11-04 10:42   ` Hans Verkuil
2022-10-27 12:14 ` Try to address the DMA-buf coherency problem Christian König
2022-10-28  8:09 ` Lucas Stach
2022-10-28  8:40   ` Christian König
2022-10-28 11:42     ` Lucas Stach
2022-10-28 14:26       ` Christian König
2022-10-28 15:46         ` Nicolas Dufresne
2022-10-28 17:50           ` Christian König
2022-10-28 18:47             ` Daniel Stone
2022-10-28 18:47               ` Daniel Stone
2022-11-01 17:40               ` Christian König
2022-11-01 17:40                 ` Christian König
2022-11-01 21:09                 ` Nicolas Dufresne
2022-11-01 21:09                   ` Nicolas Dufresne
2022-11-02 11:18                   ` Christian König
2022-11-02 11:18                     ` Christian König
2022-11-02 11:39                     ` Lucas Stach
2022-11-02 11:39                       ` Lucas Stach
2022-11-02 12:21                       ` Christian König
2022-11-02 12:21                         ` Christian König
2022-11-02 17:10                         ` Lucas Stach
2022-11-02 17:10                           ` Lucas Stach
2022-11-02 19:13                           ` Christian König
2022-11-02 19:13                             ` Christian König
2022-11-02 19:48                             ` Lucas Stach
2022-11-02 19:48                               ` Lucas Stach
2022-11-17  9:35                             ` Tomasz Figa
2022-11-17  9:35                               ` Tomasz Figa
2022-11-17 12:10                               ` Christian König
2022-11-17 12:10                                 ` Christian König
2022-11-17 15:38                                 ` Nicolas Dufresne
2022-11-17 15:38                                   ` Nicolas Dufresne
2022-11-18 19:32                                   ` Rob Clark
2022-11-18 19:32                                     ` Rob Clark
2022-11-19 20:35                                     ` Nicolas Dufresne
2022-11-19 20:35                                       ` Nicolas Dufresne
2022-11-22 14:36                                     ` Daniel Vetter
2022-11-22 14:36                                       ` Daniel Vetter
2022-11-22 17:33                                       ` Christian König
2022-11-22 17:33                                         ` Christian König
2022-11-22 18:26                                         ` Daniel Vetter
2022-11-22 18:26                                           ` Daniel Vetter
2022-11-23  8:33                                         ` Pekka Paalanen
2022-11-23  8:33                                           ` Pekka Paalanen
2022-11-23 16:33                                           ` Daniel Vetter
2022-11-23 16:33                                             ` Daniel Vetter
2022-11-25 16:40                                             ` Nicolas Dufresne
2022-11-25 16:40                                               ` Nicolas Dufresne
2022-11-30 10:30                                               ` Daniel Vetter
2022-11-30 10:30                                                 ` Daniel Vetter
2022-12-02 15:27                                                 ` Christian König
2022-12-02 15:27                                                   ` Christian König
2023-01-05 11:46                                                   ` Daniel Vetter
2023-01-05 11:46                                                     ` Daniel Vetter
2022-12-05  6:41                                 ` Tomasz Figa
2022-12-05  6:41                                   ` Tomasz Figa
2022-12-05  8:28                                   ` Christian König
2022-12-05  8:28                                     ` Christian König
2022-12-06 18:26                                     ` Nicolas Dufresne
2022-12-06 18:26                                       ` Nicolas Dufresne
2022-12-06 18:37                                       ` Christian König
2022-12-06 18:37                                         ` Christian König
2022-12-09  8:26                                     ` Tomasz Figa
2022-12-09  8:26                                       ` Tomasz Figa
2022-12-09  9:32                                       ` Pekka Paalanen
2022-12-09  9:32                                         ` Pekka Paalanen
2022-12-09 17:07                                         ` Alex Deucher
2022-12-09 17:07                                           ` Alex Deucher
2023-01-05 11:50                                           ` [Linaro-mm-sig] " Daniel Vetter
2023-01-05 11:50                                             ` Daniel Vetter
2022-12-12  3:13                                         ` Tomasz Figa
2022-12-12  3:13                                           ` Tomasz Figa
2022-12-09 10:27                                       ` Christian König
2022-12-09 10:27                                         ` Christian König
2022-12-12  3:00                                         ` Tomasz Figa
2022-12-12  3:00                                           ` Tomasz Figa
2022-12-12 11:15                                           ` Christian König
2022-12-12 11:15                                             ` Christian König
2023-01-05 20:39                                         ` Sebastian Wick
2023-01-05 20:39                                           ` Sebastian Wick
2022-11-04 14:58                         ` Rob Clark
2022-11-04 14:58                           ` Rob Clark
2022-11-04 15:00                           ` Christian König
2022-11-04 15:00                             ` Christian König
2022-11-02 12:19                     ` Pekka Paalanen
2022-11-02 12:19                       ` Pekka Paalanen
2022-11-02 12:27                       ` Christian König
2022-11-02 12:27                         ` Christian König
2022-11-02 12:50                         ` Pekka Paalanen
2022-11-02 12:50                           ` Pekka Paalanen
2022-11-02 12:56                           ` Christian König
2022-11-02 12:56                             ` Christian König
2022-11-03 22:16                     ` Nicolas Dufresne
2022-11-03 22:16                       ` Nicolas Dufresne
2022-11-04  9:03                       ` Christian König
2022-11-04  9:03                         ` Christian König
2022-11-04 13:38                         ` Nicolas Dufresne
2022-11-04 13:38                           ` Nicolas Dufresne
2022-11-04 14:51                           ` Christian König
2022-11-04 14:51                             ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.