On Mon, 2020-05-18 at 16:48 +0100, Dave Stevenson wrote:
> Hi Nicolas
> > > +     exp_info.ops = &dma_buf_import_ops;
> > > +     exp_info.size = import.size;
> > > +     exp_info.flags = O_RDWR;
> > > +     exp_info.priv = buffer;
> > > +
> > > +     buffer->dma_buf = dma_buf_export(&exp_info);
> > 
> > Could you comment on the need for this second dma_buf? I've only reviewed
> > code
> > related to mmal-vchiq imports, but it seems to me that it would be saner to
> > do
> > the unmapping and unattaching explicitly as opposed to having this second
> > buffer refcount hit 0. Although, I can imagine this being needed for the
> > userspace interface.
> 
> Indeed, as it is needed for the userspace interface it seemed to make
> more sense to have common handling rather than two code paths doing
> nearly the same thing but in different ways.
> Downstream we need a userspace import at least to allow MMAL to set up
> zero copy, so unless it raises any real objections then it would be
> useful to keep it.
> 
> > When you talk about moving to dmabuf heaps, I've pictured a specific dmabuf
> > heap for vc4 that takes care of all the importing and unimporting (aside
> > from
> > cma allocations). Am I right? If so, I'm pretty confident we can do away
> > with
> > this.
> 
> (Note I'm talking about the VideoCore4 VPU and other blocks, and not
> the vc4 DRM/KMS and V3D drivers)
> 
> No, I'm looking at using the existing cma_heap driver to do the
> allocations, and then this driver will import them and handle the
> lifetime on behalf of the VPU. There's no need for VPU allocations to
> be split off into yet another heap.

Fair enough.

> One of the things we are trying to get away from is having the gpu_mem
> reserved lump that Linux can't get access to at all, so allocating
> from the CMA heap and importing to the VPU avoids that.

That's great! Will this also apply at some point to the GPU side of things? I
vaguely recall having to reserve up to 300M on rpi3 to get mesa to work on a
desktop environment.

> I'll give some history here, which also hopefully covers your query
> over switching mmal-vchiq to zero copy.
> 
> Almost all the VC4 blocks need contiguous memory, so fragmentation was
> an issue. To resolve that we (back in Broadcom days) had the
> "relocatable heap" - allocations that needed to be locked before
> access and unlocked after. Unlocked blocks could be copied and moved
> around to free up larger contiguous blocks. These allocations use a
> handle instead of a pointer, and have internal refcounting etc.
> Basically providing some of the features of an MMU when you don't have
> one.
> 
> The original VCSM driver allowed userspace to make a relocatable heap
> allocation, lock it, and the kernel to map the relevant pages into the
> ARM memory space. Now you have a shared buffer, and VCHIQ no longer
> has to copy the data back and forth. (Cache flushing was also
> handled).
> So MMAL in zero copy mode passes the VPU relocatable heap handle
> across in the VCHIQ message, not a pointer to the actual data. VCSM
> did the allocation on behalf of the MMAL client, and provides the
> mapping and VPU handle to the buffer. This still leaves the allocation
> being made from gpu_mem though.
> 
> The rewrite (vc-sm-cma) was to make use of an import feature into the
> relocatable heap, termed internally as mem wrapping. Take a CMA
> allocation made by something, pass the DMA address and size across to
> the VPU, and it can insert it as a relocatable heap object that can be
> used in exactly the same way gpu_mem allocations. gpu_mem can now be
> shrunk in size :-) It was using a dma-buf as a convenient object to
> manage the allocation, and handle importing buffers allocated by other
> subsystems
> Note that we still have refcounting internally to the relocatable
> heap, so at the point the client says it has finished with it, the VPU
> may not have done. When the last relocatable heap reference is
> released, the kernel gets a callback (VC_SM_MSG_TYPE_RELEASED), and it
> is only at that point that it is safe to drop the reference to the
> imported dmabuf.
> 
> V4L2 can do the relevant import and wrapping to a relocatable heap
> handle as part of the buffer passing. MMAL needs to do it manually
> from userspace as VCHIQ is the only in-kernel service that it uses,
> hence we need an import ioctl and free mechanism (if the handle is a
> dmabuf, then that's close).
> 
> 
> From a platform level it would be nice to have the userspace ioctl for
> importing a dmabuf in mainline, however it isn't necessary for the
> V4L2 use cases that we're trying to upstream here. The driver without
> userspace API would look pretty much like the one in [1]. I'll try and
> update that to include the basic import userspace API to give a
> comparison.
> I don't mind which way this goes as to whether the userspace ioctl
> remains as downstream patches, but losing the dmabuf as the handle
> within vc-sm-cma will make that patch huge, and they're almost
> guaranteed to diverge.
> Ignore the caching ioctls - they're irrelevant.
> 
> I hope that makes the situation a little clearer.

Thanks a lot for the in-depth explanation, it does indeed. Actually, it
wouldn't hurt to add a subset of this into a text file alongside with the new
driver.

Regards,
Nicolas