On Mon, 2020-05-18 at 16:48 +0100, Dave Stevenson wrote: > Hi Nicolas > > > + exp_info.ops = &dma_buf_import_ops; > > > + exp_info.size = import.size; > > > + exp_info.flags = O_RDWR; > > > + exp_info.priv = buffer; > > > + > > > + buffer->dma_buf = dma_buf_export(&exp_info); > > > > Could you comment on the need for this second dma_buf? I've only reviewed > > code > > related to mmal-vchiq imports, but it seems to me that it would be saner to > > do > > the unmapping and unattaching explicitly as opposed to having this second > > buffer refcount hit 0. Although, I can imagine this being needed for the > > userspace interface. > > Indeed, as it is needed for the userspace interface it seemed to make > more sense to have common handling rather than two code paths doing > nearly the same thing but in different ways. > Downstream we need a userspace import at least to allow MMAL to set up > zero copy, so unless it raises any real objections then it would be > useful to keep it. > > > When you talk about moving to dmabuf heaps, I've pictured a specific dmabuf > > heap for vc4 that takes care of all the importing and unimporting (aside > > from > > cma allocations). Am I right? If so, I'm pretty confident we can do away > > with > > this. > > (Note I'm talking about the VideoCore4 VPU and other blocks, and not > the vc4 DRM/KMS and V3D drivers) > > No, I'm looking at using the existing cma_heap driver to do the > allocations, and then this driver will import them and handle the > lifetime on behalf of the VPU. There's no need for VPU allocations to > be split off into yet another heap. Fair enough. > One of the things we are trying to get away from is having the gpu_mem > reserved lump that Linux can't get access to at all, so allocating > from the CMA heap and importing to the VPU avoids that. That's great! Will this also apply at some point to the GPU side of things? I vaguely recall having to reserve up to 300M on rpi3 to get mesa to work on a desktop environment. > I'll give some history here, which also hopefully covers your query > over switching mmal-vchiq to zero copy. > > Almost all the VC4 blocks need contiguous memory, so fragmentation was > an issue. To resolve that we (back in Broadcom days) had the > "relocatable heap" - allocations that needed to be locked before > access and unlocked after. Unlocked blocks could be copied and moved > around to free up larger contiguous blocks. These allocations use a > handle instead of a pointer, and have internal refcounting etc. > Basically providing some of the features of an MMU when you don't have > one. > > The original VCSM driver allowed userspace to make a relocatable heap > allocation, lock it, and the kernel to map the relevant pages into the > ARM memory space. Now you have a shared buffer, and VCHIQ no longer > has to copy the data back and forth. (Cache flushing was also > handled). > So MMAL in zero copy mode passes the VPU relocatable heap handle > across in the VCHIQ message, not a pointer to the actual data. VCSM > did the allocation on behalf of the MMAL client, and provides the > mapping and VPU handle to the buffer. This still leaves the allocation > being made from gpu_mem though. > > The rewrite (vc-sm-cma) was to make use of an import feature into the > relocatable heap, termed internally as mem wrapping. Take a CMA > allocation made by something, pass the DMA address and size across to > the VPU, and it can insert it as a relocatable heap object that can be > used in exactly the same way gpu_mem allocations. gpu_mem can now be > shrunk in size :-) It was using a dma-buf as a convenient object to > manage the allocation, and handle importing buffers allocated by other > subsystems > Note that we still have refcounting internally to the relocatable > heap, so at the point the client says it has finished with it, the VPU > may not have done. When the last relocatable heap reference is > released, the kernel gets a callback (VC_SM_MSG_TYPE_RELEASED), and it > is only at that point that it is safe to drop the reference to the > imported dmabuf. > > V4L2 can do the relevant import and wrapping to a relocatable heap > handle as part of the buffer passing. MMAL needs to do it manually > from userspace as VCHIQ is the only in-kernel service that it uses, > hence we need an import ioctl and free mechanism (if the handle is a > dmabuf, then that's close). > > > From a platform level it would be nice to have the userspace ioctl for > importing a dmabuf in mainline, however it isn't necessary for the > V4L2 use cases that we're trying to upstream here. The driver without > userspace API would look pretty much like the one in [1]. I'll try and > update that to include the basic import userspace API to give a > comparison. > I don't mind which way this goes as to whether the userspace ioctl > remains as downstream patches, but losing the dmabuf as the handle > within vc-sm-cma will make that patch huge, and they're almost > guaranteed to diverge. > Ignore the caching ioctls - they're irrelevant. > > I hope that makes the situation a little clearer. Thanks a lot for the in-depth explanation, it does indeed. Actually, it wouldn't hurt to add a subset of this into a text file alongside with the new driver. Regards, Nicolas