Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region

From: Daniel Vetter <daniel@ffwll.ch>
To: "Xiong, Jianxin" <jianxin.xiong@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	Doug Ledford <dledford@redhat.com>,
	"Vetter, Daniel" <daniel.vetter@intel.com>,
	Christian Koenig <christian.koenig@amd.com>
Subject: Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region
Date: Tue, 6 Oct 2020 11:22:14 +0200	[thread overview]
Message-ID: <20201006092214.GX438822@phenom.ffwll.local> (raw)
In-Reply-To: <MW3PR11MB455572267489B3F6B1C5F8C5E50C0@MW3PR11MB4555.namprd11.prod.outlook.com>

On Mon, Oct 05, 2020 at 04:18:11PM +0000, Xiong, Jianxin wrote:
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Monday, October 05, 2020 6:13 AM
> > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > Cc: linux-rdma@vger.kernel.org; dri-devel@lists.freedesktop.org; Doug Ledford <dledford@redhat.com>; Leon Romanovsky
> > <leon@kernel.org>; Sumit Semwal <sumit.semwal@linaro.org>; Christian Koenig <christian.koenig@amd.com>; Vetter, Daniel
> > <daniel.vetter@intel.com>
> > Subject: Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region
> > 
> > On Sun, Oct 04, 2020 at 12:12:28PM -0700, Jianxin Xiong wrote:
> > > Dma-buf is a standard cross-driver buffer sharing mechanism that can
> > > be used to support peer-to-peer access from RDMA devices.
> > >
> > > Device memory exported via dma-buf is associated with a file descriptor.
> > > This is passed to the user space as a property associated with the
> > > buffer allocation. When the buffer is registered as a memory region,
> > > the file descriptor is passed to the RDMA driver along with other
> > > parameters.
> > >
> > > Implement the common code for importing dma-buf object and mapping
> > > dma-buf pages.
> > >
> > > Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
> > > Reviewed-by: Sean Hefty <sean.hefty@intel.com>
> > > Acked-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> > > ---
> > >  drivers/infiniband/core/Makefile      |   2 +-
> > >  drivers/infiniband/core/umem.c        |   4 +
> > >  drivers/infiniband/core/umem_dmabuf.c | 291
> > > ++++++++++++++++++++++++++++++++++
> > >  drivers/infiniband/core/umem_dmabuf.h |  14 ++
> > >  drivers/infiniband/core/umem_odp.c    |  12 ++
> > >  include/rdma/ib_umem.h                |  19 ++-
> > >  6 files changed, 340 insertions(+), 2 deletions(-)  create mode
> > > 100644 drivers/infiniband/core/umem_dmabuf.c
> > >  create mode 100644 drivers/infiniband/core/umem_dmabuf.h
> > 
> > I think this is using ODP too literally, dmabuf isn't going to need fine grained page faults, and I'm not sure this locking scheme is OK - ODP is
> > horrifically complicated.
> > 
> 
> > If this is the approach then I think we should make dmabuf its own stand alone API, reg_user_mr_dmabuf()
> 
> That's the original approach in the first version. We can go back there.
> 
> > 
> > The implementation in mlx5 will be much more understandable, it would just do dma_buf_dynamic_attach() and program the XLT exactly
> > the same as a normal umem.
> > 
> > The move_notify() simply zap's the XLT and triggers a work to reload it after the move. Locking is provided by the dma_resv_lock. Only a
> > small disruption to the page fault handler is needed.
> > 
> 
> We considered such scheme but didn't go that way due to the lack of
> notification when the move is done and thus the work wouldn't know when
> it can reload.
> 
> Now I think it again, we could probably signal the reload in the page fault handler. 

For reinstanting the pages you need:

- dma_resv_lock, this prevents anyone else from issuing new moves or
  anything like that
- dma_resv_get_excl + dma_fence_wait to wait for any pending moves to
  finish. gpus generally don't wait on the cpu, but block the dependent
  dma operations from being scheduled until that fence fired. But for rdma
  odp I think you need the cpu wait in your worker here.
- get the new sg list, write it into your ptes
- dma_resv_unlock to make sure you're not racing with a concurrent
  move_notify

You can also grab multiple dma_resv_lock in atomically, but I think the
odp rdma model doesn't require that (gpus need that).

Note that you're allowed to allocate memory with GFP_KERNEL while holding
dma_resv_lock, so this shouldn't impose any issues. You are otoh not
allowed to cause userspace faults (so no gup/pup or copy*user with
faulting enabled). So all in all this shouldn't be any worse that calling
pup for normal umem.

Unlike mmu notifier the caller holds dma_resv_lock already for you around
the move_notify callback, so you shouldn't need any additional locking in
there (aside from what you need to zap the ptes and flush hw tlbs).

Cheers, Daniel

> 
> > > +	dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
> > > +	sgt = dma_buf_map_attachment(umem_dmabuf->attach,
> > > +				     DMA_BIDIRECTIONAL);
> > > +	dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
> > 
> > This doesn't look right, this lock has to be held up until the HW is programmed
> 
> The mapping remains valid until being invalidated again. There is a sequence number checking before programming the HW. 
> 
> > 
> > The use of atomic looks probably wrong as well.
> 
> Do you mean umem_dmabuf->notifier_seq? Could you elaborate the concern?
> 
> > 
> > > +	k = 0;
> > > +	total_pages = ib_umem_odp_num_pages(umem_odp);
> > > +	for_each_sg(umem->sg_head.sgl, sg, umem->sg_head.nents, j) {
> > > +		addr = sg_dma_address(sg);
> > > +		pages = sg_dma_len(sg) >> page_shift;
> > > +		while (pages > 0 && k < total_pages) {
> > > +			umem_odp->dma_list[k++] = addr | access_mask;
> > > +			umem_odp->npages++;
> > > +			addr += page_size;
> > > +			pages--;
> > 
> > This isn't fragmenting the sg into a page list properly, won't work for unaligned things
> 
> I thought the addresses are aligned, but will add explicit alignment here.
> 
> > 
> > And really we don't need the dma_list for this case, with a fixed whole mapping DMA SGL a normal umem sgl is OK and the normal umem
> > XLT programming in mlx5 is fine.
> 
> The dma_list is used by both "polulate_mtt()" and "mlx5_ib_invalidate_range", which are used for XLT programming and invalidating (zapping), respectively.
> 
> > 
> > Jason
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch