Re: [PATCH rdma-next v5 0/4] Dynamicaly allocate SG table from the pages

From: Daniel Vetter <daniel@ffwll.ch>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>,
	Doug Ledford <dledford@redhat.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Christoph Hellwig <hch@lst.de>, David Airlie <airlied@linux.ie>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	intel-gfx <intel-gfx@lists.freedesktop.org>,
	Jani Nikula <jani.nikula@linux.intel.com>,
	Joonas Lahtinen <joonas.lahtinen@linux.intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	Maor Gottlieb <maorg@nvidia.com>,
	Rodrigo Vivi <rodrigo.vivi@intel.com>,
	Roland Scheidegger <sroland@vmware.com>,
	Tvrtko Ursulin <tvrtko.ursulin@intel.com>,
	VMware Graphics <linux-graphics-maintainer@vmware.com>
Subject: Re: [PATCH rdma-next v5 0/4] Dynamicaly allocate SG table from the pages
Date: Wed, 7 Oct 2020 10:15:31 +0200	[thread overview]
Message-ID: <CAKMK7uG5UOS5360_HjJyroLE8b+6wrhT291PaqjFbii+BT7+Hg@mail.gmail.com> (raw)
In-Reply-To: <20201006114627.GE5177@ziepe.ca>

On Wed, Oct 7, 2020 at 9:22 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Oct 06, 2020 at 12:41:22PM +0200, Daniel Vetter wrote:
> > On Mon, Oct 05, 2020 at 08:56:50PM -0300, Jason Gunthorpe wrote:
> > > On Sun, Oct 04, 2020 at 06:43:36PM +0300, Leon Romanovsky wrote:
> > > > This series extends __sg_alloc_table_from_pages to allow chaining of
> > > > new pages to already initialized SG table.
> > > >
> > > > This allows for the drivers to utilize the optimization of merging contiguous
> > > > pages without a need to pre allocate all the pages and hold them in
> > > > a very large temporary buffer prior to the call to SG table initialization.
> > > >
> > > > The second patch changes the Infiniband driver to use the new API. It
> > > > removes duplicate functionality from the code and benefits the
> > > > optimization of allocating dynamic SG table from pages.
> > > >
> > > > In huge pages system of 2MB page size, without this change, the SG table
> > > > would contain x512 SG entries.
> > > > E.g. for 100GB memory registration:
> > > >
> > > >              Number of entries      Size
> > > >     Before        26214400          600.0MB
> > > >     After            51200            1.2MB
> > > >
> > > > Thanks
> > > >
> > > > Maor Gottlieb (2):
> > > >   lib/scatterlist: Add support in dynamic allocation of SG table from
> > > >     pages
> > > >   RDMA/umem: Move to allocate SG table from pages
> > > >
> > > > Tvrtko Ursulin (2):
> > > >   tools/testing/scatterlist: Rejuvenate bit-rotten test
> > > >   tools/testing/scatterlist: Show errors in human readable form
> > >
> > > This looks OK, I'm going to send it into linux-next on the hmm tree
> > > for awhile to see if anything gets broken. If there is more
> > > remarks/tags/etc please continue
> >
> > An idea that just crossed my mind: A pin_user_pages_sgt might be useful
> > for both rdma and drm, since this would avoid the possible huge interim
> > struct pages array for thp pages. Or anything else that could be coalesced
> > down into a single sg entry.
> >
> > Not sure it's worth it, but would at least give a slightly neater
> > interface I think.
>
> We've talked about it. Christoph wants to see this area move to a biovec
> interface instead of sgl, but it might still be worthwhile to have an
> interm step at least as an API consolidation.

Hm but then we'd need a new struct for the mapped side of things
(which would still be what you get from dma-buf). That would be quite
a bit of work to roll out everywhere, and sgt isn't such a huge misfit
for passing buffer object mappings and system memory backing storage
around, and hence what we (very slowly) converging drivers/gpu towards
over the past 10 years or so.

And moving the dma_map step out of dma-buf doesn't work, because some
of the use-cases we have is for very special iommus which are managed
by the gpu driver directly. Stuff that e.g. rotates/retiles/compresses
on the fly, and is accessible by other (gfx related like video code,
camera, ..) devices. Not something I expect to ever be relevant for
rdma since this exist mostly on some small soc, but it's a thing.
Without that dma-buf could hand out biovec for struct_page backed
stuff, or some pfn_vec for the p2p stuff.

Anyway was just an idea, I guess we'll have to live with some
impedance mismatch since rolling out the one an only iovec structure
which suits everyone is I think impossible :-)

> Avoiding the page list would be complicated as we'd somehow have to
> code share the page table iterator scheme.

We're (slowly) getting towards thp for vram mappings and everything so
I guess for drivers/gpu we might make that happen. But yeah it'd be
not so pretty I think.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch