Re: Phyr Starter

From: Jason Gunthorpe <jgg@nvidia.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Joao Martins <joao.m.martins@oracle.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Logan Gunthorpe <logang@deltatee.com>,
	Ming Lei <ming.lei@redhat.com>,
	linux-block@vger.kernel.org, netdev@vger.kernel.org,
	linux-mm@kvack.org, linux-rdma@vger.kernel.org,
	dri-devel@lists.freedesktop.org, nvdimm@lists.linux.dev
Subject: Re: Phyr Starter
Date: Tue, 11 Jan 2022 18:53:06 -0400	[thread overview]
Message-ID: <20220111225306.GR2328285@nvidia.com> (raw)
In-Reply-To: <Yd311C45gpQ3LqaW@casper.infradead.org>

On Tue, Jan 11, 2022 at 09:25:40PM +0000, Matthew Wilcox wrote:
> > I don't need the sgt at all. I just need another list of physical
> > addresses for DMA. I see no issue with a phsr_list storing either CPU
> > Physical Address or DMA Physical Addresses, same data structure.
> 
> There's a difference between a phys_addr_t and a dma_addr_t.  They
> can even be different sizes; some architectures use a 32-bit dma_addr_t
> and a 64-bit phys_addr_t or vice-versa.  phyr cannot store DMA addresses.

I know, but I'm not sure optimizing for 32 bit phys_addr_t is
worthwhile. So I imagine phyrs forced to be 64 bits so it can always
hold a dma_addr_t and we can re-use all the machinery that supports it
for the DMA list as well.

Even on 32 bit physaddr platforms scatterlist is still 24 bytes,
forcing 8 bytes for the physr CPU list is still a net space win.

> > Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
> >   31:0 - # of order pages
> > 
> > Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
> >   11:0 - starting byte offset in the 4k
> >   31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
> >           length in bytes
> 
> Honestly, this looks awful to operate on.  Mandatory 8-bytes per entry
> with an optional 4 byte extension?

I expect it is, if we don't value memory efficiency then make it
simpler. A fixed 12 bytes means that the worst case is still only 24
bytes so it isn't a degredation from scatterlist. 

Unfortunately 16 bytes is a degredation.

My point is the structure can hold what scatterlist holds and we can
trade some CPU power to achieve memory compression. I don't know what
the right balance is, but it suggests to me that the idea of a general
flexable array to hold 64 bit addr/length intervals is a useful
generic data structure for this problem.

> > Well, I'm not comfortable with the idea above where RDMA would have to
> > take a memory penalty to use the new interface. To avoid that memory
> > penalty we need to get rid of scatterlist entirely.
> > 
> > If we do the 16 byte struct from the first email then a umem for MRs
> > will increase in memory consumption by 160% compared today's 24
> > bytes/page. I think the HPC workloads will veto this.
> 
> Huh?  We do 16 bytes per physically contiguous range.  Then, if your
> HPC workloads use an IOMMU that can map a virtually contiguous range
> into a single sg entry, it uses 24 bytes for the entire mapping.  It
> should shrink.

IOMMU is not common in those cases, it is slow.

So you end up with 16 bytes per entry then another 24 bytes in the
entirely redundant scatter list. That is now 40 bytes/page for typical
HPC case, and I can't see that being OK.

> > > I just want to delete page_link, offset and length from struct
> > > scatterlist.  Given the above sequence of calls, we're going to get
> > > sg lists that aren't chained.  They may have to be vmalloced, but
> > > they should be contiguous.
> > 
> > I don't understand that? Why would the SGL out of the iommu suddenly
> > not be chained?
> 
> Because it's being given a single set of ranges to map, instead of
> being given 512 pages at a time.

I still don't understand what you are describing here? I don't know of
any case where a struct scatterlist will be vmalloc'd not page chained
- we don't even support that??

> It would only be slow for degenerate cases where the pinned memory
> is fragmented and not contiguous.

Degenerate? This is the normal case today isn't it? I think it is for
RDMA the last time I looked. Even small allocations like < 64k were
fragmented...

> > IMHO, the scatterlist has to go away. The interface should be physr
> > list in, physr list out.
> 
> That's reproducing the bad decision of the scatterlist, only with
> a different encoding.  You end up with something like:
> 
> struct neoscat {
> 	dma_addr_t dma_addr;
> 	phys_addr_t phys_addr;
> 	size_t dma_len;
> 	size_t phys_len;
> };

This isn't what I mean at all!

I imagine a generic data structure that can hold an array of 64 bit
intervals.

The DMA map operation takes in this array that holds CPU addreses,
allocates a new array and fills it with DMA addresses and returns
that. The caller ends up with two arrays in two memory allocations.

No scatterlist required.

It is undoing the bad design of scatterlist by forcing the CPU and DMA
to be in different memory. 

I just want to share the whole API that will have to exist to
reasonably support this flexible array of intervals data structure..

Jason