linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Joao Martins <joao.m.martins@oracle.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Logan Gunthorpe <logang@deltatee.com>,
	Ming Lei <ming.lei@redhat.com>,
	linux-block@vger.kernel.org, netdev@vger.kernel.org,
	linux-mm@kvack.org, linux-rdma@vger.kernel.org,
	dri-devel@lists.freedesktop.org, nvdimm@lists.linux.dev
Subject: Re: Phyr Starter
Date: Tue, 11 Jan 2022 18:53:06 -0400	[thread overview]
Message-ID: <20220111225306.GR2328285@nvidia.com> (raw)
In-Reply-To: <Yd311C45gpQ3LqaW@casper.infradead.org>

On Tue, Jan 11, 2022 at 09:25:40PM +0000, Matthew Wilcox wrote:
> > I don't need the sgt at all. I just need another list of physical
> > addresses for DMA. I see no issue with a phsr_list storing either CPU
> > Physical Address or DMA Physical Addresses, same data structure.
> 
> There's a difference between a phys_addr_t and a dma_addr_t.  They
> can even be different sizes; some architectures use a 32-bit dma_addr_t
> and a 64-bit phys_addr_t or vice-versa.  phyr cannot store DMA addresses.

I know, but I'm not sure optimizing for 32 bit phys_addr_t is
worthwhile. So I imagine phyrs forced to be 64 bits so it can always
hold a dma_addr_t and we can re-use all the machinery that supports it
for the DMA list as well.

Even on 32 bit physaddr platforms scatterlist is still 24 bytes,
forcing 8 bytes for the physr CPU list is still a net space win.

> > Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
> >   31:0 - # of order pages
> > 
> > Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
> >   11:0 - starting byte offset in the 4k
> >   31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
> >           length in bytes
> 
> Honestly, this looks awful to operate on.  Mandatory 8-bytes per entry
> with an optional 4 byte extension?

I expect it is, if we don't value memory efficiency then make it
simpler. A fixed 12 bytes means that the worst case is still only 24
bytes so it isn't a degredation from scatterlist. 

Unfortunately 16 bytes is a degredation.

My point is the structure can hold what scatterlist holds and we can
trade some CPU power to achieve memory compression. I don't know what
the right balance is, but it suggests to me that the idea of a general
flexable array to hold 64 bit addr/length intervals is a useful
generic data structure for this problem.

> > Well, I'm not comfortable with the idea above where RDMA would have to
> > take a memory penalty to use the new interface. To avoid that memory
> > penalty we need to get rid of scatterlist entirely.
> > 
> > If we do the 16 byte struct from the first email then a umem for MRs
> > will increase in memory consumption by 160% compared today's 24
> > bytes/page. I think the HPC workloads will veto this.
> 
> Huh?  We do 16 bytes per physically contiguous range.  Then, if your
> HPC workloads use an IOMMU that can map a virtually contiguous range
> into a single sg entry, it uses 24 bytes for the entire mapping.  It
> should shrink.

IOMMU is not common in those cases, it is slow.

So you end up with 16 bytes per entry then another 24 bytes in the
entirely redundant scatter list. That is now 40 bytes/page for typical
HPC case, and I can't see that being OK.

> > > I just want to delete page_link, offset and length from struct
> > > scatterlist.  Given the above sequence of calls, we're going to get
> > > sg lists that aren't chained.  They may have to be vmalloced, but
> > > they should be contiguous.
> > 
> > I don't understand that? Why would the SGL out of the iommu suddenly
> > not be chained?
> 
> Because it's being given a single set of ranges to map, instead of
> being given 512 pages at a time.

I still don't understand what you are describing here? I don't know of
any case where a struct scatterlist will be vmalloc'd not page chained
- we don't even support that??

> It would only be slow for degenerate cases where the pinned memory
> is fragmented and not contiguous.

Degenerate? This is the normal case today isn't it? I think it is for
RDMA the last time I looked. Even small allocations like < 64k were
fragmented...

> > IMHO, the scatterlist has to go away. The interface should be physr
> > list in, physr list out.
> 
> That's reproducing the bad decision of the scatterlist, only with
> a different encoding.  You end up with something like:
> 
> struct neoscat {
> 	dma_addr_t dma_addr;
> 	phys_addr_t phys_addr;
> 	size_t dma_len;
> 	size_t phys_len;
> };

This isn't what I mean at all!

I imagine a generic data structure that can hold an array of 64 bit
intervals.

The DMA map operation takes in this array that holds CPU addreses,
allocates a new array and fills it with DMA addresses and returns
that. The caller ends up with two arrays in two memory allocations.

No scatterlist required.

It is undoing the bad design of scatterlist by forcing the CPU and DMA
to be in different memory. 

I just want to share the whole API that will have to exist to
reasonably support this flexible array of intervals data structure..

Jason

  parent reply	other threads:[~2022-01-11 22:53 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-10 19:34 Phyr Starter Matthew Wilcox
2022-01-11  0:41 ` Jason Gunthorpe
2022-01-11  4:32   ` Matthew Wilcox
2022-01-11 15:01     ` Jason Gunthorpe
2022-01-11 18:33       ` Matthew Wilcox
2022-01-11 20:21         ` Jason Gunthorpe
2022-01-11 21:25           ` Matthew Wilcox
2022-01-11 22:09             ` Logan Gunthorpe
2022-01-11 22:57               ` Jason Gunthorpe
2022-01-11 23:02                 ` Logan Gunthorpe
2022-01-11 22:53             ` Jason Gunthorpe [this message]
2022-01-11 22:57               ` Logan Gunthorpe
2022-01-11 23:02                 ` Jason Gunthorpe
2022-01-11 23:08                   ` Logan Gunthorpe
2022-01-12 18:37               ` Matthew Wilcox
2022-01-12 19:08                 ` Jason Gunthorpe
2022-01-20 14:03                 ` Christoph Hellwig
2022-01-20 17:17                   ` Jason Gunthorpe
2022-01-20 14:00       ` Christoph Hellwig
2022-01-11  9:05   ` Daniel Vetter
2022-01-11 20:26     ` Jason Gunthorpe
2022-01-20 14:09       ` Christoph Hellwig
2022-01-20 13:56   ` Christoph Hellwig
2022-01-20 15:27     ` Keith Busch
2022-01-20 15:28       ` Christoph Hellwig
2022-01-20 17:54       ` Robin Murphy
2022-01-11  8:17 ` John Hubbard
2022-01-11 14:01   ` Matthew Wilcox
2022-01-11 15:02     ` Jason Gunthorpe
2022-01-11 17:31   ` Logan Gunthorpe
2022-01-20 14:12   ` Christoph Hellwig
2022-01-20 21:35     ` John Hubbard
2022-01-11 11:40 ` Thomas Zimmermann
2022-01-11 13:56   ` Matthew Wilcox
2022-01-11 14:10     ` Thomas Zimmermann
2022-01-20 13:39 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220111225306.GR2328285@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@lst.de \
    --cc=jhubbard@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=logang@deltatee.com \
    --cc=ming.lei@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).