linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Christoph Hellwig <hch@lst.de>
Cc: "Leon Romanovsky" <leon@kernel.org>,
	"Robin Murphy" <robin.murphy@arm.com>,
	"Marek Szyprowski" <m.szyprowski@samsung.com>,
	"Joerg Roedel" <joro@8bytes.org>, "Will Deacon" <will@kernel.org>,
	"Chaitanya Kulkarni" <chaitanyak@nvidia.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Jens Axboe" <axboe@kernel.dk>, "Keith Busch" <kbusch@kernel.org>,
	"Sagi Grimberg" <sagi@grimberg.me>,
	"Yishai Hadas" <yishaih@nvidia.com>,
	"Shameer Kolothum" <shameerali.kolothum.thodi@huawei.com>,
	"Kevin Tian" <kevin.tian@intel.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
	iommu@lists.linux.dev, linux-nvme@lists.infradead.org,
	kvm@vger.kernel.org, linux-mm@kvack.org,
	"Bart Van Assche" <bvanassche@acm.org>,
	"Damien Le Moal" <damien.lemoal@opensource.wdc.com>,
	"Amir Goldstein" <amir73il@gmail.com>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"jack@suse.com" <jack@suse.com>,
	"Zhu Yanjun" <zyjzyj2000@gmail.com>
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps
Date: Thu, 7 Mar 2024 17:01:16 -0400	[thread overview]
Message-ID: <20240307210116.GQ9225@ziepe.ca> (raw)
In-Reply-To: <20240307150505.GA28978@lst.de>

On Thu, Mar 07, 2024 at 04:05:05PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote:
> > > 
> > > I don't think you can do without dma_addr_t storage.  In most cases
> > > your can just store the dma_addr_t in the LE/BE encoded hardware
> > > SGL, so no extra storage should be needed though.
> > 
> > RDMA (and often DRM too) generally doesn't work like that, the driver
> > copies the page table into the device and then the only reason to have
> > a dma_addr_t storage is to pass that to the dma unmap API. Optionally
> > eliminating long term dma_addr_t storage would be a worthwhile memory
> > savings for large long lived user space memory registrations.
> 
> It's just kinda hard to do.  For aligned IOMMU mapping you'd only
> have one dma_addr_t mappings (or maybe a few if P2P regions are
> involved), so this probably doesn't matter.  For direct mappings
> you'd have a few, but maybe the better answer is to use THP
> more aggressively and reduce the number of segments.

Right, those things have all been done. 100GB of huge pages is still
using a fair amount of memory for storing dma_addr_t's.

It is hard to do perfectly, but I think it is not so bad if we focus
on the direct only case and simple systems that can exclude swiotlb
early on.

> > > > So are you thinking something more like a driver flow of:
> > > > 
> > > >   .. extent IO and get # aligned pages and know if there is P2P ..
> > > >   dma_init_io(state, num_pages, p2p_flag)
> > > >   if (dma_io_single_range(state)) {
> > > >        // #2, #4
> > > >        for each io()
> > > > 	    dma_link_aligned_pages(state, io range)
> > > >        hw_sgl = (state->iova, state->len)
> > > >   } else {
> > > 
> > > I think what you have a dma_io_single_range should become before
> > > the dma_init_io.  If we know we can't coalesce it really just is a
> > > dma_map_{single,page,bvec} loop, no need for any extra state.
> > 
> > I imagine dma_io_single_range() to just check a flag in state.
> > 
> > I still want to call dma_init_io() for the non-coalescing cases
> > because all the flows, regardless of composition, should be about as
> > fast as dma_map_sg is today.
> 
> If all flows includes multiple non-coalesced regions that just makes
> things very complicated, and that's exactly what I'd want to avoid.

I don't see how to avoid it unless we say RDMA shouldn't use this API,
which is kind of the whole point from my perspective..

I want an API that can handle all the same complexity as dma_map_sg()
without forcing the use of scatterlist. Instead "bring your own
datastructure". This is the essence of what we discussed.

An API that is inferior to dma_map_sg() is really problematic to use
with RDMA.

> > That means we need to always pre-allocate the IOVA in any case where
> > the IOMMU might be active - even on a non-coalescing flow.
> > 
> > IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
> > be used and we can't just call today's dma_map_page() in a loop on the
> > non-coalescing side and pay the overhead of Nx IOVA allocations.
> > 
> > In large part this is for RDMA, were a single P2P page in a large
> > multi-gigabyte user memory registration shouldn't drastically harm the
> > registration performance by falling down to doing dma_map_page, and an
> > IOVA allocation, on a 4k page by page basis.
> 
> But that P2P page needs to be handled very differently, as with it
> we can't actually use a single iova range.  So I'm not sure how that
> is even supposed to work.  If you have
> 
>  +-------+-----+-------+
>  | local | P2P | local |
>  +-------+-----+-------+
> 
> you need at least 3 hw SGL entries, as the IOVA won't be contigous.

Sure, 3 SGL entries is fine, that isn't what I'm pointing at

I'm saying that today if you give such a scatterlist to dma_map_sg()
it scans it and computes the IOVA space need, allocates one IOVA
space, then subdivides that single space up into the 3 HW SGLs you
show.

If you don't preserve that then we are calling, 4k at a time, a
dma_map_page() which is not anywhere close to the same outcome as what
dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs,
and we call into the IOVA allocator a huge number of times.

It needs to work following the same basic structure of dma_map_sg,
unfolding that logic into helpers so that the driver can provide
the data structure:

 - Scan the io ranges and figure out how much IOVA needed
   (dma_io_summarize_range)
 - Allocate the IOVA (dma_init_io)
 - Scan the io ranges again generate the final HW SGL
   (dma_io_link_page)
 - Finish the iommu batch (dma_io_done_mapping)

And you can make that pattern work for all the other cases too.

So I don't see this as particularly worse, calling some other API
instead of dma_map_page is not really a complexity on the
driver. Calling dma_init_io every time is also not a complexity. The
DMA API side is a bit more, but not substantively different logic from
what dma_map_sg already does.

Otherwise what is the alternative? How do I keep these complex things
working in RDMA and remove scatterlist?

> > The other thing that got hand waved here is how does dma_init_io()
> > know which of the 6 states we are looking at? I imagine we probably
> > want to do something like:
> > 
> >    struct dma_io_summarize summary = {};
> >    for each io()
> >         dma_io_summarize_range(&summary, io range)
> >    dma_init_io(dev, &state, &summary);
> >    if (state->single_range) {
> >    } else {
> >    }
> >    dma_io_done_mapping(&state); <-- flush IOTLB once
> 
> That's why I really just want 2 cases.  If the caller guarantees the
> range is coalescable and there is an IOMMU use the iommu-API like
> API, else just iter over map_single/page.

But how does the caller even know if it is coalescable? Other than the
trivial case of a single CPU range, that is a complicated detail based
on what pages are inside the range combined with the capability of the
device doing DMA. I don't see a simple way for the caller to figure
this out. You need to sweep every page and collect some information on
it. The above is to abstract that detail.

It was simpler before the confidential compute stuff :(

Jason


  reply	other threads:[~2024-03-07 21:01 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-05 11:18 [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 01/16] mm/hmm: let users to tag specific PFNs Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 02/16] dma-mapping: provide an interface to allocate IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 03/16] dma-mapping: provide callbacks to link/unlink pages to specific IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 04/16] iommu/dma: Provide an interface to allow preallocate IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 05/16] iommu/dma: Prepare map/unmap page functions to receive IOVA Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 06/16] iommu/dma: Implement link/unlink page callbacks Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 07/16] RDMA/umem: Preallocate and cache IOVA for UMEM ODP Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 08/16] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 09/16] RDMA/core: Separate DMA mapping to caching IOVA and page linkage Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 10/16] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 11/16] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 12/16] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 13/16] vfio/mlx5: Explicitly store page list Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 14/16] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 15/16] block: add dma_link_range() based API Leon Romanovsky
2024-03-05 11:18 ` [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL Leon Romanovsky
2024-03-05 15:51   ` Keith Busch
2024-03-05 16:08     ` Jens Axboe
2024-03-05 16:39       ` Chaitanya Kulkarni
2024-03-05 16:46         ` Chaitanya Kulkarni
2024-03-06 14:33     ` Christoph Hellwig
2024-03-06 15:05       ` Jason Gunthorpe
2024-03-06 16:14         ` Christoph Hellwig
2024-05-03 14:41   ` Zhu Yanjun
2024-05-05 13:23     ` Leon Romanovsky
2024-05-06  7:25       ` Zhu Yanjun
2024-03-05 12:05 ` [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Robin Murphy
2024-03-05 12:29   ` Leon Romanovsky
2024-03-06 14:44     ` Christoph Hellwig
2024-03-06 15:43       ` Jason Gunthorpe
2024-03-06 16:20         ` Christoph Hellwig
2024-03-06 17:44           ` Jason Gunthorpe
2024-03-06 22:14             ` Christoph Hellwig
2024-03-07  0:00               ` Jason Gunthorpe
2024-03-07 15:05                 ` Christoph Hellwig
2024-03-07 21:01                   ` Jason Gunthorpe [this message]
2024-03-08 16:49                     ` Christoph Hellwig
2024-03-08 20:23                       ` Jason Gunthorpe
2024-03-09 16:14                         ` Christoph Hellwig
2024-03-10  9:35                           ` Leon Romanovsky
2024-03-12 21:28                             ` Christoph Hellwig
2024-03-13  7:46                               ` Leon Romanovsky
2024-03-13 21:44                                 ` Christoph Hellwig
2024-03-19 15:36                           ` Jason Gunthorpe
2024-03-20  8:55                             ` Leon Romanovsky
2024-03-21 22:40                               ` Christoph Hellwig
2024-03-22 17:46                                 ` Leon Romanovsky
2024-03-24 23:16                                   ` Christoph Hellwig
2024-03-21 22:39                             ` Christoph Hellwig
2024-03-22 18:43                               ` Jason Gunthorpe
2024-03-24 23:22                                 ` Christoph Hellwig
2024-03-27 17:14                                   ` Jason Gunthorpe
2024-03-07  6:01 ` Zhu Yanjun
2024-04-09 20:39   ` Zhu Yanjun
2024-05-02 23:32 ` Zeng, Oak
2024-05-03 11:57   ` Zhu Yanjun
2024-05-03 16:42   ` Jason Gunthorpe
2024-05-03 20:59     ` Zeng, Oak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240307210116.GQ9225@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=amir73il@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=chaitanyak@nvidia.com \
    --cc=corbet@lwn.net \
    --cc=damien.lemoal@opensource.wdc.com \
    --cc=dan.j.williams@intel.com \
    --cc=daniel@iogearbox.net \
    --cc=hch@lst.de \
    --cc=iommu@lists.linux.dev \
    --cc=jack@suse.com \
    --cc=jglisse@redhat.com \
    --cc=joro@8bytes.org \
    --cc=josef@toxicpanda.com \
    --cc=kbusch@kernel.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=m.szyprowski@samsung.com \
    --cc=martin.petersen@oracle.com \
    --cc=robin.murphy@arm.com \
    --cc=sagi@grimberg.me \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=will@kernel.org \
    --cc=yishaih@nvidia.com \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).