linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-pci@vger.kernel.org,
	linux-rdma <linux-rdma@vger.kernel.org>,
	Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Sagi Grimberg <sagi@grimberg.me>, Keith Busch <kbusch@kernel.org>,
	Stephen Bates <sbates@raithlin.com>
Subject: Re: [RFC PATCH 00/28] Removing struct page from P2PDMA
Date: Fri, 21 Jun 2019 14:47:24 -0300	[thread overview]
Message-ID: <20190621174724.GV19891@ziepe.ca> (raw)
In-Reply-To: <CAPcyv4jyNRBvtWhr9+aHbzWP6=D4qAME+=hWMtOYJ17BVHdy2w@mail.gmail.com>

On Thu, Jun 20, 2019 at 01:18:13PM -0700, Dan Williams wrote:

> > This P2P is quite distinct from DAX as the struct page* would point to
> > non-cacheable weird memory that few struct page users would even be
> > able to work with, while I understand DAX use cases focused on CPU
> > cache coherent memory, and filesystem involvement.
> 
> What I'm poking at is whether this block layer capability can pick up
> users outside of RDMA, more on this below...

The generic capability is to do a transfer through the block layer and
scatter/gather the resulting data to some PCIe BAR memory. Currently
the block layer can only scatter/gather data into CPU cache coherent
memory.

We know of several useful places to put PCIe BAR memory already:
 - On a GPU (or FPGA, acclerator, etc), ie the GB's of GPU private
   memory that is standard these days.
 - On a NVMe CMB. This lets the NVMe drive avoid DMA entirely
 - On a RDMA NIC. Mellanox NICs have a small amount of BAR memory that
   can be used like a CMB and avoids a DMA

RDMA doesn't really get so involved here, except that RDMA is often
the prefered way to source/sink the data buffers after the block layer has
scatter/gathered to them. (and of course RDMA is often for a block
driver, ie NMVe over fabrics)

> > > My primary concern with this is that ascribes a level of generality
> > > that just isn't there for peer-to-peer dma operations. "Peer"
> > > addresses are not "DMA" addresses, and the rules about what can and
> > > can't do peer-DMA are not generically known to the block layer.
> >
> > ?? The P2P infrastructure produces a DMA bus address for the
> > initiating device that is is absolutely a DMA address. There is some
> > intermediate CPU centric representation, but after mapping it is the
> > same as any other DMA bus address.
> 
> Right, this goes back to the confusion caused by the hardware / bus /
> address that a dma-engine would consume directly, and Linux "DMA"
> address as a device-specific translation of host memory.

I don't think there is a confusion :) Logan explained it, the
dma_addr_t is always the thing you program into the DMA engine of the
device it was created for, and this changes nothing about that.

Think of the dma vec as the same as a dma mapped SGL, just with no
available struct page.

> Is the block layer representation of this address going to go through
> a peer / "bus" address translation when it reaches the RDMA driver? 

No, it is just like any other dma mapped SGL, it is ready to go for
the device it was mapped for, and can be used for nothing other than
programming DMA on that device.

> > ie GPU people wouuld really like to do read() and have P2P
> > transparently happen to on-GPU pages. With GPUs having huge amounts of
> > memory loading file data into them is really a performance critical
> > thing.
> 
> A direct-i/o read(2) into a page-less GPU mapping? 

The interesting case is probably an O_DIRECT read into a
DEVICE_PRIVATE page owned by the GPU driver and mmaped into the
process calling read(). The GPU driver can dynamically arrange for
that DEVICE_PRIVATE page to linked to P2P targettable BAR memory so
the HW is capable of a direct CPU bypass transfer from the underlying
block device (ie NVMe or RDMA) to the GPU.

One way to approach this problem is to use this new dma_addr path in
the block layer.

Another way is to feed the DEVICE_PRIVATE pages into the block layer
and have it DMA map them to a P2P address.

In either case we have a situation where the block layer cannot touch
the target struct page buffers with the CPU because there is no cache
coherent CPU mapping for them, and we have to create a CPU clean path
in the block layer.

At best you could do memcpy to/from on these things, but if a GPU is
involved even that is incredibly inefficient. The GPU can do the
memcpy with DMA much faster than a memcpy_to/from_io.

Jason

  parent reply	other threads:[~2019-06-21 17:47 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-20 16:12 [RFC PATCH 00/28] Removing struct page from P2PDMA Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 01/28] block: Introduce DMA direct request type Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 02/28] block: Add dma_vec structure Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 03/28] block: Warn on mis-use of dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 04/28] block: Never bounce " Logan Gunthorpe
2019-06-20 17:23   ` Jason Gunthorpe
2019-06-20 18:38     ` Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 05/28] block: Skip dma-direct bios in bio_integrity_prep() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 06/28] block: Support dma-direct bios in bio_advance_iter() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 07/28] block: Use dma_vec length in bio_cur_bytes() for dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 08/28] block: Introduce dmavec_phys_mergeable() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 09/28] block: Introduce vec_gap_to_prev() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 10/28] block: Create generic vec_split_segs() from bvec_split_segs() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 11/28] block: Create blk_segment_split_ctx Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 12/28] block: Create helper for bvec_should_split() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 13/28] block: Generalize bvec_should_split() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 14/28] block: Support splitting dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 15/28] block: Support counting dma-direct bio segments Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 16/28] block: Implement mapping dma-direct requests to SGs in blk_rq_map_sg() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 17/28] block: Introduce queue flag to indicate support for dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 18/28] block: Introduce bio_add_dma_addr() Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 19/28] nvme-pci: Support dma-direct bios Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 20/28] IB/core: Introduce API for initializing a RW ctx from a DMA address Logan Gunthorpe
2019-06-20 16:49   ` Jason Gunthorpe
2019-06-20 16:59     ` Logan Gunthorpe
2019-06-20 17:11       ` Jason Gunthorpe
2019-06-20 18:24         ` Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 21/28] nvmet: Split nvmet_bdev_execute_rw() into a helper function Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 22/28] nvmet: Use DMA addresses instead of struct pages for P2P Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 23/28] nvme-pci: Remove support for PCI_P2PDMA requests Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 24/28] block: Remove PCI_P2PDMA queue flag Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 25/28] IB/core: Remove P2PDMA mapping support in rdma_rw_ctx Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 26/28] PCI/P2PDMA: Remove SGL helpers Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 27/28] PCI/P2PDMA: Remove struct pages that back P2PDMA memory Logan Gunthorpe
2019-06-20 16:12 ` [RFC PATCH 28/28] memremap: Remove PCI P2PDMA page memory type Logan Gunthorpe
2019-06-20 18:45 ` [RFC PATCH 00/28] Removing struct page from P2PDMA Dan Williams
2019-06-20 19:33   ` Jason Gunthorpe
2019-06-20 20:18     ` Dan Williams
2019-06-20 20:51       ` Logan Gunthorpe
2019-06-21 17:47       ` Jason Gunthorpe [this message]
2019-06-21 17:54         ` Dan Williams
2019-06-24  7:31     ` Christoph Hellwig
2019-06-24 13:46       ` Jason Gunthorpe
2019-06-24 13:50         ` Christoph Hellwig
2019-06-24 13:55           ` Jason Gunthorpe
2019-06-24 16:53             ` Logan Gunthorpe
2019-06-24 18:16               ` Jason Gunthorpe
2019-06-24 18:28                 ` Logan Gunthorpe
2019-06-24 18:54                   ` Jason Gunthorpe
2019-06-24 19:37                     ` Logan Gunthorpe
2019-06-24 16:10         ` Logan Gunthorpe
2019-06-25  7:18           ` Christoph Hellwig
2019-06-20 19:34   ` Logan Gunthorpe
2019-06-20 23:40     ` Dan Williams
2019-06-20 23:42       ` Logan Gunthorpe
2019-06-24  7:27 ` Christoph Hellwig
2019-06-24 16:07   ` Logan Gunthorpe
2019-06-25  7:20     ` Christoph Hellwig
2019-06-25 15:57       ` Logan Gunthorpe
2019-06-25 17:01         ` Christoph Hellwig
2019-06-25 19:54           ` Logan Gunthorpe
2019-06-26  6:57             ` Christoph Hellwig
2019-06-26 18:31               ` Logan Gunthorpe
2019-06-26 20:21                 ` Jason Gunthorpe
2019-06-26 20:39                   ` Dan Williams
2019-06-26 20:54                     ` Jason Gunthorpe
2019-06-26 20:55                     ` Logan Gunthorpe
2019-06-26 20:45                   ` Logan Gunthorpe
2019-06-26 21:00                     ` Jason Gunthorpe
2019-06-26 21:18                       ` Logan Gunthorpe
2019-06-27  6:32                         ` Jason Gunthorpe
2019-06-27 16:09                           ` Logan Gunthorpe
2019-06-27 16:35                             ` Jason Gunthorpe
2019-06-27 16:49                               ` Logan Gunthorpe
2019-06-28  4:57                                 ` Jason Gunthorpe
2019-06-28 16:22                                   ` Logan Gunthorpe
2019-06-28 17:29                                     ` Jason Gunthorpe
2019-06-28 18:29                                       ` Logan Gunthorpe
2019-06-28 19:09                                         ` Jason Gunthorpe
2019-06-28 19:35                                           ` Logan Gunthorpe
2019-07-02 22:45                                             ` Jason Gunthorpe
2019-07-02 22:52                                               ` Logan Gunthorpe
2019-06-27  9:08                     ` Christoph Hellwig
2019-06-27 16:30                       ` Logan Gunthorpe
2019-06-27 17:00                         ` Christoph Hellwig
2019-06-27 18:00                           ` Logan Gunthorpe
2019-06-28 13:38                             ` Christoph Hellwig
2019-06-28 15:54                               ` Logan Gunthorpe
2019-06-27  9:01                 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190621174724.GV19891@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=axboe@kernel.dk \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=logang@deltatee.com \
    --cc=sagi@grimberg.me \
    --cc=sbates@raithlin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).