From: Bjorn Helgaas <helgaas@kernel.org>
To: Logan Gunthorpe <logang@deltatee.com>
Cc: "Jens Axboe" <axboe@kernel.dk>,
"Alex Williamson" <alex.williamson@redhat.com>,
linux-nvdimm@lists.01.org, linux-rdma@vger.kernel.org,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
"Jérôme Glisse" <jglisse@redhat.com>,
"Jason Gunthorpe" <jgg@mellanox.com>,
"Christian König" <christian.koenig@amd.com>,
"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
"Bjorn Helgaas" <bhelgaas@google.com>,
"Max Gurtovoy" <maxg@mellanox.com>,
"Christoph Hellwig" <hch@lst.de>
Subject: Re: [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory
Date: Wed, 10 Oct 2018 15:19:18 -0500 [thread overview]
Message-ID: <20181010201918.GF5906@bhelgaas-glaptop.roam.corp.google.com> (raw)
In-Reply-To: <20181004212747.6301-1-logang@deltatee.com>
On Thu, Oct 04, 2018 at 03:27:34PM -0600, Logan Gunthorpe wrote:
> This is v9 for the p2pdma patch set. It has some substantial changes
> from the previous version. Essentially, the last patch has changed
> based on information from Sagi that the P2P memory can be assigned
> per namespace instead of globally. To that end, I've added a
> radix tree that stores the p2p devices to use for each namespace and
> a lookup will be done on that before allocating the memory.
>
> This has some knock on effects of simplifying the requirements
> from the p2pdma code and so we drop all the client list code seeing
> we don't need to worry about selecting a P2P device that works
> with every namespace at once; only each namespace independantly.
> Thus, Patch 1, 6 and 13 have significant changes. However,
> I think the removal of the client list code will be generally
> appreciated based on the feedback I've gotten from Christian.
>
> As usual, a git repo of the branch is available here:
>
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v9
>
> Thanks,
>
> Logan
>
> --
>
> Changes in v9:
> * Rebased on v4.19-rc6
> * Droped pci_p2pdma_add_client(), pci_p2pdma_remove_client() and
> pci_p2pdma_client_list_free(), and reworked pci_p2pdma_distance() and
> pci_p2pmem_find() to take simple arrays instead of list_heads.
> * Updated the documentation to match
> * Reworked the nvmet code to have a configfs attribute per namespace
> instead of per port. Then each namespace has it's own P2P device
> stored in a radix-tree in the controller (it can't be stored in
> the nvme_ns structure seeing it needs to be unique per controller).
> * Dropped the 'sq' argument in nvmet_req_alloc_sgl() seeing I've noticed
> it is now available in the nvmet_req structure.
> * Collected Keith's reviewed-by tags (however, I have not applied his
> tag to the last patch seeing I think it has changed too much since
> his review).
>
> Changes in v8:
>
> * Added a couple of comments to address Bart's feedback and
> removed the bogus call to percpu_ref_switch_to_atomic_sync()
>
> Changes in v7:
>
> * Rebased on v4.19-rc5
>
> * Fixed up commit message of patch 7 that was no longer accurate. (as
> pointed out by Jens)
> * Change the configfs to not use "auto" or "none" and instead just
> use a 0/1/<pci_dev> (or boolean). This matches the existing
> nvme-target configfs booleans. (Per Bjorn)
> * A handful of other minor changes and edits that were noticed by Bjorn
> * Collected Acks from Bjorn
>
> Changes in v6:
>
> * Rebased on v4.19-rc3
>
> * Remove the changes to the common submit_bio() path and instead
> set REQ_NOMERGE in the NVME target driver, when appropriate.
> Per discussions with Jens and Christoph.
>
> * Some minor grammar changes in the documentation as spotted by Randy.
>
> Changes in v5:
>
> * Rebased on v4.19-rc1
>
> * Drop changing ACS settings in this patchset. Now, the code
> will only allow P2P transactions between devices whos
> downstream ports do not restrict P2P TLPs.
>
> * Drop the REQ_PCI_P2PDMA block flag and instead use
> is_pci_p2pdma_page() to tell if a request is P2P or not. In that
> case we check for queue support and enforce using REQ_NOMERGE.
> Per feedback from Christoph.
>
> * Drop the pci_p2pdma_unmap_sg() function as it was empty and only
> there for symmetry and compatibility with dma_unmap_sg. Per feedback
> from Christoph.
>
> * Split off the logic to handle enabling P2P in NVMe fabrics' configfs
> into specific helpers in the p2pdma code. Per feedback from Christoph.
>
> * A number of other minor cleanups and fixes as pointed out by
> Christoph and others.
>
> Changes in v4:
>
> * Change the original upstream_bridges_match() function to
> upstream_bridge_distance() which calculates the distance between two
> devices as long as they are behind the same root port. This should
> address Bjorn's concerns that the code was to focused on
> being behind a single switch.
>
> * The disable ACS function now disables ACS for all bridge ports instead
> of switch ports (ie. those that had two upstream_bridge ports).
>
> * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
> API to be more like sgl_alloc() in that the alloc function returns
> the allocated scatterlist and nents is not required bythe free
> function.
>
> * Moved the new documentation into the driver-api tree as requested
> by Jonathan
>
> * Add SGL alloc and free helpers in the nvmet code so that the
> individual drivers can share the code that allocates P2P memory.
> As requested by Christoph.
>
> * Cleanup the nvmet_p2pmem_store() function as Christoph
> thought my first attempt was ugly.
>
> * Numerous commit message and comment fix-ups
>
> Changes in v3:
>
> * Many more fixes and minor cleanups that were spotted by Bjorn
>
> * Additional explanation of the ACS change in both the commit message
> and Kconfig doc. Also, the code that disables the ACS bits is surrounded
> explicitly by an #ifdef
>
> * Removed the flag we added to rdma_rw_ctx() in favour of using
> is_pci_p2pdma_page(), as suggested by Sagi.
>
> * Adjust pci_p2pmem_find() so that it prefers P2P providers that
> are closest to (or the same as) the clients using them. In cases
> of ties, the provider is randomly chosen.
>
> * Modify the NVMe Target code so that the PCI device name of the provider
> may be explicitly specified, bypassing the logic in pci_p2pmem_find().
> (Note: it's still enforced that the provider must be behind the
> same switch as the clients).
>
> * As requested by Bjorn, added documentation for driver writers.
>
>
> Changes in v2:
>
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> as a bunch of cleanup and spelling fixes he pointed out in the last
> series.
>
> * To address Alex's ACS concerns, we change to a simpler method of
> just disabling ACS behind switches for any kernel that has
> CONFIG_PCI_P2PDMA.
>
> * We also reject using devices that employ 'dma_virt_ops' which should
> fairly simply handle Jason's concerns that this work might break with
> the HFI, QIB and rxe drivers that use the virtual ops to implement
> their own special DMA operations.
>
> --
>
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in the kernel with initial support for the NVMe fabrics target
> subsystem. Many thanks go to Christoph Hellwig who provided valuable
> feedback to get these patches to where they are today.
>
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVMe target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU).
>
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch hierarchy. This will mean many setups that
> could likely work well will not be supported so that we can be more
> confident it will work and not place any responsibility on the user to
> understand their topology. (We chose to go this route based on feedback
> we received at the last LSF). Future work may enable these transfers
> using a white list of known good root complexes. However, at this time,
> there is no reliable way to ensure that Peer-to-Peer transactions are
> permitted between PCI Root Ports.
>
> For PCI P2P DMA transfers to work in this situation the ACS bits
> must be disabled on the downstream ports (DSPs) for all devices
> involved in the transfer. This can be done using the "disable_acs_redir"
> PCI kernel command line option which was introduced in v4.19.
>
> In order to enable PCI P2P functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
>
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices.
>
> In the block layer, we also introduce a flag for a request queue
> to indicate a given queue supports targeting P2P memory. The driver
> submitting bios must ensure that the queue supports P2P before
> attempting to submit BIOs backed by P2P memory. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
>
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
>
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
> that don't use the proper DMA infrastructure this code rejects using
> any device that employs the virt_dma_ops implementation.
>
> Finally, in the NVMe fabrics target port we introduce a new
> configuration attribute: 'p2pmem'. When set to a true boolean, the port
> will attempt to find P2P memory supported by the RDMA NIC and all namespaces.
> It may also be set to a PCI device name to select a specific P2P
> memory to use. If supported memory is found, it will be used in all IO
> transfers. And if a port is using P2P memory, adding new namespaces that
> are not supported by that memory will fail.
>
> These patches have been tested on a number of Intel based systems and
> for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
> SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
> Microsemi, Chelsio and Everspin) using switches from both Microsemi
> and Broadcomm.
>
> --
>
> Logan Gunthorpe (13):
> PCI/P2PDMA: Support peer-to-peer memory
> PCI/P2PDMA: Add sysfs group to display p2pmem stats
> PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset
> PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers
> docs-rst: Add a new directory for PCI documentation
> PCI/P2PDMA: Add P2P DMA driver writer's documentation
> block: Add PCI P2P flag for request queue and check support for
> requests
> IB/core: Ensure we map P2P memory correctly in
> rdma_rw_ctx_[init|destroy]()
> nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> nvme-pci: Add support for P2P memory in requests
> nvme-pci: Add a quirk for a pseudo CMB
> nvmet: Introduce helper functions to allocate and free request SGLs
> nvmet: Optionally use PCI P2P memory
>
> Documentation/ABI/testing/sysfs-bus-pci | 24 +
> Documentation/driver-api/index.rst | 2 +-
> Documentation/driver-api/pci/index.rst | 21 +
> Documentation/driver-api/pci/p2pdma.rst | 145 ++++
> Documentation/driver-api/{ => pci}/pci.rst | 0
> drivers/infiniband/core/rw.c | 11 +-
> drivers/nvme/host/core.c | 4 +
> drivers/nvme/host/nvme.h | 8 +
> drivers/nvme/host/pci.c | 121 +--
> drivers/nvme/target/configfs.c | 45 ++
> drivers/nvme/target/core.c | 180 +++++
> drivers/nvme/target/io-cmd-bdev.c | 3 +
> drivers/nvme/target/nvmet.h | 17 +
> drivers/nvme/target/rdma.c | 22 +-
> drivers/pci/Kconfig | 17 +
> drivers/pci/Makefile | 1 +
> drivers/pci/p2pdma.c | 809 +++++++++++++++++++++
> include/linux/blkdev.h | 3 +
> include/linux/memremap.h | 6 +
> include/linux/mm.h | 18 +
> include/linux/pci-p2pdma.h | 114 +++
> include/linux/pci.h | 4 +
> 22 files changed, 1521 insertions(+), 54 deletions(-)
> create mode 100644 Documentation/driver-api/pci/index.rst
> create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> create mode 100644 drivers/pci/p2pdma.c
> create mode 100644 include/linux/pci-p2pdma.h
I added the reviewed-by tags from Christoph, Jens' ack on the blkdev.h
change, and applied these to pci/peer-to-peer with the intent of
merging these for v4.20.
I gave up on waiting for an ack for the memremap.h and mm.h changes.
I dropped the "nvme-pci: Add a quirk for a pseudo CMB" quirk because
of Christoph's objection. After this is all merged, I won't need to
be involved, and you and the NVMe folks can hash that out.
If there are updates to "nvmet: Optionally use PCI P2P memory" based
on Sagi's comments, send an incremental patch and I'll fold them in.
Bjorn
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
next prev parent reply other threads:[~2018-10-10 20:19 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-04 21:27 [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory Logan Gunthorpe
2018-10-04 21:27 ` [PATCH v9 01/13] PCI/P2PDMA: Support peer-to-peer memory Logan Gunthorpe
2018-10-04 21:27 ` [PATCH v9 02/13] PCI/P2PDMA: Add sysfs group to display p2pmem stats Logan Gunthorpe
2018-10-05 7:08 ` Christoph Hellwig
2018-10-04 21:27 ` [PATCH v9 03/13] PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset Logan Gunthorpe
2018-10-05 7:08 ` Christoph Hellwig
2018-10-04 21:27 ` [PATCH v9 04/13] PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers Logan Gunthorpe
2018-10-05 7:08 ` Christoph Hellwig
2018-10-04 21:27 ` [PATCH v9 05/13] docs-rst: Add a new directory for PCI documentation Logan Gunthorpe
2018-10-04 21:27 ` [PATCH v9 06/13] PCI/P2PDMA: Add P2P DMA driver writer's documentation Logan Gunthorpe
2018-10-04 21:27 ` [PATCH v9 07/13] block: Add PCI P2P flag for request queue and check support for requests Logan Gunthorpe
2018-10-05 7:09 ` Christoph Hellwig
2018-10-06 1:16 ` Jens Axboe
2018-10-10 19:59 ` Bjorn Helgaas
2018-10-10 20:00 ` Jens Axboe
2018-10-04 21:27 ` [PATCH v9 08/13] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]() Logan Gunthorpe
2018-10-05 7:09 ` Christoph Hellwig
2018-10-04 21:27 ` [PATCH v9 09/13] nvme-pci: Use PCI p2pmem subsystem to manage the CMB Logan Gunthorpe
2018-10-05 7:09 ` Christoph Hellwig
2018-10-04 21:27 ` [PATCH v9 10/13] nvme-pci: Add support for P2P memory in requests Logan Gunthorpe
2018-10-04 21:27 ` [PATCH v9 11/13] nvme-pci: Add a quirk for a pseudo CMB Logan Gunthorpe
2018-10-05 7:10 ` Christoph Hellwig
2018-10-05 15:39 ` Logan Gunthorpe
2018-10-04 21:27 ` [PATCH v9 12/13] nvmet: Introduce helper functions to allocate and free request SGLs Logan Gunthorpe
2018-10-05 7:11 ` Christoph Hellwig
2018-10-04 21:27 ` [PATCH v9 13/13] nvmet: Optionally use PCI P2P memory Logan Gunthorpe
2018-10-04 22:20 ` Sagi Grimberg
2018-10-04 22:29 ` Logan Gunthorpe
2018-10-05 7:07 ` Christoph Hellwig
2018-10-05 7:34 ` Sagi Grimberg
2018-10-05 15:42 ` Logan Gunthorpe
2018-10-10 20:19 ` Bjorn Helgaas [this message]
2018-10-10 23:03 ` [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory Logan Gunthorpe
2018-10-11 3:18 ` Bjorn Helgaas
2018-10-11 15:38 ` Logan Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181010201918.GF5906@bhelgaas-glaptop.roam.corp.google.com \
--to=helgaas@kernel.org \
--cc=alex.williamson@redhat.com \
--cc=axboe@kernel.dk \
--cc=benh@kernel.crashing.org \
--cc=bhelgaas@google.com \
--cc=christian.koenig@amd.com \
--cc=hch@lst.de \
--cc=jgg@mellanox.com \
--cc=jglisse@redhat.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=logang@deltatee.com \
--cc=maxg@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).