All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-04-23 23:30 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Everyone,

Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v4

Thanks,

Logan

Changes in v4:

* Change the original upstream_bridges_match() function to
  upstream_bridge_distance() which calculates the distance between two
  devices as long as they are behind the same root port. This should
  address Bjorn's concerns that the code was to focused on
  being behind a single switch.

* The disable ACS function now disables ACS for all bridge ports instead
  of switch ports (ie. those that had two upstream_bridge ports).

* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
  API to be more like sgl_alloc() in that the alloc function returns
  the allocated scatterlist and nents is not required bythe free
  function.

* Moved the new documentation into the driver-api tree as requested
  by Jonathan

* Add SGL alloc and free helpers in the nvmet code so that the
  individual drivers can share the code that allocates P2P memory.
  As requested by Christoph.

* Cleanup the nvmet_p2pmem_store() function as Christoph
  thought my first attempt was ugly.

* Numerous commit message and comment fix-ups

Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
  and Kconfig doc. Also, the code that disables the ACS bits is surrounded
  explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
  is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
  are closest to (or the same as) the clients using them. In cases
  of ties, the provider is randomly chosen.

* Modify the NVMe Target code so that the PCI device name of the provider
  may be explicitly specified, bypassing the logic in pci_p2pmem_find().
  (Note: it's still enforced that the provider must be behind the
   same switch as the clients).

* As requested by Bjorn, added documentation for driver writers.


Changes in v2:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be addressed in the future. There is no way to dynamically
disable the bit and alter the groups.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
that don't use the proper DMA infrastructure this code rejects using
any device that employs the virt_dma_ops implementation.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

These patches have been tested on a number of Intel based systems and
for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
Microsemi, Chelsio and Everspin) using switches from both Microsemi
and Broadcomm.

Logan Gunthorpe (14):
  PCI/P2PDMA: Support peer-to-peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  docs-rst: Add a new directory for PCI documentation
  PCI/P2PDMA: Add P2P DMA driver writer's documentation
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Ensure we map P2P memory correctly in
    rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Introduce helper functions to allocate and free request SGLs
  nvmet-rdma: Use new SGL alloc/free helper for requests
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci    |  25 +
 Documentation/PCI/index.rst                |  14 +
 Documentation/driver-api/index.rst         |   2 +-
 Documentation/driver-api/pci/index.rst     |  20 +
 Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
 Documentation/driver-api/{ => pci}/pci.rst |   0
 Documentation/index.rst                    |   3 +-
 block/blk-core.c                           |   3 +
 drivers/infiniband/core/rw.c               |  13 +-
 drivers/nvme/host/core.c                   |   4 +
 drivers/nvme/host/nvme.h                   |   8 +
 drivers/nvme/host/pci.c                    | 118 +++--
 drivers/nvme/target/configfs.c             |  67 +++
 drivers/nvme/target/core.c                 | 143 ++++-
 drivers/nvme/target/io-cmd.c               |   3 +
 drivers/nvme/target/nvmet.h                |  15 +
 drivers/nvme/target/rdma.c                 |  22 +-
 drivers/pci/Kconfig                        |  26 +
 drivers/pci/Makefile                       |   1 +
 drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
 drivers/pci/pci.c                          |   6 +
 include/linux/blk_types.h                  |  18 +-
 include/linux/blkdev.h                     |   3 +
 include/linux/memremap.h                   |  19 +
 include/linux/pci-p2pdma.h                 | 118 +++++
 include/linux/pci.h                        |   4 +
 26 files changed, 1579 insertions(+), 56 deletions(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-04-23 23:30 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

Hi Everyone,

Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v4

Thanks,

Logan

Changes in v4:

* Change the original upstream_bridges_match() function to
  upstream_bridge_distance() which calculates the distance between two
  devices as long as they are behind the same root port. This should
  address Bjorn's concerns that the code was to focused on
  being behind a single switch.

* The disable ACS function now disables ACS for all bridge ports instead
  of switch ports (ie. those that had two upstream_bridge ports).

* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
  API to be more like sgl_alloc() in that the alloc function returns
  the allocated scatterlist and nents is not required bythe free
  function.

* Moved the new documentation into the driver-api tree as requested
  by Jonathan

* Add SGL alloc and free helpers in the nvmet code so that the
  individual drivers can share the code that allocates P2P memory.
  As requested by Christoph.

* Cleanup the nvmet_p2pmem_store() function as Christoph
  thought my first attempt was ugly.

* Numerous commit message and comment fix-ups

Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
  and Kconfig doc. Also, the code that disables the ACS bits is surrounded
  explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
  is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
  are closest to (or the same as) the clients using them. In cases
  of ties, the provider is randomly chosen.

* Modify the NVMe Target code so that the PCI device name of the provider
  may be explicitly specified, bypassing the logic in pci_p2pmem_find().
  (Note: it's still enforced that the provider must be behind the
   same switch as the clients).

* As requested by Bjorn, added documentation for driver writers.


Changes in v2:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be addressed in the future. There is no way to dynamically
disable the bit and alter the groups.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
that don't use the proper DMA infrastructure this code rejects using
any device that employs the virt_dma_ops implementation.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

These patches have been tested on a number of Intel based systems and
for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
Microsemi, Chelsio and Everspin) using switches from both Microsemi
and Broadcomm.

Logan Gunthorpe (14):
  PCI/P2PDMA: Support peer-to-peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  docs-rst: Add a new directory for PCI documentation
  PCI/P2PDMA: Add P2P DMA driver writer's documentation
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Ensure we map P2P memory correctly in
    rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Introduce helper functions to allocate and free request SGLs
  nvmet-rdma: Use new SGL alloc/free helper for requests
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci    |  25 +
 Documentation/PCI/index.rst                |  14 +
 Documentation/driver-api/index.rst         |   2 +-
 Documentation/driver-api/pci/index.rst     |  20 +
 Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
 Documentation/driver-api/{ => pci}/pci.rst |   0
 Documentation/index.rst                    |   3 +-
 block/blk-core.c                           |   3 +
 drivers/infiniband/core/rw.c               |  13 +-
 drivers/nvme/host/core.c                   |   4 +
 drivers/nvme/host/nvme.h                   |   8 +
 drivers/nvme/host/pci.c                    | 118 +++--
 drivers/nvme/target/configfs.c             |  67 +++
 drivers/nvme/target/core.c                 | 143 ++++-
 drivers/nvme/target/io-cmd.c               |   3 +
 drivers/nvme/target/nvmet.h                |  15 +
 drivers/nvme/target/rdma.c                 |  22 +-
 drivers/pci/Kconfig                        |  26 +
 drivers/pci/Makefile                       |   1 +
 drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
 drivers/pci/pci.c                          |   6 +
 include/linux/blk_types.h                  |  18 +-
 include/linux/blkdev.h                     |   3 +
 include/linux/memremap.h                   |  19 +
 include/linux/pci-p2pdma.h                 | 118 +++++
 include/linux/pci.h                        |   4 +
 26 files changed, 1579 insertions(+), 56 deletions(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-04-23 23:30 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Everyone,

Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v4

Thanks,

Logan

Changes in v4:

* Change the original upstream_bridges_match() function to
  upstream_bridge_distance() which calculates the distance between two
  devices as long as they are behind the same root port. This should
  address Bjorn's concerns that the code was to focused on
  being behind a single switch.

* The disable ACS function now disables ACS for all bridge ports instead
  of switch ports (ie. those that had two upstream_bridge ports).

* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
  API to be more like sgl_alloc() in that the alloc function returns
  the allocated scatterlist and nents is not required bythe free
  function.

* Moved the new documentation into the driver-api tree as requested
  by Jonathan

* Add SGL alloc and free helpers in the nvmet code so that the
  individual drivers can share the code that allocates P2P memory.
  As requested by Christoph.

* Cleanup the nvmet_p2pmem_store() function as Christoph
  thought my first attempt was ugly.

* Numerous commit message and comment fix-ups

Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
  and Kconfig doc. Also, the code that disables the ACS bits is surrounded
  explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
  is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
  are closest to (or the same as) the clients using them. In cases
  of ties, the provider is randomly chosen.

* Modify the NVMe Target code so that the PCI device name of the provider
  may be explicitly specified, bypassing the logic in pci_p2pmem_find().
  (Note: it's still enforced that the provider must be behind the
   same switch as the clients).

* As requested by Bjorn, added documentation for driver writers.


Changes in v2:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be addressed in the future. There is no way to dynamically
disable the bit and alter the groups.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
that don't use the proper DMA infrastructure this code rejects using
any device that employs the virt_dma_ops implementation.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

These patches have been tested on a number of Intel based systems and
for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
Microsemi, Chelsio and Everspin) using switches from both Microsemi
and Broadcomm.

Logan Gunthorpe (14):
  PCI/P2PDMA: Support peer-to-peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  docs-rst: Add a new directory for PCI documentation
  PCI/P2PDMA: Add P2P DMA driver writer's documentation
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Ensure we map P2P memory correctly in
    rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Introduce helper functions to allocate and free request SGLs
  nvmet-rdma: Use new SGL alloc/free helper for requests
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci    |  25 +
 Documentation/PCI/index.rst                |  14 +
 Documentation/driver-api/index.rst         |   2 +-
 Documentation/driver-api/pci/index.rst     |  20 +
 Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
 Documentation/driver-api/{ => pci}/pci.rst |   0
 Documentation/index.rst                    |   3 +-
 block/blk-core.c                           |   3 +
 drivers/infiniband/core/rw.c               |  13 +-
 drivers/nvme/host/core.c                   |   4 +
 drivers/nvme/host/nvme.h                   |   8 +
 drivers/nvme/host/pci.c                    | 118 +++--
 drivers/nvme/target/configfs.c             |  67 +++
 drivers/nvme/target/core.c                 | 143 ++++-
 drivers/nvme/target/io-cmd.c               |   3 +
 drivers/nvme/target/nvmet.h                |  15 +
 drivers/nvme/target/rdma.c                 |  22 +-
 drivers/pci/Kconfig                        |  26 +
 drivers/pci/Makefile                       |   1 +
 drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
 drivers/pci/pci.c                          |   6 +
 include/linux/blk_types.h                  |  18 +-
 include/linux/blkdev.h                     |   3 +
 include/linux/memremap.h                   |  19 +
 include/linux/pci-p2pdma.h                 | 118 +++++
 include/linux/pci.h                        |   4 +
 26 files changed, 1579 insertions(+), 56 deletions(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-04-23 23:30 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


Hi Everyone,

Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v4

Thanks,

Logan

Changes in v4:

* Change the original upstream_bridges_match() function to
  upstream_bridge_distance() which calculates the distance between two
  devices as long as they are behind the same root port. This should
  address Bjorn's concerns that the code was to focused on
  being behind a single switch.

* The disable ACS function now disables ACS for all bridge ports instead
  of switch ports (ie. those that had two upstream_bridge ports).

* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
  API to be more like sgl_alloc() in that the alloc function returns
  the allocated scatterlist and nents is not required bythe free
  function.

* Moved the new documentation into the driver-api tree as requested
  by Jonathan

* Add SGL alloc and free helpers in the nvmet code so that the
  individual drivers can share the code that allocates P2P memory.
  As requested by Christoph.

* Cleanup the nvmet_p2pmem_store() function as Christoph
  thought my first attempt was ugly.

* Numerous commit message and comment fix-ups

Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
  and Kconfig doc. Also, the code that disables the ACS bits is surrounded
  explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
  is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
  are closest to (or the same as) the clients using them. In cases
  of ties, the provider is randomly chosen.

* Modify the NVMe Target code so that the PCI device name of the provider
  may be explicitly specified, bypassing the logic in pci_p2pmem_find().
  (Note: it's still enforced that the provider must be behind the
   same switch as the clients).

* As requested by Bjorn, added documentation for driver writers.


Changes in v2:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be addressed in the future. There is no way to dynamically
disable the bit and alter the groups.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
that don't use the proper DMA infrastructure this code rejects using
any device that employs the virt_dma_ops implementation.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

These patches have been tested on a number of Intel based systems and
for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
Microsemi, Chelsio and Everspin) using switches from both Microsemi
and Broadcomm.

Logan Gunthorpe (14):
  PCI/P2PDMA: Support peer-to-peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  docs-rst: Add a new directory for PCI documentation
  PCI/P2PDMA: Add P2P DMA driver writer's documentation
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Ensure we map P2P memory correctly in
    rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Introduce helper functions to allocate and free request SGLs
  nvmet-rdma: Use new SGL alloc/free helper for requests
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci    |  25 +
 Documentation/PCI/index.rst                |  14 +
 Documentation/driver-api/index.rst         |   2 +-
 Documentation/driver-api/pci/index.rst     |  20 +
 Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
 Documentation/driver-api/{ => pci}/pci.rst |   0
 Documentation/index.rst                    |   3 +-
 block/blk-core.c                           |   3 +
 drivers/infiniband/core/rw.c               |  13 +-
 drivers/nvme/host/core.c                   |   4 +
 drivers/nvme/host/nvme.h                   |   8 +
 drivers/nvme/host/pci.c                    | 118 +++--
 drivers/nvme/target/configfs.c             |  67 +++
 drivers/nvme/target/core.c                 | 143 ++++-
 drivers/nvme/target/io-cmd.c               |   3 +
 drivers/nvme/target/nvmet.h                |  15 +
 drivers/nvme/target/rdma.c                 |  22 +-
 drivers/pci/Kconfig                        |  26 +
 drivers/pci/Makefile                       |   1 +
 drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
 drivers/pci/pci.c                          |   6 +
 include/linux/blk_types.h                  |  18 +-
 include/linux/blkdev.h                     |   3 +
 include/linux/memremap.h                   |  19 +
 include/linux/pci-p2pdma.h                 | 118 +++++
 include/linux/pci.h                        |   4 +
 26 files changed, 1579 insertions(+), 56 deletions(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCI switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same root port (typically through
a network of PCIe switches). This is because we have no way of knowing
whether peer-to-peer routing between PCIe Root Ports is supported
(PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P transfers that
go through the RC is limited to only reducing DRAM usage and, in some
cases, coding convenience. The PCI-SIG may be exploring adding a new
capability bit to advertise whether this is possible for future
hardware.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  17 ++
 drivers/pci/Makefile       |   1 +
 drivers/pci/p2pdma.c       | 694 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  18 ++
 include/linux/pci-p2pdma.h | 100 +++++++
 include/linux/pci.h        |   4 +
 6 files changed, 834 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..b2396c22b53e 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,23 @@ config PCI_PASID
 
 	  If unsure, say N.
 
+config PCI_P2PDMA
+	bool "PCI peer-to-peer transfer support"
+	depends on PCI && ZONE_DEVICE && EXPERT
+	select GENERIC_ALLOCATOR
+	help
+	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
+	  BARs that are exposed in other devices that are the part of
+	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+	  specification to work (ie. anything below a single PCI bridge).
+
+	  Many PCIe root complexes do not support P2P transactions and
+	  it's hard to tell which support it at all, so at this time, DMA
+	  transations must be between devices behind the same root port.
+	  (Typically behind a network of PCIe switches).
+
+	  If unsure, say N.
+
 config PCI_LABEL
 	def_bool y if (DMI || ACPI)
 	depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 952addc7bacf..050c1e19a1de 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_X86_INTEL_MID)	+= pci-mid.o
 obj-$(CONFIG_PCI_SYSCALL)	+= syscall.o
 obj-$(CONFIG_PCI_STUB)		+= pci-stub.o
 obj-$(CONFIG_PCI_ECAM)		+= ecam.o
+obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 
 obj-y				+= host/
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..e524a12eca1f
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,694 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+#include <linux/random.h>
+
+struct pci_p2pdma {
+	struct percpu_ref devmap_ref;
+	struct completion devmap_ref_done;
+	struct gen_pool *pool;
+	bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+	struct pci_p2pdma *p2p =
+		container_of(ref, struct pci_p2pdma, devmap_ref);
+
+	complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+	struct pci_dev *pdev = data;
+
+	if (!pdev->p2pdma)
+		return;
+
+	wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+	gen_pool_destroy(pdev->p2pdma->pool);
+	pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+	int error = -ENOMEM;
+	struct pci_p2pdma *p2p;
+
+	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+	if (!p2p)
+		return -ENOMEM;
+
+	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2p->pool)
+		goto out;
+
+	init_completion(&p2p->devmap_ref_done);
+	error = percpu_ref_init(&p2p->devmap_ref,
+			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+	if (error)
+		goto out_pool_destroy;
+
+	percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_pool_destroy;
+
+	pdev->p2pdma = p2p;
+
+	return 0;
+
+out_pool_destroy:
+	gen_pool_destroy(p2p->pool);
+out:
+	devm_kfree(&pdev->dev, p2p);
+	return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pdev: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any DMA request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+			    u64 offset)
+{
+	struct dev_pagemap *pgmap;
+	void *addr;
+	int error;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	if (offset >= pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!size)
+		size = pci_resource_len(pdev, bar) - offset;
+
+	if (size + offset > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!pdev->p2pdma) {
+		error = pci_p2pdma_setup(pdev);
+		if (error)
+			return error;
+	}
+
+	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+	pgmap->res.end = pgmap->res.start + size - 1;
+	pgmap->res.flags = pci_resource_flags(pdev, bar);
+	pgmap->ref = &pdev->p2pdma->devmap_ref;
+	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+	addr = devm_memremap_pages(&pdev->dev, pgmap);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
+		goto pgmap_free;
+	}
+
+	error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
+			pci_bus_address(pdev, bar) + offset,
+			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+	if (error)
+		goto pgmap_free;
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+					  &pdev->p2pdma->devmap_ref);
+	if (error)
+		goto pgmap_free;
+
+	pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
+		 &pgmap->res);
+
+	return 0;
+
+pgmap_free:
+	devres_free(pgmap);
+	return error;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+	struct device *parent;
+
+	dev = get_device(dev);
+
+	while (dev) {
+		if (dev_is_pci(dev))
+			return to_pci_dev(dev);
+
+		parent = get_device(dev->parent);
+		put_device(dev);
+		dev = parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge():
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+	struct pci_dev *up1, *up2;
+
+	if (!pdev)
+		return NULL;
+
+	up1 = pci_dev_get(pci_upstream_bridge(pdev));
+	if (!up1)
+		return NULL;
+
+	up2 = pci_dev_get(pci_upstream_bridge(up1));
+	pci_dev_put(up1);
+
+	return up2;
+}
+
+/*
+ * Find the distance through the nearest common upstream bridge between
+ * two PCI devices.
+ *
+ * If the two devices are the same device then 0 will be returned.
+ *
+ * If there are two virtual functions of the same device behind the same
+ * bridge port then 2 will be returned (one step down to the bridge then
+ * one step back to the same device).
+ *
+ * In the case where two devices are connected to the same PCIe switch, the
+ * value 4 will be returned. This corresponds to the following PCI tree:
+ *
+ *     -+  Root Port
+ *      \+ Switch Upstream Port
+ *       +-+ Switch Downstream Port
+ *       + \- Device A
+ *       \-+ Switch Downstream Port
+ *         \- Device B
+ *
+ * The distance is 4 because we traverse from Device A through the downstream
+ * port of the switch, to the common upstream port, back up to the second
+ * downstream port and then to Device B.
+ *
+ * Any two devices that don't have a common upstream bridge will return -1.
+ * In this way devices on seperate root ports will be rejected, which
+ * is what we want for peer-to-peer seeing there's no way to determine
+ * if the root complex supports forwarding between root ports.
+ *
+ * In the case where two devices are connected to different PCIe switches
+ * this function will still return a positive distance as long as both
+ * switches evenutally have a common upstream bridge. Note this covers
+ * the case of using multiple PCIe switches to achieve a desired level of
+ * fan-out from a root port. The exact distance will be a function of the
+ * number of switches between Device A and Device B.
+ *
+ */
+static int upstream_bridge_distance(struct pci_dev *a,
+				    struct pci_dev *b)
+{
+	int dist_a = 0;
+	int dist_b = 0;
+	struct pci_dev *aa, *bb = NULL, *tmp;
+
+	aa = pci_dev_get(a);
+
+	while (aa) {
+		dist_b = 0;
+
+		pci_dev_put(bb);
+		bb = pci_dev_get(b);
+
+		while (bb) {
+			if (aa == bb)
+				goto put_and_return;
+
+			tmp = pci_dev_get(pci_upstream_bridge(bb));
+			pci_dev_put(bb);
+			bb = tmp;
+
+			dist_b++;
+		}
+
+		tmp = pci_dev_get(pci_upstream_bridge(aa));
+		pci_dev_put(aa);
+		aa = tmp;
+
+		dist_a++;
+	}
+
+	dist_a = -1;
+	dist_b = 0;
+
+put_and_return:
+	pci_dev_put(bb);
+	pci_dev_put(aa);
+
+	return dist_a + dist_b;
+}
+
+struct pci_p2pdma_client {
+	struct list_head list;
+	struct pci_dev *client;
+	struct pci_dev *provider;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to pci_p2pmem_find(). Once pci_p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *item, *new_item;
+	struct pci_dev *provider = NULL;
+	struct pci_dev *client;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+		dev_warn(dev, "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+		return -ENODEV;
+	}
+
+
+	client = find_parent_pci_dev(dev);
+	if (!client) {
+		dev_warn(dev, "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+		return -ENODEV;
+	}
+
+	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+	if (item && item->provider) {
+		provider = item->provider;
+
+		if (upstream_bridge_distance(provider, client) < 0) {
+			dev_warn(dev, "cannot be used for peer-to-peer DMA as the client and provider do not share an upstream bridge\n");
+
+			ret = -EXDEV;
+			goto put_client;
+		}
+	}
+
+	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+	if (!new_item) {
+		ret = -ENOMEM;
+		goto put_client;
+	}
+
+	new_item->client = client;
+	new_item->provider = pci_dev_get(provider);
+
+	list_add_tail(&new_item->list, head);
+
+	return 0;
+
+put_client:
+	pci_dev_put(client);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+	list_del(&item->list);
+	pci_dev_put(item->client);
+	pci_dev_put(item->provider);
+	kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+	struct pci_dev *pdev;
+
+	pdev = find_parent_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	list_for_each_entry_safe(pos, tmp, head, list) {
+		if (pos->client != pdev)
+			continue;
+
+		pci_p2pdma_client_free(pos);
+	}
+
+	pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, head, list)
+		pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+/**
+ * pci_p2pdma_distance - Determive the cumulative distance between
+ *	a p2pdma provider and the clients in use.
+ * @provider: p2pdma provider to check against the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns -1 if any of the clients are not compatible (behind the same
+ * root port as the provider), otherwise returns a positive number where
+ * the lower number is the preferrable choice. (If there's one client
+ * that's the same as the provider it will return 0, which is best choice).
+ *
+ * For now, "compatible" means the provider and the clients are all behind
+ * the same PCI root port. This cuts out cases that may work but is safest
+ * for the user. Future work can expand this to white-list root complexes that
+ * can safely forward between each ports.
+ */
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+	int ret;
+	int distance = 0;
+
+	if (list_empty(clients))
+		return -1;
+
+	list_for_each_entry(pos, clients, list) {
+		ret = upstream_bridge_distance(provider, pos->client);
+		if (ret < 0)
+			goto no_match;
+
+		distance += ret;
+	}
+
+	ret = distance;
+
+no_match:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_distance);
+
+/**
+ * pci_p2pdma_assign_provider - Check compatibily (as per pci_p2pdma_distance)
+ *	and assign a provider to a list of clients
+ * @provider: p2pdma provider to assign to the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns false if any of the clients are not compatible, true if the
+ * provider was successfully assigned to the clients.
+ */
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+
+	if (pci_p2pdma_distance(provider, clients) < 0)
+		return false;
+
+	list_for_each_entry(pos, clients, list)
+		pos->provider = provider;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_assign_provider);
+
+/**
+ * pci_has_p2pmem - check if a given PCI device has published any p2pmem
+ * @pdev: PCI device to check
+ */
+bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return pdev->p2pdma && pdev->p2pdma->p2pmem_published;
+}
+EXPORT_SYMBOL_GPL(pci_has_p2pmem);
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ *	the specified list of clients and shortest distance (as determined
+ *	by pci_p2pmem_dma())
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * If multiple devices are behind the same switch, the one "closest" to the
+ * client devices in use will be chosen first. (So if one of the providers are
+ * the same as one of the clients, that provider will be used ahead of any
+ * other providers that are unrelated). If multiple providers are an equal
+ * distance away, one will be chosen at random.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found. The
+ * found provider will also be assigned to the client list.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	struct pci_dev *pdev = NULL;
+	struct pci_p2pdma_client *pos;
+	int distance;
+	int closest_distance = INT_MAX;
+	struct pci_dev **closest_pdevs;
+	int dev_cnt = 0;
+	const int max_devs = PAGE_SIZE / sizeof(*closest_pdevs);
+	int i;
+
+	closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+
+	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+		if (!pci_has_p2pmem(pdev))
+			continue;
+
+		distance = pci_p2pdma_distance(pdev, clients);
+		if (distance < 0 || distance > closest_distance)
+			continue;
+
+		if (distance == closest_distance && dev_cnt >= max_devs)
+			continue;
+
+		if (distance < closest_distance) {
+			for (i = 0; i < dev_cnt; i++)
+				pci_dev_put(closest_pdevs[i]);
+
+			dev_cnt = 0;
+			closest_distance = distance;
+		}
+
+		closest_pdevs[dev_cnt++] = pci_dev_get(pdev);
+	}
+
+	if (dev_cnt)
+		pdev = pci_dev_get(closest_pdevs[prandom_u32_max(dev_cnt)]);
+
+	for (i = 0; i < dev_cnt; i++)
+		pci_dev_put(closest_pdevs[i]);
+
+	if (pdev)
+		list_for_each_entry(pos, clients, list)
+			pos->provider = pdev;
+
+	kfree(closest_pdevs);
+	return pdev;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	void *ret;
+
+	if (unlikely(!pdev->p2pdma))
+		return NULL;
+
+	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+		return NULL;
+
+	ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+	if (unlikely(!ret))
+		percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+	percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ *	address obtained with pci_alloc_p2pmem()
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+	if (!addr)
+		return 0;
+	if (!pdev->p2pdma)
+		return 0;
+
+	/*
+	 * Note: when we added the memory to the pool we used the PCI
+	 * bus address as the physical address. So gen_pool_virt_to_phys()
+	 * actually returns the bus address despite the misleading name.
+	 */
+	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ * @length: number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+					 unsigned int *nents, u32 length)
+{
+	struct scatterlist *sg;
+	void *addr;
+
+	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+	if (!sg)
+		return NULL;
+
+	sg_init_table(sg, 1);
+
+	addr = pci_alloc_p2pmem(pdev, length);
+	if (!addr)
+		goto out_free_sg;
+
+	sg_set_buf(sg, addr, length);
+	*nents = 1;
+	return sg;
+
+out_free_sg:
+	kfree(sg);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl()
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl)
+{
+	struct scatterlist *sg;
+	int count;
+
+	for_each_sg(sgl, sg, INT_MAX, count) {
+		if (!sg)
+			break;
+
+		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+	}
+	kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ *	other devices with pci_p2pmem_find()
+ * @pdev: the device with peer-to-peer DMA memory to publish
+ * @publish: set to true to publish the memory, false to unpublish it
+ *
+ * Published memory can be used by other PCI device drivers for
+ * peer-2-peer DMA operations. Non-published memory is reserved for
+ * exlusive use of the device driver that registers the peer-to-peer
+ * memory.
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+	if (publish && !pdev->p2pdma)
+		return;
+
+	pdev->p2pdma->p2pmem_published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_PCI_P2PDMA,
 };
 
 /*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
 static inline bool is_device_private_page(const struct page *page)
 {
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..80e931cb1235
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+		u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients);
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients);
+bool pci_has_p2pmem(struct pci_dev *pdev);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+					 unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+		size_t size, u64 offset)
+{
+	return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+		struct device *dev)
+{
+	return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+		struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline int pci_p2pdma_distance(struct pci_dev *provider,
+				      struct list_head *clients)
+{
+	return -1;
+}
+static inline bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+					      struct list_head *clients)
+{
+	return false;
+}
+static inline bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return false;
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+		size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+						    void *addr)
+{
+	return 0;
+}
+static inline struct scatterlist * pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+		unsigned int *nents, u32 length)
+{
+	return NULL;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+		struct scatterlist *sgl)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 73178a2fcee0..005feaea8dca 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -277,6 +277,7 @@ struct pcie_link_state;
 struct pci_vpd;
 struct pci_sriov;
 struct pci_ats;
+struct pci_p2pdma;
 
 /* The pci_dev structure describes PCI devices */
 struct pci_dev {
@@ -430,6 +431,9 @@ struct pci_dev {
 #ifdef CONFIG_PCI_PASID
 	u16		pasid_features;
 #endif
+#ifdef CONFIG_PCI_P2PDMA
+	struct pci_p2pdma *p2pdma;
+#endif
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCI switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same root port (typically through
a network of PCIe switches). This is because we have no way of knowing
whether peer-to-peer routing between PCIe Root Ports is supported
(PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P transfers that
go through the RC is limited to only reducing DRAM usage and, in some
cases, coding convenience. The PCI-SIG may be exploring adding a new
capability bit to advertise whether this is possible for future
hardware.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  17 ++
 drivers/pci/Makefile       |   1 +
 drivers/pci/p2pdma.c       | 694 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  18 ++
 include/linux/pci-p2pdma.h | 100 +++++++
 include/linux/pci.h        |   4 +
 6 files changed, 834 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..b2396c22b53e 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,23 @@ config PCI_PASID
 
 	  If unsure, say N.
 
+config PCI_P2PDMA
+	bool "PCI peer-to-peer transfer support"
+	depends on PCI && ZONE_DEVICE && EXPERT
+	select GENERIC_ALLOCATOR
+	help
+	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
+	  BARs that are exposed in other devices that are the part of
+	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+	  specification to work (ie. anything below a single PCI bridge).
+
+	  Many PCIe root complexes do not support P2P transactions and
+	  it's hard to tell which support it at all, so at this time, DMA
+	  transations must be between devices behind the same root port.
+	  (Typically behind a network of PCIe switches).
+
+	  If unsure, say N.
+
 config PCI_LABEL
 	def_bool y if (DMI || ACPI)
 	depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 952addc7bacf..050c1e19a1de 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_X86_INTEL_MID)	+= pci-mid.o
 obj-$(CONFIG_PCI_SYSCALL)	+= syscall.o
 obj-$(CONFIG_PCI_STUB)		+= pci-stub.o
 obj-$(CONFIG_PCI_ECAM)		+= ecam.o
+obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 
 obj-y				+= host/
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..e524a12eca1f
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,694 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+#include <linux/random.h>
+
+struct pci_p2pdma {
+	struct percpu_ref devmap_ref;
+	struct completion devmap_ref_done;
+	struct gen_pool *pool;
+	bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+	struct pci_p2pdma *p2p =
+		container_of(ref, struct pci_p2pdma, devmap_ref);
+
+	complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+	struct pci_dev *pdev = data;
+
+	if (!pdev->p2pdma)
+		return;
+
+	wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+	gen_pool_destroy(pdev->p2pdma->pool);
+	pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+	int error = -ENOMEM;
+	struct pci_p2pdma *p2p;
+
+	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+	if (!p2p)
+		return -ENOMEM;
+
+	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2p->pool)
+		goto out;
+
+	init_completion(&p2p->devmap_ref_done);
+	error = percpu_ref_init(&p2p->devmap_ref,
+			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+	if (error)
+		goto out_pool_destroy;
+
+	percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_pool_destroy;
+
+	pdev->p2pdma = p2p;
+
+	return 0;
+
+out_pool_destroy:
+	gen_pool_destroy(p2p->pool);
+out:
+	devm_kfree(&pdev->dev, p2p);
+	return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pdev: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any DMA request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+			    u64 offset)
+{
+	struct dev_pagemap *pgmap;
+	void *addr;
+	int error;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	if (offset >= pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!size)
+		size = pci_resource_len(pdev, bar) - offset;
+
+	if (size + offset > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!pdev->p2pdma) {
+		error = pci_p2pdma_setup(pdev);
+		if (error)
+			return error;
+	}
+
+	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+	pgmap->res.end = pgmap->res.start + size - 1;
+	pgmap->res.flags = pci_resource_flags(pdev, bar);
+	pgmap->ref = &pdev->p2pdma->devmap_ref;
+	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+	addr = devm_memremap_pages(&pdev->dev, pgmap);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
+		goto pgmap_free;
+	}
+
+	error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
+			pci_bus_address(pdev, bar) + offset,
+			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+	if (error)
+		goto pgmap_free;
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+					  &pdev->p2pdma->devmap_ref);
+	if (error)
+		goto pgmap_free;
+
+	pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
+		 &pgmap->res);
+
+	return 0;
+
+pgmap_free:
+	devres_free(pgmap);
+	return error;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+	struct device *parent;
+
+	dev = get_device(dev);
+
+	while (dev) {
+		if (dev_is_pci(dev))
+			return to_pci_dev(dev);
+
+		parent = get_device(dev->parent);
+		put_device(dev);
+		dev = parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge():
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+	struct pci_dev *up1, *up2;
+
+	if (!pdev)
+		return NULL;
+
+	up1 = pci_dev_get(pci_upstream_bridge(pdev));
+	if (!up1)
+		return NULL;
+
+	up2 = pci_dev_get(pci_upstream_bridge(up1));
+	pci_dev_put(up1);
+
+	return up2;
+}
+
+/*
+ * Find the distance through the nearest common upstream bridge between
+ * two PCI devices.
+ *
+ * If the two devices are the same device then 0 will be returned.
+ *
+ * If there are two virtual functions of the same device behind the same
+ * bridge port then 2 will be returned (one step down to the bridge then
+ * one step back to the same device).
+ *
+ * In the case where two devices are connected to the same PCIe switch, the
+ * value 4 will be returned. This corresponds to the following PCI tree:
+ *
+ *     -+  Root Port
+ *      \+ Switch Upstream Port
+ *       +-+ Switch Downstream Port
+ *       + \- Device A
+ *       \-+ Switch Downstream Port
+ *         \- Device B
+ *
+ * The distance is 4 because we traverse from Device A through the downstream
+ * port of the switch, to the common upstream port, back up to the second
+ * downstream port and then to Device B.
+ *
+ * Any two devices that don't have a common upstream bridge will return -1.
+ * In this way devices on seperate root ports will be rejected, which
+ * is what we want for peer-to-peer seeing there's no way to determine
+ * if the root complex supports forwarding between root ports.
+ *
+ * In the case where two devices are connected to different PCIe switches
+ * this function will still return a positive distance as long as both
+ * switches evenutally have a common upstream bridge. Note this covers
+ * the case of using multiple PCIe switches to achieve a desired level of
+ * fan-out from a root port. The exact distance will be a function of the
+ * number of switches between Device A and Device B.
+ *
+ */
+static int upstream_bridge_distance(struct pci_dev *a,
+				    struct pci_dev *b)
+{
+	int dist_a = 0;
+	int dist_b = 0;
+	struct pci_dev *aa, *bb = NULL, *tmp;
+
+	aa = pci_dev_get(a);
+
+	while (aa) {
+		dist_b = 0;
+
+		pci_dev_put(bb);
+		bb = pci_dev_get(b);
+
+		while (bb) {
+			if (aa == bb)
+				goto put_and_return;
+
+			tmp = pci_dev_get(pci_upstream_bridge(bb));
+			pci_dev_put(bb);
+			bb = tmp;
+
+			dist_b++;
+		}
+
+		tmp = pci_dev_get(pci_upstream_bridge(aa));
+		pci_dev_put(aa);
+		aa = tmp;
+
+		dist_a++;
+	}
+
+	dist_a = -1;
+	dist_b = 0;
+
+put_and_return:
+	pci_dev_put(bb);
+	pci_dev_put(aa);
+
+	return dist_a + dist_b;
+}
+
+struct pci_p2pdma_client {
+	struct list_head list;
+	struct pci_dev *client;
+	struct pci_dev *provider;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to pci_p2pmem_find(). Once pci_p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *item, *new_item;
+	struct pci_dev *provider = NULL;
+	struct pci_dev *client;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+		dev_warn(dev, "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+		return -ENODEV;
+	}
+
+
+	client = find_parent_pci_dev(dev);
+	if (!client) {
+		dev_warn(dev, "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+		return -ENODEV;
+	}
+
+	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+	if (item && item->provider) {
+		provider = item->provider;
+
+		if (upstream_bridge_distance(provider, client) < 0) {
+			dev_warn(dev, "cannot be used for peer-to-peer DMA as the client and provider do not share an upstream bridge\n");
+
+			ret = -EXDEV;
+			goto put_client;
+		}
+	}
+
+	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+	if (!new_item) {
+		ret = -ENOMEM;
+		goto put_client;
+	}
+
+	new_item->client = client;
+	new_item->provider = pci_dev_get(provider);
+
+	list_add_tail(&new_item->list, head);
+
+	return 0;
+
+put_client:
+	pci_dev_put(client);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+	list_del(&item->list);
+	pci_dev_put(item->client);
+	pci_dev_put(item->provider);
+	kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+	struct pci_dev *pdev;
+
+	pdev = find_parent_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	list_for_each_entry_safe(pos, tmp, head, list) {
+		if (pos->client != pdev)
+			continue;
+
+		pci_p2pdma_client_free(pos);
+	}
+
+	pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, head, list)
+		pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+/**
+ * pci_p2pdma_distance - Determive the cumulative distance between
+ *	a p2pdma provider and the clients in use.
+ * @provider: p2pdma provider to check against the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns -1 if any of the clients are not compatible (behind the same
+ * root port as the provider), otherwise returns a positive number where
+ * the lower number is the preferrable choice. (If there's one client
+ * that's the same as the provider it will return 0, which is best choice).
+ *
+ * For now, "compatible" means the provider and the clients are all behind
+ * the same PCI root port. This cuts out cases that may work but is safest
+ * for the user. Future work can expand this to white-list root complexes that
+ * can safely forward between each ports.
+ */
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+	int ret;
+	int distance = 0;
+
+	if (list_empty(clients))
+		return -1;
+
+	list_for_each_entry(pos, clients, list) {
+		ret = upstream_bridge_distance(provider, pos->client);
+		if (ret < 0)
+			goto no_match;
+
+		distance += ret;
+	}
+
+	ret = distance;
+
+no_match:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_distance);
+
+/**
+ * pci_p2pdma_assign_provider - Check compatibily (as per pci_p2pdma_distance)
+ *	and assign a provider to a list of clients
+ * @provider: p2pdma provider to assign to the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns false if any of the clients are not compatible, true if the
+ * provider was successfully assigned to the clients.
+ */
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+
+	if (pci_p2pdma_distance(provider, clients) < 0)
+		return false;
+
+	list_for_each_entry(pos, clients, list)
+		pos->provider = provider;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_assign_provider);
+
+/**
+ * pci_has_p2pmem - check if a given PCI device has published any p2pmem
+ * @pdev: PCI device to check
+ */
+bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return pdev->p2pdma && pdev->p2pdma->p2pmem_published;
+}
+EXPORT_SYMBOL_GPL(pci_has_p2pmem);
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ *	the specified list of clients and shortest distance (as determined
+ *	by pci_p2pmem_dma())
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * If multiple devices are behind the same switch, the one "closest" to the
+ * client devices in use will be chosen first. (So if one of the providers are
+ * the same as one of the clients, that provider will be used ahead of any
+ * other providers that are unrelated). If multiple providers are an equal
+ * distance away, one will be chosen at random.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found. The
+ * found provider will also be assigned to the client list.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	struct pci_dev *pdev = NULL;
+	struct pci_p2pdma_client *pos;
+	int distance;
+	int closest_distance = INT_MAX;
+	struct pci_dev **closest_pdevs;
+	int dev_cnt = 0;
+	const int max_devs = PAGE_SIZE / sizeof(*closest_pdevs);
+	int i;
+
+	closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+
+	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+		if (!pci_has_p2pmem(pdev))
+			continue;
+
+		distance = pci_p2pdma_distance(pdev, clients);
+		if (distance < 0 || distance > closest_distance)
+			continue;
+
+		if (distance == closest_distance && dev_cnt >= max_devs)
+			continue;
+
+		if (distance < closest_distance) {
+			for (i = 0; i < dev_cnt; i++)
+				pci_dev_put(closest_pdevs[i]);
+
+			dev_cnt = 0;
+			closest_distance = distance;
+		}
+
+		closest_pdevs[dev_cnt++] = pci_dev_get(pdev);
+	}
+
+	if (dev_cnt)
+		pdev = pci_dev_get(closest_pdevs[prandom_u32_max(dev_cnt)]);
+
+	for (i = 0; i < dev_cnt; i++)
+		pci_dev_put(closest_pdevs[i]);
+
+	if (pdev)
+		list_for_each_entry(pos, clients, list)
+			pos->provider = pdev;
+
+	kfree(closest_pdevs);
+	return pdev;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	void *ret;
+
+	if (unlikely(!pdev->p2pdma))
+		return NULL;
+
+	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+		return NULL;
+
+	ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+	if (unlikely(!ret))
+		percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+	percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ *	address obtained with pci_alloc_p2pmem()
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+	if (!addr)
+		return 0;
+	if (!pdev->p2pdma)
+		return 0;
+
+	/*
+	 * Note: when we added the memory to the pool we used the PCI
+	 * bus address as the physical address. So gen_pool_virt_to_phys()
+	 * actually returns the bus address despite the misleading name.
+	 */
+	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ * @length: number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+					 unsigned int *nents, u32 length)
+{
+	struct scatterlist *sg;
+	void *addr;
+
+	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+	if (!sg)
+		return NULL;
+
+	sg_init_table(sg, 1);
+
+	addr = pci_alloc_p2pmem(pdev, length);
+	if (!addr)
+		goto out_free_sg;
+
+	sg_set_buf(sg, addr, length);
+	*nents = 1;
+	return sg;
+
+out_free_sg:
+	kfree(sg);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl()
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl)
+{
+	struct scatterlist *sg;
+	int count;
+
+	for_each_sg(sgl, sg, INT_MAX, count) {
+		if (!sg)
+			break;
+
+		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+	}
+	kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ *	other devices with pci_p2pmem_find()
+ * @pdev: the device with peer-to-peer DMA memory to publish
+ * @publish: set to true to publish the memory, false to unpublish it
+ *
+ * Published memory can be used by other PCI device drivers for
+ * peer-2-peer DMA operations. Non-published memory is reserved for
+ * exlusive use of the device driver that registers the peer-to-peer
+ * memory.
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+	if (publish && !pdev->p2pdma)
+		return;
+
+	pdev->p2pdma->p2pmem_published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_PCI_P2PDMA,
 };
 
 /*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
 static inline bool is_device_private_page(const struct page *page)
 {
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..80e931cb1235
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+		u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients);
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients);
+bool pci_has_p2pmem(struct pci_dev *pdev);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+					 unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+		size_t size, u64 offset)
+{
+	return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+		struct device *dev)
+{
+	return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+		struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline int pci_p2pdma_distance(struct pci_dev *provider,
+				      struct list_head *clients)
+{
+	return -1;
+}
+static inline bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+					      struct list_head *clients)
+{
+	return false;
+}
+static inline bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return false;
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+		size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+						    void *addr)
+{
+	return 0;
+}
+static inline struct scatterlist * pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+		unsigned int *nents, u32 length)
+{
+	return NULL;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+		struct scatterlist *sgl)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 73178a2fcee0..005feaea8dca 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -277,6 +277,7 @@ struct pcie_link_state;
 struct pci_vpd;
 struct pci_sriov;
 struct pci_ats;
+struct pci_p2pdma;
 
 /* The pci_dev structure describes PCI devices */
 struct pci_dev {
@@ -430,6 +431,9 @@ struct pci_dev {
 #ifdef CONFIG_PCI_PASID
 	u16		pasid_features;
 #endif
+#ifdef CONFIG_PCI_P2PDMA
+	struct pci_p2pdma *p2pdma;
+#endif
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCI switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same root port (typically through
a network of PCIe switches). This is because we have no way of knowing
whether peer-to-peer routing between PCIe Root Ports is supported
(PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P transfers that
go through the RC is limited to only reducing DRAM usage and, in some
cases, coding convenience. The PCI-SIG may be exploring adding a new
capability bit to advertise whether this is possible for future
hardware.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  17 ++
 drivers/pci/Makefile       |   1 +
 drivers/pci/p2pdma.c       | 694 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  18 ++
 include/linux/pci-p2pdma.h | 100 +++++++
 include/linux/pci.h        |   4 +
 6 files changed, 834 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..b2396c22b53e 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,23 @@ config PCI_PASID
 
 	  If unsure, say N.
 
+config PCI_P2PDMA
+	bool "PCI peer-to-peer transfer support"
+	depends on PCI && ZONE_DEVICE && EXPERT
+	select GENERIC_ALLOCATOR
+	help
+	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
+	  BARs that are exposed in other devices that are the part of
+	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+	  specification to work (ie. anything below a single PCI bridge).
+
+	  Many PCIe root complexes do not support P2P transactions and
+	  it's hard to tell which support it at all, so at this time, DMA
+	  transations must be between devices behind the same root port.
+	  (Typically behind a network of PCIe switches).
+
+	  If unsure, say N.
+
 config PCI_LABEL
 	def_bool y if (DMI || ACPI)
 	depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 952addc7bacf..050c1e19a1de 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_X86_INTEL_MID)	+= pci-mid.o
 obj-$(CONFIG_PCI_SYSCALL)	+= syscall.o
 obj-$(CONFIG_PCI_STUB)		+= pci-stub.o
 obj-$(CONFIG_PCI_ECAM)		+= ecam.o
+obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 
 obj-y				+= host/
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..e524a12eca1f
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,694 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+#include <linux/random.h>
+
+struct pci_p2pdma {
+	struct percpu_ref devmap_ref;
+	struct completion devmap_ref_done;
+	struct gen_pool *pool;
+	bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+	struct pci_p2pdma *p2p =
+		container_of(ref, struct pci_p2pdma, devmap_ref);
+
+	complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+	struct pci_dev *pdev = data;
+
+	if (!pdev->p2pdma)
+		return;
+
+	wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+	gen_pool_destroy(pdev->p2pdma->pool);
+	pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+	int error = -ENOMEM;
+	struct pci_p2pdma *p2p;
+
+	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+	if (!p2p)
+		return -ENOMEM;
+
+	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2p->pool)
+		goto out;
+
+	init_completion(&p2p->devmap_ref_done);
+	error = percpu_ref_init(&p2p->devmap_ref,
+			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+	if (error)
+		goto out_pool_destroy;
+
+	percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_pool_destroy;
+
+	pdev->p2pdma = p2p;
+
+	return 0;
+
+out_pool_destroy:
+	gen_pool_destroy(p2p->pool);
+out:
+	devm_kfree(&pdev->dev, p2p);
+	return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pdev: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any DMA request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+			    u64 offset)
+{
+	struct dev_pagemap *pgmap;
+	void *addr;
+	int error;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	if (offset >= pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!size)
+		size = pci_resource_len(pdev, bar) - offset;
+
+	if (size + offset > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!pdev->p2pdma) {
+		error = pci_p2pdma_setup(pdev);
+		if (error)
+			return error;
+	}
+
+	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+	pgmap->res.end = pgmap->res.start + size - 1;
+	pgmap->res.flags = pci_resource_flags(pdev, bar);
+	pgmap->ref = &pdev->p2pdma->devmap_ref;
+	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+	addr = devm_memremap_pages(&pdev->dev, pgmap);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
+		goto pgmap_free;
+	}
+
+	error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
+			pci_bus_address(pdev, bar) + offset,
+			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+	if (error)
+		goto pgmap_free;
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+					  &pdev->p2pdma->devmap_ref);
+	if (error)
+		goto pgmap_free;
+
+	pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
+		 &pgmap->res);
+
+	return 0;
+
+pgmap_free:
+	devres_free(pgmap);
+	return error;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+	struct device *parent;
+
+	dev = get_device(dev);
+
+	while (dev) {
+		if (dev_is_pci(dev))
+			return to_pci_dev(dev);
+
+		parent = get_device(dev->parent);
+		put_device(dev);
+		dev = parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge():
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+	struct pci_dev *up1, *up2;
+
+	if (!pdev)
+		return NULL;
+
+	up1 = pci_dev_get(pci_upstream_bridge(pdev));
+	if (!up1)
+		return NULL;
+
+	up2 = pci_dev_get(pci_upstream_bridge(up1));
+	pci_dev_put(up1);
+
+	return up2;
+}
+
+/*
+ * Find the distance through the nearest common upstream bridge between
+ * two PCI devices.
+ *
+ * If the two devices are the same device then 0 will be returned.
+ *
+ * If there are two virtual functions of the same device behind the same
+ * bridge port then 2 will be returned (one step down to the bridge then
+ * one step back to the same device).
+ *
+ * In the case where two devices are connected to the same PCIe switch, the
+ * value 4 will be returned. This corresponds to the following PCI tree:
+ *
+ *     -+  Root Port
+ *      \+ Switch Upstream Port
+ *       +-+ Switch Downstream Port
+ *       + \- Device A
+ *       \-+ Switch Downstream Port
+ *         \- Device B
+ *
+ * The distance is 4 because we traverse from Device A through the downstream
+ * port of the switch, to the common upstream port, back up to the second
+ * downstream port and then to Device B.
+ *
+ * Any two devices that don't have a common upstream bridge will return -1.
+ * In this way devices on seperate root ports will be rejected, which
+ * is what we want for peer-to-peer seeing there's no way to determine
+ * if the root complex supports forwarding between root ports.
+ *
+ * In the case where two devices are connected to different PCIe switches
+ * this function will still return a positive distance as long as both
+ * switches evenutally have a common upstream bridge. Note this covers
+ * the case of using multiple PCIe switches to achieve a desired level of
+ * fan-out from a root port. The exact distance will be a function of the
+ * number of switches between Device A and Device B.
+ *
+ */
+static int upstream_bridge_distance(struct pci_dev *a,
+				    struct pci_dev *b)
+{
+	int dist_a = 0;
+	int dist_b = 0;
+	struct pci_dev *aa, *bb = NULL, *tmp;
+
+	aa = pci_dev_get(a);
+
+	while (aa) {
+		dist_b = 0;
+
+		pci_dev_put(bb);
+		bb = pci_dev_get(b);
+
+		while (bb) {
+			if (aa == bb)
+				goto put_and_return;
+
+			tmp = pci_dev_get(pci_upstream_bridge(bb));
+			pci_dev_put(bb);
+			bb = tmp;
+
+			dist_b++;
+		}
+
+		tmp = pci_dev_get(pci_upstream_bridge(aa));
+		pci_dev_put(aa);
+		aa = tmp;
+
+		dist_a++;
+	}
+
+	dist_a = -1;
+	dist_b = 0;
+
+put_and_return:
+	pci_dev_put(bb);
+	pci_dev_put(aa);
+
+	return dist_a + dist_b;
+}
+
+struct pci_p2pdma_client {
+	struct list_head list;
+	struct pci_dev *client;
+	struct pci_dev *provider;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to pci_p2pmem_find(). Once pci_p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *item, *new_item;
+	struct pci_dev *provider = NULL;
+	struct pci_dev *client;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+		dev_warn(dev, "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+		return -ENODEV;
+	}
+
+
+	client = find_parent_pci_dev(dev);
+	if (!client) {
+		dev_warn(dev, "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+		return -ENODEV;
+	}
+
+	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+	if (item && item->provider) {
+		provider = item->provider;
+
+		if (upstream_bridge_distance(provider, client) < 0) {
+			dev_warn(dev, "cannot be used for peer-to-peer DMA as the client and provider do not share an upstream bridge\n");
+
+			ret = -EXDEV;
+			goto put_client;
+		}
+	}
+
+	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+	if (!new_item) {
+		ret = -ENOMEM;
+		goto put_client;
+	}
+
+	new_item->client = client;
+	new_item->provider = pci_dev_get(provider);
+
+	list_add_tail(&new_item->list, head);
+
+	return 0;
+
+put_client:
+	pci_dev_put(client);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+	list_del(&item->list);
+	pci_dev_put(item->client);
+	pci_dev_put(item->provider);
+	kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+	struct pci_dev *pdev;
+
+	pdev = find_parent_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	list_for_each_entry_safe(pos, tmp, head, list) {
+		if (pos->client != pdev)
+			continue;
+
+		pci_p2pdma_client_free(pos);
+	}
+
+	pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, head, list)
+		pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+/**
+ * pci_p2pdma_distance - Determive the cumulative distance between
+ *	a p2pdma provider and the clients in use.
+ * @provider: p2pdma provider to check against the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns -1 if any of the clients are not compatible (behind the same
+ * root port as the provider), otherwise returns a positive number where
+ * the lower number is the preferrable choice. (If there's one client
+ * that's the same as the provider it will return 0, which is best choice).
+ *
+ * For now, "compatible" means the provider and the clients are all behind
+ * the same PCI root port. This cuts out cases that may work but is safest
+ * for the user. Future work can expand this to white-list root complexes that
+ * can safely forward between each ports.
+ */
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+	int ret;
+	int distance = 0;
+
+	if (list_empty(clients))
+		return -1;
+
+	list_for_each_entry(pos, clients, list) {
+		ret = upstream_bridge_distance(provider, pos->client);
+		if (ret < 0)
+			goto no_match;
+
+		distance += ret;
+	}
+
+	ret = distance;
+
+no_match:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_distance);
+
+/**
+ * pci_p2pdma_assign_provider - Check compatibily (as per pci_p2pdma_distance)
+ *	and assign a provider to a list of clients
+ * @provider: p2pdma provider to assign to the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns false if any of the clients are not compatible, true if the
+ * provider was successfully assigned to the clients.
+ */
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+
+	if (pci_p2pdma_distance(provider, clients) < 0)
+		return false;
+
+	list_for_each_entry(pos, clients, list)
+		pos->provider = provider;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_assign_provider);
+
+/**
+ * pci_has_p2pmem - check if a given PCI device has published any p2pmem
+ * @pdev: PCI device to check
+ */
+bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return pdev->p2pdma && pdev->p2pdma->p2pmem_published;
+}
+EXPORT_SYMBOL_GPL(pci_has_p2pmem);
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ *	the specified list of clients and shortest distance (as determined
+ *	by pci_p2pmem_dma())
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * If multiple devices are behind the same switch, the one "closest" to the
+ * client devices in use will be chosen first. (So if one of the providers are
+ * the same as one of the clients, that provider will be used ahead of any
+ * other providers that are unrelated). If multiple providers are an equal
+ * distance away, one will be chosen at random.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found. The
+ * found provider will also be assigned to the client list.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	struct pci_dev *pdev = NULL;
+	struct pci_p2pdma_client *pos;
+	int distance;
+	int closest_distance = INT_MAX;
+	struct pci_dev **closest_pdevs;
+	int dev_cnt = 0;
+	const int max_devs = PAGE_SIZE / sizeof(*closest_pdevs);
+	int i;
+
+	closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+
+	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+		if (!pci_has_p2pmem(pdev))
+			continue;
+
+		distance = pci_p2pdma_distance(pdev, clients);
+		if (distance < 0 || distance > closest_distance)
+			continue;
+
+		if (distance == closest_distance && dev_cnt >= max_devs)
+			continue;
+
+		if (distance < closest_distance) {
+			for (i = 0; i < dev_cnt; i++)
+				pci_dev_put(closest_pdevs[i]);
+
+			dev_cnt = 0;
+			closest_distance = distance;
+		}
+
+		closest_pdevs[dev_cnt++] = pci_dev_get(pdev);
+	}
+
+	if (dev_cnt)
+		pdev = pci_dev_get(closest_pdevs[prandom_u32_max(dev_cnt)]);
+
+	for (i = 0; i < dev_cnt; i++)
+		pci_dev_put(closest_pdevs[i]);
+
+	if (pdev)
+		list_for_each_entry(pos, clients, list)
+			pos->provider = pdev;
+
+	kfree(closest_pdevs);
+	return pdev;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	void *ret;
+
+	if (unlikely(!pdev->p2pdma))
+		return NULL;
+
+	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+		return NULL;
+
+	ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+	if (unlikely(!ret))
+		percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+	percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ *	address obtained with pci_alloc_p2pmem()
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+	if (!addr)
+		return 0;
+	if (!pdev->p2pdma)
+		return 0;
+
+	/*
+	 * Note: when we added the memory to the pool we used the PCI
+	 * bus address as the physical address. So gen_pool_virt_to_phys()
+	 * actually returns the bus address despite the misleading name.
+	 */
+	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ * @length: number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+					 unsigned int *nents, u32 length)
+{
+	struct scatterlist *sg;
+	void *addr;
+
+	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+	if (!sg)
+		return NULL;
+
+	sg_init_table(sg, 1);
+
+	addr = pci_alloc_p2pmem(pdev, length);
+	if (!addr)
+		goto out_free_sg;
+
+	sg_set_buf(sg, addr, length);
+	*nents = 1;
+	return sg;
+
+out_free_sg:
+	kfree(sg);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl()
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl)
+{
+	struct scatterlist *sg;
+	int count;
+
+	for_each_sg(sgl, sg, INT_MAX, count) {
+		if (!sg)
+			break;
+
+		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+	}
+	kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ *	other devices with pci_p2pmem_find()
+ * @pdev: the device with peer-to-peer DMA memory to publish
+ * @publish: set to true to publish the memory, false to unpublish it
+ *
+ * Published memory can be used by other PCI device drivers for
+ * peer-2-peer DMA operations. Non-published memory is reserved for
+ * exlusive use of the device driver that registers the peer-to-peer
+ * memory.
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+	if (publish && !pdev->p2pdma)
+		return;
+
+	pdev->p2pdma->p2pmem_published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_PCI_P2PDMA,
 };
 
 /*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
 static inline bool is_device_private_page(const struct page *page)
 {
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..80e931cb1235
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+		u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients);
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients);
+bool pci_has_p2pmem(struct pci_dev *pdev);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+					 unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+		size_t size, u64 offset)
+{
+	return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+		struct device *dev)
+{
+	return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+		struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline int pci_p2pdma_distance(struct pci_dev *provider,
+				      struct list_head *clients)
+{
+	return -1;
+}
+static inline bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+					      struct list_head *clients)
+{
+	return false;
+}
+static inline bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return false;
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+		size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+						    void *addr)
+{
+	return 0;
+}
+static inline struct scatterlist * pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+		unsigned int *nents, u32 length)
+{
+	return NULL;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+		struct scatterlist *sgl)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 73178a2fcee0..005feaea8dca 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -277,6 +277,7 @@ struct pcie_link_state;
 struct pci_vpd;
 struct pci_sriov;
 struct pci_ats;
+struct pci_p2pdma;
 
 /* The pci_dev structure describes PCI devices */
 struct pci_dev {
@@ -430,6 +431,9 @@ struct pci_dev {
 #ifdef CONFIG_PCI_PASID
 	u16		pasid_features;
 #endif
+#ifdef CONFIG_PCI_P2PDMA
+	struct pci_p2pdma *p2pdma;
+#endif
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCI switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same root port (typically through
a network of PCIe switches). This is because we have no way of knowing
whether peer-to-peer routing between PCIe Root Ports is supported
(PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P transfers that
go through the RC is limited to only reducing DRAM usage and, in some
cases, coding convenience. The PCI-SIG may be exploring adding a new
capability bit to advertise whether this is possible for future
hardware.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/pci/Kconfig        |  17 ++
 drivers/pci/Makefile       |   1 +
 drivers/pci/p2pdma.c       | 694 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  18 ++
 include/linux/pci-p2pdma.h | 100 +++++++
 include/linux/pci.h        |   4 +
 6 files changed, 834 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..b2396c22b53e 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,23 @@ config PCI_PASID
 
 	  If unsure, say N.
 
+config PCI_P2PDMA
+	bool "PCI peer-to-peer transfer support"
+	depends on PCI && ZONE_DEVICE && EXPERT
+	select GENERIC_ALLOCATOR
+	help
+	  Enable? drivers to do PCI peer-to-peer transactions to and from
+	  BARs that are exposed in other devices that are the part of
+	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+	  specification to work (ie. anything below a single PCI bridge).
+
+	  Many PCIe root complexes do not support P2P transactions and
+	  it's hard to tell which support it at all, so at this time, DMA
+	  transations must be between devices behind the same root port.
+	  (Typically behind a network of PCIe switches).
+
+	  If unsure, say N.
+
 config PCI_LABEL
 	def_bool y if (DMI || ACPI)
 	depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 952addc7bacf..050c1e19a1de 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_X86_INTEL_MID)	+= pci-mid.o
 obj-$(CONFIG_PCI_SYSCALL)	+= syscall.o
 obj-$(CONFIG_PCI_STUB)		+= pci-stub.o
 obj-$(CONFIG_PCI_ECAM)		+= ecam.o
+obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 
 obj-y				+= host/
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..e524a12eca1f
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,694 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+#include <linux/random.h>
+
+struct pci_p2pdma {
+	struct percpu_ref devmap_ref;
+	struct completion devmap_ref_done;
+	struct gen_pool *pool;
+	bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+	struct pci_p2pdma *p2p =
+		container_of(ref, struct pci_p2pdma, devmap_ref);
+
+	complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+	struct pci_dev *pdev = data;
+
+	if (!pdev->p2pdma)
+		return;
+
+	wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+	gen_pool_destroy(pdev->p2pdma->pool);
+	pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+	int error = -ENOMEM;
+	struct pci_p2pdma *p2p;
+
+	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+	if (!p2p)
+		return -ENOMEM;
+
+	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2p->pool)
+		goto out;
+
+	init_completion(&p2p->devmap_ref_done);
+	error = percpu_ref_init(&p2p->devmap_ref,
+			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+	if (error)
+		goto out_pool_destroy;
+
+	percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_pool_destroy;
+
+	pdev->p2pdma = p2p;
+
+	return 0;
+
+out_pool_destroy:
+	gen_pool_destroy(p2p->pool);
+out:
+	devm_kfree(&pdev->dev, p2p);
+	return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pdev: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any DMA request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+			    u64 offset)
+{
+	struct dev_pagemap *pgmap;
+	void *addr;
+	int error;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	if (offset >= pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!size)
+		size = pci_resource_len(pdev, bar) - offset;
+
+	if (size + offset > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!pdev->p2pdma) {
+		error = pci_p2pdma_setup(pdev);
+		if (error)
+			return error;
+	}
+
+	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+	pgmap->res.end = pgmap->res.start + size - 1;
+	pgmap->res.flags = pci_resource_flags(pdev, bar);
+	pgmap->ref = &pdev->p2pdma->devmap_ref;
+	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+	addr = devm_memremap_pages(&pdev->dev, pgmap);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
+		goto pgmap_free;
+	}
+
+	error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
+			pci_bus_address(pdev, bar) + offset,
+			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+	if (error)
+		goto pgmap_free;
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+					  &pdev->p2pdma->devmap_ref);
+	if (error)
+		goto pgmap_free;
+
+	pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
+		 &pgmap->res);
+
+	return 0;
+
+pgmap_free:
+	devres_free(pgmap);
+	return error;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+	struct device *parent;
+
+	dev = get_device(dev);
+
+	while (dev) {
+		if (dev_is_pci(dev))
+			return to_pci_dev(dev);
+
+		parent = get_device(dev->parent);
+		put_device(dev);
+		dev = parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge():
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+	struct pci_dev *up1, *up2;
+
+	if (!pdev)
+		return NULL;
+
+	up1 = pci_dev_get(pci_upstream_bridge(pdev));
+	if (!up1)
+		return NULL;
+
+	up2 = pci_dev_get(pci_upstream_bridge(up1));
+	pci_dev_put(up1);
+
+	return up2;
+}
+
+/*
+ * Find the distance through the nearest common upstream bridge between
+ * two PCI devices.
+ *
+ * If the two devices are the same device then 0 will be returned.
+ *
+ * If there are two virtual functions of the same device behind the same
+ * bridge port then 2 will be returned (one step down to the bridge then
+ * one step back to the same device).
+ *
+ * In the case where two devices are connected to the same PCIe switch, the
+ * value 4 will be returned. This corresponds to the following PCI tree:
+ *
+ *     -+  Root Port
+ *      \+ Switch Upstream Port
+ *       +-+ Switch Downstream Port
+ *       + \- Device A
+ *       \-+ Switch Downstream Port
+ *         \- Device B
+ *
+ * The distance is 4 because we traverse from Device A through the downstream
+ * port of the switch, to the common upstream port, back up to the second
+ * downstream port and then to Device B.
+ *
+ * Any two devices that don't have a common upstream bridge will return -1.
+ * In this way devices on seperate root ports will be rejected, which
+ * is what we want for peer-to-peer seeing there's no way to determine
+ * if the root complex supports forwarding between root ports.
+ *
+ * In the case where two devices are connected to different PCIe switches
+ * this function will still return a positive distance as long as both
+ * switches evenutally have a common upstream bridge. Note this covers
+ * the case of using multiple PCIe switches to achieve a desired level of
+ * fan-out from a root port. The exact distance will be a function of the
+ * number of switches between Device A and Device B.
+ *
+ */
+static int upstream_bridge_distance(struct pci_dev *a,
+				    struct pci_dev *b)
+{
+	int dist_a = 0;
+	int dist_b = 0;
+	struct pci_dev *aa, *bb = NULL, *tmp;
+
+	aa = pci_dev_get(a);
+
+	while (aa) {
+		dist_b = 0;
+
+		pci_dev_put(bb);
+		bb = pci_dev_get(b);
+
+		while (bb) {
+			if (aa == bb)
+				goto put_and_return;
+
+			tmp = pci_dev_get(pci_upstream_bridge(bb));
+			pci_dev_put(bb);
+			bb = tmp;
+
+			dist_b++;
+		}
+
+		tmp = pci_dev_get(pci_upstream_bridge(aa));
+		pci_dev_put(aa);
+		aa = tmp;
+
+		dist_a++;
+	}
+
+	dist_a = -1;
+	dist_b = 0;
+
+put_and_return:
+	pci_dev_put(bb);
+	pci_dev_put(aa);
+
+	return dist_a + dist_b;
+}
+
+struct pci_p2pdma_client {
+	struct list_head list;
+	struct pci_dev *client;
+	struct pci_dev *provider;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to pci_p2pmem_find(). Once pci_p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *item, *new_item;
+	struct pci_dev *provider = NULL;
+	struct pci_dev *client;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+		dev_warn(dev, "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+		return -ENODEV;
+	}
+
+
+	client = find_parent_pci_dev(dev);
+	if (!client) {
+		dev_warn(dev, "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+		return -ENODEV;
+	}
+
+	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+	if (item && item->provider) {
+		provider = item->provider;
+
+		if (upstream_bridge_distance(provider, client) < 0) {
+			dev_warn(dev, "cannot be used for peer-to-peer DMA as the client and provider do not share an upstream bridge\n");
+
+			ret = -EXDEV;
+			goto put_client;
+		}
+	}
+
+	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+	if (!new_item) {
+		ret = -ENOMEM;
+		goto put_client;
+	}
+
+	new_item->client = client;
+	new_item->provider = pci_dev_get(provider);
+
+	list_add_tail(&new_item->list, head);
+
+	return 0;
+
+put_client:
+	pci_dev_put(client);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+	list_del(&item->list);
+	pci_dev_put(item->client);
+	pci_dev_put(item->provider);
+	kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+	struct pci_dev *pdev;
+
+	pdev = find_parent_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	list_for_each_entry_safe(pos, tmp, head, list) {
+		if (pos->client != pdev)
+			continue;
+
+		pci_p2pdma_client_free(pos);
+	}
+
+	pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, head, list)
+		pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+/**
+ * pci_p2pdma_distance - Determive the cumulative distance between
+ *	a p2pdma provider and the clients in use.
+ * @provider: p2pdma provider to check against the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns -1 if any of the clients are not compatible (behind the same
+ * root port as the provider), otherwise returns a positive number where
+ * the lower number is the preferrable choice. (If there's one client
+ * that's the same as the provider it will return 0, which is best choice).
+ *
+ * For now, "compatible" means the provider and the clients are all behind
+ * the same PCI root port. This cuts out cases that may work but is safest
+ * for the user. Future work can expand this to white-list root complexes that
+ * can safely forward between each ports.
+ */
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+	int ret;
+	int distance = 0;
+
+	if (list_empty(clients))
+		return -1;
+
+	list_for_each_entry(pos, clients, list) {
+		ret = upstream_bridge_distance(provider, pos->client);
+		if (ret < 0)
+			goto no_match;
+
+		distance += ret;
+	}
+
+	ret = distance;
+
+no_match:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_distance);
+
+/**
+ * pci_p2pdma_assign_provider - Check compatibily (as per pci_p2pdma_distance)
+ *	and assign a provider to a list of clients
+ * @provider: p2pdma provider to assign to the client list
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * Returns false if any of the clients are not compatible, true if the
+ * provider was successfully assigned to the clients.
+ */
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients)
+{
+	struct pci_p2pdma_client *pos;
+
+	if (pci_p2pdma_distance(provider, clients) < 0)
+		return false;
+
+	list_for_each_entry(pos, clients, list)
+		pos->provider = provider;
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_assign_provider);
+
+/**
+ * pci_has_p2pmem - check if a given PCI device has published any p2pmem
+ * @pdev: PCI device to check
+ */
+bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return pdev->p2pdma && pdev->p2pdma->p2pmem_published;
+}
+EXPORT_SYMBOL_GPL(pci_has_p2pmem);
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ *	the specified list of clients and shortest distance (as determined
+ *	by pci_p2pmem_dma())
+ * @clients: list of devices to check (NULL-terminated)
+ *
+ * If multiple devices are behind the same switch, the one "closest" to the
+ * client devices in use will be chosen first. (So if one of the providers are
+ * the same as one of the clients, that provider will be used ahead of any
+ * other providers that are unrelated). If multiple providers are an equal
+ * distance away, one will be chosen at random.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found. The
+ * found provider will also be assigned to the client list.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	struct pci_dev *pdev = NULL;
+	struct pci_p2pdma_client *pos;
+	int distance;
+	int closest_distance = INT_MAX;
+	struct pci_dev **closest_pdevs;
+	int dev_cnt = 0;
+	const int max_devs = PAGE_SIZE / sizeof(*closest_pdevs);
+	int i;
+
+	closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+
+	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+		if (!pci_has_p2pmem(pdev))
+			continue;
+
+		distance = pci_p2pdma_distance(pdev, clients);
+		if (distance < 0 || distance > closest_distance)
+			continue;
+
+		if (distance == closest_distance && dev_cnt >= max_devs)
+			continue;
+
+		if (distance < closest_distance) {
+			for (i = 0; i < dev_cnt; i++)
+				pci_dev_put(closest_pdevs[i]);
+
+			dev_cnt = 0;
+			closest_distance = distance;
+		}
+
+		closest_pdevs[dev_cnt++] = pci_dev_get(pdev);
+	}
+
+	if (dev_cnt)
+		pdev = pci_dev_get(closest_pdevs[prandom_u32_max(dev_cnt)]);
+
+	for (i = 0; i < dev_cnt; i++)
+		pci_dev_put(closest_pdevs[i]);
+
+	if (pdev)
+		list_for_each_entry(pos, clients, list)
+			pos->provider = pdev;
+
+	kfree(closest_pdevs);
+	return pdev;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	void *ret;
+
+	if (unlikely(!pdev->p2pdma))
+		return NULL;
+
+	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+		return NULL;
+
+	ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+	if (unlikely(!ret))
+		percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+	percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ *	address obtained with pci_alloc_p2pmem()
+ * @pdev: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+	if (!addr)
+		return 0;
+	if (!pdev->p2pdma)
+		return 0;
+
+	/*
+	 * Note: when we added the memory to the pool we used the PCI
+	 * bus address as the physical address. So gen_pool_virt_to_phys()
+	 * actually returns the bus address despite the misleading name.
+	 */
+	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ * @length: number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+					 unsigned int *nents, u32 length)
+{
+	struct scatterlist *sg;
+	void *addr;
+
+	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+	if (!sg)
+		return NULL;
+
+	sg_init_table(sg, 1);
+
+	addr = pci_alloc_p2pmem(pdev, length);
+	if (!addr)
+		goto out_free_sg;
+
+	sg_set_buf(sg, addr, length);
+	*nents = 1;
+	return sg;
+
+out_free_sg:
+	kfree(sg);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl()
+ * @pdev: the device to allocate memory from
+ * @sgl: the allocated scatterlist
+ * @nents: the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl)
+{
+	struct scatterlist *sg;
+	int count;
+
+	for_each_sg(sgl, sg, INT_MAX, count) {
+		if (!sg)
+			break;
+
+		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+	}
+	kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ *	other devices with pci_p2pmem_find()
+ * @pdev: the device with peer-to-peer DMA memory to publish
+ * @publish: set to true to publish the memory, false to unpublish it
+ *
+ * Published memory can be used by other PCI device drivers for
+ * peer-2-peer DMA operations. Non-published memory is reserved for
+ * exlusive use of the device driver that registers the peer-to-peer
+ * memory.
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+	if (publish && !pdev->p2pdma)
+		return;
+
+	pdev->p2pdma->p2pmem_published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_PCI_P2PDMA,
 };
 
 /*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
 static inline bool is_device_private_page(const struct page *page)
 {
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..80e931cb1235
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ *
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+		u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients);
+bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+				struct list_head *clients);
+bool pci_has_p2pmem(struct pci_dev *pdev);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+					 unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+		size_t size, u64 offset)
+{
+	return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+		struct device *dev)
+{
+	return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+		struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline int pci_p2pdma_distance(struct pci_dev *provider,
+				      struct list_head *clients)
+{
+	return -1;
+}
+static inline bool pci_p2pdma_assign_provider(struct pci_dev *provider,
+					      struct list_head *clients)
+{
+	return false;
+}
+static inline bool pci_has_p2pmem(struct pci_dev *pdev)
+{
+	return false;
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+		size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+						    void *addr)
+{
+	return 0;
+}
+static inline struct scatterlist * pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+		unsigned int *nents, u32 length)
+{
+	return NULL;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+		struct scatterlist *sgl)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 73178a2fcee0..005feaea8dca 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -277,6 +277,7 @@ struct pcie_link_state;
 struct pci_vpd;
 struct pci_sriov;
 struct pci_ats;
+struct pci_p2pdma;
 
 /* The pci_dev structure describes PCI devices */
 struct pci_dev {
@@ -430,6 +431,9 @@ struct pci_dev {
 #ifdef CONFIG_PCI_PASID
 	u16		pasid_features;
 #endif
+#ifdef CONFIG_PCI_P2PDMA
+	struct pci_p2pdma *p2pdma;
+#endif
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 02/14] PCI/P2PDMA: Add sysfs group to display p2pmem stats
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Add a sysfs group to display statistics about P2P memory that is
registered in each PCI device.

Attributes in the group display the total amount of P2P memory, the
amount available and whether it is published or not.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++
 drivers/pci/p2pdma.c                    | 54 +++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..044812c816d0 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
 
 		This is similar to /sys/bus/pci/drivers_autoprobe, but
 		affects only the VFs associated with a specific PF.
+
+What:		/sys/bus/pci/devices/.../p2pmem/available
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the amount of memory that has not been
+		allocated (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/size
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the total amount of memory that the device
+		provides (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/published
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains a '1' if the memory has been published for
+		use inside the kernel or a '0' if it is only intended
+		for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index e524a12eca1f..4daad6374869 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -24,6 +24,54 @@ struct pci_p2pdma {
 	bool p2pmem_published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t size = 0;
+
+	if (pdev->p2pdma->pool)
+		size = gen_pool_size(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t avail = 0;
+
+	if (pdev->p2pdma->pool)
+		avail = gen_pool_avail(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n",
+			pdev->p2pdma->p2pmem_published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+	&dev_attr_size.attr,
+	&dev_attr_available.attr,
+	&dev_attr_published.attr,
+	NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+	.attrs = p2pmem_attrs,
+	.name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
 	struct pci_p2pdma *p2p =
@@ -53,6 +101,7 @@ static void pci_p2pdma_release(void *data)
 	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
 
 	gen_pool_destroy(pdev->p2pdma->pool);
+	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	pdev->p2pdma = NULL;
 }
 
@@ -83,9 +132,14 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 
 	pdev->p2pdma = p2p;
 
+	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+	if (error)
+		goto out_pool_destroy;
+
 	return 0;
 
 out_pool_destroy:
+	pdev->p2pdma = NULL;
 	gen_pool_destroy(p2p->pool);
 out:
 	devm_kfree(&pdev->dev, p2p);
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 02/14] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

Add a sysfs group to display statistics about P2P memory that is
registered in each PCI device.

Attributes in the group display the total amount of P2P memory, the
amount available and whether it is published or not.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++
 drivers/pci/p2pdma.c                    | 54 +++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..044812c816d0 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
 
 		This is similar to /sys/bus/pci/drivers_autoprobe, but
 		affects only the VFs associated with a specific PF.
+
+What:		/sys/bus/pci/devices/.../p2pmem/available
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the amount of memory that has not been
+		allocated (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/size
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the total amount of memory that the device
+		provides (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/published
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains a '1' if the memory has been published for
+		use inside the kernel or a '0' if it is only intended
+		for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index e524a12eca1f..4daad6374869 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -24,6 +24,54 @@ struct pci_p2pdma {
 	bool p2pmem_published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t size = 0;
+
+	if (pdev->p2pdma->pool)
+		size = gen_pool_size(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t avail = 0;
+
+	if (pdev->p2pdma->pool)
+		avail = gen_pool_avail(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n",
+			pdev->p2pdma->p2pmem_published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+	&dev_attr_size.attr,
+	&dev_attr_available.attr,
+	&dev_attr_published.attr,
+	NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+	.attrs = p2pmem_attrs,
+	.name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
 	struct pci_p2pdma *p2p =
@@ -53,6 +101,7 @@ static void pci_p2pdma_release(void *data)
 	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
 
 	gen_pool_destroy(pdev->p2pdma->pool);
+	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	pdev->p2pdma = NULL;
 }
 
@@ -83,9 +132,14 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 
 	pdev->p2pdma = p2p;
 
+	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+	if (error)
+		goto out_pool_destroy;
+
 	return 0;
 
 out_pool_destroy:
+	pdev->p2pdma = NULL;
 	gen_pool_destroy(p2p->pool);
 out:
 	devm_kfree(&pdev->dev, p2p);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 02/14] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Add a sysfs group to display statistics about P2P memory that is
registered in each PCI device.

Attributes in the group display the total amount of P2P memory, the
amount available and whether it is published or not.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++
 drivers/pci/p2pdma.c                    | 54 +++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..044812c816d0 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
 
 		This is similar to /sys/bus/pci/drivers_autoprobe, but
 		affects only the VFs associated with a specific PF.
+
+What:		/sys/bus/pci/devices/.../p2pmem/available
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the amount of memory that has not been
+		allocated (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/size
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the total amount of memory that the device
+		provides (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/published
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains a '1' if the memory has been published for
+		use inside the kernel or a '0' if it is only intended
+		for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index e524a12eca1f..4daad6374869 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -24,6 +24,54 @@ struct pci_p2pdma {
 	bool p2pmem_published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t size = 0;
+
+	if (pdev->p2pdma->pool)
+		size = gen_pool_size(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t avail = 0;
+
+	if (pdev->p2pdma->pool)
+		avail = gen_pool_avail(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n",
+			pdev->p2pdma->p2pmem_published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+	&dev_attr_size.attr,
+	&dev_attr_available.attr,
+	&dev_attr_published.attr,
+	NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+	.attrs = p2pmem_attrs,
+	.name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
 	struct pci_p2pdma *p2p =
@@ -53,6 +101,7 @@ static void pci_p2pdma_release(void *data)
 	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
 
 	gen_pool_destroy(pdev->p2pdma->pool);
+	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	pdev->p2pdma = NULL;
 }
 
@@ -83,9 +132,14 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 
 	pdev->p2pdma = p2p;
 
+	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+	if (error)
+		goto out_pool_destroy;
+
 	return 0;
 
 out_pool_destroy:
+	pdev->p2pdma = NULL;
 	gen_pool_destroy(p2p->pool);
 out:
 	devm_kfree(&pdev->dev, p2p);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 02/14] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


Add a sysfs group to display statistics about P2P memory that is
registered in each PCI device.

Attributes in the group display the total amount of P2P memory, the
amount available and whether it is published or not.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++
 drivers/pci/p2pdma.c                    | 54 +++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..044812c816d0 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
 
 		This is similar to /sys/bus/pci/drivers_autoprobe, but
 		affects only the VFs associated with a specific PF.
+
+What:		/sys/bus/pci/devices/.../p2pmem/available
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang at deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the amount of memory that has not been
+		allocated (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/size
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang at deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the total amount of memory that the device
+		provides (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/published
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang at deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains a '1' if the memory has been published for
+		use inside the kernel or a '0' if it is only intended
+		for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index e524a12eca1f..4daad6374869 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -24,6 +24,54 @@ struct pci_p2pdma {
 	bool p2pmem_published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t size = 0;
+
+	if (pdev->p2pdma->pool)
+		size = gen_pool_size(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t avail = 0;
+
+	if (pdev->p2pdma->pool)
+		avail = gen_pool_avail(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n",
+			pdev->p2pdma->p2pmem_published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+	&dev_attr_size.attr,
+	&dev_attr_available.attr,
+	&dev_attr_published.attr,
+	NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+	.attrs = p2pmem_attrs,
+	.name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
 	struct pci_p2pdma *p2p =
@@ -53,6 +101,7 @@ static void pci_p2pdma_release(void *data)
 	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
 
 	gen_pool_destroy(pdev->p2pdma->pool);
+	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	pdev->p2pdma = NULL;
 }
 
@@ -83,9 +132,14 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 
 	pdev->p2pdma = p2p;
 
+	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+	if (error)
+		goto out_pool_destroy;
+
 	return 0;
 
 out_pool_destroy:
+	pdev->p2pdma = NULL;
 	gen_pool_destroy(p2p->pool);
 out:
 	devm_kfree(&pdev->dev, p2p);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.

For this, we assume that an SGL passed to these functions contain all
P2P memory or no P2P memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  1 +
 include/linux/pci-p2pdma.h | 13 ++++++++++++
 3 files changed, 65 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4daad6374869..ed9dce8552a2 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -190,6 +190,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->res.flags = pci_resource_flags(pdev, bar);
 	pgmap->ref = &pdev->p2pdma->devmap_ref;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+		pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -746,3 +748,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 	pdev->p2pdma->p2pmem_published = publish;
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/**
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir)
+{
+	struct dev_pagemap *pgmap;
+	struct scatterlist *s;
+	phys_addr_t paddr;
+	int i;
+
+	/*
+	 * p2pdma mappings are not compatible with devices that use
+	 * dma_virt_ops.
+	 */
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+		return 0;
+
+	for_each_sg(sg, s, nents, i) {
+		pgmap = sg_page(s)->pgmap;
+		paddr = sg_phys(s);
+
+		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+		sg_dma_len(s) = s->length;
+	}
+
+	return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
 	struct device *dev;
 	void *data;
 	enum memory_type type;
+	u64 pci_p2pdma_bus_offset;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 80e931cb1235..0cde88341eeb 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -35,6 +35,10 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
 					 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -96,5 +100,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline int pci_p2pdma_map_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+	return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
 #endif /* CONFIG_PCI_P2PDMA */
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.

For this, we assume that an SGL passed to these functions contain all
P2P memory or no P2P memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  1 +
 include/linux/pci-p2pdma.h | 13 ++++++++++++
 3 files changed, 65 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4daad6374869..ed9dce8552a2 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -190,6 +190,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->res.flags = pci_resource_flags(pdev, bar);
 	pgmap->ref = &pdev->p2pdma->devmap_ref;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+		pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -746,3 +748,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 	pdev->p2pdma->p2pmem_published = publish;
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/**
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir)
+{
+	struct dev_pagemap *pgmap;
+	struct scatterlist *s;
+	phys_addr_t paddr;
+	int i;
+
+	/*
+	 * p2pdma mappings are not compatible with devices that use
+	 * dma_virt_ops.
+	 */
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+		return 0;
+
+	for_each_sg(sg, s, nents, i) {
+		pgmap = sg_page(s)->pgmap;
+		paddr = sg_phys(s);
+
+		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+		sg_dma_len(s) = s->length;
+	}
+
+	return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
 	struct device *dev;
 	void *data;
 	enum memory_type type;
+	u64 pci_p2pdma_bus_offset;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 80e931cb1235..0cde88341eeb 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -35,6 +35,10 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
 					 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -96,5 +100,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline int pci_p2pdma_map_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+	return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
 #endif /* CONFIG_PCI_P2PDMA */
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.

For this, we assume that an SGL passed to these functions contain all
P2P memory or no P2P memory.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  1 +
 include/linux/pci-p2pdma.h | 13 ++++++++++++
 3 files changed, 65 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4daad6374869..ed9dce8552a2 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -190,6 +190,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->res.flags = pci_resource_flags(pdev, bar);
 	pgmap->ref = &pdev->p2pdma->devmap_ref;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+		pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -746,3 +748,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 	pdev->p2pdma->p2pmem_published = publish;
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/**
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir)
+{
+	struct dev_pagemap *pgmap;
+	struct scatterlist *s;
+	phys_addr_t paddr;
+	int i;
+
+	/*
+	 * p2pdma mappings are not compatible with devices that use
+	 * dma_virt_ops.
+	 */
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+		return 0;
+
+	for_each_sg(sg, s, nents, i) {
+		pgmap = sg_page(s)->pgmap;
+		paddr = sg_phys(s);
+
+		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+		sg_dma_len(s) = s->length;
+	}
+
+	return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
 	struct device *dev;
 	void *data;
 	enum memory_type type;
+	u64 pci_p2pdma_bus_offset;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 80e931cb1235..0cde88341eeb 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -35,6 +35,10 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
 					 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -96,5 +100,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline int pci_p2pdma_map_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+	return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
 #endif /* CONFIG_PCI_P2PDMA */
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.

For this, we assume that an SGL passed to these functions contain all
P2P memory or no P2P memory.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  1 +
 include/linux/pci-p2pdma.h | 13 ++++++++++++
 3 files changed, 65 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4daad6374869..ed9dce8552a2 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -190,6 +190,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->res.flags = pci_resource_flags(pdev, bar);
 	pgmap->ref = &pdev->p2pdma->devmap_ref;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+		pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -746,3 +748,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 	pdev->p2pdma->p2pmem_published = publish;
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/**
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir)
+{
+	struct dev_pagemap *pgmap;
+	struct scatterlist *s;
+	phys_addr_t paddr;
+	int i;
+
+	/*
+	 * p2pdma mappings are not compatible with devices that use
+	 * dma_virt_ops.
+	 */
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+		return 0;
+
+	for_each_sg(sg, s, nents, i) {
+		pgmap = sg_page(s)->pgmap;
+		paddr = sg_phys(s);
+
+		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+		sg_dma_len(s) = s->length;
+	}
+
+	return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
 	struct device *dev;
 	void *data;
 	enum memory_type type;
+	u64 pci_p2pdma_bus_offset;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 80e931cb1235..0cde88341eeb 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -35,6 +35,10 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
 					 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -96,5 +100,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline int pci_p2pdma_map_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+	return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
 #endif /* CONFIG_PCI_P2PDMA */
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch heirarchy will be in the same IOMMU
group. Which implies that individual devices behind any switch
heirarchy will not be able to be assigned to separate VMs because
there is no isolation between them. Additionally, any malicious PCIe
devices will be able to DMA to memory exposed by other EPs in the same
domain as TLPs will not be checked by the IOMMU.

Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  9 +++++++++
 drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
 drivers/pci/pci.c          |  6 ++++++
 include/linux/pci-p2pdma.h |  5 +++++
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index b2396c22b53e..b6db41d4b708 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -139,6 +139,15 @@ config PCI_P2PDMA
 	  transations must be between devices behind the same root port.
 	  (Typically behind a network of PCIe switches).
 
+	  Enabling this option will also disable ACS on all ports behind
+	  any PCIe switch. This effectively puts all devices behind any
+	  switch heirarchy into the same IOMMU group. Which implies that
+	  individual devices behind any switch will not be able to be
+	  assigned to separate VMs because there is no isolation between
+	  them. Additionally, any malicious PCIe devices will be able to
+	  DMA to memory exposed by other EPs in the same domain as TLPs
+	  will not be checked by the IOMMU.
+
 	  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ed9dce8552a2..e9f43b43acac 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
 }
 
 /*
- * If a device is behind a switch, we try to find the upstream bridge
- * port of the switch. This requires two calls to pci_upstream_bridge():
- * one for the upstream port on the switch, one on the upstream port
- * for the next level in the hierarchy. Because of this, devices connected
- * to the root port will be rejected.
+ * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
+ * up to the RC which is not what we want for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any bridge to be in the same IOMMU
+ * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
  */
-static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
 {
-	struct pci_dev *up1, *up2;
+	int pos;
+	u16 ctrl;
 
-	if (!pdev)
-		return NULL;
+	if (!pci_is_bridge(pdev))
+		return 0;
 
-	up1 = pci_dev_get(pci_upstream_bridge(pdev));
-	if (!up1)
-		return NULL;
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return 0;
+
+	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
+
+	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
 
-	up2 = pci_dev_get(pci_upstream_bridge(up1));
-	pci_dev_put(up1);
+	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
 
-	return up2;
+	return 1;
 }
 
 /*
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e597655a5643..7e2f5724ba22 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/pm.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+#ifdef CONFIG_PCI_P2PDMA
+	if (pci_p2pdma_disable_acs(dev))
+		return;
+#endif
+
 	if (!pci_acs_enable)
 		return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 0cde88341eeb..fcb3437a2f3c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -18,6 +18,7 @@ struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	return 0;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch heirarchy will be in the same IOMMU
group. Which implies that individual devices behind any switch
heirarchy will not be able to be assigned to separate VMs because
there is no isolation between them. Additionally, any malicious PCIe
devices will be able to DMA to memory exposed by other EPs in the same
domain as TLPs will not be checked by the IOMMU.

Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  9 +++++++++
 drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
 drivers/pci/pci.c          |  6 ++++++
 include/linux/pci-p2pdma.h |  5 +++++
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index b2396c22b53e..b6db41d4b708 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -139,6 +139,15 @@ config PCI_P2PDMA
 	  transations must be between devices behind the same root port.
 	  (Typically behind a network of PCIe switches).
 
+	  Enabling this option will also disable ACS on all ports behind
+	  any PCIe switch. This effectively puts all devices behind any
+	  switch heirarchy into the same IOMMU group. Which implies that
+	  individual devices behind any switch will not be able to be
+	  assigned to separate VMs because there is no isolation between
+	  them. Additionally, any malicious PCIe devices will be able to
+	  DMA to memory exposed by other EPs in the same domain as TLPs
+	  will not be checked by the IOMMU.
+
 	  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ed9dce8552a2..e9f43b43acac 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
 }
 
 /*
- * If a device is behind a switch, we try to find the upstream bridge
- * port of the switch. This requires two calls to pci_upstream_bridge():
- * one for the upstream port on the switch, one on the upstream port
- * for the next level in the hierarchy. Because of this, devices connected
- * to the root port will be rejected.
+ * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
+ * up to the RC which is not what we want for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any bridge to be in the same IOMMU
+ * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
  */
-static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
 {
-	struct pci_dev *up1, *up2;
+	int pos;
+	u16 ctrl;
 
-	if (!pdev)
-		return NULL;
+	if (!pci_is_bridge(pdev))
+		return 0;
 
-	up1 = pci_dev_get(pci_upstream_bridge(pdev));
-	if (!up1)
-		return NULL;
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return 0;
+
+	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
+
+	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
 
-	up2 = pci_dev_get(pci_upstream_bridge(up1));
-	pci_dev_put(up1);
+	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
 
-	return up2;
+	return 1;
 }
 
 /*
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e597655a5643..7e2f5724ba22 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/pm.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+#ifdef CONFIG_PCI_P2PDMA
+	if (pci_p2pdma_disable_acs(dev))
+		return;
+#endif
+
 	if (!pci_acs_enable)
 		return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 0cde88341eeb..fcb3437a2f3c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -18,6 +18,7 @@ struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	return 0;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch heirarchy will be in the same IOMMU
group. Which implies that individual devices behind any switch
heirarchy will not be able to be assigned to separate VMs because
there is no isolation between them. Additionally, any malicious PCIe
devices will be able to DMA to memory exposed by other EPs in the same
domain as TLPs will not be checked by the IOMMU.

Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/pci/Kconfig        |  9 +++++++++
 drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
 drivers/pci/pci.c          |  6 ++++++
 include/linux/pci-p2pdma.h |  5 +++++
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index b2396c22b53e..b6db41d4b708 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -139,6 +139,15 @@ config PCI_P2PDMA
 	  transations must be between devices behind the same root port.
 	  (Typically behind a network of PCIe switches).
 
+	  Enabling this option will also disable ACS on all ports behind
+	  any PCIe switch. This effectively puts all devices behind any
+	  switch heirarchy into the same IOMMU group. Which implies that
+	  individual devices behind any switch will not be able to be
+	  assigned to separate VMs because there is no isolation between
+	  them. Additionally, any malicious PCIe devices will be able to
+	  DMA to memory exposed by other EPs in the same domain as TLPs
+	  will not be checked by the IOMMU.
+
 	  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ed9dce8552a2..e9f43b43acac 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
 }
 
 /*
- * If a device is behind a switch, we try to find the upstream bridge
- * port of the switch. This requires two calls to pci_upstream_bridge():
- * one for the upstream port on the switch, one on the upstream port
- * for the next level in the hierarchy. Because of this, devices connected
- * to the root port will be rejected.
+ * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
+ * up to the RC which is not what we want for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any bridge to be in the same IOMMU
+ * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
  */
-static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
 {
-	struct pci_dev *up1, *up2;
+	int pos;
+	u16 ctrl;
 
-	if (!pdev)
-		return NULL;
+	if (!pci_is_bridge(pdev))
+		return 0;
 
-	up1 = pci_dev_get(pci_upstream_bridge(pdev));
-	if (!up1)
-		return NULL;
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return 0;
+
+	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
+
+	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
 
-	up2 = pci_dev_get(pci_upstream_bridge(up1));
-	pci_dev_put(up1);
+	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
 
-	return up2;
+	return 1;
 }
 
 /*
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e597655a5643..7e2f5724ba22 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/pm.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+#ifdef CONFIG_PCI_P2PDMA
+	if (pci_p2pdma_disable_acs(dev))
+		return;
+#endif
+
 	if (!pci_acs_enable)
 		return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 0cde88341eeb..fcb3437a2f3c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -18,6 +18,7 @@ struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	return 0;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch heirarchy will be in the same IOMMU
group. Which implies that individual devices behind any switch
heirarchy will not be able to be assigned to separate VMs because
there is no isolation between them. Additionally, any malicious PCIe
devices will be able to DMA to memory exposed by other EPs in the same
domain as TLPs will not be checked by the IOMMU.

Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/pci/Kconfig        |  9 +++++++++
 drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
 drivers/pci/pci.c          |  6 ++++++
 include/linux/pci-p2pdma.h |  5 +++++
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index b2396c22b53e..b6db41d4b708 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -139,6 +139,15 @@ config PCI_P2PDMA
 	  transations must be between devices behind the same root port.
 	  (Typically behind a network of PCIe switches).
 
+	  Enabling this option will also disable ACS on all ports behind
+	  any PCIe switch. This effectively puts all devices behind any
+	  switch heirarchy into the same IOMMU group. Which implies that
+	  individual devices behind any switch will not be able to be
+	  assigned to separate VMs because there is no isolation between
+	  them. Additionally, any malicious PCIe devices will be able to
+	  DMA to memory exposed by other EPs in the same domain as TLPs
+	  will not be checked by the IOMMU.
+
 	  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ed9dce8552a2..e9f43b43acac 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
 }
 
 /*
- * If a device is behind a switch, we try to find the upstream bridge
- * port of the switch. This requires two calls to pci_upstream_bridge():
- * one for the upstream port on the switch, one on the upstream port
- * for the next level in the hierarchy. Because of this, devices connected
- * to the root port will be rejected.
+ * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
+ * up to the RC which is not what we want for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any bridge to be in the same IOMMU
+ * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
  */
-static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
 {
-	struct pci_dev *up1, *up2;
+	int pos;
+	u16 ctrl;
 
-	if (!pdev)
-		return NULL;
+	if (!pci_is_bridge(pdev))
+		return 0;
 
-	up1 = pci_dev_get(pci_upstream_bridge(pdev));
-	if (!up1)
-		return NULL;
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return 0;
+
+	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
+
+	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
 
-	up2 = pci_dev_get(pci_upstream_bridge(up1));
-	pci_dev_put(up1);
+	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
 
-	return up2;
+	return 1;
 }
 
 /*
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e597655a5643..7e2f5724ba22 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/pm.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+#ifdef CONFIG_PCI_P2PDMA
+	if (pci_p2pdma_disable_acs(dev))
+		return;
+#endif
+
 	if (!pci_acs_enable)
 		return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 0cde88341eeb..fcb3437a2f3c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -18,6 +18,7 @@ struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	return 0;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 05/14] docs-rst: Add a new directory for PCI documentation
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Benjamin Herrenschmidt, Linus Walleij, Keith Busch, Max Gurtovoy,
	Christoph Hellwig, Jonathan Corbet, Vinod Koul, Jason Gunthorpe,
	Thierry Reding, Alex Williamson, Bjorn Helgaas,
	Mauro Carvalho Chehab, Sagar Dharia, Jens Axboe,
	Greg Kroah-Hartman, Jérôme Glisse, Sanyog Kale,
	Christian König

Add a new directory in the driver API guide for PCI specific
documentation.

This is in preparation for adding a new PCI P2P DMA driver writers
guide which will go in this directory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Sanyog Kale <sanyog.r.kale@intel.com>
Cc: Sagar Dharia <sdharia@codeaurora.org>
---
 Documentation/driver-api/index.rst         |  2 +-
 Documentation/driver-api/pci/index.rst     | 19 +++++++++++++++++++
 Documentation/driver-api/{ => pci}/pci.rst |  0
 3 files changed, 20 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/driver-api/pci/index.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 6d8352c0f354..9e4cd4e91a49 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -27,7 +27,7 @@ available subsections can be seen below.
    iio/index
    input
    usb/index
-   pci
+   pci/index
    spi
    i2c
    hsi
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
new file mode 100644
index 000000000000..03b57cbf8cc2
--- /dev/null
+++ b/Documentation/driver-api/pci/index.rst
@@ -0,0 +1,19 @@
+============================================
+The Linux PCI driver implementer's API guide
+============================================
+
+.. class:: toc-title
+
+	   Table of contents
+
+.. toctree::
+   :maxdepth: 2
+
+   pci
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci.rst b/Documentation/driver-api/pci/pci.rst
similarity index 100%
rename from Documentation/driver-api/pci.rst
rename to Documentation/driver-api/pci/pci.rst
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 05/14] docs-rst: Add a new directory for PCI documentation
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe,
	Jonathan Corbet, Mauro Carvalho Chehab, Greg Kroah-Hartman,
	Vinod Koul, Linus Walleij, Thierry Reding, Sanyog Kale,
	Sagar Dharia

Add a new directory in the driver API guide for PCI specific
documentation.

This is in preparation for adding a new PCI P2P DMA driver writers
guide which will go in this directory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Sanyog Kale <sanyog.r.kale@intel.com>
Cc: Sagar Dharia <sdharia@codeaurora.org>
---
 Documentation/driver-api/index.rst         |  2 +-
 Documentation/driver-api/pci/index.rst     | 19 +++++++++++++++++++
 Documentation/driver-api/{ => pci}/pci.rst |  0
 3 files changed, 20 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/driver-api/pci/index.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 6d8352c0f354..9e4cd4e91a49 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -27,7 +27,7 @@ available subsections can be seen below.
    iio/index
    input
    usb/index
-   pci
+   pci/index
    spi
    i2c
    hsi
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
new file mode 100644
index 000000000000..03b57cbf8cc2
--- /dev/null
+++ b/Documentation/driver-api/pci/index.rst
@@ -0,0 +1,19 @@
+============================================
+The Linux PCI driver implementer's API guide
+============================================
+
+.. class:: toc-title
+
+	   Table of contents
+
+.. toctree::
+   :maxdepth: 2
+
+   pci
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci.rst b/Documentation/driver-api/pci/pci.rst
similarity index 100%
rename from Documentation/driver-api/pci.rst
rename to Documentation/driver-api/pci/pci.rst
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 05/14] docs-rst: Add a new directory for PCI documentation
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Benjamin Herrenschmidt, Linus Walleij, Keith Busch, Max Gurtovoy,
	Christoph Hellwig, Jonathan Corbet, Vinod Koul, Jason Gunthorpe,
	Thierry Reding, Alex Williamson, Bjorn Helgaas,
	Mauro Carvalho Chehab, Sagar Dharia, Jens Axboe,
	Greg Kroah-Hartman, Jérôme Glisse, Sanyog Kale,
	Christian König

Add a new directory in the driver API guide for PCI specific
documentation.

This is in preparation for adding a new PCI P2P DMA driver writers
guide which will go in this directory.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Cc: Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>
Cc: Mauro Carvalho Chehab <mchehab-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
Cc: Vinod Koul <vinod.koul-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Linus Walleij <linus.walleij-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Cc: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Cc: Thierry Reding <treding-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Cc: Sanyog Kale <sanyog.r.kale-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Sagar Dharia <sdharia-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
---
 Documentation/driver-api/index.rst         |  2 +-
 Documentation/driver-api/pci/index.rst     | 19 +++++++++++++++++++
 Documentation/driver-api/{ => pci}/pci.rst |  0
 3 files changed, 20 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/driver-api/pci/index.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 6d8352c0f354..9e4cd4e91a49 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -27,7 +27,7 @@ available subsections can be seen below.
    iio/index
    input
    usb/index
-   pci
+   pci/index
    spi
    i2c
    hsi
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
new file mode 100644
index 000000000000..03b57cbf8cc2
--- /dev/null
+++ b/Documentation/driver-api/pci/index.rst
@@ -0,0 +1,19 @@
+============================================
+The Linux PCI driver implementer's API guide
+============================================
+
+.. class:: toc-title
+
+	   Table of contents
+
+.. toctree::
+   :maxdepth: 2
+
+   pci
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci.rst b/Documentation/driver-api/pci/pci.rst
similarity index 100%
rename from Documentation/driver-api/pci.rst
rename to Documentation/driver-api/pci/pci.rst
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 05/14] docs-rst: Add a new directory for PCI documentation
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


Add a new directory in the driver API guide for PCI specific
documentation.

This is in preparation for adding a new PCI P2P DMA driver writers
guide which will go in this directory.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Cc: Jonathan Corbet <corbet at lwn.net>
Cc: Mauro Carvalho Chehab <mchehab at kernel.org>
Cc: Greg Kroah-Hartman <gregkh at linuxfoundation.org>
Cc: Vinod Koul <vinod.koul at intel.com>
Cc: Linus Walleij <linus.walleij at linaro.org>
Cc: Logan Gunthorpe <logang at deltatee.com>
Cc: Thierry Reding <treding at nvidia.com>
Cc: Sanyog Kale <sanyog.r.kale at intel.com>
Cc: Sagar Dharia <sdharia at codeaurora.org>
---
 Documentation/driver-api/index.rst         |  2 +-
 Documentation/driver-api/pci/index.rst     | 19 +++++++++++++++++++
 Documentation/driver-api/{ => pci}/pci.rst |  0
 3 files changed, 20 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/driver-api/pci/index.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 6d8352c0f354..9e4cd4e91a49 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -27,7 +27,7 @@ available subsections can be seen below.
    iio/index
    input
    usb/index
-   pci
+   pci/index
    spi
    i2c
    hsi
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
new file mode 100644
index 000000000000..03b57cbf8cc2
--- /dev/null
+++ b/Documentation/driver-api/pci/index.rst
@@ -0,0 +1,19 @@
+============================================
+The Linux PCI driver implementer's API guide
+============================================
+
+.. class:: toc-title
+
+	   Table of contents
+
+.. toctree::
+   :maxdepth: 2
+
+   pci
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci.rst b/Documentation/driver-api/pci/pci.rst
similarity index 100%
rename from Documentation/driver-api/pci.rst
rename to Documentation/driver-api/pci/pci.rst
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
  2018-04-23 23:30 ` Logan Gunthorpe
                     ` (2 preceding siblings ...)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Jonathan Corbet, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Add a restructured text file describing how to write drivers
with support for P2P DMA transactions. The document describes
how to use the APIs that were added in the previous few
commits.

Also adds an index for the PCI documentation tree even though this
is the only PCI document that has been converted to restructured text
at this time.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jonathan Corbet <corbet@lwn.net>
---
 Documentation/PCI/index.rst             |  14 +++
 Documentation/driver-api/pci/index.rst  |   1 +
 Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
 Documentation/index.rst                 |   3 +-
 4 files changed, 183 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst

diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
new file mode 100644
index 000000000000..2fdc4b3c291d
--- /dev/null
+++ b/Documentation/PCI/index.rst
@@ -0,0 +1,14 @@
+==================================
+Linux PCI Driver Developer's Guide
+==================================
+
+.. toctree::
+
+   p2pdma
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
index 03b57cbf8cc2..d12eeafbfc90 100644
--- a/Documentation/driver-api/pci/index.rst
+++ b/Documentation/driver-api/pci/index.rst
@@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
    :maxdepth: 2
 
    pci
+   p2pdma
 
 .. only::  subproject and html
 
diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
new file mode 100644
index 000000000000..49a512c405b2
--- /dev/null
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -0,0 +1,166 @@
+============================
+PCI Peer-to-Peer DMA Support
+============================
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two endpoints on the bus. This type of transaction is
+henceforth called Peer-to-Peer (or P2P). However, there are a number of
+issues that make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI Root Complexes are not required
+to support forwarding packets between Root Ports. To make things worse,
+there is no simple way to determine if a given Root Complex supports
+this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
+the kernel only supports doing P2P when the endpoints involved are all
+behind the same PCIe root port as the spec guarantees that all
+packets will always be routable but does not require routing between
+root ports.
+
+The second issue is that to make use of existing interfaces in Linux,
+memory that is used for P2P transactions needs to be backed by struct
+pages. However, PCI BARs are not typically cache coherent so there are
+a few corner case gotchas with these pages so developers need to
+be careful about what they do with them.
+
+
+Driver Writer's Guide
+=====================
+
+In a given P2P implementation there may be three or more different
+types of kernel drivers in play:
+
+* Providers - A driver which provides or publishes P2P resources like
+  memory or doorbell registers to other drivers.
+* Clients - A driver which makes use of a resource by setting up a
+  DMA transaction to or from it.
+* Orchestrators - A driver which orchestrates the flow of data between
+  clients and providers
+
+In many cases there could be overlap between these three types (ie.
+it may be typical for a driver to be both a provider and a client).
+
+For example, in the NVMe Target Copy Offload implementation:
+
+* The NVMe PCI driver is both a client, provider and orchestrator
+  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
+  resource (provider), it accepts P2P memory pages as buffers in requests
+  to be used directly (client) and it can also make use the CMB as
+  submission queue entries.
+* The RDMA driver is a client in this arrangement so that an RNIC
+  can DMA directly to the memory exposed by the NVMe device.
+* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
+  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
+
+This is currently the only arrangement supported by the kernel but
+one could imagine slight tweaks to this that would allow for the same
+functionality. For example, if a specific RNIC added a BAR with some
+memory behind it, its driver could add support as a P2P provider and
+then the NVMe Target could use the RNIC's memory instead of the CMB
+in cases where the NVMe cards in use do not have CMB support.
+
+
+Provider Drivers
+----------------
+
+A provider simply needs to register a BAR (or a portion of a BAR)
+as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
+This will register struct pages for all the specified memory.
+
+After that it may optionally publish all of its resources as
+P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
+any orchestrator drivers to find and use the memory. When marked in
+this way, the resource must be regular memory with no side effects.
+
+For the time being this is fairly rudimentary in that all resources
+are typically going to be P2P memory. Future work will likely expand
+this to include other types of resources like doorbells.
+
+
+Client Drivers
+--------------
+
+A client driver typically only has to conditionally change its DMA map
+routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
+:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
+functions.
+
+The client may also, optionally, make use of
+:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
+functions and when to use the regular mapping functions. In some
+situations, it may be more appropriate to use a flag to indicate a
+given request is P2P memory and map appropriately (for example the
+block layer uses a flag to keep P2P memory out of queues that do not
+have P2P client support). It is important to ensure that struct pages that
+back P2P memory stay out of code that does not have support for them.
+
+
+Orchestrator Drivers
+--------------------
+
+The first task an orchestrator driver must do is compile a list of
+all client drivers that will be involved in a given transaction. For
+example, the NVMe Target driver creates a list including all NVMe drives
+and the RNIC in use. The list is stored as an anonymous struct
+list_head which must be initialized with the usual INIT_LIST_HEAD.
+The following functions may then be used to add to, remove from and free
+the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
+:c:func:`pci_p2pdma_remove_client()` and
+:c:func:`pci_p2pdma_client_list_free()`.
+
+With the client list in hand, the orchestrator may then call
+:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
+that is supported (behind the same root port) as all the clients. If more
+than one provider is supported, the one nearest to all the clients will
+be chosen first. If there are more than one provider is an equal distance
+away, the one returned will be chosen at random. This function returns the PCI
+device to use for the provider with a reference taken and therefore
+when it's no longer needed it should be returned with pci_dev_put().
+
+Alternatively, if the orchestrator knows (via some other means)
+which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
+to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
+to determine the cumulative distance between it and a potential
+list of clients.
+
+With a supported provider in hand, the driver can then call
+:c:func:`pci_p2pdma_assign_provider()` to assign the provider
+to the client list. This function returns false if any of the
+clients are unsupported by the provider.
+
+Once a provider is assigned to a client list via either
+:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
+the list is permanently bound to the provider such that any new clients
+added to the list must be supported by the already selected provider.
+If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
+an error. In this way, orchestrators are free to add and remove devices
+without having to recheck support or tear down existing transfers to
+change P2P providers.
+
+Once a provider is selected, the orchestrator can then use
+:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
+allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
+and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
+allocating scatter-gather lists with P2P memory.
+
+Struct Page Caveats
+-------------------
+
+Driver writers should be very careful about not passing these special
+struct pages to code that isn't prepared for it. At this time, the kernel
+interfaces do not have any checks for ensuring this. This obviously
+precludes passing these pages to userspace.
+
+P2P memory is also technically IO memory but should never have any side
+effects behind it. Thus, the order of loads and stores should not be important
+and ioreadX(), iowriteX() and friends should not be necessary.
+However, as the memory is not cache coherent, if access ever needs to
+be protected by a spinlock then :c:func:`mmiowb()` must be used before
+unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
+Documentation/memory-barriers.txt)
+
+
+P2P DMA Support Library
+=====================
+
+.. kernel-doc:: drivers/pci/p2pdma.c
+   :export:
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 3b99ab931d41..e7938b507df3 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
 .. toctree::
    :maxdepth: 2
 
-   userspace-api/index	      
+   userspace-api/index
 
 
 Introduction to kernel development
@@ -89,6 +89,7 @@ needed).
    sound/index
    crypto/index
    filesystems/index
+   PCI/index
 
 Architecture-specific documentation
 -----------------------------------
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Sagi Grimberg, Christian König,
	Benjamin Herrenschmidt, Jonathan Corbet, Alex Williamson,
	Stephen Bates, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Logan Gunthorpe, Christoph Hellwig

Add a restructured text file describing how to write drivers
with support for P2P DMA transactions. The document describes
how to use the APIs that were added in the previous few
commits.

Also adds an index for the PCI documentation tree even though this
is the only PCI document that has been converted to restructured text
at this time.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jonathan Corbet <corbet@lwn.net>
---
 Documentation/PCI/index.rst             |  14 +++
 Documentation/driver-api/pci/index.rst  |   1 +
 Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
 Documentation/index.rst                 |   3 +-
 4 files changed, 183 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst

diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
new file mode 100644
index 000000000000..2fdc4b3c291d
--- /dev/null
+++ b/Documentation/PCI/index.rst
@@ -0,0 +1,14 @@
+==================================
+Linux PCI Driver Developer's Guide
+==================================
+
+.. toctree::
+
+   p2pdma
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
index 03b57cbf8cc2..d12eeafbfc90 100644
--- a/Documentation/driver-api/pci/index.rst
+++ b/Documentation/driver-api/pci/index.rst
@@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
    :maxdepth: 2
 
    pci
+   p2pdma
 
 .. only::  subproject and html
 
diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
new file mode 100644
index 000000000000..49a512c405b2
--- /dev/null
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -0,0 +1,166 @@
+============================
+PCI Peer-to-Peer DMA Support
+============================
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two endpoints on the bus. This type of transaction is
+henceforth called Peer-to-Peer (or P2P). However, there are a number of
+issues that make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI Root Complexes are not required
+to support forwarding packets between Root Ports. To make things worse,
+there is no simple way to determine if a given Root Complex supports
+this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
+the kernel only supports doing P2P when the endpoints involved are all
+behind the same PCIe root port as the spec guarantees that all
+packets will always be routable but does not require routing between
+root ports.
+
+The second issue is that to make use of existing interfaces in Linux,
+memory that is used for P2P transactions needs to be backed by struct
+pages. However, PCI BARs are not typically cache coherent so there are
+a few corner case gotchas with these pages so developers need to
+be careful about what they do with them.
+
+
+Driver Writer's Guide
+=====================
+
+In a given P2P implementation there may be three or more different
+types of kernel drivers in play:
+
+* Providers - A driver which provides or publishes P2P resources like
+  memory or doorbell registers to other drivers.
+* Clients - A driver which makes use of a resource by setting up a
+  DMA transaction to or from it.
+* Orchestrators - A driver which orchestrates the flow of data between
+  clients and providers
+
+In many cases there could be overlap between these three types (ie.
+it may be typical for a driver to be both a provider and a client).
+
+For example, in the NVMe Target Copy Offload implementation:
+
+* The NVMe PCI driver is both a client, provider and orchestrator
+  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
+  resource (provider), it accepts P2P memory pages as buffers in requests
+  to be used directly (client) and it can also make use the CMB as
+  submission queue entries.
+* The RDMA driver is a client in this arrangement so that an RNIC
+  can DMA directly to the memory exposed by the NVMe device.
+* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
+  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
+
+This is currently the only arrangement supported by the kernel but
+one could imagine slight tweaks to this that would allow for the same
+functionality. For example, if a specific RNIC added a BAR with some
+memory behind it, its driver could add support as a P2P provider and
+then the NVMe Target could use the RNIC's memory instead of the CMB
+in cases where the NVMe cards in use do not have CMB support.
+
+
+Provider Drivers
+----------------
+
+A provider simply needs to register a BAR (or a portion of a BAR)
+as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
+This will register struct pages for all the specified memory.
+
+After that it may optionally publish all of its resources as
+P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
+any orchestrator drivers to find and use the memory. When marked in
+this way, the resource must be regular memory with no side effects.
+
+For the time being this is fairly rudimentary in that all resources
+are typically going to be P2P memory. Future work will likely expand
+this to include other types of resources like doorbells.
+
+
+Client Drivers
+--------------
+
+A client driver typically only has to conditionally change its DMA map
+routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
+:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
+functions.
+
+The client may also, optionally, make use of
+:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
+functions and when to use the regular mapping functions. In some
+situations, it may be more appropriate to use a flag to indicate a
+given request is P2P memory and map appropriately (for example the
+block layer uses a flag to keep P2P memory out of queues that do not
+have P2P client support). It is important to ensure that struct pages that
+back P2P memory stay out of code that does not have support for them.
+
+
+Orchestrator Drivers
+--------------------
+
+The first task an orchestrator driver must do is compile a list of
+all client drivers that will be involved in a given transaction. For
+example, the NVMe Target driver creates a list including all NVMe drives
+and the RNIC in use. The list is stored as an anonymous struct
+list_head which must be initialized with the usual INIT_LIST_HEAD.
+The following functions may then be used to add to, remove from and free
+the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
+:c:func:`pci_p2pdma_remove_client()` and
+:c:func:`pci_p2pdma_client_list_free()`.
+
+With the client list in hand, the orchestrator may then call
+:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
+that is supported (behind the same root port) as all the clients. If more
+than one provider is supported, the one nearest to all the clients will
+be chosen first. If there are more than one provider is an equal distance
+away, the one returned will be chosen at random. This function returns the PCI
+device to use for the provider with a reference taken and therefore
+when it's no longer needed it should be returned with pci_dev_put().
+
+Alternatively, if the orchestrator knows (via some other means)
+which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
+to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
+to determine the cumulative distance between it and a potential
+list of clients.
+
+With a supported provider in hand, the driver can then call
+:c:func:`pci_p2pdma_assign_provider()` to assign the provider
+to the client list. This function returns false if any of the
+clients are unsupported by the provider.
+
+Once a provider is assigned to a client list via either
+:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
+the list is permanently bound to the provider such that any new clients
+added to the list must be supported by the already selected provider.
+If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
+an error. In this way, orchestrators are free to add and remove devices
+without having to recheck support or tear down existing transfers to
+change P2P providers.
+
+Once a provider is selected, the orchestrator can then use
+:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
+allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
+and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
+allocating scatter-gather lists with P2P memory.
+
+Struct Page Caveats
+-------------------
+
+Driver writers should be very careful about not passing these special
+struct pages to code that isn't prepared for it. At this time, the kernel
+interfaces do not have any checks for ensuring this. This obviously
+precludes passing these pages to userspace.
+
+P2P memory is also technically IO memory but should never have any side
+effects behind it. Thus, the order of loads and stores should not be important
+and ioreadX(), iowriteX() and friends should not be necessary.
+However, as the memory is not cache coherent, if access ever needs to
+be protected by a spinlock then :c:func:`mmiowb()` must be used before
+unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
+Documentation/memory-barriers.txt)
+
+
+P2P DMA Support Library
+=====================
+
+.. kernel-doc:: drivers/pci/p2pdma.c
+   :export:
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 3b99ab931d41..e7938b507df3 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
 .. toctree::
    :maxdepth: 2
 
-   userspace-api/index	      
+   userspace-api/index
 
 
 Introduction to kernel development
@@ -89,6 +89,7 @@ needed).
    sound/index
    crypto/index
    filesystems/index
+   PCI/index
 
 Architecture-specific documentation
 -----------------------------------
-- 
2.11.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Jonathan Corbet, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Add a restructured text file describing how to write drivers
with support for P2P DMA transactions. The document describes
how to use the APIs that were added in the previous few
commits.

Also adds an index for the PCI documentation tree even though this
is the only PCI document that has been converted to restructured text
at this time.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Cc: Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>
---
 Documentation/PCI/index.rst             |  14 +++
 Documentation/driver-api/pci/index.rst  |   1 +
 Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
 Documentation/index.rst                 |   3 +-
 4 files changed, 183 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst

diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
new file mode 100644
index 000000000000..2fdc4b3c291d
--- /dev/null
+++ b/Documentation/PCI/index.rst
@@ -0,0 +1,14 @@
+==================================
+Linux PCI Driver Developer's Guide
+==================================
+
+.. toctree::
+
+   p2pdma
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
index 03b57cbf8cc2..d12eeafbfc90 100644
--- a/Documentation/driver-api/pci/index.rst
+++ b/Documentation/driver-api/pci/index.rst
@@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
    :maxdepth: 2
 
    pci
+   p2pdma
 
 .. only::  subproject and html
 
diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
new file mode 100644
index 000000000000..49a512c405b2
--- /dev/null
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -0,0 +1,166 @@
+============================
+PCI Peer-to-Peer DMA Support
+============================
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two endpoints on the bus. This type of transaction is
+henceforth called Peer-to-Peer (or P2P). However, there are a number of
+issues that make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI Root Complexes are not required
+to support forwarding packets between Root Ports. To make things worse,
+there is no simple way to determine if a given Root Complex supports
+this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
+the kernel only supports doing P2P when the endpoints involved are all
+behind the same PCIe root port as the spec guarantees that all
+packets will always be routable but does not require routing between
+root ports.
+
+The second issue is that to make use of existing interfaces in Linux,
+memory that is used for P2P transactions needs to be backed by struct
+pages. However, PCI BARs are not typically cache coherent so there are
+a few corner case gotchas with these pages so developers need to
+be careful about what they do with them.
+
+
+Driver Writer's Guide
+=====================
+
+In a given P2P implementation there may be three or more different
+types of kernel drivers in play:
+
+* Providers - A driver which provides or publishes P2P resources like
+  memory or doorbell registers to other drivers.
+* Clients - A driver which makes use of a resource by setting up a
+  DMA transaction to or from it.
+* Orchestrators - A driver which orchestrates the flow of data between
+  clients and providers
+
+In many cases there could be overlap between these three types (ie.
+it may be typical for a driver to be both a provider and a client).
+
+For example, in the NVMe Target Copy Offload implementation:
+
+* The NVMe PCI driver is both a client, provider and orchestrator
+  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
+  resource (provider), it accepts P2P memory pages as buffers in requests
+  to be used directly (client) and it can also make use the CMB as
+  submission queue entries.
+* The RDMA driver is a client in this arrangement so that an RNIC
+  can DMA directly to the memory exposed by the NVMe device.
+* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
+  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
+
+This is currently the only arrangement supported by the kernel but
+one could imagine slight tweaks to this that would allow for the same
+functionality. For example, if a specific RNIC added a BAR with some
+memory behind it, its driver could add support as a P2P provider and
+then the NVMe Target could use the RNIC's memory instead of the CMB
+in cases where the NVMe cards in use do not have CMB support.
+
+
+Provider Drivers
+----------------
+
+A provider simply needs to register a BAR (or a portion of a BAR)
+as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
+This will register struct pages for all the specified memory.
+
+After that it may optionally publish all of its resources as
+P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
+any orchestrator drivers to find and use the memory. When marked in
+this way, the resource must be regular memory with no side effects.
+
+For the time being this is fairly rudimentary in that all resources
+are typically going to be P2P memory. Future work will likely expand
+this to include other types of resources like doorbells.
+
+
+Client Drivers
+--------------
+
+A client driver typically only has to conditionally change its DMA map
+routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
+:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
+functions.
+
+The client may also, optionally, make use of
+:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
+functions and when to use the regular mapping functions. In some
+situations, it may be more appropriate to use a flag to indicate a
+given request is P2P memory and map appropriately (for example the
+block layer uses a flag to keep P2P memory out of queues that do not
+have P2P client support). It is important to ensure that struct pages that
+back P2P memory stay out of code that does not have support for them.
+
+
+Orchestrator Drivers
+--------------------
+
+The first task an orchestrator driver must do is compile a list of
+all client drivers that will be involved in a given transaction. For
+example, the NVMe Target driver creates a list including all NVMe drives
+and the RNIC in use. The list is stored as an anonymous struct
+list_head which must be initialized with the usual INIT_LIST_HEAD.
+The following functions may then be used to add to, remove from and free
+the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
+:c:func:`pci_p2pdma_remove_client()` and
+:c:func:`pci_p2pdma_client_list_free()`.
+
+With the client list in hand, the orchestrator may then call
+:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
+that is supported (behind the same root port) as all the clients. If more
+than one provider is supported, the one nearest to all the clients will
+be chosen first. If there are more than one provider is an equal distance
+away, the one returned will be chosen at random. This function returns the PCI
+device to use for the provider with a reference taken and therefore
+when it's no longer needed it should be returned with pci_dev_put().
+
+Alternatively, if the orchestrator knows (via some other means)
+which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
+to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
+to determine the cumulative distance between it and a potential
+list of clients.
+
+With a supported provider in hand, the driver can then call
+:c:func:`pci_p2pdma_assign_provider()` to assign the provider
+to the client list. This function returns false if any of the
+clients are unsupported by the provider.
+
+Once a provider is assigned to a client list via either
+:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
+the list is permanently bound to the provider such that any new clients
+added to the list must be supported by the already selected provider.
+If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
+an error. In this way, orchestrators are free to add and remove devices
+without having to recheck support or tear down existing transfers to
+change P2P providers.
+
+Once a provider is selected, the orchestrator can then use
+:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
+allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
+and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
+allocating scatter-gather lists with P2P memory.
+
+Struct Page Caveats
+-------------------
+
+Driver writers should be very careful about not passing these special
+struct pages to code that isn't prepared for it. At this time, the kernel
+interfaces do not have any checks for ensuring this. This obviously
+precludes passing these pages to userspace.
+
+P2P memory is also technically IO memory but should never have any side
+effects behind it. Thus, the order of loads and stores should not be important
+and ioreadX(), iowriteX() and friends should not be necessary.
+However, as the memory is not cache coherent, if access ever needs to
+be protected by a spinlock then :c:func:`mmiowb()` must be used before
+unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
+Documentation/memory-barriers.txt)
+
+
+P2P DMA Support Library
+=====================
+
+.. kernel-doc:: drivers/pci/p2pdma.c
+   :export:
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 3b99ab931d41..e7938b507df3 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
 .. toctree::
    :maxdepth: 2
 
-   userspace-api/index	      
+   userspace-api/index
 
 
 Introduction to kernel development
@@ -89,6 +89,7 @@ needed).
    sound/index
    crypto/index
    filesystems/index
+   PCI/index
 
 Architecture-specific documentation
 -----------------------------------
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe,
	Jonathan Corbet

Add a restructured text file describing how to write drivers
with support for P2P DMA transactions. The document describes
how to use the APIs that were added in the previous few
commits.

Also adds an index for the PCI documentation tree even though this
is the only PCI document that has been converted to restructured text
at this time.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jonathan Corbet <corbet@lwn.net>
---
 Documentation/PCI/index.rst             |  14 +++
 Documentation/driver-api/pci/index.rst  |   1 +
 Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
 Documentation/index.rst                 |   3 +-
 4 files changed, 183 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst

diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
new file mode 100644
index 000000000000..2fdc4b3c291d
--- /dev/null
+++ b/Documentation/PCI/index.rst
@@ -0,0 +1,14 @@
+==================================
+Linux PCI Driver Developer's Guide
+==================================
+
+.. toctree::
+
+   p2pdma
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
index 03b57cbf8cc2..d12eeafbfc90 100644
--- a/Documentation/driver-api/pci/index.rst
+++ b/Documentation/driver-api/pci/index.rst
@@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
    :maxdepth: 2
 
    pci
+   p2pdma
 
 .. only::  subproject and html
 
diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
new file mode 100644
index 000000000000..49a512c405b2
--- /dev/null
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -0,0 +1,166 @@
+============================
+PCI Peer-to-Peer DMA Support
+============================
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two endpoints on the bus. This type of transaction is
+henceforth called Peer-to-Peer (or P2P). However, there are a number of
+issues that make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI Root Complexes are not required
+to support forwarding packets between Root Ports. To make things worse,
+there is no simple way to determine if a given Root Complex supports
+this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
+the kernel only supports doing P2P when the endpoints involved are all
+behind the same PCIe root port as the spec guarantees that all
+packets will always be routable but does not require routing between
+root ports.
+
+The second issue is that to make use of existing interfaces in Linux,
+memory that is used for P2P transactions needs to be backed by struct
+pages. However, PCI BARs are not typically cache coherent so there are
+a few corner case gotchas with these pages so developers need to
+be careful about what they do with them.
+
+
+Driver Writer's Guide
+=====================
+
+In a given P2P implementation there may be three or more different
+types of kernel drivers in play:
+
+* Providers - A driver which provides or publishes P2P resources like
+  memory or doorbell registers to other drivers.
+* Clients - A driver which makes use of a resource by setting up a
+  DMA transaction to or from it.
+* Orchestrators - A driver which orchestrates the flow of data between
+  clients and providers
+
+In many cases there could be overlap between these three types (ie.
+it may be typical for a driver to be both a provider and a client).
+
+For example, in the NVMe Target Copy Offload implementation:
+
+* The NVMe PCI driver is both a client, provider and orchestrator
+  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
+  resource (provider), it accepts P2P memory pages as buffers in requests
+  to be used directly (client) and it can also make use the CMB as
+  submission queue entries.
+* The RDMA driver is a client in this arrangement so that an RNIC
+  can DMA directly to the memory exposed by the NVMe device.
+* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
+  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
+
+This is currently the only arrangement supported by the kernel but
+one could imagine slight tweaks to this that would allow for the same
+functionality. For example, if a specific RNIC added a BAR with some
+memory behind it, its driver could add support as a P2P provider and
+then the NVMe Target could use the RNIC's memory instead of the CMB
+in cases where the NVMe cards in use do not have CMB support.
+
+
+Provider Drivers
+----------------
+
+A provider simply needs to register a BAR (or a portion of a BAR)
+as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
+This will register struct pages for all the specified memory.
+
+After that it may optionally publish all of its resources as
+P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
+any orchestrator drivers to find and use the memory. When marked in
+this way, the resource must be regular memory with no side effects.
+
+For the time being this is fairly rudimentary in that all resources
+are typically going to be P2P memory. Future work will likely expand
+this to include other types of resources like doorbells.
+
+
+Client Drivers
+--------------
+
+A client driver typically only has to conditionally change its DMA map
+routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
+:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
+functions.
+
+The client may also, optionally, make use of
+:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
+functions and when to use the regular mapping functions. In some
+situations, it may be more appropriate to use a flag to indicate a
+given request is P2P memory and map appropriately (for example the
+block layer uses a flag to keep P2P memory out of queues that do not
+have P2P client support). It is important to ensure that struct pages that
+back P2P memory stay out of code that does not have support for them.
+
+
+Orchestrator Drivers
+--------------------
+
+The first task an orchestrator driver must do is compile a list of
+all client drivers that will be involved in a given transaction. For
+example, the NVMe Target driver creates a list including all NVMe drives
+and the RNIC in use. The list is stored as an anonymous struct
+list_head which must be initialized with the usual INIT_LIST_HEAD.
+The following functions may then be used to add to, remove from and free
+the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
+:c:func:`pci_p2pdma_remove_client()` and
+:c:func:`pci_p2pdma_client_list_free()`.
+
+With the client list in hand, the orchestrator may then call
+:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
+that is supported (behind the same root port) as all the clients. If more
+than one provider is supported, the one nearest to all the clients will
+be chosen first. If there are more than one provider is an equal distance
+away, the one returned will be chosen at random. This function returns the PCI
+device to use for the provider with a reference taken and therefore
+when it's no longer needed it should be returned with pci_dev_put().
+
+Alternatively, if the orchestrator knows (via some other means)
+which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
+to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
+to determine the cumulative distance between it and a potential
+list of clients.
+
+With a supported provider in hand, the driver can then call
+:c:func:`pci_p2pdma_assign_provider()` to assign the provider
+to the client list. This function returns false if any of the
+clients are unsupported by the provider.
+
+Once a provider is assigned to a client list via either
+:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
+the list is permanently bound to the provider such that any new clients
+added to the list must be supported by the already selected provider.
+If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
+an error. In this way, orchestrators are free to add and remove devices
+without having to recheck support or tear down existing transfers to
+change P2P providers.
+
+Once a provider is selected, the orchestrator can then use
+:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
+allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
+and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
+allocating scatter-gather lists with P2P memory.
+
+Struct Page Caveats
+-------------------
+
+Driver writers should be very careful about not passing these special
+struct pages to code that isn't prepared for it. At this time, the kernel
+interfaces do not have any checks for ensuring this. This obviously
+precludes passing these pages to userspace.
+
+P2P memory is also technically IO memory but should never have any side
+effects behind it. Thus, the order of loads and stores should not be important
+and ioreadX(), iowriteX() and friends should not be necessary.
+However, as the memory is not cache coherent, if access ever needs to
+be protected by a spinlock then :c:func:`mmiowb()` must be used before
+unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
+Documentation/memory-barriers.txt)
+
+
+P2P DMA Support Library
+=====================
+
+.. kernel-doc:: drivers/pci/p2pdma.c
+   :export:
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 3b99ab931d41..e7938b507df3 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
 .. toctree::
    :maxdepth: 2
 
-   userspace-api/index	      
+   userspace-api/index
 
 
 Introduction to kernel development
@@ -89,6 +89,7 @@ needed).
    sound/index
    crypto/index
    filesystems/index
+   PCI/index
 
 Architecture-specific documentation
 -----------------------------------
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


Add a restructured text file describing how to write drivers
with support for P2P DMA transactions. The document describes
how to use the APIs that were added in the previous few
commits.

Also adds an index for the PCI documentation tree even though this
is the only PCI document that has been converted to restructured text
at this time.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Cc: Jonathan Corbet <corbet at lwn.net>
---
 Documentation/PCI/index.rst             |  14 +++
 Documentation/driver-api/pci/index.rst  |   1 +
 Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
 Documentation/index.rst                 |   3 +-
 4 files changed, 183 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst

diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
new file mode 100644
index 000000000000..2fdc4b3c291d
--- /dev/null
+++ b/Documentation/PCI/index.rst
@@ -0,0 +1,14 @@
+==================================
+Linux PCI Driver Developer's Guide
+==================================
+
+.. toctree::
+
+   p2pdma
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
index 03b57cbf8cc2..d12eeafbfc90 100644
--- a/Documentation/driver-api/pci/index.rst
+++ b/Documentation/driver-api/pci/index.rst
@@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
    :maxdepth: 2
 
    pci
+   p2pdma
 
 .. only::  subproject and html
 
diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
new file mode 100644
index 000000000000..49a512c405b2
--- /dev/null
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -0,0 +1,166 @@
+============================
+PCI Peer-to-Peer DMA Support
+============================
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two endpoints on the bus. This type of transaction is
+henceforth called Peer-to-Peer (or P2P). However, there are a number of
+issues that make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI Root Complexes are not required
+to support forwarding packets between Root Ports. To make things worse,
+there is no simple way to determine if a given Root Complex supports
+this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
+the kernel only supports doing P2P when the endpoints involved are all
+behind the same PCIe root port as the spec guarantees that all
+packets will always be routable but does not require routing between
+root ports.
+
+The second issue is that to make use of existing interfaces in Linux,
+memory that is used for P2P transactions needs to be backed by struct
+pages. However, PCI BARs are not typically cache coherent so there are
+a few corner case gotchas with these pages so developers need to
+be careful about what they do with them.
+
+
+Driver Writer's Guide
+=====================
+
+In a given P2P implementation there may be three or more different
+types of kernel drivers in play:
+
+* Providers - A driver which provides or publishes P2P resources like
+  memory or doorbell registers to other drivers.
+* Clients - A driver which makes use of a resource by setting up a
+  DMA transaction to or from it.
+* Orchestrators - A driver which orchestrates the flow of data between
+  clients and providers
+
+In many cases there could be overlap between these three types (ie.
+it may be typical for a driver to be both a provider and a client).
+
+For example, in the NVMe Target Copy Offload implementation:
+
+* The NVMe PCI driver is both a client, provider and orchestrator
+  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
+  resource (provider), it accepts P2P memory pages as buffers in requests
+  to be used directly (client) and it can also make use the CMB as
+  submission queue entries.
+* The RDMA driver is a client in this arrangement so that an RNIC
+  can DMA directly to the memory exposed by the NVMe device.
+* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
+  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
+
+This is currently the only arrangement supported by the kernel but
+one could imagine slight tweaks to this that would allow for the same
+functionality. For example, if a specific RNIC added a BAR with some
+memory behind it, its driver could add support as a P2P provider and
+then the NVMe Target could use the RNIC's memory instead of the CMB
+in cases where the NVMe cards in use do not have CMB support.
+
+
+Provider Drivers
+----------------
+
+A provider simply needs to register a BAR (or a portion of a BAR)
+as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
+This will register struct pages for all the specified memory.
+
+After that it may optionally publish all of its resources as
+P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
+any orchestrator drivers to find and use the memory. When marked in
+this way, the resource must be regular memory with no side effects.
+
+For the time being this is fairly rudimentary in that all resources
+are typically going to be P2P memory. Future work will likely expand
+this to include other types of resources like doorbells.
+
+
+Client Drivers
+--------------
+
+A client driver typically only has to conditionally change its DMA map
+routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
+:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
+functions.
+
+The client may also, optionally, make use of
+:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
+functions and when to use the regular mapping functions. In some
+situations, it may be more appropriate to use a flag to indicate a
+given request is P2P memory and map appropriately (for example the
+block layer uses a flag to keep P2P memory out of queues that do not
+have P2P client support). It is important to ensure that struct pages that
+back P2P memory stay out of code that does not have support for them.
+
+
+Orchestrator Drivers
+--------------------
+
+The first task an orchestrator driver must do is compile a list of
+all client drivers that will be involved in a given transaction. For
+example, the NVMe Target driver creates a list including all NVMe drives
+and the RNIC in use. The list is stored as an anonymous struct
+list_head which must be initialized with the usual INIT_LIST_HEAD.
+The following functions may then be used to add to, remove from and free
+the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
+:c:func:`pci_p2pdma_remove_client()` and
+:c:func:`pci_p2pdma_client_list_free()`.
+
+With the client list in hand, the orchestrator may then call
+:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
+that is supported (behind the same root port) as all the clients. If more
+than one provider is supported, the one nearest to all the clients will
+be chosen first. If there are more than one provider is an equal distance
+away, the one returned will be chosen at random. This function returns the PCI
+device to use for the provider with a reference taken and therefore
+when it's no longer needed it should be returned with pci_dev_put().
+
+Alternatively, if the orchestrator knows (via some other means)
+which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
+to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
+to determine the cumulative distance between it and a potential
+list of clients.
+
+With a supported provider in hand, the driver can then call
+:c:func:`pci_p2pdma_assign_provider()` to assign the provider
+to the client list. This function returns false if any of the
+clients are unsupported by the provider.
+
+Once a provider is assigned to a client list via either
+:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
+the list is permanently bound to the provider such that any new clients
+added to the list must be supported by the already selected provider.
+If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
+an error. In this way, orchestrators are free to add and remove devices
+without having to recheck support or tear down existing transfers to
+change P2P providers.
+
+Once a provider is selected, the orchestrator can then use
+:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
+allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
+and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
+allocating scatter-gather lists with P2P memory.
+
+Struct Page Caveats
+-------------------
+
+Driver writers should be very careful about not passing these special
+struct pages to code that isn't prepared for it. At this time, the kernel
+interfaces do not have any checks for ensuring this. This obviously
+precludes passing these pages to userspace.
+
+P2P memory is also technically IO memory but should never have any side
+effects behind it. Thus, the order of loads and stores should not be important
+and ioreadX(), iowriteX() and friends should not be necessary.
+However, as the memory is not cache coherent, if access ever needs to
+be protected by a spinlock then :c:func:`mmiowb()` must be used before
+unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
+Documentation/memory-barriers.txt)
+
+
+P2P DMA Support Library
+=====================
+
+.. kernel-doc:: drivers/pci/p2pdma.c
+   :export:
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 3b99ab931d41..e7938b507df3 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
 .. toctree::
    :maxdepth: 2
 
-   userspace-api/index	      
+   userspace-api/index
 
 
 Introduction to kernel development
@@ -89,6 +89,7 @@ needed).
    sound/index
    crypto/index
    filesystems/index
+   PCI/index
 
 Architecture-specific documentation
 -----------------------------------
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 07/14] block: Introduce PCI P2P flags for request and request queue
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c          |  3 +++
 include/linux/blk_types.h | 18 +++++++++++++++++-
 include/linux/blkdev.h    |  3 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 806ce2442819..35680cbebaf4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2270,6 +2270,9 @@ generic_make_request_checks(struct bio *bio)
 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
 		goto not_supported;
 
+	if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+		goto not_supported;
+
 	if (should_fail_bio(bio))
 		goto end_io;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..41194d54c45a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -279,6 +279,10 @@ enum req_flag_bits {
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
 
+#ifdef CONFIG_PCI_P2PDMA
+	__REQ_PCI_P2PDMA,	/* request is to/from P2P memory */
+#endif
+
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
@@ -303,6 +307,18 @@ enum req_flag_bits {
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA		(1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req)	((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA		0
+#define REQ_IS_PCI_P2PDMA(req)	0
+#endif /* CONFIG_PCI_P2PDMA */
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
@@ -311,7 +327,7 @@ enum req_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
 
 #define bio_op(bio) \
 	((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..116367babb39 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30	/* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -730,6 +731,8 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)	\
+	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 07/14] block: Introduce PCI P2P flags for request and request queue
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c          |  3 +++
 include/linux/blk_types.h | 18 +++++++++++++++++-
 include/linux/blkdev.h    |  3 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 806ce2442819..35680cbebaf4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2270,6 +2270,9 @@ generic_make_request_checks(struct bio *bio)
 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
 		goto not_supported;
 
+	if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+		goto not_supported;
+
 	if (should_fail_bio(bio))
 		goto end_io;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..41194d54c45a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -279,6 +279,10 @@ enum req_flag_bits {
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
 
+#ifdef CONFIG_PCI_P2PDMA
+	__REQ_PCI_P2PDMA,	/* request is to/from P2P memory */
+#endif
+
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
@@ -303,6 +307,18 @@ enum req_flag_bits {
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA		(1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req)	((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA		0
+#define REQ_IS_PCI_P2PDMA(req)	0
+#endif /* CONFIG_PCI_P2PDMA */
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
@@ -311,7 +327,7 @@ enum req_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
 
 #define bio_op(bio) \
 	((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..116367babb39 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30	/* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -730,6 +731,8 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)	\
+	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 07/14] block: Introduce PCI P2P flags for request and request queue
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 block/blk-core.c          |  3 +++
 include/linux/blk_types.h | 18 +++++++++++++++++-
 include/linux/blkdev.h    |  3 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 806ce2442819..35680cbebaf4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2270,6 +2270,9 @@ generic_make_request_checks(struct bio *bio)
 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
 		goto not_supported;
 
+	if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+		goto not_supported;
+
 	if (should_fail_bio(bio))
 		goto end_io;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..41194d54c45a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -279,6 +279,10 @@ enum req_flag_bits {
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
 
+#ifdef CONFIG_PCI_P2PDMA
+	__REQ_PCI_P2PDMA,	/* request is to/from P2P memory */
+#endif
+
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
@@ -303,6 +307,18 @@ enum req_flag_bits {
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA		(1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req)	((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA		0
+#define REQ_IS_PCI_P2PDMA(req)	0
+#endif /* CONFIG_PCI_P2PDMA */
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
@@ -311,7 +327,7 @@ enum req_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
 
 #define bio_op(bio) \
 	((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..116367babb39 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30	/* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -730,6 +731,8 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)	\
+	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 07/14] block: Introduce PCI P2P flags for request and request queue
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
Reviewed-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c          |  3 +++
 include/linux/blk_types.h | 18 +++++++++++++++++-
 include/linux/blkdev.h    |  3 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 806ce2442819..35680cbebaf4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2270,6 +2270,9 @@ generic_make_request_checks(struct bio *bio)
 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
 		goto not_supported;
 
+	if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+		goto not_supported;
+
 	if (should_fail_bio(bio))
 		goto end_io;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..41194d54c45a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -279,6 +279,10 @@ enum req_flag_bits {
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
 
+#ifdef CONFIG_PCI_P2PDMA
+	__REQ_PCI_P2PDMA,	/* request is to/from P2P memory */
+#endif
+
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
@@ -303,6 +307,18 @@ enum req_flag_bits {
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA		(1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req)	((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA		0
+#define REQ_IS_PCI_P2PDMA(req)	0
+#endif /* CONFIG_PCI_P2PDMA */
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
@@ -311,7 +327,7 @@ enum req_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
 
 #define bio_op(bio) \
 	((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..116367babb39 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30	/* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -730,6 +731,8 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)	\
+	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 08/14] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct PCI bus address.

To do this, check the first page in the scatter list to see if it is P2P
memory or not. At the moment, scatter lists that contain P2P memory must
be homogeneous so if the first page is P2P the entire SGL should be P2P.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/infiniband/core/rw.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..f495e8a7f8ac 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (is_pci_p2pdma_page(sg_page(sg)))
+		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+	else
+		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -602,7 +607,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		break;
 	}
 
-	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	if (is_pci_p2pdma_page(sg_page(sg)))
+		pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+				    sg_cnt, dir);
+	else
+		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 08/14] IB/core: Ensure we map P2P memory correctly in  rdma_rw_ctx_[init|destroy]()
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct PCI bus address.

To do this, check the first page in the scatter list to see if it is P2P
memory or not. At the moment, scatter lists that contain P2P memory must
be homogeneous so if the first page is P2P the entire SGL should be P2P.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/infiniband/core/rw.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..f495e8a7f8ac 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (is_pci_p2pdma_page(sg_page(sg)))
+		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+	else
+		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -602,7 +607,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		break;
 	}
 
-	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	if (is_pci_p2pdma_page(sg_page(sg)))
+		pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+				    sg_cnt, dir);
+	else
+		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 08/14] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct PCI bus address.

To do this, check the first page in the scatter list to see if it is P2P
memory or not. At the moment, scatter lists that contain P2P memory must
be homogeneous so if the first page is P2P the entire SGL should be P2P.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/infiniband/core/rw.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..f495e8a7f8ac 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (is_pci_p2pdma_page(sg_page(sg)))
+		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+	else
+		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -602,7 +607,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		break;
 	}
 
-	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	if (is_pci_p2pdma_page(sg_page(sg)))
+		pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+				    sg_cnt, dir);
+	else
+		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 08/14] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct PCI bus address.

To do this, check the first page in the scatter list to see if it is P2P
memory or not. At the moment, scatter lists that contain P2P memory must
be homogeneous so if the first page is P2P the entire SGL should be P2P.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Reviewed-by: Christoph Hellwig <hch at lst.de>
---
 drivers/infiniband/core/rw.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..f495e8a7f8ac 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (is_pci_p2pdma_page(sg_page(sg)))
+		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+	else
+		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -602,7 +607,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		break;
 	}
 
-	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	if (is_pci_p2pdma_page(sg_page(sg)))
+		pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+				    sg_cnt, dir);
+	else
+		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 09/14] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO submission queues.

If the CMB supports WDS and RDS, publish it for use as P2P memory
by other devices.

We can now drop the __iomem safety on the buffer seeing that, by
convention, devm_memremap_pages() allocates regular memory without
side effects that's accessible without the iomem accessors.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..514da4de3c85 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvme.h"
 
@@ -92,9 +93,8 @@ struct nvme_dev {
 	struct work_struct remove_work;
 	struct mutex shutdown_lock;
 	bool subsystem;
-	void __iomem *cmb;
-	pci_bus_addr_t cmb_bus_addr;
 	u64 cmb_size;
+	bool cmb_use_sqes;
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
@@ -149,7 +149,7 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t q_lock;
 	struct nvme_command *sq_cmds;
-	struct nvme_command __iomem *sq_cmds_io;
+	bool sq_cmds_is_io;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -431,10 +431,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
 	u16 tail = nvmeq->sq_tail;
 
-	if (nvmeq->sq_cmds_io)
-		memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
-	else
-		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
@@ -1289,9 +1286,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-	if (nvmeq->sq_cmds)
-		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+	if (nvmeq->sq_cmds) {
+		if (nvmeq->sq_cmds_is_io)
+			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+					nvmeq->sq_cmds,
+					SQ_SIZE(nvmeq->q_depth));
+		else
+			dma_free_coherent(nvmeq->q_dmadev,
+					  SQ_SIZE(nvmeq->q_depth),
+					  nvmeq->sq_cmds,
+					  nvmeq->sq_dma_addr);
+	}
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1371,12 +1377,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 				int qid, int depth)
 {
-	/* CMB SQEs will be mapped before creation */
-	if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-		return 0;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+	if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+						nvmeq->sq_cmds);
+		nvmeq->sq_cmds_is_io = true;
+	}
+
+	if (!nvmeq->sq_cmds) {
+		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+					&nvmeq->sq_dma_addr, GFP_KERNEL);
+		nvmeq->sq_cmds_is_io = false;
+	}
 
-	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					    &nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
@@ -1451,13 +1466,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
 
-	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
-						      dev->ctrl.page_size);
-		nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
-		nvmeq->sq_cmds_io = dev->cmb + offset;
-	}
-
 	/*
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
@@ -1691,9 +1699,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 		return;
 	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
-	if (!use_cmb_sqes)
-		return;
-
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
 	bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1710,11 +1715,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	if (size > bar_size - offset)
 		size = bar_size - offset;
 
-	dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
-	if (!dev->cmb)
+	if (pci_p2pdma_add_resource(pdev, bar, size, offset))
 		return;
-	dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
 	dev->cmb_size = size;
+	dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+	if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+			(NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+		pci_p2pmem_publish(pdev, true);
 
 	if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
 				    &dev_attr_cmb.attr, NULL))
@@ -1724,12 +1733,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 
 static inline void nvme_release_cmb(struct nvme_dev *dev)
 {
-	if (dev->cmb) {
-		iounmap(dev->cmb);
-		dev->cmb = NULL;
+	if (dev->cmb_size) {
 		sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
 					     &dev_attr_cmb.attr, NULL);
-		dev->cmbsz = 0;
+		dev->cmb_size = 0;
 	}
 }
 
@@ -1928,13 +1935,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
 				sizeof(struct nvme_command));
 		if (result > 0)
 			dev->q_depth = result;
 		else
-			nvme_release_cmb(dev);
+			dev->cmb_use_sqes = false;
 	}
 
 	do {
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 09/14] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO submission queues.

If the CMB supports WDS and RDS, publish it for use as P2P memory
by other devices.

We can now drop the __iomem safety on the buffer seeing that, by
convention, devm_memremap_pages() allocates regular memory without
side effects that's accessible without the iomem accessors.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..514da4de3c85 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvme.h"
 
@@ -92,9 +93,8 @@ struct nvme_dev {
 	struct work_struct remove_work;
 	struct mutex shutdown_lock;
 	bool subsystem;
-	void __iomem *cmb;
-	pci_bus_addr_t cmb_bus_addr;
 	u64 cmb_size;
+	bool cmb_use_sqes;
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
@@ -149,7 +149,7 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t q_lock;
 	struct nvme_command *sq_cmds;
-	struct nvme_command __iomem *sq_cmds_io;
+	bool sq_cmds_is_io;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -431,10 +431,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
 	u16 tail = nvmeq->sq_tail;
 
-	if (nvmeq->sq_cmds_io)
-		memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
-	else
-		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
@@ -1289,9 +1286,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-	if (nvmeq->sq_cmds)
-		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+	if (nvmeq->sq_cmds) {
+		if (nvmeq->sq_cmds_is_io)
+			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+					nvmeq->sq_cmds,
+					SQ_SIZE(nvmeq->q_depth));
+		else
+			dma_free_coherent(nvmeq->q_dmadev,
+					  SQ_SIZE(nvmeq->q_depth),
+					  nvmeq->sq_cmds,
+					  nvmeq->sq_dma_addr);
+	}
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1371,12 +1377,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 				int qid, int depth)
 {
-	/* CMB SQEs will be mapped before creation */
-	if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-		return 0;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+	if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+						nvmeq->sq_cmds);
+		nvmeq->sq_cmds_is_io = true;
+	}
+
+	if (!nvmeq->sq_cmds) {
+		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+					&nvmeq->sq_dma_addr, GFP_KERNEL);
+		nvmeq->sq_cmds_is_io = false;
+	}
 
-	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					    &nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
@@ -1451,13 +1466,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
 
-	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
-						      dev->ctrl.page_size);
-		nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
-		nvmeq->sq_cmds_io = dev->cmb + offset;
-	}
-
 	/*
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
@@ -1691,9 +1699,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 		return;
 	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
-	if (!use_cmb_sqes)
-		return;
-
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
 	bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1710,11 +1715,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	if (size > bar_size - offset)
 		size = bar_size - offset;
 
-	dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
-	if (!dev->cmb)
+	if (pci_p2pdma_add_resource(pdev, bar, size, offset))
 		return;
-	dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
 	dev->cmb_size = size;
+	dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+	if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+			(NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+		pci_p2pmem_publish(pdev, true);
 
 	if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
 				    &dev_attr_cmb.attr, NULL))
@@ -1724,12 +1733,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 
 static inline void nvme_release_cmb(struct nvme_dev *dev)
 {
-	if (dev->cmb) {
-		iounmap(dev->cmb);
-		dev->cmb = NULL;
+	if (dev->cmb_size) {
 		sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
 					     &dev_attr_cmb.attr, NULL);
-		dev->cmbsz = 0;
+		dev->cmb_size = 0;
 	}
 }
 
@@ -1928,13 +1935,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
 				sizeof(struct nvme_command));
 		if (result > 0)
 			dev->q_depth = result;
 		else
-			nvme_release_cmb(dev);
+			dev->cmb_use_sqes = false;
 	}
 
 	do {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 09/14] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO submission queues.

If the CMB supports WDS and RDS, publish it for use as P2P memory
by other devices.

We can now drop the __iomem safety on the buffer seeing that, by
convention, devm_memremap_pages() allocates regular memory without
side effects that's accessible without the iomem accessors.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..514da4de3c85 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvme.h"
 
@@ -92,9 +93,8 @@ struct nvme_dev {
 	struct work_struct remove_work;
 	struct mutex shutdown_lock;
 	bool subsystem;
-	void __iomem *cmb;
-	pci_bus_addr_t cmb_bus_addr;
 	u64 cmb_size;
+	bool cmb_use_sqes;
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
@@ -149,7 +149,7 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t q_lock;
 	struct nvme_command *sq_cmds;
-	struct nvme_command __iomem *sq_cmds_io;
+	bool sq_cmds_is_io;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -431,10 +431,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
 	u16 tail = nvmeq->sq_tail;
 
-	if (nvmeq->sq_cmds_io)
-		memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
-	else
-		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
@@ -1289,9 +1286,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-	if (nvmeq->sq_cmds)
-		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+	if (nvmeq->sq_cmds) {
+		if (nvmeq->sq_cmds_is_io)
+			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+					nvmeq->sq_cmds,
+					SQ_SIZE(nvmeq->q_depth));
+		else
+			dma_free_coherent(nvmeq->q_dmadev,
+					  SQ_SIZE(nvmeq->q_depth),
+					  nvmeq->sq_cmds,
+					  nvmeq->sq_dma_addr);
+	}
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1371,12 +1377,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 				int qid, int depth)
 {
-	/* CMB SQEs will be mapped before creation */
-	if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-		return 0;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+	if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+						nvmeq->sq_cmds);
+		nvmeq->sq_cmds_is_io = true;
+	}
+
+	if (!nvmeq->sq_cmds) {
+		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+					&nvmeq->sq_dma_addr, GFP_KERNEL);
+		nvmeq->sq_cmds_is_io = false;
+	}
 
-	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					    &nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
@@ -1451,13 +1466,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
 
-	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
-						      dev->ctrl.page_size);
-		nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
-		nvmeq->sq_cmds_io = dev->cmb + offset;
-	}
-
 	/*
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
@@ -1691,9 +1699,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 		return;
 	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
-	if (!use_cmb_sqes)
-		return;
-
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
 	bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1710,11 +1715,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	if (size > bar_size - offset)
 		size = bar_size - offset;
 
-	dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
-	if (!dev->cmb)
+	if (pci_p2pdma_add_resource(pdev, bar, size, offset))
 		return;
-	dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
 	dev->cmb_size = size;
+	dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+	if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+			(NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+		pci_p2pmem_publish(pdev, true);
 
 	if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
 				    &dev_attr_cmb.attr, NULL))
@@ -1724,12 +1733,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 
 static inline void nvme_release_cmb(struct nvme_dev *dev)
 {
-	if (dev->cmb) {
-		iounmap(dev->cmb);
-		dev->cmb = NULL;
+	if (dev->cmb_size) {
 		sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
 					     &dev_attr_cmb.attr, NULL);
-		dev->cmbsz = 0;
+		dev->cmb_size = 0;
 	}
 }
 
@@ -1928,13 +1935,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
 				sizeof(struct nvme_command));
 		if (result > 0)
 			dev->q_depth = result;
 		else
-			nvme_release_cmb(dev);
+			dev->cmb_use_sqes = false;
 	}
 
 	do {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 09/14] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO submission queues.

If the CMB supports WDS and RDS, publish it for use as P2P memory
by other devices.

We can now drop the __iomem safety on the buffer seeing that, by
convention, devm_memremap_pages() allocates regular memory without
side effects that's accessible without the iomem accessors.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..514da4de3c85 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvme.h"
 
@@ -92,9 +93,8 @@ struct nvme_dev {
 	struct work_struct remove_work;
 	struct mutex shutdown_lock;
 	bool subsystem;
-	void __iomem *cmb;
-	pci_bus_addr_t cmb_bus_addr;
 	u64 cmb_size;
+	bool cmb_use_sqes;
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
@@ -149,7 +149,7 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t q_lock;
 	struct nvme_command *sq_cmds;
-	struct nvme_command __iomem *sq_cmds_io;
+	bool sq_cmds_is_io;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -431,10 +431,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
 	u16 tail = nvmeq->sq_tail;
 
-	if (nvmeq->sq_cmds_io)
-		memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
-	else
-		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
@@ -1289,9 +1286,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-	if (nvmeq->sq_cmds)
-		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+	if (nvmeq->sq_cmds) {
+		if (nvmeq->sq_cmds_is_io)
+			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+					nvmeq->sq_cmds,
+					SQ_SIZE(nvmeq->q_depth));
+		else
+			dma_free_coherent(nvmeq->q_dmadev,
+					  SQ_SIZE(nvmeq->q_depth),
+					  nvmeq->sq_cmds,
+					  nvmeq->sq_dma_addr);
+	}
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1371,12 +1377,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 				int qid, int depth)
 {
-	/* CMB SQEs will be mapped before creation */
-	if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-		return 0;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+	if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+						nvmeq->sq_cmds);
+		nvmeq->sq_cmds_is_io = true;
+	}
+
+	if (!nvmeq->sq_cmds) {
+		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+					&nvmeq->sq_dma_addr, GFP_KERNEL);
+		nvmeq->sq_cmds_is_io = false;
+	}
 
-	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					    &nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
@@ -1451,13 +1466,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
 
-	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
-						      dev->ctrl.page_size);
-		nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
-		nvmeq->sq_cmds_io = dev->cmb + offset;
-	}
-
 	/*
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
@@ -1691,9 +1699,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 		return;
 	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
-	if (!use_cmb_sqes)
-		return;
-
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
 	bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1710,11 +1715,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	if (size > bar_size - offset)
 		size = bar_size - offset;
 
-	dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
-	if (!dev->cmb)
+	if (pci_p2pdma_add_resource(pdev, bar, size, offset))
 		return;
-	dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
 	dev->cmb_size = size;
+	dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+	if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+			(NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+		pci_p2pmem_publish(pdev, true);
 
 	if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
 				    &dev_attr_cmb.attr, NULL))
@@ -1724,12 +1733,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 
 static inline void nvme_release_cmb(struct nvme_dev *dev)
 {
-	if (dev->cmb) {
-		iounmap(dev->cmb);
-		dev->cmb = NULL;
+	if (dev->cmb_size) {
 		sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
 					     &dev_attr_cmb.attr, NULL);
-		dev->cmbsz = 0;
+		dev->cmb_size = 0;
 	}
 }
 
@@ -1928,13 +1935,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
 				sizeof(struct nvme_command));
 		if (result > 0)
 			dev->q_depth = result;
 		else
-			nvme_release_cmb(dev);
+			dev->cmb_use_sqes = false;
 	}
 
 	do {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 10/14] nvme-pci: Add support for P2P memory in requests
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

For P2P requests, we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c |  4 ++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 19 +++++++++++++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..2ca9debbcf2b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2977,7 +2977,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
 		goto out_free_ns;
+
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
+	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..9a689c13998f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -306,6 +306,7 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
+#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 514da4de3c85..09b6aba6ed28 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -798,8 +798,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out;
 
 	ret = BLK_STS_RESOURCE;
-	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-			DMA_ATTR_NO_WARN);
+
+	if (REQ_IS_PCI_P2PDMA(req))
+		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+					  dma_dir);
+	else
+		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+					     dma_dir,  DMA_ATTR_NO_WARN);
 	if (!nr_mapped)
 		goto out;
 
@@ -844,7 +849,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 			DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
 	if (iod->nents) {
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+		if (REQ_IS_PCI_P2PDMA(req))
+			pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+					    dma_dir);
+		else
+			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
 		if (blk_integrity_rq(req)) {
 			if (req_op(req) == REQ_OP_READ)
 				nvme_dif_remap(req, nvme_dif_complete);
@@ -2439,7 +2449,8 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED,
+	.flags			= NVME_F_METADATA_SUPPORTED |
+				  NVME_F_PCI_P2PDMA,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 10/14] nvme-pci: Add support for P2P memory in requests
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

For P2P requests, we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c |  4 ++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 19 +++++++++++++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..2ca9debbcf2b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2977,7 +2977,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
 		goto out_free_ns;
+
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
+	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..9a689c13998f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -306,6 +306,7 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
+#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 514da4de3c85..09b6aba6ed28 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -798,8 +798,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out;
 
 	ret = BLK_STS_RESOURCE;
-	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-			DMA_ATTR_NO_WARN);
+
+	if (REQ_IS_PCI_P2PDMA(req))
+		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+					  dma_dir);
+	else
+		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+					     dma_dir,  DMA_ATTR_NO_WARN);
 	if (!nr_mapped)
 		goto out;
 
@@ -844,7 +849,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 			DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
 	if (iod->nents) {
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+		if (REQ_IS_PCI_P2PDMA(req))
+			pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+					    dma_dir);
+		else
+			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
 		if (blk_integrity_rq(req)) {
 			if (req_op(req) == REQ_OP_READ)
 				nvme_dif_remap(req, nvme_dif_complete);
@@ -2439,7 +2449,8 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED,
+	.flags			= NVME_F_METADATA_SUPPORTED |
+				  NVME_F_PCI_P2PDMA,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 10/14] nvme-pci: Add support for P2P memory in requests
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

For P2P requests, we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/nvme/host/core.c |  4 ++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 19 +++++++++++++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..2ca9debbcf2b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2977,7 +2977,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
 		goto out_free_ns;
+
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
+	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..9a689c13998f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -306,6 +306,7 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
+#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 514da4de3c85..09b6aba6ed28 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -798,8 +798,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out;
 
 	ret = BLK_STS_RESOURCE;
-	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-			DMA_ATTR_NO_WARN);
+
+	if (REQ_IS_PCI_P2PDMA(req))
+		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+					  dma_dir);
+	else
+		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+					     dma_dir,  DMA_ATTR_NO_WARN);
 	if (!nr_mapped)
 		goto out;
 
@@ -844,7 +849,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 			DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
 	if (iod->nents) {
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+		if (REQ_IS_PCI_P2PDMA(req))
+			pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+					    dma_dir);
+		else
+			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
 		if (blk_integrity_rq(req)) {
 			if (req_op(req) == REQ_OP_READ)
 				nvme_dif_remap(req, nvme_dif_complete);
@@ -2439,7 +2449,8 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED,
+	.flags			= NVME_F_METADATA_SUPPORTED |
+				  NVME_F_PCI_P2PDMA,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 10/14] nvme-pci: Add support for P2P memory in requests
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


For P2P requests, we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
Reviewed-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c |  4 ++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 19 +++++++++++++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..2ca9debbcf2b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2977,7 +2977,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
 		goto out_free_ns;
+
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
+	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..9a689c13998f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -306,6 +306,7 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
+#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 514da4de3c85..09b6aba6ed28 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -798,8 +798,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out;
 
 	ret = BLK_STS_RESOURCE;
-	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-			DMA_ATTR_NO_WARN);
+
+	if (REQ_IS_PCI_P2PDMA(req))
+		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+					  dma_dir);
+	else
+		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+					     dma_dir,  DMA_ATTR_NO_WARN);
 	if (!nr_mapped)
 		goto out;
 
@@ -844,7 +849,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 			DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
 	if (iod->nents) {
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+		if (REQ_IS_PCI_P2PDMA(req))
+			pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+					    dma_dir);
+		else
+			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
 		if (blk_integrity_rq(req)) {
 			if (req_op(req) == REQ_OP_READ)
 				nvme_dif_remap(req, nvme_dif_complete);
@@ -2439,7 +2449,8 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED,
+	.flags			= NVME_F_METADATA_SUPPORTED |
+				  NVME_F_PCI_P2PDMA,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 11/14] nvme-pci: Add a quirk for a pseudo CMB
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/nvme.h |  7 +++++++
 drivers/nvme/host/pci.c  | 24 ++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a689c13998f..885e9ec9b889 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -84,6 +84,13 @@ enum nvme_quirks {
 	 * Supports the LighNVM command set if indicated in vs[1].
 	 */
 	NVME_QUIRK_LIGHTNVM			= (1 << 6),
+
+	/*
+	 * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+	 * NVRAM that have CMB-like memory on a BAR but does not set
+	 * CMBLOC or CMBSZ.
+	 */
+	NVME_QUIRK_PSEUDO_CMB_BAR4		= (1 << 7),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 09b6aba6ed28..e526e969680a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1685,6 +1685,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+	return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+		(((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+		((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
 	u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1704,10 +1711,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int bar;
 
-	dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-	if (!dev->cmbsz)
-		return;
-	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+		dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+		dev->cmbloc = 4;
+	} else {
+		dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+		if (!dev->cmbsz)
+			return;
+		dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	}
 
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2736,6 +2748,10 @@ static const struct pci_device_id nvme_id_table[] = {
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
 	{ PCI_DEVICE(0x1d1d, 0x2807),	/* CNEX WL */
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
+	{ PCI_DEVICE(0x11f8, 0xf117),	/* Microsemi NVRAM adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+	{ PCI_DEVICE(0x1db1, 0x0002),	/* Everspin nvNitro adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 11/14] nvme-pci: Add a quirk for a pseudo CMB
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/nvme.h |  7 +++++++
 drivers/nvme/host/pci.c  | 24 ++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a689c13998f..885e9ec9b889 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -84,6 +84,13 @@ enum nvme_quirks {
 	 * Supports the LighNVM command set if indicated in vs[1].
 	 */
 	NVME_QUIRK_LIGHTNVM			= (1 << 6),
+
+	/*
+	 * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+	 * NVRAM that have CMB-like memory on a BAR but does not set
+	 * CMBLOC or CMBSZ.
+	 */
+	NVME_QUIRK_PSEUDO_CMB_BAR4		= (1 << 7),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 09b6aba6ed28..e526e969680a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1685,6 +1685,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+	return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+		(((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+		((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
 	u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1704,10 +1711,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int bar;
 
-	dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-	if (!dev->cmbsz)
-		return;
-	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+		dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+		dev->cmbloc = 4;
+	} else {
+		dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+		if (!dev->cmbsz)
+			return;
+		dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	}
 
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2736,6 +2748,10 @@ static const struct pci_device_id nvme_id_table[] = {
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
 	{ PCI_DEVICE(0x1d1d, 0x2807),	/* CNEX WL */
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
+	{ PCI_DEVICE(0x11f8, 0xf117),	/* Microsemi NVRAM adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+	{ PCI_DEVICE(0x1db1, 0x0002),	/* Everspin nvNitro adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 11/14] nvme-pci: Add a quirk for a pseudo CMB
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/host/nvme.h |  7 +++++++
 drivers/nvme/host/pci.c  | 24 ++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a689c13998f..885e9ec9b889 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -84,6 +84,13 @@ enum nvme_quirks {
 	 * Supports the LighNVM command set if indicated in vs[1].
 	 */
 	NVME_QUIRK_LIGHTNVM			= (1 << 6),
+
+	/*
+	 * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+	 * NVRAM that have CMB-like memory on a BAR but does not set
+	 * CMBLOC or CMBSZ.
+	 */
+	NVME_QUIRK_PSEUDO_CMB_BAR4		= (1 << 7),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 09b6aba6ed28..e526e969680a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1685,6 +1685,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+	return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+		(((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+		((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
 	u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1704,10 +1711,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int bar;
 
-	dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-	if (!dev->cmbsz)
-		return;
-	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+		dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+		dev->cmbloc = 4;
+	} else {
+		dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+		if (!dev->cmbsz)
+			return;
+		dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	}
 
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2736,6 +2748,10 @@ static const struct pci_device_id nvme_id_table[] = {
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
 	{ PCI_DEVICE(0x1d1d, 0x2807),	/* CNEX WL */
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
+	{ PCI_DEVICE(0x11f8, 0xf117),	/* Microsemi NVRAM adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+	{ PCI_DEVICE(0x1db1, 0x0002),	/* Everspin nvNitro adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 11/14] nvme-pci: Add a quirk for a pseudo CMB
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/nvme.h |  7 +++++++
 drivers/nvme/host/pci.c  | 24 ++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a689c13998f..885e9ec9b889 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -84,6 +84,13 @@ enum nvme_quirks {
 	 * Supports the LighNVM command set if indicated in vs[1].
 	 */
 	NVME_QUIRK_LIGHTNVM			= (1 << 6),
+
+	/*
+	 * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+	 * NVRAM that have CMB-like memory on a BAR but does not set
+	 * CMBLOC or CMBSZ.
+	 */
+	NVME_QUIRK_PSEUDO_CMB_BAR4		= (1 << 7),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 09b6aba6ed28..e526e969680a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1685,6 +1685,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+	return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+		(((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+		((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
 	u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1704,10 +1711,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int bar;
 
-	dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-	if (!dev->cmbsz)
-		return;
-	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+		dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+		dev->cmbloc = 4;
+	} else {
+		dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+		if (!dev->cmbsz)
+			return;
+		dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	}
 
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2736,6 +2748,10 @@ static const struct pci_device_id nvme_id_table[] = {
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
 	{ PCI_DEVICE(0x1d1d, 0x2807),	/* CNEX WL */
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
+	{ PCI_DEVICE(0x11f8, 0xf117),	/* Microsemi NVRAM adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+	{ PCI_DEVICE(0x1db1, 0x0002),	/* Everspin nvNitro adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 12/14] nvmet: Introduce helper functions to allocate and free request SGLs
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Add helpers to allocate and free the SGL in a struct nvmet_req:

int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
void nvmet_req_free_sgl(struct nvmet_req *req)

This will be expanded in a future patch to implement peer-to-peer
memory DMAs and should be common with all target drivers. The presently
unused 'sq' argument in the alloc function will be necessary to
decide whether to use peer-to-peer memory and obtain the correct
provider to allocate the memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/target/core.c  | 18 ++++++++++++++++++
 drivers/nvme/target/nvmet.h |  2 ++
 2 files changed, 20 insertions(+)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index e95424f172fd..75d44bc3e8d3 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -575,6 +575,24 @@ void nvmet_req_execute(struct nvmet_req *req)
 }
 EXPORT_SYMBOL_GPL(nvmet_req_execute);
 
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
+{
+	req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
+	if (!req->sg)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
+
+void nvmet_req_free_sgl(struct nvmet_req *req)
+{
+	sgl_free(req->sg);
+	req->sg = NULL;
+	req->sg_cnt = 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_free_sgl);
+
 static inline bool nvmet_cc_en(u32 cc)
 {
 	return (cc >> NVME_CC_EN_SHIFT) & 0x1;
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 15fd84ab21f8..10b162615a5e 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -273,6 +273,8 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
 void nvmet_req_uninit(struct nvmet_req *req);
 void nvmet_req_execute(struct nvmet_req *req);
 void nvmet_req_complete(struct nvmet_req *req, u16 status);
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq);
+void nvmet_req_free_sgl(struct nvmet_req *req);
 
 void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid,
 		u16 size);
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 12/14] nvmet: Introduce helper functions to allocate and free request SGLs
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

Add helpers to allocate and free the SGL in a struct nvmet_req:

int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
void nvmet_req_free_sgl(struct nvmet_req *req)

This will be expanded in a future patch to implement peer-to-peer
memory DMAs and should be common with all target drivers. The presently
unused 'sq' argument in the alloc function will be necessary to
decide whether to use peer-to-peer memory and obtain the correct
provider to allocate the memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/target/core.c  | 18 ++++++++++++++++++
 drivers/nvme/target/nvmet.h |  2 ++
 2 files changed, 20 insertions(+)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index e95424f172fd..75d44bc3e8d3 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -575,6 +575,24 @@ void nvmet_req_execute(struct nvmet_req *req)
 }
 EXPORT_SYMBOL_GPL(nvmet_req_execute);
 
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
+{
+	req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
+	if (!req->sg)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
+
+void nvmet_req_free_sgl(struct nvmet_req *req)
+{
+	sgl_free(req->sg);
+	req->sg = NULL;
+	req->sg_cnt = 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_free_sgl);
+
 static inline bool nvmet_cc_en(u32 cc)
 {
 	return (cc >> NVME_CC_EN_SHIFT) & 0x1;
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 15fd84ab21f8..10b162615a5e 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -273,6 +273,8 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
 void nvmet_req_uninit(struct nvmet_req *req);
 void nvmet_req_execute(struct nvmet_req *req);
 void nvmet_req_complete(struct nvmet_req *req, u16 status);
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq);
+void nvmet_req_free_sgl(struct nvmet_req *req);
 
 void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid,
 		u16 size);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 12/14] nvmet: Introduce helper functions to allocate and free request SGLs
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Add helpers to allocate and free the SGL in a struct nvmet_req:

int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
void nvmet_req_free_sgl(struct nvmet_req *req)

This will be expanded in a future patch to implement peer-to-peer
memory DMAs and should be common with all target drivers. The presently
unused 'sq' argument in the alloc function will be necessary to
decide whether to use peer-to-peer memory and obtain the correct
provider to allocate the memory.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/target/core.c  | 18 ++++++++++++++++++
 drivers/nvme/target/nvmet.h |  2 ++
 2 files changed, 20 insertions(+)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index e95424f172fd..75d44bc3e8d3 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -575,6 +575,24 @@ void nvmet_req_execute(struct nvmet_req *req)
 }
 EXPORT_SYMBOL_GPL(nvmet_req_execute);
 
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
+{
+	req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
+	if (!req->sg)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
+
+void nvmet_req_free_sgl(struct nvmet_req *req)
+{
+	sgl_free(req->sg);
+	req->sg = NULL;
+	req->sg_cnt = 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_free_sgl);
+
 static inline bool nvmet_cc_en(u32 cc)
 {
 	return (cc >> NVME_CC_EN_SHIFT) & 0x1;
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 15fd84ab21f8..10b162615a5e 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -273,6 +273,8 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
 void nvmet_req_uninit(struct nvmet_req *req);
 void nvmet_req_execute(struct nvmet_req *req);
 void nvmet_req_complete(struct nvmet_req *req, u16 status);
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq);
+void nvmet_req_free_sgl(struct nvmet_req *req);
 
 void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid,
 		u16 size);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 12/14] nvmet: Introduce helper functions to allocate and free request SGLs
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


Add helpers to allocate and free the SGL in a struct nvmet_req:

int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
void nvmet_req_free_sgl(struct nvmet_req *req)

This will be expanded in a future patch to implement peer-to-peer
memory DMAs and should be common with all target drivers. The presently
unused 'sq' argument in the alloc function will be necessary to
decide whether to use peer-to-peer memory and obtain the correct
provider to allocate the memory.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/target/core.c  | 18 ++++++++++++++++++
 drivers/nvme/target/nvmet.h |  2 ++
 2 files changed, 20 insertions(+)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index e95424f172fd..75d44bc3e8d3 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -575,6 +575,24 @@ void nvmet_req_execute(struct nvmet_req *req)
 }
 EXPORT_SYMBOL_GPL(nvmet_req_execute);
 
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
+{
+	req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
+	if (!req->sg)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
+
+void nvmet_req_free_sgl(struct nvmet_req *req)
+{
+	sgl_free(req->sg);
+	req->sg = NULL;
+	req->sg_cnt = 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_free_sgl);
+
 static inline bool nvmet_cc_en(u32 cc)
 {
 	return (cc >> NVME_CC_EN_SHIFT) & 0x1;
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 15fd84ab21f8..10b162615a5e 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -273,6 +273,8 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
 void nvmet_req_uninit(struct nvmet_req *req);
 void nvmet_req_execute(struct nvmet_req *req);
 void nvmet_req_complete(struct nvmet_req *req, u16 status);
+int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq);
+void nvmet_req_free_sgl(struct nvmet_req *req);
 
 void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid,
 		u16 size);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 13/14] nvmet-rdma: Use new SGL alloc/free helper for requests
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Use the new helpers introduced in the previous patch to allocate
the SGLs for the request.

Seeing we use req.transfer_len as the length of the SGL it is
set earlier and cleared on any error. It also seems to be unnecessary
to accumulate the length as the map_sgl functions should only ever
be called once.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/target/rdma.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 52e0c5d579a7..f7a3459d618f 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -430,7 +430,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		sgl_free(rsp->req.sg);
+		nvmet_req_free_sgl(&rsp->req);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -564,24 +564,24 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 {
 	struct rdma_cm_id *cm_id = rsp->queue->cm_id;
 	u64 addr = le64_to_cpu(sgl->addr);
-	u32 len = get_unaligned_le24(sgl->length);
 	u32 key = get_unaligned_le32(sgl->key);
 	int ret;
 
+	rsp->req.transfer_len = get_unaligned_le24(sgl->length);
+
 	/* no data command? */
-	if (!len)
+	if (!rsp->req.transfer_len)
 		return 0;
 
-	rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
-	if (!rsp->req.sg)
-		return NVME_SC_INTERNAL;
+	ret = nvmet_req_alloc_sgl(&rsp->req, &rsp->queue->nvme_sq);
+	if (ret < 0)
+		goto error_out;
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
 			nvmet_data_dir(&rsp->req));
 	if (ret < 0)
-		return NVME_SC_INTERNAL;
-	rsp->req.transfer_len += len;
+		goto error_out;
 	rsp->n_rdma += ret;
 
 	if (invalidate) {
@@ -590,6 +590,10 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	}
 
 	return 0;
+
+error_out:
+	rsp->req.transfer_len = 0;
+	return NVME_SC_INTERNAL;
 }
 
 static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp *rsp)
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 13/14] nvmet-rdma: Use new SGL alloc/free helper for requests
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe

Use the new helpers introduced in the previous patch to allocate
the SGLs for the request.

Seeing we use req.transfer_len as the length of the SGL it is
set earlier and cleared on any error. It also seems to be unnecessary
to accumulate the length as the map_sgl functions should only ever
be called once.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/target/rdma.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 52e0c5d579a7..f7a3459d618f 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -430,7 +430,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		sgl_free(rsp->req.sg);
+		nvmet_req_free_sgl(&rsp->req);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -564,24 +564,24 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 {
 	struct rdma_cm_id *cm_id = rsp->queue->cm_id;
 	u64 addr = le64_to_cpu(sgl->addr);
-	u32 len = get_unaligned_le24(sgl->length);
 	u32 key = get_unaligned_le32(sgl->key);
 	int ret;
 
+	rsp->req.transfer_len = get_unaligned_le24(sgl->length);
+
 	/* no data command? */
-	if (!len)
+	if (!rsp->req.transfer_len)
 		return 0;
 
-	rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
-	if (!rsp->req.sg)
-		return NVME_SC_INTERNAL;
+	ret = nvmet_req_alloc_sgl(&rsp->req, &rsp->queue->nvme_sq);
+	if (ret < 0)
+		goto error_out;
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
 			nvmet_data_dir(&rsp->req));
 	if (ret < 0)
-		return NVME_SC_INTERNAL;
-	rsp->req.transfer_len += len;
+		goto error_out;
 	rsp->n_rdma += ret;
 
 	if (invalidate) {
@@ -590,6 +590,10 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	}
 
 	return 0;
+
+error_out:
+	rsp->req.transfer_len = 0;
+	return NVME_SC_INTERNAL;
 }
 
 static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp *rsp)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 13/14] nvmet-rdma: Use new SGL alloc/free helper for requests
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Use the new helpers introduced in the previous patch to allocate
the SGLs for the request.

Seeing we use req.transfer_len as the length of the SGL it is
set earlier and cleared on any error. It also seems to be unnecessary
to accumulate the length as the map_sgl functions should only ever
be called once.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/target/rdma.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 52e0c5d579a7..f7a3459d618f 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -430,7 +430,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		sgl_free(rsp->req.sg);
+		nvmet_req_free_sgl(&rsp->req);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -564,24 +564,24 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 {
 	struct rdma_cm_id *cm_id = rsp->queue->cm_id;
 	u64 addr = le64_to_cpu(sgl->addr);
-	u32 len = get_unaligned_le24(sgl->length);
 	u32 key = get_unaligned_le32(sgl->key);
 	int ret;
 
+	rsp->req.transfer_len = get_unaligned_le24(sgl->length);
+
 	/* no data command? */
-	if (!len)
+	if (!rsp->req.transfer_len)
 		return 0;
 
-	rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
-	if (!rsp->req.sg)
-		return NVME_SC_INTERNAL;
+	ret = nvmet_req_alloc_sgl(&rsp->req, &rsp->queue->nvme_sq);
+	if (ret < 0)
+		goto error_out;
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
 			nvmet_data_dir(&rsp->req));
 	if (ret < 0)
-		return NVME_SC_INTERNAL;
-	rsp->req.transfer_len += len;
+		goto error_out;
 	rsp->n_rdma += ret;
 
 	if (invalidate) {
@@ -590,6 +590,10 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	}
 
 	return 0;
+
+error_out:
+	rsp->req.transfer_len = 0;
+	return NVME_SC_INTERNAL;
 }
 
 static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp *rsp)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 13/14] nvmet-rdma: Use new SGL alloc/free helper for requests
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


Use the new helpers introduced in the previous patch to allocate
the SGLs for the request.

Seeing we use req.transfer_len as the length of the SGL it is
set earlier and cleared on any error. It also seems to be unnecessary
to accumulate the length as the map_sgl functions should only ever
be called once.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/target/rdma.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 52e0c5d579a7..f7a3459d618f 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -430,7 +430,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		sgl_free(rsp->req.sg);
+		nvmet_req_free_sgl(&rsp->req);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -564,24 +564,24 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 {
 	struct rdma_cm_id *cm_id = rsp->queue->cm_id;
 	u64 addr = le64_to_cpu(sgl->addr);
-	u32 len = get_unaligned_le24(sgl->length);
 	u32 key = get_unaligned_le32(sgl->key);
 	int ret;
 
+	rsp->req.transfer_len = get_unaligned_le24(sgl->length);
+
 	/* no data command? */
-	if (!len)
+	if (!rsp->req.transfer_len)
 		return 0;
 
-	rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
-	if (!rsp->req.sg)
-		return NVME_SC_INTERNAL;
+	ret = nvmet_req_alloc_sgl(&rsp->req, &rsp->queue->nvme_sq);
+	if (ret < 0)
+		goto error_out;
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
 			nvmet_data_dir(&rsp->req));
 	if (ret < 0)
-		return NVME_SC_INTERNAL;
-	rsp->req.transfer_len += len;
+		goto error_out;
 	rsp->n_rdma += ret;
 
 	if (invalidate) {
@@ -590,6 +590,10 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	}
 
 	return 0;
+
+error_out:
+	rsp->req.transfer_len = 0;
+	return NVME_SC_INTERNAL;
 }
 
 static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp *rsp)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 14/14] nvmet: Optionally use PCI P2P memory
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-23 23:30   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Steve Wise, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch heirarchy as the RDMA port and all the block devices in
use. If the user enabled it and no devices are found, then the system
will silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, only a limited number
of cards with CMB buffers seem to be available.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/target/configfs.c |  67 ++++++++++++++++++++++
 drivers/nvme/target/core.c     | 127 ++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/target/io-cmd.c   |   3 +
 drivers/nvme/target/nvmet.h    |  13 +++++
 drivers/nvme/target/rdma.c     |   2 +
 5 files changed, 210 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index ad9ff27234b5..5efe0dae0ee7 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,6 +17,8 @@
 #include <linux/slab.h>
 #include <linux/stat.h>
 #include <linux/ctype.h>
+#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -864,12 +866,77 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_p2pmem_show(struct config_item *item, char *page)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+
+	if (!port->use_p2pmem)
+		return sprintf(page, "none\n");
+
+	if (!port->p2p_dev)
+		return sprintf(page, "auto\n");
+
+	return sprintf(page, "%s\n", pci_name(port->p2p_dev));
+}
+
+static ssize_t nvmet_p2pmem_store(struct config_item *item,
+				  const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	struct device *dev;
+	struct pci_dev *p2p_dev = NULL;
+	bool use_p2pmem;
+
+	dev = bus_find_device_by_name(&pci_bus_type, NULL, page);
+	if (dev) {
+		use_p2pmem = true;
+		p2p_dev = to_pci_dev(dev);
+
+		if (!pci_has_p2pmem(p2p_dev)) {
+			pr_err("PCI device has no peer-to-peer memory: %s\n",
+			       page);
+			pci_dev_put(p2p_dev);
+			return -ENODEV;
+		}
+	} else if (sysfs_streq(page, "auto")) {
+		use_p2pmem = 1;
+	} else if ((page[0] == '0' || page[0] == '1') && !iscntrl(page[1])) {
+		/*
+		 * If the user enters a PCI device that  doesn't exist
+		 * like "0000:01:00.1", we don't want strtobool to think
+		 * it's a '0' when it's clearly not what the user wanted.
+		 * So we require 0's and 1's to be exactly one character.
+		 */
+		goto no_such_pci_device;
+	} else if (strtobool(page, &use_p2pmem)) {
+		goto no_such_pci_device;
+	}
+
+	down_write(&nvmet_config_sem);
+	port->use_p2pmem = use_p2pmem;
+	pci_dev_put(port->p2p_dev);
+	port->p2p_dev = p2p_dev;
+	up_write(&nvmet_config_sem);
+
+	return count;
+
+no_such_pci_device:
+	pr_err("No such PCI device: %s\n", page);
+	return -ENODEV;
+}
+CONFIGFS_ATTR(nvmet_, p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+	&nvmet_attr_p2pmem,
+#endif
 	NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 75d44bc3e8d3..b2b62cd36f6c 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
 	percpu_ref_put(&ns->ref);
 }
 
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+				   struct nvmet_ns *ns)
+{
+	int ret;
+
+	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+		pr_err("peer-to-peer DMA is not supported by %s\n",
+		       ns->device_path);
+		return -EINVAL;
+	}
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+	if (ret)
+		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+		       ns->device_path, ret);
+
+	return ret;
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
 	struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 	if (ret)
 		goto out_blkdev_put;
 
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		if (ctrl->p2p_dev) {
+			ret = nvmet_p2pdma_add_client(ctrl, ns);
+			if (ret)
+				goto out_remove_clients;
+		}
+	}
+
 	if (ns->nsid > subsys->max_nsid)
 		subsys->max_nsid = ns->nsid;
 
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 out_unlock:
 	mutex_unlock(&subsys->lock);
 	return ret;
+out_remove_clients:
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 out_blkdev_put:
 	blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
 	ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	percpu_ref_exit(&ns->ref);
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 		nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+	}
 
 	if (ns->bdev)
 		blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -577,6 +610,21 @@ EXPORT_SYMBOL_GPL(nvmet_req_execute);
 
 int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
 {
+	struct pci_dev *p2p_dev = NULL;
+
+	if (sq->ctrl)
+		p2p_dev = sq->ctrl->p2p_dev;
+
+	req->p2p_dev = NULL;
+	if (sq->qid && p2p_dev) {
+		req->sg = pci_p2pmem_alloc_sgl(p2p_dev, &req->sg_cnt,
+					       req->transfer_len);
+		if (req->sg) {
+			req->p2p_dev = p2p_dev;
+			return 0;
+		}
+	}
+
 	req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
 	if (!req->sg)
 		return -ENOMEM;
@@ -587,7 +635,11 @@ EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
 
 void nvmet_req_free_sgl(struct nvmet_req *req)
 {
-	sgl_free(req->sg);
+	if (req->p2p_dev)
+		pci_p2pmem_free_sgl(req->p2p_dev, req->sg);
+	else
+		sgl_free(req->sg);
+
 	req->sg = NULL;
 	req->sg_cnt = 0;
 }
@@ -782,6 +834,74 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
 		return __nvmet_host_allowed(subsys, hostnqn);
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+	struct nvmet_ns *ns;
+	int ret;
+
+	if (!req->port->use_p2pmem || !req->p2p_client)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+	if (ret) {
+		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+		       dev_name(req->p2p_client), ret);
+		goto free_devices;
+	}
+
+	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+		ret = nvmet_p2pdma_add_client(ctrl, ns);
+		if (ret)
+			goto free_devices;
+	}
+
+	if (req->port->p2p_dev) {
+		if (!pci_p2pdma_assign_provider(req->port->p2p_dev,
+						&ctrl->p2p_clients)) {
+			pr_info("peer-to-peer memory on %s is not supported\n",
+				pci_name(req->port->p2p_dev));
+			goto free_devices;
+		}
+		ctrl->p2p_dev = pci_dev_get(req->port->p2p_dev);
+	} else {
+		ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+		if (!ctrl->p2p_dev) {
+			pr_info("no supported peer-to-peer memory devices found\n");
+			goto free_devices;
+		}
+	}
+
+	mutex_unlock(&ctrl->subsys->lock);
+
+	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+	return;
+
+free_devices:
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+	if (!ctrl->p2p_dev)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	pci_dev_put(ctrl->p2p_dev);
+	ctrl->p2p_dev = NULL;
+
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
 u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
 {
@@ -821,6 +941,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 
 	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
 	INIT_LIST_HEAD(&ctrl->async_events);
+	INIT_LIST_HEAD(&ctrl->p2p_clients);
 
 	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
 	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -876,6 +997,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		ctrl->kato = DIV_ROUND_UP(kato, 1000);
 	}
 	nvmet_start_keep_alive_timer(ctrl);
+	nvmet_setup_p2pmem(ctrl, req);
 
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -912,6 +1034,7 @@ static void nvmet_ctrl_free(struct kref *ref)
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fatal_err_work);
 
+	nvmet_release_p2pmem(ctrl);
 	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
 
 	kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index cd2344179673..39bd37f1f312 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 		op = REQ_OP_READ;
 	}
 
+	if (is_pci_p2pdma_page(sg_page(req->sg)))
+		op_flags |= REQ_PCI_P2PDMA;
+
 	sector = le64_to_cpu(req->cmd->rw.slba);
 	sector <<= (req->ns->blksize_shift - 9);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 10b162615a5e..f192fefe61d9 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
 	return container_of(to_config_group(item), struct nvmet_ns, group);
 }
 
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+	return disk_to_dev(ns->bdev->bd_disk);
+}
+
 struct nvmet_cq {
 	u16			qid;
 	u16			size;
@@ -98,6 +103,8 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				use_p2pmem;
+	struct pci_dev			*p2p_dev;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -132,6 +139,9 @@ struct nvmet_ctrl {
 
 	const struct nvmet_fabrics_ops *ops;
 
+	struct pci_dev		*p2p_dev;
+	struct list_head	p2p_clients;
+
 	char			subsysnqn[NVMF_NQN_FIELD_LEN];
 	char			hostnqn[NVMF_NQN_FIELD_LEN];
 };
@@ -234,6 +244,9 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	const struct nvmet_fabrics_ops *ops;
+
+	struct pci_dev *p2p_dev;
+	struct device *p2p_client;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index f7a3459d618f..27a6d8ea1b56 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -661,6 +661,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
 		cmd->send_sge.addr, cmd->send_sge.length,
 		DMA_TO_DEVICE);
 
+	cmd->req.p2p_client = &queue->dev->device->dev;
+
 	if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
 			&queue->nvme_sq, &nvmet_rdma_ops))
 		return;
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 14/14] nvmet: Optionally use PCI P2P memory
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Logan Gunthorpe,
	Steve Wise

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch heirarchy as the RDMA port and all the block devices in
use. If the user enabled it and no devices are found, then the system
will silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, only a limited number
of cards with CMB buffers seem to be available.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/target/configfs.c |  67 ++++++++++++++++++++++
 drivers/nvme/target/core.c     | 127 ++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/target/io-cmd.c   |   3 +
 drivers/nvme/target/nvmet.h    |  13 +++++
 drivers/nvme/target/rdma.c     |   2 +
 5 files changed, 210 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index ad9ff27234b5..5efe0dae0ee7 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,6 +17,8 @@
 #include <linux/slab.h>
 #include <linux/stat.h>
 #include <linux/ctype.h>
+#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -864,12 +866,77 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_p2pmem_show(struct config_item *item, char *page)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+
+	if (!port->use_p2pmem)
+		return sprintf(page, "none\n");
+
+	if (!port->p2p_dev)
+		return sprintf(page, "auto\n");
+
+	return sprintf(page, "%s\n", pci_name(port->p2p_dev));
+}
+
+static ssize_t nvmet_p2pmem_store(struct config_item *item,
+				  const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	struct device *dev;
+	struct pci_dev *p2p_dev = NULL;
+	bool use_p2pmem;
+
+	dev = bus_find_device_by_name(&pci_bus_type, NULL, page);
+	if (dev) {
+		use_p2pmem = true;
+		p2p_dev = to_pci_dev(dev);
+
+		if (!pci_has_p2pmem(p2p_dev)) {
+			pr_err("PCI device has no peer-to-peer memory: %s\n",
+			       page);
+			pci_dev_put(p2p_dev);
+			return -ENODEV;
+		}
+	} else if (sysfs_streq(page, "auto")) {
+		use_p2pmem = 1;
+	} else if ((page[0] == '0' || page[0] == '1') && !iscntrl(page[1])) {
+		/*
+		 * If the user enters a PCI device that  doesn't exist
+		 * like "0000:01:00.1", we don't want strtobool to think
+		 * it's a '0' when it's clearly not what the user wanted.
+		 * So we require 0's and 1's to be exactly one character.
+		 */
+		goto no_such_pci_device;
+	} else if (strtobool(page, &use_p2pmem)) {
+		goto no_such_pci_device;
+	}
+
+	down_write(&nvmet_config_sem);
+	port->use_p2pmem = use_p2pmem;
+	pci_dev_put(port->p2p_dev);
+	port->p2p_dev = p2p_dev;
+	up_write(&nvmet_config_sem);
+
+	return count;
+
+no_such_pci_device:
+	pr_err("No such PCI device: %s\n", page);
+	return -ENODEV;
+}
+CONFIGFS_ATTR(nvmet_, p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+	&nvmet_attr_p2pmem,
+#endif
 	NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 75d44bc3e8d3..b2b62cd36f6c 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
 	percpu_ref_put(&ns->ref);
 }
 
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+				   struct nvmet_ns *ns)
+{
+	int ret;
+
+	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+		pr_err("peer-to-peer DMA is not supported by %s\n",
+		       ns->device_path);
+		return -EINVAL;
+	}
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+	if (ret)
+		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+		       ns->device_path, ret);
+
+	return ret;
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
 	struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 	if (ret)
 		goto out_blkdev_put;
 
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		if (ctrl->p2p_dev) {
+			ret = nvmet_p2pdma_add_client(ctrl, ns);
+			if (ret)
+				goto out_remove_clients;
+		}
+	}
+
 	if (ns->nsid > subsys->max_nsid)
 		subsys->max_nsid = ns->nsid;
 
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 out_unlock:
 	mutex_unlock(&subsys->lock);
 	return ret;
+out_remove_clients:
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 out_blkdev_put:
 	blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
 	ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	percpu_ref_exit(&ns->ref);
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 		nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+	}
 
 	if (ns->bdev)
 		blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -577,6 +610,21 @@ EXPORT_SYMBOL_GPL(nvmet_req_execute);
 
 int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
 {
+	struct pci_dev *p2p_dev = NULL;
+
+	if (sq->ctrl)
+		p2p_dev = sq->ctrl->p2p_dev;
+
+	req->p2p_dev = NULL;
+	if (sq->qid && p2p_dev) {
+		req->sg = pci_p2pmem_alloc_sgl(p2p_dev, &req->sg_cnt,
+					       req->transfer_len);
+		if (req->sg) {
+			req->p2p_dev = p2p_dev;
+			return 0;
+		}
+	}
+
 	req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
 	if (!req->sg)
 		return -ENOMEM;
@@ -587,7 +635,11 @@ EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
 
 void nvmet_req_free_sgl(struct nvmet_req *req)
 {
-	sgl_free(req->sg);
+	if (req->p2p_dev)
+		pci_p2pmem_free_sgl(req->p2p_dev, req->sg);
+	else
+		sgl_free(req->sg);
+
 	req->sg = NULL;
 	req->sg_cnt = 0;
 }
@@ -782,6 +834,74 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
 		return __nvmet_host_allowed(subsys, hostnqn);
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+	struct nvmet_ns *ns;
+	int ret;
+
+	if (!req->port->use_p2pmem || !req->p2p_client)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+	if (ret) {
+		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+		       dev_name(req->p2p_client), ret);
+		goto free_devices;
+	}
+
+	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+		ret = nvmet_p2pdma_add_client(ctrl, ns);
+		if (ret)
+			goto free_devices;
+	}
+
+	if (req->port->p2p_dev) {
+		if (!pci_p2pdma_assign_provider(req->port->p2p_dev,
+						&ctrl->p2p_clients)) {
+			pr_info("peer-to-peer memory on %s is not supported\n",
+				pci_name(req->port->p2p_dev));
+			goto free_devices;
+		}
+		ctrl->p2p_dev = pci_dev_get(req->port->p2p_dev);
+	} else {
+		ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+		if (!ctrl->p2p_dev) {
+			pr_info("no supported peer-to-peer memory devices found\n");
+			goto free_devices;
+		}
+	}
+
+	mutex_unlock(&ctrl->subsys->lock);
+
+	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+	return;
+
+free_devices:
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+	if (!ctrl->p2p_dev)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	pci_dev_put(ctrl->p2p_dev);
+	ctrl->p2p_dev = NULL;
+
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
 u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
 {
@@ -821,6 +941,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 
 	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
 	INIT_LIST_HEAD(&ctrl->async_events);
+	INIT_LIST_HEAD(&ctrl->p2p_clients);
 
 	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
 	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -876,6 +997,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		ctrl->kato = DIV_ROUND_UP(kato, 1000);
 	}
 	nvmet_start_keep_alive_timer(ctrl);
+	nvmet_setup_p2pmem(ctrl, req);
 
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -912,6 +1034,7 @@ static void nvmet_ctrl_free(struct kref *ref)
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fatal_err_work);
 
+	nvmet_release_p2pmem(ctrl);
 	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
 
 	kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index cd2344179673..39bd37f1f312 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 		op = REQ_OP_READ;
 	}
 
+	if (is_pci_p2pdma_page(sg_page(req->sg)))
+		op_flags |= REQ_PCI_P2PDMA;
+
 	sector = le64_to_cpu(req->cmd->rw.slba);
 	sector <<= (req->ns->blksize_shift - 9);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 10b162615a5e..f192fefe61d9 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
 	return container_of(to_config_group(item), struct nvmet_ns, group);
 }
 
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+	return disk_to_dev(ns->bdev->bd_disk);
+}
+
 struct nvmet_cq {
 	u16			qid;
 	u16			size;
@@ -98,6 +103,8 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				use_p2pmem;
+	struct pci_dev			*p2p_dev;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -132,6 +139,9 @@ struct nvmet_ctrl {
 
 	const struct nvmet_fabrics_ops *ops;
 
+	struct pci_dev		*p2p_dev;
+	struct list_head	p2p_clients;
+
 	char			subsysnqn[NVMF_NQN_FIELD_LEN];
 	char			hostnqn[NVMF_NQN_FIELD_LEN];
 };
@@ -234,6 +244,9 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	const struct nvmet_fabrics_ops *ops;
+
+	struct pci_dev *p2p_dev;
+	struct device *p2p_client;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index f7a3459d618f..27a6d8ea1b56 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -661,6 +661,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
 		cmd->send_sge.addr, cmd->send_sge.length,
 		DMA_TO_DEVICE);
 
+	cmd->req.p2p_client = &queue->dev->device->dev;
+
 	if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
 			&queue->nvme_sq, &nvmet_rdma_ops))
 		return;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 14/14] nvmet: Optionally use PCI P2P memory
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Steve Wise, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch heirarchy as the RDMA port and all the block devices in
use. If the user enabled it and no devices are found, then the system
will silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, only a limited number
of cards with CMB buffers seem to be available.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/target/configfs.c |  67 ++++++++++++++++++++++
 drivers/nvme/target/core.c     | 127 ++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/target/io-cmd.c   |   3 +
 drivers/nvme/target/nvmet.h    |  13 +++++
 drivers/nvme/target/rdma.c     |   2 +
 5 files changed, 210 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index ad9ff27234b5..5efe0dae0ee7 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,6 +17,8 @@
 #include <linux/slab.h>
 #include <linux/stat.h>
 #include <linux/ctype.h>
+#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -864,12 +866,77 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_p2pmem_show(struct config_item *item, char *page)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+
+	if (!port->use_p2pmem)
+		return sprintf(page, "none\n");
+
+	if (!port->p2p_dev)
+		return sprintf(page, "auto\n");
+
+	return sprintf(page, "%s\n", pci_name(port->p2p_dev));
+}
+
+static ssize_t nvmet_p2pmem_store(struct config_item *item,
+				  const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	struct device *dev;
+	struct pci_dev *p2p_dev = NULL;
+	bool use_p2pmem;
+
+	dev = bus_find_device_by_name(&pci_bus_type, NULL, page);
+	if (dev) {
+		use_p2pmem = true;
+		p2p_dev = to_pci_dev(dev);
+
+		if (!pci_has_p2pmem(p2p_dev)) {
+			pr_err("PCI device has no peer-to-peer memory: %s\n",
+			       page);
+			pci_dev_put(p2p_dev);
+			return -ENODEV;
+		}
+	} else if (sysfs_streq(page, "auto")) {
+		use_p2pmem = 1;
+	} else if ((page[0] == '0' || page[0] == '1') && !iscntrl(page[1])) {
+		/*
+		 * If the user enters a PCI device that  doesn't exist
+		 * like "0000:01:00.1", we don't want strtobool to think
+		 * it's a '0' when it's clearly not what the user wanted.
+		 * So we require 0's and 1's to be exactly one character.
+		 */
+		goto no_such_pci_device;
+	} else if (strtobool(page, &use_p2pmem)) {
+		goto no_such_pci_device;
+	}
+
+	down_write(&nvmet_config_sem);
+	port->use_p2pmem = use_p2pmem;
+	pci_dev_put(port->p2p_dev);
+	port->p2p_dev = p2p_dev;
+	up_write(&nvmet_config_sem);
+
+	return count;
+
+no_such_pci_device:
+	pr_err("No such PCI device: %s\n", page);
+	return -ENODEV;
+}
+CONFIGFS_ATTR(nvmet_, p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+	&nvmet_attr_p2pmem,
+#endif
 	NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 75d44bc3e8d3..b2b62cd36f6c 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
 	percpu_ref_put(&ns->ref);
 }
 
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+				   struct nvmet_ns *ns)
+{
+	int ret;
+
+	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+		pr_err("peer-to-peer DMA is not supported by %s\n",
+		       ns->device_path);
+		return -EINVAL;
+	}
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+	if (ret)
+		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+		       ns->device_path, ret);
+
+	return ret;
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
 	struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 	if (ret)
 		goto out_blkdev_put;
 
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		if (ctrl->p2p_dev) {
+			ret = nvmet_p2pdma_add_client(ctrl, ns);
+			if (ret)
+				goto out_remove_clients;
+		}
+	}
+
 	if (ns->nsid > subsys->max_nsid)
 		subsys->max_nsid = ns->nsid;
 
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 out_unlock:
 	mutex_unlock(&subsys->lock);
 	return ret;
+out_remove_clients:
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 out_blkdev_put:
 	blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
 	ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	percpu_ref_exit(&ns->ref);
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 		nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+	}
 
 	if (ns->bdev)
 		blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -577,6 +610,21 @@ EXPORT_SYMBOL_GPL(nvmet_req_execute);
 
 int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
 {
+	struct pci_dev *p2p_dev = NULL;
+
+	if (sq->ctrl)
+		p2p_dev = sq->ctrl->p2p_dev;
+
+	req->p2p_dev = NULL;
+	if (sq->qid && p2p_dev) {
+		req->sg = pci_p2pmem_alloc_sgl(p2p_dev, &req->sg_cnt,
+					       req->transfer_len);
+		if (req->sg) {
+			req->p2p_dev = p2p_dev;
+			return 0;
+		}
+	}
+
 	req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
 	if (!req->sg)
 		return -ENOMEM;
@@ -587,7 +635,11 @@ EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
 
 void nvmet_req_free_sgl(struct nvmet_req *req)
 {
-	sgl_free(req->sg);
+	if (req->p2p_dev)
+		pci_p2pmem_free_sgl(req->p2p_dev, req->sg);
+	else
+		sgl_free(req->sg);
+
 	req->sg = NULL;
 	req->sg_cnt = 0;
 }
@@ -782,6 +834,74 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
 		return __nvmet_host_allowed(subsys, hostnqn);
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+	struct nvmet_ns *ns;
+	int ret;
+
+	if (!req->port->use_p2pmem || !req->p2p_client)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+	if (ret) {
+		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+		       dev_name(req->p2p_client), ret);
+		goto free_devices;
+	}
+
+	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+		ret = nvmet_p2pdma_add_client(ctrl, ns);
+		if (ret)
+			goto free_devices;
+	}
+
+	if (req->port->p2p_dev) {
+		if (!pci_p2pdma_assign_provider(req->port->p2p_dev,
+						&ctrl->p2p_clients)) {
+			pr_info("peer-to-peer memory on %s is not supported\n",
+				pci_name(req->port->p2p_dev));
+			goto free_devices;
+		}
+		ctrl->p2p_dev = pci_dev_get(req->port->p2p_dev);
+	} else {
+		ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+		if (!ctrl->p2p_dev) {
+			pr_info("no supported peer-to-peer memory devices found\n");
+			goto free_devices;
+		}
+	}
+
+	mutex_unlock(&ctrl->subsys->lock);
+
+	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+	return;
+
+free_devices:
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+	if (!ctrl->p2p_dev)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	pci_dev_put(ctrl->p2p_dev);
+	ctrl->p2p_dev = NULL;
+
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
 u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
 {
@@ -821,6 +941,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 
 	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
 	INIT_LIST_HEAD(&ctrl->async_events);
+	INIT_LIST_HEAD(&ctrl->p2p_clients);
 
 	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
 	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -876,6 +997,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		ctrl->kato = DIV_ROUND_UP(kato, 1000);
 	}
 	nvmet_start_keep_alive_timer(ctrl);
+	nvmet_setup_p2pmem(ctrl, req);
 
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -912,6 +1034,7 @@ static void nvmet_ctrl_free(struct kref *ref)
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fatal_err_work);
 
+	nvmet_release_p2pmem(ctrl);
 	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
 
 	kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index cd2344179673..39bd37f1f312 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 		op = REQ_OP_READ;
 	}
 
+	if (is_pci_p2pdma_page(sg_page(req->sg)))
+		op_flags |= REQ_PCI_P2PDMA;
+
 	sector = le64_to_cpu(req->cmd->rw.slba);
 	sector <<= (req->ns->blksize_shift - 9);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 10b162615a5e..f192fefe61d9 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
 	return container_of(to_config_group(item), struct nvmet_ns, group);
 }
 
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+	return disk_to_dev(ns->bdev->bd_disk);
+}
+
 struct nvmet_cq {
 	u16			qid;
 	u16			size;
@@ -98,6 +103,8 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				use_p2pmem;
+	struct pci_dev			*p2p_dev;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -132,6 +139,9 @@ struct nvmet_ctrl {
 
 	const struct nvmet_fabrics_ops *ops;
 
+	struct pci_dev		*p2p_dev;
+	struct list_head	p2p_clients;
+
 	char			subsysnqn[NVMF_NQN_FIELD_LEN];
 	char			hostnqn[NVMF_NQN_FIELD_LEN];
 };
@@ -234,6 +244,9 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	const struct nvmet_fabrics_ops *ops;
+
+	struct pci_dev *p2p_dev;
+	struct device *p2p_client;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index f7a3459d618f..27a6d8ea1b56 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -661,6 +661,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
 		cmd->send_sge.addr, cmd->send_sge.length,
 		DMA_TO_DEVICE);
 
+	cmd->req.p2p_client = &queue->dev->device->dev;
+
 	if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
 			&queue->nvme_sq, &nvmet_rdma_ops))
 		return;
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* [PATCH v4 14/14] nvmet: Optionally use PCI P2P memory
@ 2018-04-23 23:30   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-04-23 23:30 UTC (permalink / raw)


We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch heirarchy as the RDMA port and all the block devices in
use. If the user enabled it and no devices are found, then the system
will silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, only a limited number
of cards with CMB buffers seem to be available.

Signed-off-by: Stephen Bates <sbates at raithlin.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/nvme/target/configfs.c |  67 ++++++++++++++++++++++
 drivers/nvme/target/core.c     | 127 ++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/target/io-cmd.c   |   3 +
 drivers/nvme/target/nvmet.h    |  13 +++++
 drivers/nvme/target/rdma.c     |   2 +
 5 files changed, 210 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index ad9ff27234b5..5efe0dae0ee7 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,6 +17,8 @@
 #include <linux/slab.h>
 #include <linux/stat.h>
 #include <linux/ctype.h>
+#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -864,12 +866,77 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_p2pmem_show(struct config_item *item, char *page)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+
+	if (!port->use_p2pmem)
+		return sprintf(page, "none\n");
+
+	if (!port->p2p_dev)
+		return sprintf(page, "auto\n");
+
+	return sprintf(page, "%s\n", pci_name(port->p2p_dev));
+}
+
+static ssize_t nvmet_p2pmem_store(struct config_item *item,
+				  const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	struct device *dev;
+	struct pci_dev *p2p_dev = NULL;
+	bool use_p2pmem;
+
+	dev = bus_find_device_by_name(&pci_bus_type, NULL, page);
+	if (dev) {
+		use_p2pmem = true;
+		p2p_dev = to_pci_dev(dev);
+
+		if (!pci_has_p2pmem(p2p_dev)) {
+			pr_err("PCI device has no peer-to-peer memory: %s\n",
+			       page);
+			pci_dev_put(p2p_dev);
+			return -ENODEV;
+		}
+	} else if (sysfs_streq(page, "auto")) {
+		use_p2pmem = 1;
+	} else if ((page[0] == '0' || page[0] == '1') && !iscntrl(page[1])) {
+		/*
+		 * If the user enters a PCI device that  doesn't exist
+		 * like "0000:01:00.1", we don't want strtobool to think
+		 * it's a '0' when it's clearly not what the user wanted.
+		 * So we require 0's and 1's to be exactly one character.
+		 */
+		goto no_such_pci_device;
+	} else if (strtobool(page, &use_p2pmem)) {
+		goto no_such_pci_device;
+	}
+
+	down_write(&nvmet_config_sem);
+	port->use_p2pmem = use_p2pmem;
+	pci_dev_put(port->p2p_dev);
+	port->p2p_dev = p2p_dev;
+	up_write(&nvmet_config_sem);
+
+	return count;
+
+no_such_pci_device:
+	pr_err("No such PCI device: %s\n", page);
+	return -ENODEV;
+}
+CONFIGFS_ATTR(nvmet_, p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+	&nvmet_attr_p2pmem,
+#endif
 	NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 75d44bc3e8d3..b2b62cd36f6c 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
 	percpu_ref_put(&ns->ref);
 }
 
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+				   struct nvmet_ns *ns)
+{
+	int ret;
+
+	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+		pr_err("peer-to-peer DMA is not supported by %s\n",
+		       ns->device_path);
+		return -EINVAL;
+	}
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+	if (ret)
+		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+		       ns->device_path, ret);
+
+	return ret;
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
 	struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 	if (ret)
 		goto out_blkdev_put;
 
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		if (ctrl->p2p_dev) {
+			ret = nvmet_p2pdma_add_client(ctrl, ns);
+			if (ret)
+				goto out_remove_clients;
+		}
+	}
+
 	if (ns->nsid > subsys->max_nsid)
 		subsys->max_nsid = ns->nsid;
 
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 out_unlock:
 	mutex_unlock(&subsys->lock);
 	return ret;
+out_remove_clients:
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 out_blkdev_put:
 	blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
 	ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	percpu_ref_exit(&ns->ref);
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 		nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+	}
 
 	if (ns->bdev)
 		blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -577,6 +610,21 @@ EXPORT_SYMBOL_GPL(nvmet_req_execute);
 
 int nvmet_req_alloc_sgl(struct nvmet_req *req, struct nvmet_sq *sq)
 {
+	struct pci_dev *p2p_dev = NULL;
+
+	if (sq->ctrl)
+		p2p_dev = sq->ctrl->p2p_dev;
+
+	req->p2p_dev = NULL;
+	if (sq->qid && p2p_dev) {
+		req->sg = pci_p2pmem_alloc_sgl(p2p_dev, &req->sg_cnt,
+					       req->transfer_len);
+		if (req->sg) {
+			req->p2p_dev = p2p_dev;
+			return 0;
+		}
+	}
+
 	req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
 	if (!req->sg)
 		return -ENOMEM;
@@ -587,7 +635,11 @@ EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
 
 void nvmet_req_free_sgl(struct nvmet_req *req)
 {
-	sgl_free(req->sg);
+	if (req->p2p_dev)
+		pci_p2pmem_free_sgl(req->p2p_dev, req->sg);
+	else
+		sgl_free(req->sg);
+
 	req->sg = NULL;
 	req->sg_cnt = 0;
 }
@@ -782,6 +834,74 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
 		return __nvmet_host_allowed(subsys, hostnqn);
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * ?/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+	struct nvmet_ns *ns;
+	int ret;
+
+	if (!req->port->use_p2pmem || !req->p2p_client)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+	if (ret) {
+		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+		       dev_name(req->p2p_client), ret);
+		goto free_devices;
+	}
+
+	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+		ret = nvmet_p2pdma_add_client(ctrl, ns);
+		if (ret)
+			goto free_devices;
+	}
+
+	if (req->port->p2p_dev) {
+		if (!pci_p2pdma_assign_provider(req->port->p2p_dev,
+						&ctrl->p2p_clients)) {
+			pr_info("peer-to-peer memory on %s is not supported\n",
+				pci_name(req->port->p2p_dev));
+			goto free_devices;
+		}
+		ctrl->p2p_dev = pci_dev_get(req->port->p2p_dev);
+	} else {
+		ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+		if (!ctrl->p2p_dev) {
+			pr_info("no supported peer-to-peer memory devices found\n");
+			goto free_devices;
+		}
+	}
+
+	mutex_unlock(&ctrl->subsys->lock);
+
+	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+	return;
+
+free_devices:
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+	if (!ctrl->p2p_dev)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	pci_dev_put(ctrl->p2p_dev);
+	ctrl->p2p_dev = NULL;
+
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
 u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
 {
@@ -821,6 +941,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 
 	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
 	INIT_LIST_HEAD(&ctrl->async_events);
+	INIT_LIST_HEAD(&ctrl->p2p_clients);
 
 	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
 	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -876,6 +997,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		ctrl->kato = DIV_ROUND_UP(kato, 1000);
 	}
 	nvmet_start_keep_alive_timer(ctrl);
+	nvmet_setup_p2pmem(ctrl, req);
 
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -912,6 +1034,7 @@ static void nvmet_ctrl_free(struct kref *ref)
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fatal_err_work);
 
+	nvmet_release_p2pmem(ctrl);
 	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
 
 	kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index cd2344179673..39bd37f1f312 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 		op = REQ_OP_READ;
 	}
 
+	if (is_pci_p2pdma_page(sg_page(req->sg)))
+		op_flags |= REQ_PCI_P2PDMA;
+
 	sector = le64_to_cpu(req->cmd->rw.slba);
 	sector <<= (req->ns->blksize_shift - 9);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 10b162615a5e..f192fefe61d9 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
 	return container_of(to_config_group(item), struct nvmet_ns, group);
 }
 
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+	return disk_to_dev(ns->bdev->bd_disk);
+}
+
 struct nvmet_cq {
 	u16			qid;
 	u16			size;
@@ -98,6 +103,8 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				use_p2pmem;
+	struct pci_dev			*p2p_dev;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -132,6 +139,9 @@ struct nvmet_ctrl {
 
 	const struct nvmet_fabrics_ops *ops;
 
+	struct pci_dev		*p2p_dev;
+	struct list_head	p2p_clients;
+
 	char			subsysnqn[NVMF_NQN_FIELD_LEN];
 	char			hostnqn[NVMF_NQN_FIELD_LEN];
 };
@@ -234,6 +244,9 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	const struct nvmet_fabrics_ops *ops;
+
+	struct pci_dev *p2p_dev;
+	struct device *p2p_client;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index f7a3459d618f..27a6d8ea1b56 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -661,6 +661,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
 		cmd->send_sge.addr, cmd->send_sge.length,
 		DMA_TO_DEVICE);
 
+	cmd->req.p2p_client = &queue->dev->device->dev;
+
 	if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
 			&queue->nvme_sq, &nvmet_rdma_ops))
 		return;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-04-23 23:30   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-04-24  3:33     ` Randy Dunlap
  -1 siblings, 0 replies; 460+ messages in thread
From: Randy Dunlap @ 2018-04-24  3:33 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:> > Signed-off-by: Logan Gunthorpe <logang@deltatee.com>> ---
>  drivers/pci/Kconfig        |  9 +++++++++>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++--------------->  drivers/pci/pci.c          |  6 ++++++>  include/linux/pci-p2pdma.h |  5 +++++>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

	         hierarchy                     group, which

and sames fixes in the commit description...

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL


-- 
~Randy
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-04-24  3:33     ` Randy Dunlap
  0 siblings, 0 replies; 460+ messages in thread
From: Randy Dunlap @ 2018-04-24  3:33 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König

On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:> > Signed-off-by: Logan Gunthorpe <logang@deltatee.com>> ---
>  drivers/pci/Kconfig        |  9 +++++++++>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++--------------->  drivers/pci/pci.c          |  6 ++++++>  include/linux/pci-p2pdma.h |  5 +++++>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

	         hierarchy                     group, which

and sames fixes in the commit description...

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL


-- 
~Randy

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-04-24  3:33     ` Randy Dunlap
  0 siblings, 0 replies; 460+ messages in thread
From: Randy Dunlap @ 2018-04-24  3:33 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Alex Williamson, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:> > Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>> ---
>  drivers/pci/Kconfig        |  9 +++++++++>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++--------------->  drivers/pci/pci.c          |  6 ++++++>  include/linux/pci-p2pdma.h |  5 +++++>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

	         hierarchy                     group, which

and sames fixes in the commit description...

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL


-- 
~Randy

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-04-24  3:33     ` Randy Dunlap
  0 siblings, 0 replies; 460+ messages in thread
From: Randy Dunlap @ 2018-04-24  3:33 UTC (permalink / raw)


On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:> > Signed-off-by: Logan Gunthorpe <logang at deltatee.com>> ---
>  drivers/pci/Kconfig        |  9 +++++++++>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++--------------->  drivers/pci/pci.c          |  6 ++++++>  include/linux/pci-p2pdma.h |  5 +++++>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

	         hierarchy                     group, which

and sames fixes in the commit description...

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL


-- 
~Randy

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-04-23 23:30 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-02 11:51   ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-02 11:51 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Logan,

it would be rather nice to have if you could separate out the functions 
to detect if peer2peer is possible between two devices.

That would allow me to reuse the same logic for GPU peer2peer where I 
don't really have ZONE_DEVICE.

Regards,
Christian.

Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe:
> Hi Everyone,
>
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
>
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>
> Thanks,
>
> Logan
>
> Changes in v4:
>
> * Change the original upstream_bridges_match() function to
>    upstream_bridge_distance() which calculates the distance between two
>    devices as long as they are behind the same root port. This should
>    address Bjorn's concerns that the code was to focused on
>    being behind a single switch.
>
> * The disable ACS function now disables ACS for all bridge ports instead
>    of switch ports (ie. those that had two upstream_bridge ports).
>
> * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
>    API to be more like sgl_alloc() in that the alloc function returns
>    the allocated scatterlist and nents is not required bythe free
>    function.
>
> * Moved the new documentation into the driver-api tree as requested
>    by Jonathan
>
> * Add SGL alloc and free helpers in the nvmet code so that the
>    individual drivers can share the code that allocates P2P memory.
>    As requested by Christoph.
>
> * Cleanup the nvmet_p2pmem_store() function as Christoph
>    thought my first attempt was ugly.
>
> * Numerous commit message and comment fix-ups
>
> Changes in v3:
>
> * Many more fixes and minor cleanups that were spotted by Bjorn
>
> * Additional explanation of the ACS change in both the commit message
>    and Kconfig doc. Also, the code that disables the ACS bits is surrounded
>    explicitly by an #ifdef
>
> * Removed the flag we added to rdma_rw_ctx() in favour of using
>    is_pci_p2pdma_page(), as suggested by Sagi.
>
> * Adjust pci_p2pmem_find() so that it prefers P2P providers that
>    are closest to (or the same as) the clients using them. In cases
>    of ties, the provider is randomly chosen.
>
> * Modify the NVMe Target code so that the PCI device name of the provider
>    may be explicitly specified, bypassing the logic in pci_p2pmem_find().
>    (Note: it's still enforced that the provider must be behind the
>     same switch as the clients).
>
> * As requested by Bjorn, added documentation for driver writers.
>
>
> Changes in v2:
>
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>    as a bunch of cleanup and spelling fixes he pointed out in the last
>    series.
>
> * To address Alex's ACS concerns, we change to a simpler method of
>    just disabling ACS behind switches for any kernel that has
>    CONFIG_PCI_P2PDMA.
>
> * We also reject using devices that employ 'dma_virt_ops' which should
>    fairly simply handle Jason's concerns that this work might break with
>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>    their own special DMA operations.
>
> --
>
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in the kernel with initial support for the NVMe fabrics target
> subsystem. Many thanks go to Christoph Hellwig who provided valuable
> feedback to get these patches to where they are today.
>
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVMe target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU).
>
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch hierarchy. This will mean many setups that
> could likely work well will not be supported so that we can be more
> confident it will work and not place any responsibility on the user to
> understand their topology. (We chose to go this route based on feedback
> we received at the last LSF). Future work may enable these transfers
> using a white list of known good root complexes. However, at this time,
> there is no reliable way to ensure that Peer-to-Peer transactions are
> permitted between PCI Root Ports.
>
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
>
> When the PCI P2PDMA config option is selected the ACS bits in every
> bridge port in the system are turned off to allow traffic to
> pass freely behind the root port. At this time, the bit must be disabled
> at boot so the IOMMU subsystem can correctly create the groups, though
> this could be addressed in the future. There is no way to dynamically
> disable the bit and alter the groups.
>
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices.
>
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
>
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
>
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
> that don't use the proper DMA infrastructure this code rejects using
> any device that employs the virt_dma_ops implementation.
>
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
>
> These patches have been tested on a number of Intel based systems and
> for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
> SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
> Microsemi, Chelsio and Everspin) using switches from both Microsemi
> and Broadcomm.
>
> Logan Gunthorpe (14):
>    PCI/P2PDMA: Support peer-to-peer memory
>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>    docs-rst: Add a new directory for PCI documentation
>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>    block: Introduce PCI P2P flags for request and request queue
>    IB/core: Ensure we map P2P memory correctly in
>      rdma_rw_ctx_[init|destroy]()
>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>    nvme-pci: Add support for P2P memory in requests
>    nvme-pci: Add a quirk for a pseudo CMB
>    nvmet: Introduce helper functions to allocate and free request SGLs
>    nvmet-rdma: Use new SGL alloc/free helper for requests
>    nvmet: Optionally use PCI P2P memory
>
>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>   Documentation/PCI/index.rst                |  14 +
>   Documentation/driver-api/index.rst         |   2 +-
>   Documentation/driver-api/pci/index.rst     |  20 +
>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>   Documentation/driver-api/{ => pci}/pci.rst |   0
>   Documentation/index.rst                    |   3 +-
>   block/blk-core.c                           |   3 +
>   drivers/infiniband/core/rw.c               |  13 +-
>   drivers/nvme/host/core.c                   |   4 +
>   drivers/nvme/host/nvme.h                   |   8 +
>   drivers/nvme/host/pci.c                    | 118 +++--
>   drivers/nvme/target/configfs.c             |  67 +++
>   drivers/nvme/target/core.c                 | 143 ++++-
>   drivers/nvme/target/io-cmd.c               |   3 +
>   drivers/nvme/target/nvmet.h                |  15 +
>   drivers/nvme/target/rdma.c                 |  22 +-
>   drivers/pci/Kconfig                        |  26 +
>   drivers/pci/Makefile                       |   1 +
>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>   drivers/pci/pci.c                          |   6 +
>   include/linux/blk_types.h                  |  18 +-
>   include/linux/blkdev.h                     |   3 +
>   include/linux/memremap.h                   |  19 +
>   include/linux/pci-p2pdma.h                 | 118 +++++
>   include/linux/pci.h                        |   4 +
>   26 files changed, 1579 insertions(+), 56 deletions(-)
>   create mode 100644 Documentation/PCI/index.rst
>   create mode 100644 Documentation/driver-api/pci/index.rst
>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>   create mode 100644 drivers/pci/p2pdma.c
>   create mode 100644 include/linux/pci-p2pdma.h
>
> --
> 2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-02 11:51   ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-02 11:51 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson

Hi Logan,

it would be rather nice to have if you could separate out the functions 
to detect if peer2peer is possible between two devices.

That would allow me to reuse the same logic for GPU peer2peer where I 
don't really have ZONE_DEVICE.

Regards,
Christian.

Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe:
> Hi Everyone,
>
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
>
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>
> Thanks,
>
> Logan
>
> Changes in v4:
>
> * Change the original upstream_bridges_match() function to
>    upstream_bridge_distance() which calculates the distance between two
>    devices as long as they are behind the same root port. This should
>    address Bjorn's concerns that the code was to focused on
>    being behind a single switch.
>
> * The disable ACS function now disables ACS for all bridge ports instead
>    of switch ports (ie. those that had two upstream_bridge ports).
>
> * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
>    API to be more like sgl_alloc() in that the alloc function returns
>    the allocated scatterlist and nents is not required bythe free
>    function.
>
> * Moved the new documentation into the driver-api tree as requested
>    by Jonathan
>
> * Add SGL alloc and free helpers in the nvmet code so that the
>    individual drivers can share the code that allocates P2P memory.
>    As requested by Christoph.
>
> * Cleanup the nvmet_p2pmem_store() function as Christoph
>    thought my first attempt was ugly.
>
> * Numerous commit message and comment fix-ups
>
> Changes in v3:
>
> * Many more fixes and minor cleanups that were spotted by Bjorn
>
> * Additional explanation of the ACS change in both the commit message
>    and Kconfig doc. Also, the code that disables the ACS bits is surrounded
>    explicitly by an #ifdef
>
> * Removed the flag we added to rdma_rw_ctx() in favour of using
>    is_pci_p2pdma_page(), as suggested by Sagi.
>
> * Adjust pci_p2pmem_find() so that it prefers P2P providers that
>    are closest to (or the same as) the clients using them. In cases
>    of ties, the provider is randomly chosen.
>
> * Modify the NVMe Target code so that the PCI device name of the provider
>    may be explicitly specified, bypassing the logic in pci_p2pmem_find().
>    (Note: it's still enforced that the provider must be behind the
>     same switch as the clients).
>
> * As requested by Bjorn, added documentation for driver writers.
>
>
> Changes in v2:
>
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>    as a bunch of cleanup and spelling fixes he pointed out in the last
>    series.
>
> * To address Alex's ACS concerns, we change to a simpler method of
>    just disabling ACS behind switches for any kernel that has
>    CONFIG_PCI_P2PDMA.
>
> * We also reject using devices that employ 'dma_virt_ops' which should
>    fairly simply handle Jason's concerns that this work might break with
>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>    their own special DMA operations.
>
> --
>
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in the kernel with initial support for the NVMe fabrics target
> subsystem. Many thanks go to Christoph Hellwig who provided valuable
> feedback to get these patches to where they are today.
>
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVMe target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU).
>
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch hierarchy. This will mean many setups that
> could likely work well will not be supported so that we can be more
> confident it will work and not place any responsibility on the user to
> understand their topology. (We chose to go this route based on feedback
> we received at the last LSF). Future work may enable these transfers
> using a white list of known good root complexes. However, at this time,
> there is no reliable way to ensure that Peer-to-Peer transactions are
> permitted between PCI Root Ports.
>
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
>
> When the PCI P2PDMA config option is selected the ACS bits in every
> bridge port in the system are turned off to allow traffic to
> pass freely behind the root port. At this time, the bit must be disabled
> at boot so the IOMMU subsystem can correctly create the groups, though
> this could be addressed in the future. There is no way to dynamically
> disable the bit and alter the groups.
>
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices.
>
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
>
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
>
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
> that don't use the proper DMA infrastructure this code rejects using
> any device that employs the virt_dma_ops implementation.
>
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
>
> These patches have been tested on a number of Intel based systems and
> for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
> SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
> Microsemi, Chelsio and Everspin) using switches from both Microsemi
> and Broadcomm.
>
> Logan Gunthorpe (14):
>    PCI/P2PDMA: Support peer-to-peer memory
>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>    docs-rst: Add a new directory for PCI documentation
>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>    block: Introduce PCI P2P flags for request and request queue
>    IB/core: Ensure we map P2P memory correctly in
>      rdma_rw_ctx_[init|destroy]()
>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>    nvme-pci: Add support for P2P memory in requests
>    nvme-pci: Add a quirk for a pseudo CMB
>    nvmet: Introduce helper functions to allocate and free request SGLs
>    nvmet-rdma: Use new SGL alloc/free helper for requests
>    nvmet: Optionally use PCI P2P memory
>
>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>   Documentation/PCI/index.rst                |  14 +
>   Documentation/driver-api/index.rst         |   2 +-
>   Documentation/driver-api/pci/index.rst     |  20 +
>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>   Documentation/driver-api/{ => pci}/pci.rst |   0
>   Documentation/index.rst                    |   3 +-
>   block/blk-core.c                           |   3 +
>   drivers/infiniband/core/rw.c               |  13 +-
>   drivers/nvme/host/core.c                   |   4 +
>   drivers/nvme/host/nvme.h                   |   8 +
>   drivers/nvme/host/pci.c                    | 118 +++--
>   drivers/nvme/target/configfs.c             |  67 +++
>   drivers/nvme/target/core.c                 | 143 ++++-
>   drivers/nvme/target/io-cmd.c               |   3 +
>   drivers/nvme/target/nvmet.h                |  15 +
>   drivers/nvme/target/rdma.c                 |  22 +-
>   drivers/pci/Kconfig                        |  26 +
>   drivers/pci/Makefile                       |   1 +
>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>   drivers/pci/pci.c                          |   6 +
>   include/linux/blk_types.h                  |  18 +-
>   include/linux/blkdev.h                     |   3 +
>   include/linux/memremap.h                   |  19 +
>   include/linux/pci-p2pdma.h                 | 118 +++++
>   include/linux/pci.h                        |   4 +
>   26 files changed, 1579 insertions(+), 56 deletions(-)
>   create mode 100644 Documentation/PCI/index.rst
>   create mode 100644 Documentation/driver-api/pci/index.rst
>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>   create mode 100644 drivers/pci/p2pdma.c
>   create mode 100644 include/linux/pci-p2pdma.h
>
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-02 11:51   ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-02 11:51 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Logan,

it would be rather nice to have if you could separate out the functions 
to detect if peer2peer is possible between two devices.

That would allow me to reuse the same logic for GPU peer2peer where I 
don't really have ZONE_DEVICE.

Regards,
Christian.

Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe:
> Hi Everyone,
>
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
>
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>
> Thanks,
>
> Logan
>
> Changes in v4:
>
> * Change the original upstream_bridges_match() function to
>    upstream_bridge_distance() which calculates the distance between two
>    devices as long as they are behind the same root port. This should
>    address Bjorn's concerns that the code was to focused on
>    being behind a single switch.
>
> * The disable ACS function now disables ACS for all bridge ports instead
>    of switch ports (ie. those that had two upstream_bridge ports).
>
> * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
>    API to be more like sgl_alloc() in that the alloc function returns
>    the allocated scatterlist and nents is not required bythe free
>    function.
>
> * Moved the new documentation into the driver-api tree as requested
>    by Jonathan
>
> * Add SGL alloc and free helpers in the nvmet code so that the
>    individual drivers can share the code that allocates P2P memory.
>    As requested by Christoph.
>
> * Cleanup the nvmet_p2pmem_store() function as Christoph
>    thought my first attempt was ugly.
>
> * Numerous commit message and comment fix-ups
>
> Changes in v3:
>
> * Many more fixes and minor cleanups that were spotted by Bjorn
>
> * Additional explanation of the ACS change in both the commit message
>    and Kconfig doc. Also, the code that disables the ACS bits is surrounded
>    explicitly by an #ifdef
>
> * Removed the flag we added to rdma_rw_ctx() in favour of using
>    is_pci_p2pdma_page(), as suggested by Sagi.
>
> * Adjust pci_p2pmem_find() so that it prefers P2P providers that
>    are closest to (or the same as) the clients using them. In cases
>    of ties, the provider is randomly chosen.
>
> * Modify the NVMe Target code so that the PCI device name of the provider
>    may be explicitly specified, bypassing the logic in pci_p2pmem_find().
>    (Note: it's still enforced that the provider must be behind the
>     same switch as the clients).
>
> * As requested by Bjorn, added documentation for driver writers.
>
>
> Changes in v2:
>
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>    as a bunch of cleanup and spelling fixes he pointed out in the last
>    series.
>
> * To address Alex's ACS concerns, we change to a simpler method of
>    just disabling ACS behind switches for any kernel that has
>    CONFIG_PCI_P2PDMA.
>
> * We also reject using devices that employ 'dma_virt_ops' which should
>    fairly simply handle Jason's concerns that this work might break with
>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>    their own special DMA operations.
>
> --
>
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in the kernel with initial support for the NVMe fabrics target
> subsystem. Many thanks go to Christoph Hellwig who provided valuable
> feedback to get these patches to where they are today.
>
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVMe target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU).
>
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch hierarchy. This will mean many setups that
> could likely work well will not be supported so that we can be more
> confident it will work and not place any responsibility on the user to
> understand their topology. (We chose to go this route based on feedback
> we received at the last LSF). Future work may enable these transfers
> using a white list of known good root complexes. However, at this time,
> there is no reliable way to ensure that Peer-to-Peer transactions are
> permitted between PCI Root Ports.
>
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
>
> When the PCI P2PDMA config option is selected the ACS bits in every
> bridge port in the system are turned off to allow traffic to
> pass freely behind the root port. At this time, the bit must be disabled
> at boot so the IOMMU subsystem can correctly create the groups, though
> this could be addressed in the future. There is no way to dynamically
> disable the bit and alter the groups.
>
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices.
>
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
>
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
>
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
> that don't use the proper DMA infrastructure this code rejects using
> any device that employs the virt_dma_ops implementation.
>
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
>
> These patches have been tested on a number of Intel based systems and
> for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
> SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
> Microsemi, Chelsio and Everspin) using switches from both Microsemi
> and Broadcomm.
>
> Logan Gunthorpe (14):
>    PCI/P2PDMA: Support peer-to-peer memory
>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>    docs-rst: Add a new directory for PCI documentation
>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>    block: Introduce PCI P2P flags for request and request queue
>    IB/core: Ensure we map P2P memory correctly in
>      rdma_rw_ctx_[init|destroy]()
>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>    nvme-pci: Add support for P2P memory in requests
>    nvme-pci: Add a quirk for a pseudo CMB
>    nvmet: Introduce helper functions to allocate and free request SGLs
>    nvmet-rdma: Use new SGL alloc/free helper for requests
>    nvmet: Optionally use PCI P2P memory
>
>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>   Documentation/PCI/index.rst                |  14 +
>   Documentation/driver-api/index.rst         |   2 +-
>   Documentation/driver-api/pci/index.rst     |  20 +
>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>   Documentation/driver-api/{ => pci}/pci.rst |   0
>   Documentation/index.rst                    |   3 +-
>   block/blk-core.c                           |   3 +
>   drivers/infiniband/core/rw.c               |  13 +-
>   drivers/nvme/host/core.c                   |   4 +
>   drivers/nvme/host/nvme.h                   |   8 +
>   drivers/nvme/host/pci.c                    | 118 +++--
>   drivers/nvme/target/configfs.c             |  67 +++
>   drivers/nvme/target/core.c                 | 143 ++++-
>   drivers/nvme/target/io-cmd.c               |   3 +
>   drivers/nvme/target/nvmet.h                |  15 +
>   drivers/nvme/target/rdma.c                 |  22 +-
>   drivers/pci/Kconfig                        |  26 +
>   drivers/pci/Makefile                       |   1 +
>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>   drivers/pci/pci.c                          |   6 +
>   include/linux/blk_types.h                  |  18 +-
>   include/linux/blkdev.h                     |   3 +
>   include/linux/memremap.h                   |  19 +
>   include/linux/pci-p2pdma.h                 | 118 +++++
>   include/linux/pci.h                        |   4 +
>   26 files changed, 1579 insertions(+), 56 deletions(-)
>   create mode 100644 Documentation/PCI/index.rst
>   create mode 100644 Documentation/driver-api/pci/index.rst
>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>   create mode 100644 drivers/pci/p2pdma.c
>   create mode 100644 include/linux/pci-p2pdma.h
>
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-02 11:51   ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-02 11:51 UTC (permalink / raw)


Hi Logan,

it would be rather nice to have if you could separate out the functions 
to detect if peer2peer is possible between two devices.

That would allow me to reuse the same logic for GPU peer2peer where I 
don't really have ZONE_DEVICE.

Regards,
Christian.

Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe:
> Hi Everyone,
>
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
>
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>
> Thanks,
>
> Logan
>
> Changes in v4:
>
> * Change the original upstream_bridges_match() function to
>    upstream_bridge_distance() which calculates the distance between two
>    devices as long as they are behind the same root port. This should
>    address Bjorn's concerns that the code was to focused on
>    being behind a single switch.
>
> * The disable ACS function now disables ACS for all bridge ports instead
>    of switch ports (ie. those that had two upstream_bridge ports).
>
> * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
>    API to be more like sgl_alloc() in that the alloc function returns
>    the allocated scatterlist and nents is not required bythe free
>    function.
>
> * Moved the new documentation into the driver-api tree as requested
>    by Jonathan
>
> * Add SGL alloc and free helpers in the nvmet code so that the
>    individual drivers can share the code that allocates P2P memory.
>    As requested by Christoph.
>
> * Cleanup the nvmet_p2pmem_store() function as Christoph
>    thought my first attempt was ugly.
>
> * Numerous commit message and comment fix-ups
>
> Changes in v3:
>
> * Many more fixes and minor cleanups that were spotted by Bjorn
>
> * Additional explanation of the ACS change in both the commit message
>    and Kconfig doc. Also, the code that disables the ACS bits is surrounded
>    explicitly by an #ifdef
>
> * Removed the flag we added to rdma_rw_ctx() in favour of using
>    is_pci_p2pdma_page(), as suggested by Sagi.
>
> * Adjust pci_p2pmem_find() so that it prefers P2P providers that
>    are closest to (or the same as) the clients using them. In cases
>    of ties, the provider is randomly chosen.
>
> * Modify the NVMe Target code so that the PCI device name of the provider
>    may be explicitly specified, bypassing the logic in pci_p2pmem_find().
>    (Note: it's still enforced that the provider must be behind the
>     same switch as the clients).
>
> * As requested by Bjorn, added documentation for driver writers.
>
>
> Changes in v2:
>
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>    as a bunch of cleanup and spelling fixes he pointed out in the last
>    series.
>
> * To address Alex's ACS concerns, we change to a simpler method of
>    just disabling ACS behind switches for any kernel that has
>    CONFIG_PCI_P2PDMA.
>
> * We also reject using devices that employ 'dma_virt_ops' which should
>    fairly simply handle Jason's concerns that this work might break with
>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>    their own special DMA operations.
>
> --
>
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in the kernel with initial support for the NVMe fabrics target
> subsystem. Many thanks go to Christoph Hellwig who provided valuable
> feedback to get these patches to where they are today.
>
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVMe target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU).
>
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch hierarchy. This will mean many setups that
> could likely work well will not be supported so that we can be more
> confident it will work and not place any responsibility on the user to
> understand their topology. (We chose to go this route based on feedback
> we received at the last LSF). Future work may enable these transfers
> using a white list of known good root complexes. However, at this time,
> there is no reliable way to ensure that Peer-to-Peer transactions are
> permitted between PCI Root Ports.
>
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
>
> When the PCI P2PDMA config option is selected the ACS bits in every
> bridge port in the system are turned off to allow traffic to
> pass freely behind the root port. At this time, the bit must be disabled
> at boot so the IOMMU subsystem can correctly create the groups, though
> this could be addressed in the future. There is no way to dynamically
> disable the bit and alter the groups.
>
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices.
>
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
>
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
>
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
> that don't use the proper DMA infrastructure this code rejects using
> any device that employs the virt_dma_ops implementation.
>
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
>
> These patches have been tested on a number of Intel based systems and
> for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
> SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
> Microsemi, Chelsio and Everspin) using switches from both Microsemi
> and Broadcomm.
>
> Logan Gunthorpe (14):
>    PCI/P2PDMA: Support peer-to-peer memory
>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>    docs-rst: Add a new directory for PCI documentation
>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>    block: Introduce PCI P2P flags for request and request queue
>    IB/core: Ensure we map P2P memory correctly in
>      rdma_rw_ctx_[init|destroy]()
>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>    nvme-pci: Add support for P2P memory in requests
>    nvme-pci: Add a quirk for a pseudo CMB
>    nvmet: Introduce helper functions to allocate and free request SGLs
>    nvmet-rdma: Use new SGL alloc/free helper for requests
>    nvmet: Optionally use PCI P2P memory
>
>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>   Documentation/PCI/index.rst                |  14 +
>   Documentation/driver-api/index.rst         |   2 +-
>   Documentation/driver-api/pci/index.rst     |  20 +
>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>   Documentation/driver-api/{ => pci}/pci.rst |   0
>   Documentation/index.rst                    |   3 +-
>   block/blk-core.c                           |   3 +
>   drivers/infiniband/core/rw.c               |  13 +-
>   drivers/nvme/host/core.c                   |   4 +
>   drivers/nvme/host/nvme.h                   |   8 +
>   drivers/nvme/host/pci.c                    | 118 +++--
>   drivers/nvme/target/configfs.c             |  67 +++
>   drivers/nvme/target/core.c                 | 143 ++++-
>   drivers/nvme/target/io-cmd.c               |   3 +
>   drivers/nvme/target/nvmet.h                |  15 +
>   drivers/nvme/target/rdma.c                 |  22 +-
>   drivers/pci/Kconfig                        |  26 +
>   drivers/pci/Makefile                       |   1 +
>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>   drivers/pci/pci.c                          |   6 +
>   include/linux/blk_types.h                  |  18 +-
>   include/linux/blkdev.h                     |   3 +
>   include/linux/memremap.h                   |  19 +
>   include/linux/pci-p2pdma.h                 | 118 +++++
>   include/linux/pci.h                        |   4 +
>   26 files changed, 1579 insertions(+), 56 deletions(-)
>   create mode 100644 Documentation/PCI/index.rst
>   create mode 100644 Documentation/driver-api/pci/index.rst
>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>   create mode 100644 drivers/pci/p2pdma.c
>   create mode 100644 include/linux/pci-p2pdma.h
>
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-02 11:51   ` Christian König
  (?)
  (?)
@ 2018-05-02 15:56     ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-02 15:56 UTC (permalink / raw)
  To: Christian König, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Christian,

On 5/2/2018 5:51 AM, Christian König wrote:
> it would be rather nice to have if you could separate out the functions 
> to detect if peer2peer is possible between two devices.

This would essentially be pci_p2pdma_distance() in the existing 
patchset. It returns the sum of the distance between a list of clients 
and a P2PDMA provider. It returns -1 if peer2peer is not possible 
between the devices (presently this means they are not behind the same 
root port).

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-02 15:56     ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-02 15:56 UTC (permalink / raw)
  To: Christian König, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson

Hi Christian,

On 5/2/2018 5:51 AM, Christian König wrote:
> it would be rather nice to have if you could separate out the functions 
> to detect if peer2peer is possible between two devices.

This would essentially be pci_p2pdma_distance() in the existing 
patchset. It returns the sum of the distance between a list of clients 
and a P2PDMA provider. It returns -1 if peer2peer is not possible 
between the devices (presently this means they are not behind the same 
root port).

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-02 15:56     ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-02 15:56 UTC (permalink / raw)
  To: Christian König, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Christian,

On 5/2/2018 5:51 AM, Christian König wrote:
> it would be rather nice to have if you could separate out the functions 
> to detect if peer2peer is possible between two devices.

This would essentially be pci_p2pdma_distance() in the existing 
patchset. It returns the sum of the distance between a list of clients 
and a P2PDMA provider. It returns -1 if peer2peer is not possible 
between the devices (presently this means they are not behind the same 
root port).

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-02 15:56     ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-02 15:56 UTC (permalink / raw)


Hi Christian,

On 5/2/2018 5:51 AM, Christian K?nig wrote:
> it would be rather nice to have if you could separate out the functions 
> to detect if peer2peer is possible between two devices.

This would essentially be pci_p2pdma_distance() in the existing 
patchset. It returns the sum of the distance between a list of clients 
and a P2PDMA provider. It returns -1 if peer2peer is not possible 
between the devices (presently this means they are not behind the same 
root port).

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-02 15:56     ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-03  9:05       ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-03  9:05 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe:
> Hi Christian,
>
> On 5/2/2018 5:51 AM, Christian König wrote:
>> it would be rather nice to have if you could separate out the 
>> functions to detect if peer2peer is possible between two devices.
>
> This would essentially be pci_p2pdma_distance() in the existing 
> patchset. It returns the sum of the distance between a list of clients 
> and a P2PDMA provider. It returns -1 if peer2peer is not possible 
> between the devices (presently this means they are not behind the same 
> root port).

Ok, I'm still missing the big picture here. First question is what is 
the P2PDMA provider?

Second question is how to you want to handle things when device are not 
behind the same root port (which is perfectly possible in the cases I 
deal with)?

Third question why multiple clients? That feels a bit like you are 
pushing something special to your use case into the common PCI 
subsystem. Something which usually isn't a good idea.



As far as I can see we need a function which return the distance between 
a initiator and target device. This function then returns -1 if the 
transaction can't be made and a positive value otherwise.

We also need to give the direction of the transaction and have a 
whitelist root complex PCI-IDs which can handle P2P transactions from 
different ports for a certain DMA direction.


Christian.

>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03  9:05       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-03  9:05 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson

Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe:
> Hi Christian,
>
> On 5/2/2018 5:51 AM, Christian König wrote:
>> it would be rather nice to have if you could separate out the 
>> functions to detect if peer2peer is possible between two devices.
>
> This would essentially be pci_p2pdma_distance() in the existing 
> patchset. It returns the sum of the distance between a list of clients 
> and a P2PDMA provider. It returns -1 if peer2peer is not possible 
> between the devices (presently this means they are not behind the same 
> root port).

Ok, I'm still missing the big picture here. First question is what is 
the P2PDMA provider?

Second question is how to you want to handle things when device are not 
behind the same root port (which is perfectly possible in the cases I 
deal with)?

Third question why multiple clients? That feels a bit like you are 
pushing something special to your use case into the common PCI 
subsystem. Something which usually isn't a good idea.



As far as I can see we need a function which return the distance between 
a initiator and target device. This function then returns -1 if the 
transaction can't be made and a positive value otherwise.

We also need to give the direction of the transaction and have a 
whitelist root complex PCI-IDs which can handle P2P transactions from 
different ports for a certain DMA direction.


Christian.

>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03  9:05       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-03  9:05 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe:
> Hi Christian,
>
> On 5/2/2018 5:51 AM, Christian König wrote:
>> it would be rather nice to have if you could separate out the 
>> functions to detect if peer2peer is possible between two devices.
>
> This would essentially be pci_p2pdma_distance() in the existing 
> patchset. It returns the sum of the distance between a list of clients 
> and a P2PDMA provider. It returns -1 if peer2peer is not possible 
> between the devices (presently this means they are not behind the same 
> root port).

Ok, I'm still missing the big picture here. First question is what is 
the P2PDMA provider?

Second question is how to you want to handle things when device are not 
behind the same root port (which is perfectly possible in the cases I 
deal with)?

Third question why multiple clients? That feels a bit like you are 
pushing something special to your use case into the common PCI 
subsystem. Something which usually isn't a good idea.



As far as I can see we need a function which return the distance between 
a initiator and target device. This function then returns -1 if the 
transaction can't be made and a positive value otherwise.

We also need to give the direction of the transaction and have a 
whitelist root complex PCI-IDs which can handle P2P transactions from 
different ports for a certain DMA direction.


Christian.

>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03  9:05       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-03  9:05 UTC (permalink / raw)


Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe:
> Hi Christian,
>
> On 5/2/2018 5:51 AM, Christian K?nig wrote:
>> it would be rather nice to have if you could separate out the 
>> functions to detect if peer2peer is possible between two devices.
>
> This would essentially be pci_p2pdma_distance() in the existing 
> patchset. It returns the sum of the distance between a list of clients 
> and a P2PDMA provider. It returns -1 if peer2peer is not possible 
> between the devices (presently this means they are not behind the same 
> root port).

Ok, I'm still missing the big picture here. First question is what is 
the P2PDMA provider?

Second question is how to you want to handle things when device are not 
behind the same root port (which is perfectly possible in the cases I 
deal with)?

Third question why multiple clients? That feels a bit like you are 
pushing something special to your use case into the common PCI 
subsystem. Something which usually isn't a good idea.



As far as I can see we need a function which return the distance between 
a initiator and target device. This function then returns -1 if the 
transaction can't be made and a positive value otherwise.

We also need to give the direction of the transaction and have a 
whitelist root complex PCI-IDs which can handle P2P transactions from 
different ports for a certain DMA direction.


Christian.

>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-03  9:05       ` Christian König
  (?)
  (?)
@ 2018-05-03 15:59         ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-03 15:59 UTC (permalink / raw)
  To: Christian König, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 03/05/18 03:05 AM, Christian König wrote:
> Ok, I'm still missing the big picture here. First question is what is 
> the P2PDMA provider?

Well there's some pretty good documentation in the patchset for this,
but in short, a provider is a device that provides some kind of P2P
resource (ie. BAR memory, or perhaps a doorbell register -- only memory
is supported at this time).

> Second question is how to you want to handle things when device are not 
> behind the same root port (which is perfectly possible in the cases I 
> deal with)?

I think we need to implement a whitelist. If both root ports are in the
white list and are on the same bus then we return a larger distance
instead of -1.

> Third question why multiple clients? That feels a bit like you are 
> pushing something special to your use case into the common PCI 
> subsystem. Something which usually isn't a good idea.

No, I think this will be pretty standard. In the simple general case you
are going to have one provider and at least two clients (one which
writes the memory and one which reads it). However, one client is
likely, but not necessarily, the same as the provider.

In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
devices. The code doesn't care which device provides the memory as it
could be the RDMA device or one/all of the block devices (or, in theory,
a completely separate device with P2P-able memory). However, it does
require that all devices involved are accessible per
pci_p2pdma_distance() or it won't use P2P transactions.

I could also imagine other use cases: ie. an RDMA NIC sends data to a
GPU for processing and then sends the data to an NVMe device for storage
(or vice-versa). In this case we have 3 clients and one provider.

> As far as I can see we need a function which return the distance between 
> a initiator and target device. This function then returns -1 if the 
> transaction can't be made and a positive value otherwise.

If you need to make a simpler convenience function for your use case I'm
not against it.

> We also need to give the direction of the transaction and have a 
> whitelist root complex PCI-IDs which can handle P2P transactions from 
> different ports for a certain DMA direction.

Yes. In the NVMeof case we need all devices to be able to DMA in both
directions so we did not need the DMA direction. But I can see this
being useful once we add the whitelist.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03 15:59         ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-03 15:59 UTC (permalink / raw)
  To: Christian König, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson



On 03/05/18 03:05 AM, Christian König wrote:
> Ok, I'm still missing the big picture here. First question is what is 
> the P2PDMA provider?

Well there's some pretty good documentation in the patchset for this,
but in short, a provider is a device that provides some kind of P2P
resource (ie. BAR memory, or perhaps a doorbell register -- only memory
is supported at this time).

> Second question is how to you want to handle things when device are not 
> behind the same root port (which is perfectly possible in the cases I 
> deal with)?

I think we need to implement a whitelist. If both root ports are in the
white list and are on the same bus then we return a larger distance
instead of -1.

> Third question why multiple clients? That feels a bit like you are 
> pushing something special to your use case into the common PCI 
> subsystem. Something which usually isn't a good idea.

No, I think this will be pretty standard. In the simple general case you
are going to have one provider and at least two clients (one which
writes the memory and one which reads it). However, one client is
likely, but not necessarily, the same as the provider.

In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
devices. The code doesn't care which device provides the memory as it
could be the RDMA device or one/all of the block devices (or, in theory,
a completely separate device with P2P-able memory). However, it does
require that all devices involved are accessible per
pci_p2pdma_distance() or it won't use P2P transactions.

I could also imagine other use cases: ie. an RDMA NIC sends data to a
GPU for processing and then sends the data to an NVMe device for storage
(or vice-versa). In this case we have 3 clients and one provider.

> As far as I can see we need a function which return the distance between 
> a initiator and target device. This function then returns -1 if the 
> transaction can't be made and a positive value otherwise.

If you need to make a simpler convenience function for your use case I'm
not against it.

> We also need to give the direction of the transaction and have a 
> whitelist root complex PCI-IDs which can handle P2P transactions from 
> different ports for a certain DMA direction.

Yes. In the NVMeof case we need all devices to be able to DMA in both
directions so we did not need the DMA direction. But I can see this
being useful once we add the whitelist.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03 15:59         ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-03 15:59 UTC (permalink / raw)
  To: Christian König, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 03/05/18 03:05 AM, Christian König wrote:
> Ok, I'm still missing the big picture here. First question is what is 
> the P2PDMA provider?

Well there's some pretty good documentation in the patchset for this,
but in short, a provider is a device that provides some kind of P2P
resource (ie. BAR memory, or perhaps a doorbell register -- only memory
is supported at this time).

> Second question is how to you want to handle things when device are not 
> behind the same root port (which is perfectly possible in the cases I 
> deal with)?

I think we need to implement a whitelist. If both root ports are in the
white list and are on the same bus then we return a larger distance
instead of -1.

> Third question why multiple clients? That feels a bit like you are 
> pushing something special to your use case into the common PCI 
> subsystem. Something which usually isn't a good idea.

No, I think this will be pretty standard. In the simple general case you
are going to have one provider and at least two clients (one which
writes the memory and one which reads it). However, one client is
likely, but not necessarily, the same as the provider.

In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
devices. The code doesn't care which device provides the memory as it
could be the RDMA device or one/all of the block devices (or, in theory,
a completely separate device with P2P-able memory). However, it does
require that all devices involved are accessible per
pci_p2pdma_distance() or it won't use P2P transactions.

I could also imagine other use cases: ie. an RDMA NIC sends data to a
GPU for processing and then sends the data to an NVMe device for storage
(or vice-versa). In this case we have 3 clients and one provider.

> As far as I can see we need a function which return the distance between 
> a initiator and target device. This function then returns -1 if the 
> transaction can't be made and a positive value otherwise.

If you need to make a simpler convenience function for your use case I'm
not against it.

> We also need to give the direction of the transaction and have a 
> whitelist root complex PCI-IDs which can handle P2P transactions from 
> different ports for a certain DMA direction.

Yes. In the NVMeof case we need all devices to be able to DMA in both
directions so we did not need the DMA direction. But I can see this
being useful once we add the whitelist.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03 15:59         ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-03 15:59 UTC (permalink / raw)




On 03/05/18 03:05 AM, Christian K?nig wrote:
> Ok, I'm still missing the big picture here. First question is what is 
> the P2PDMA provider?

Well there's some pretty good documentation in the patchset for this,
but in short, a provider is a device that provides some kind of P2P
resource (ie. BAR memory, or perhaps a doorbell register -- only memory
is supported at this time).

> Second question is how to you want to handle things when device are not 
> behind the same root port (which is perfectly possible in the cases I 
> deal with)?

I think we need to implement a whitelist. If both root ports are in the
white list and are on the same bus then we return a larger distance
instead of -1.

> Third question why multiple clients? That feels a bit like you are 
> pushing something special to your use case into the common PCI 
> subsystem. Something which usually isn't a good idea.

No, I think this will be pretty standard. In the simple general case you
are going to have one provider and at least two clients (one which
writes the memory and one which reads it). However, one client is
likely, but not necessarily, the same as the provider.

In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
devices. The code doesn't care which device provides the memory as it
could be the RDMA device or one/all of the block devices (or, in theory,
a completely separate device with P2P-able memory). However, it does
require that all devices involved are accessible per
pci_p2pdma_distance() or it won't use P2P transactions.

I could also imagine other use cases: ie. an RDMA NIC sends data to a
GPU for processing and then sends the data to an NVMe device for storage
(or vice-versa). In this case we have 3 clients and one provider.

> As far as I can see we need a function which return the distance between 
> a initiator and target device. This function then returns -1 if the 
> transaction can't be made and a positive value otherwise.

If you need to make a simpler convenience function for your use case I'm
not against it.

> We also need to give the direction of the transaction and have a 
> whitelist root complex PCI-IDs which can handle P2P transactions from 
> different ports for a certain DMA direction.

Yes. In the NVMeof case we need all devices to be able to DMA in both
directions so we did not need the DMA direction. But I can see this
being useful once we add the whitelist.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-03 15:59         ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-03 17:29           ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-03 17:29 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe:
> On 03/05/18 03:05 AM, Christian König wrote:
>> Second question is how to you want to handle things when device are not
>> behind the same root port (which is perfectly possible in the cases I
>> deal with)?
> I think we need to implement a whitelist. If both root ports are in the
> white list and are on the same bus then we return a larger distance
> instead of -1.

Sounds good.

>> Third question why multiple clients? That feels a bit like you are
>> pushing something special to your use case into the common PCI
>> subsystem. Something which usually isn't a good idea.
> No, I think this will be pretty standard. In the simple general case you
> are going to have one provider and at least two clients (one which
> writes the memory and one which reads it). However, one client is
> likely, but not necessarily, the same as the provider.

Ok, that is the point where I'm stuck. Why do we need that in one 
function call in the PCIe subsystem?

The problem at least with GPUs is that we seriously don't have that 
information here, cause the PCI subsystem might not be aware of all the 
interconnections.

For example it isn't uncommon to put multiple GPUs on one board. To the 
PCI subsystem that looks like separate devices, but in reality all GPUs 
are interconnected and can access each others memory directly without 
going over the PCIe bus.

I seriously don't want to model that in the PCI subsystem, but rather 
the driver. That's why it feels like a mistake to me to push all that 
into the PCI function.

> In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
> devices. The code doesn't care which device provides the memory as it
> could be the RDMA device or one/all of the block devices (or, in theory,
> a completely separate device with P2P-able memory). However, it does
> require that all devices involved are accessible per
> pci_p2pdma_distance() or it won't use P2P transactions.
>
> I could also imagine other use cases: ie. an RDMA NIC sends data to a
> GPU for processing and then sends the data to an NVMe device for storage
> (or vice-versa). In this case we have 3 clients and one provider.

Why can't we model that as two separate transactions?

E.g. one from the RDMA NIC to the GPU memory. And another one from the 
GPU memory to the NVMe device.

That would also match how I get this information from userspace.

>> As far as I can see we need a function which return the distance between
>> a initiator and target device. This function then returns -1 if the
>> transaction can't be made and a positive value otherwise.
> If you need to make a simpler convenience function for your use case I'm
> not against it.

Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
that I'm fine with that as well.

I think it would just be more convenient when we can come up with 
functions which can handle all use cases, cause there still seems to be 
a lot of similarities.

>
>> We also need to give the direction of the transaction and have a
>> whitelist root complex PCI-IDs which can handle P2P transactions from
>> different ports for a certain DMA direction.
> Yes. In the NVMeof case we need all devices to be able to DMA in both
> directions so we did not need the DMA direction. But I can see this
> being useful once we add the whitelist.

Ok, I agree that can be added later on. For simplicity let's assume for 
now we always to bidirectional transfers.

Thanks for the explanation,
Christian.

>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03 17:29           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-03 17:29 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson

Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe:
> On 03/05/18 03:05 AM, Christian König wrote:
>> Second question is how to you want to handle things when device are not
>> behind the same root port (which is perfectly possible in the cases I
>> deal with)?
> I think we need to implement a whitelist. If both root ports are in the
> white list and are on the same bus then we return a larger distance
> instead of -1.

Sounds good.

>> Third question why multiple clients? That feels a bit like you are
>> pushing something special to your use case into the common PCI
>> subsystem. Something which usually isn't a good idea.
> No, I think this will be pretty standard. In the simple general case you
> are going to have one provider and at least two clients (one which
> writes the memory and one which reads it). However, one client is
> likely, but not necessarily, the same as the provider.

Ok, that is the point where I'm stuck. Why do we need that in one 
function call in the PCIe subsystem?

The problem at least with GPUs is that we seriously don't have that 
information here, cause the PCI subsystem might not be aware of all the 
interconnections.

For example it isn't uncommon to put multiple GPUs on one board. To the 
PCI subsystem that looks like separate devices, but in reality all GPUs 
are interconnected and can access each others memory directly without 
going over the PCIe bus.

I seriously don't want to model that in the PCI subsystem, but rather 
the driver. That's why it feels like a mistake to me to push all that 
into the PCI function.

> In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
> devices. The code doesn't care which device provides the memory as it
> could be the RDMA device or one/all of the block devices (or, in theory,
> a completely separate device with P2P-able memory). However, it does
> require that all devices involved are accessible per
> pci_p2pdma_distance() or it won't use P2P transactions.
>
> I could also imagine other use cases: ie. an RDMA NIC sends data to a
> GPU for processing and then sends the data to an NVMe device for storage
> (or vice-versa). In this case we have 3 clients and one provider.

Why can't we model that as two separate transactions?

E.g. one from the RDMA NIC to the GPU memory. And another one from the 
GPU memory to the NVMe device.

That would also match how I get this information from userspace.

>> As far as I can see we need a function which return the distance between
>> a initiator and target device. This function then returns -1 if the
>> transaction can't be made and a positive value otherwise.
> If you need to make a simpler convenience function for your use case I'm
> not against it.

Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
that I'm fine with that as well.

I think it would just be more convenient when we can come up with 
functions which can handle all use cases, cause there still seems to be 
a lot of similarities.

>
>> We also need to give the direction of the transaction and have a
>> whitelist root complex PCI-IDs which can handle P2P transactions from
>> different ports for a certain DMA direction.
> Yes. In the NVMeof case we need all devices to be able to DMA in both
> directions so we did not need the DMA direction. But I can see this
> being useful once we add the whitelist.

Ok, I agree that can be added later on. For simplicity let's assume for 
now we always to bidirectional transfers.

Thanks for the explanation,
Christian.

>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03 17:29           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-03 17:29 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe:
> On 03/05/18 03:05 AM, Christian König wrote:
>> Second question is how to you want to handle things when device are not
>> behind the same root port (which is perfectly possible in the cases I
>> deal with)?
> I think we need to implement a whitelist. If both root ports are in the
> white list and are on the same bus then we return a larger distance
> instead of -1.

Sounds good.

>> Third question why multiple clients? That feels a bit like you are
>> pushing something special to your use case into the common PCI
>> subsystem. Something which usually isn't a good idea.
> No, I think this will be pretty standard. In the simple general case you
> are going to have one provider and at least two clients (one which
> writes the memory and one which reads it). However, one client is
> likely, but not necessarily, the same as the provider.

Ok, that is the point where I'm stuck. Why do we need that in one 
function call in the PCIe subsystem?

The problem at least with GPUs is that we seriously don't have that 
information here, cause the PCI subsystem might not be aware of all the 
interconnections.

For example it isn't uncommon to put multiple GPUs on one board. To the 
PCI subsystem that looks like separate devices, but in reality all GPUs 
are interconnected and can access each others memory directly without 
going over the PCIe bus.

I seriously don't want to model that in the PCI subsystem, but rather 
the driver. That's why it feels like a mistake to me to push all that 
into the PCI function.

> In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
> devices. The code doesn't care which device provides the memory as it
> could be the RDMA device or one/all of the block devices (or, in theory,
> a completely separate device with P2P-able memory). However, it does
> require that all devices involved are accessible per
> pci_p2pdma_distance() or it won't use P2P transactions.
>
> I could also imagine other use cases: ie. an RDMA NIC sends data to a
> GPU for processing and then sends the data to an NVMe device for storage
> (or vice-versa). In this case we have 3 clients and one provider.

Why can't we model that as two separate transactions?

E.g. one from the RDMA NIC to the GPU memory. And another one from the 
GPU memory to the NVMe device.

That would also match how I get this information from userspace.

>> As far as I can see we need a function which return the distance between
>> a initiator and target device. This function then returns -1 if the
>> transaction can't be made and a positive value otherwise.
> If you need to make a simpler convenience function for your use case I'm
> not against it.

Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
that I'm fine with that as well.

I think it would just be more convenient when we can come up with 
functions which can handle all use cases, cause there still seems to be 
a lot of similarities.

>
>> We also need to give the direction of the transaction and have a
>> whitelist root complex PCI-IDs which can handle P2P transactions from
>> different ports for a certain DMA direction.
> Yes. In the NVMeof case we need all devices to be able to DMA in both
> directions so we did not need the DMA direction. But I can see this
> being useful once we add the whitelist.

Ok, I agree that can be added later on. For simplicity let's assume for 
now we always to bidirectional transfers.

Thanks for the explanation,
Christian.

>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03 17:29           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-03 17:29 UTC (permalink / raw)


Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe:
> On 03/05/18 03:05 AM, Christian K?nig wrote:
>> Second question is how to you want to handle things when device are not
>> behind the same root port (which is perfectly possible in the cases I
>> deal with)?
> I think we need to implement a whitelist. If both root ports are in the
> white list and are on the same bus then we return a larger distance
> instead of -1.

Sounds good.

>> Third question why multiple clients? That feels a bit like you are
>> pushing something special to your use case into the common PCI
>> subsystem. Something which usually isn't a good idea.
> No, I think this will be pretty standard. In the simple general case you
> are going to have one provider and at least two clients (one which
> writes the memory and one which reads it). However, one client is
> likely, but not necessarily, the same as the provider.

Ok, that is the point where I'm stuck. Why do we need that in one 
function call in the PCIe subsystem?

The problem at least with GPUs is that we seriously don't have that 
information here, cause the PCI subsystem might not be aware of all the 
interconnections.

For example it isn't uncommon to put multiple GPUs on one board. To the 
PCI subsystem that looks like separate devices, but in reality all GPUs 
are interconnected and can access each others memory directly without 
going over the PCIe bus.

I seriously don't want to model that in the PCI subsystem, but rather 
the driver. That's why it feels like a mistake to me to push all that 
into the PCI function.

> In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
> devices. The code doesn't care which device provides the memory as it
> could be the RDMA device or one/all of the block devices (or, in theory,
> a completely separate device with P2P-able memory). However, it does
> require that all devices involved are accessible per
> pci_p2pdma_distance() or it won't use P2P transactions.
>
> I could also imagine other use cases: ie. an RDMA NIC sends data to a
> GPU for processing and then sends the data to an NVMe device for storage
> (or vice-versa). In this case we have 3 clients and one provider.

Why can't we model that as two separate transactions?

E.g. one from the RDMA NIC to the GPU memory. And another one from the 
GPU memory to the NVMe device.

That would also match how I get this information from userspace.

>> As far as I can see we need a function which return the distance between
>> a initiator and target device. This function then returns -1 if the
>> transaction can't be made and a positive value otherwise.
> If you need to make a simpler convenience function for your use case I'm
> not against it.

Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
that I'm fine with that as well.

I think it would just be more convenient when we can come up with 
functions which can handle all use cases, cause there still seems to be 
a lot of similarities.

>
>> We also need to give the direction of the transaction and have a
>> whitelist root complex PCI-IDs which can handle P2P transactions from
>> different ports for a certain DMA direction.
> Yes. In the NVMeof case we need all devices to be able to DMA in both
> directions so we did not need the DMA direction. But I can see this
> being useful once we add the whitelist.

Ok, I agree that can be added later on. For simplicity let's assume for 
now we always to bidirectional transfers.

Thanks for the explanation,
Christian.

>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-03 17:29           ` Christian König
  (?)
  (?)
@ 2018-05-03 18:43             ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-03 18:43 UTC (permalink / raw)
  To: Christian König, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 03/05/18 11:29 AM, Christian König wrote:
> Ok, that is the point where I'm stuck. Why do we need that in one 
> function call in the PCIe subsystem?
> 
> The problem at least with GPUs is that we seriously don't have that 
> information here, cause the PCI subsystem might not be aware of all the 
> interconnections.
> 
> For example it isn't uncommon to put multiple GPUs on one board. To the 
> PCI subsystem that looks like separate devices, but in reality all GPUs 
> are interconnected and can access each others memory directly without 
> going over the PCIe bus.
> 
> I seriously don't want to model that in the PCI subsystem, but rather 
> the driver. That's why it feels like a mistake to me to push all that 
> into the PCI function.

Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:

pci_p2pdma_distance(target, [initiator, target])

> Why can't we model that as two separate transactions?

You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).

> Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
> that I'm fine with that as well.
> 
> I think it would just be more convenient when we can come up with 
> functions which can handle all use cases, cause there still seems to be 
> a lot of similarities.

The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03 18:43             ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-03 18:43 UTC (permalink / raw)
  To: Christian König, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson



On 03/05/18 11:29 AM, Christian König wrote:
> Ok, that is the point where I'm stuck. Why do we need that in one 
> function call in the PCIe subsystem?
> 
> The problem at least with GPUs is that we seriously don't have that 
> information here, cause the PCI subsystem might not be aware of all the 
> interconnections.
> 
> For example it isn't uncommon to put multiple GPUs on one board. To the 
> PCI subsystem that looks like separate devices, but in reality all GPUs 
> are interconnected and can access each others memory directly without 
> going over the PCIe bus.
> 
> I seriously don't want to model that in the PCI subsystem, but rather 
> the driver. That's why it feels like a mistake to me to push all that 
> into the PCI function.

Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:

pci_p2pdma_distance(target, [initiator, target])

> Why can't we model that as two separate transactions?

You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).

> Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
> that I'm fine with that as well.
> 
> I think it would just be more convenient when we can come up with 
> functions which can handle all use cases, cause there still seems to be 
> a lot of similarities.

The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03 18:43             ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-03 18:43 UTC (permalink / raw)
  To: Christian König, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 03/05/18 11:29 AM, Christian König wrote:
> Ok, that is the point where I'm stuck. Why do we need that in one 
> function call in the PCIe subsystem?
> 
> The problem at least with GPUs is that we seriously don't have that 
> information here, cause the PCI subsystem might not be aware of all the 
> interconnections.
> 
> For example it isn't uncommon to put multiple GPUs on one board. To the 
> PCI subsystem that looks like separate devices, but in reality all GPUs 
> are interconnected and can access each others memory directly without 
> going over the PCIe bus.
> 
> I seriously don't want to model that in the PCI subsystem, but rather 
> the driver. That's why it feels like a mistake to me to push all that 
> into the PCI function.

Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:

pci_p2pdma_distance(target, [initiator, target])

> Why can't we model that as two separate transactions?

You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).

> Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
> that I'm fine with that as well.
> 
> I think it would just be more convenient when we can come up with 
> functions which can handle all use cases, cause there still seems to be 
> a lot of similarities.

The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-03 18:43             ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-03 18:43 UTC (permalink / raw)




On 03/05/18 11:29 AM, Christian K?nig wrote:
> Ok, that is the point where I'm stuck. Why do we need that in one 
> function call in the PCIe subsystem?
> 
> The problem at least with GPUs is that we seriously don't have that 
> information here, cause the PCI subsystem might not be aware of all the 
> interconnections.
> 
> For example it isn't uncommon to put multiple GPUs on one board. To the 
> PCI subsystem that looks like separate devices, but in reality all GPUs 
> are interconnected and can access each others memory directly without 
> going over the PCIe bus.
> 
> I seriously don't want to model that in the PCI subsystem, but rather 
> the driver. That's why it feels like a mistake to me to push all that 
> into the PCI function.

Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:

pci_p2pdma_distance(target, [initiator, target])

> Why can't we model that as two separate transactions?

You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).

> Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
> that I'm fine with that as well.
> 
> I think it would just be more convenient when we can come up with 
> functions which can handle all use cases, cause there still seems to be 
> a lot of similarities.

The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-03 18:43             ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-04 14:27               ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-04 14:27 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe:
>
> On 03/05/18 11:29 AM, Christian König wrote:
>> Ok, that is the point where I'm stuck. Why do we need that in one
>> function call in the PCIe subsystem?
>>
>> The problem at least with GPUs is that we seriously don't have that
>> information here, cause the PCI subsystem might not be aware of all the
>> interconnections.
>>
>> For example it isn't uncommon to put multiple GPUs on one board. To the
>> PCI subsystem that looks like separate devices, but in reality all GPUs
>> are interconnected and can access each others memory directly without
>> going over the PCIe bus.
>>
>> I seriously don't want to model that in the PCI subsystem, but rather
>> the driver. That's why it feels like a mistake to me to push all that
>> into the PCI function.
> Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
> list to this API, if you want. If the driver is _sure_ they are all the
> same, you only have to send one. In your terminology, you'd just have to
> call the interface with:
>
> pci_p2pdma_distance(target, [initiator, target])

Ok, I expected that something like that would do it.

So just to confirm: When I have a bunch of GPUs which could be the 
initiator I only need to do "pci_p2pdma_distance(target, [first GPU, 
target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, 
third GPU, forth...., target])" ?

>> Why can't we model that as two separate transactions?
> You could, but this is more convenient for users of the API that need to
> deal with multiple devices (and manage devices that may be added or
> removed at any time).

Are you sure that this is more convenient? At least on first glance it 
feels overly complicated.

I mean what's the difference between the two approaches?

     sum = pci_p2pdma_distance(target, [A, B, C, target]);

and

     sum = pci_p2pdma_distance(target, A);
     sum += pci_p2pdma_distance(target, B);
     sum += pci_p2pdma_distance(target, C);

>> Yeah, same for me. If Bjorn is ok with that specialized NVM functions
>> that I'm fine with that as well.
>>
>> I think it would just be more convenient when we can come up with
>> functions which can handle all use cases, cause there still seems to be
>> a lot of similarities.
> The way it's implemented is more general and can handle all use cases.
> You are arguing for a function that can handle your case (albeit with a
> bit more fuss) but can't handle mine and is therefore less general.
> Calling my interface specialized is wrong.

Well at the end of the day you only need to convince Bjorn of the 
interface, so I'm perfectly fine with it as long as it serves my use 
case as well :)

But I still would like to understand your intention, cause that really 
helps not to accidentally break something in the long term.

Now when I take a look at the pure PCI hardware level, what I have is a 
transaction between an initiator and a target, and not multiple devices 
in one operation.

I mean you must have a very good reason that you now want to deal with 
multiple devices in the software layer, but neither from the code nor 
from your explanation that reason becomes obvious to me.

Thanks,
Christian.

>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-04 14:27               ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-04 14:27 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson

Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe:
>
> On 03/05/18 11:29 AM, Christian König wrote:
>> Ok, that is the point where I'm stuck. Why do we need that in one
>> function call in the PCIe subsystem?
>>
>> The problem at least with GPUs is that we seriously don't have that
>> information here, cause the PCI subsystem might not be aware of all the
>> interconnections.
>>
>> For example it isn't uncommon to put multiple GPUs on one board. To the
>> PCI subsystem that looks like separate devices, but in reality all GPUs
>> are interconnected and can access each others memory directly without
>> going over the PCIe bus.
>>
>> I seriously don't want to model that in the PCI subsystem, but rather
>> the driver. That's why it feels like a mistake to me to push all that
>> into the PCI function.
> Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
> list to this API, if you want. If the driver is _sure_ they are all the
> same, you only have to send one. In your terminology, you'd just have to
> call the interface with:
>
> pci_p2pdma_distance(target, [initiator, target])

Ok, I expected that something like that would do it.

So just to confirm: When I have a bunch of GPUs which could be the 
initiator I only need to do "pci_p2pdma_distance(target, [first GPU, 
target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, 
third GPU, forth...., target])" ?

>> Why can't we model that as two separate transactions?
> You could, but this is more convenient for users of the API that need to
> deal with multiple devices (and manage devices that may be added or
> removed at any time).

Are you sure that this is more convenient? At least on first glance it 
feels overly complicated.

I mean what's the difference between the two approaches?

     sum = pci_p2pdma_distance(target, [A, B, C, target]);

and

     sum = pci_p2pdma_distance(target, A);
     sum += pci_p2pdma_distance(target, B);
     sum += pci_p2pdma_distance(target, C);

>> Yeah, same for me. If Bjorn is ok with that specialized NVM functions
>> that I'm fine with that as well.
>>
>> I think it would just be more convenient when we can come up with
>> functions which can handle all use cases, cause there still seems to be
>> a lot of similarities.
> The way it's implemented is more general and can handle all use cases.
> You are arguing for a function that can handle your case (albeit with a
> bit more fuss) but can't handle mine and is therefore less general.
> Calling my interface specialized is wrong.

Well at the end of the day you only need to convince Bjorn of the 
interface, so I'm perfectly fine with it as long as it serves my use 
case as well :)

But I still would like to understand your intention, cause that really 
helps not to accidentally break something in the long term.

Now when I take a look at the pure PCI hardware level, what I have is a 
transaction between an initiator and a target, and not multiple devices 
in one operation.

I mean you must have a very good reason that you now want to deal with 
multiple devices in the software layer, but neither from the code nor 
from your explanation that reason becomes obvious to me.

Thanks,
Christian.

>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-04 14:27               ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-04 14:27 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe:
>
> On 03/05/18 11:29 AM, Christian König wrote:
>> Ok, that is the point where I'm stuck. Why do we need that in one
>> function call in the PCIe subsystem?
>>
>> The problem at least with GPUs is that we seriously don't have that
>> information here, cause the PCI subsystem might not be aware of all the
>> interconnections.
>>
>> For example it isn't uncommon to put multiple GPUs on one board. To the
>> PCI subsystem that looks like separate devices, but in reality all GPUs
>> are interconnected and can access each others memory directly without
>> going over the PCIe bus.
>>
>> I seriously don't want to model that in the PCI subsystem, but rather
>> the driver. That's why it feels like a mistake to me to push all that
>> into the PCI function.
> Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
> list to this API, if you want. If the driver is _sure_ they are all the
> same, you only have to send one. In your terminology, you'd just have to
> call the interface with:
>
> pci_p2pdma_distance(target, [initiator, target])

Ok, I expected that something like that would do it.

So just to confirm: When I have a bunch of GPUs which could be the 
initiator I only need to do "pci_p2pdma_distance(target, [first GPU, 
target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, 
third GPU, forth...., target])" ?

>> Why can't we model that as two separate transactions?
> You could, but this is more convenient for users of the API that need to
> deal with multiple devices (and manage devices that may be added or
> removed at any time).

Are you sure that this is more convenient? At least on first glance it 
feels overly complicated.

I mean what's the difference between the two approaches?

     sum = pci_p2pdma_distance(target, [A, B, C, target]);

and

     sum = pci_p2pdma_distance(target, A);
     sum += pci_p2pdma_distance(target, B);
     sum += pci_p2pdma_distance(target, C);

>> Yeah, same for me. If Bjorn is ok with that specialized NVM functions
>> that I'm fine with that as well.
>>
>> I think it would just be more convenient when we can come up with
>> functions which can handle all use cases, cause there still seems to be
>> a lot of similarities.
> The way it's implemented is more general and can handle all use cases.
> You are arguing for a function that can handle your case (albeit with a
> bit more fuss) but can't handle mine and is therefore less general.
> Calling my interface specialized is wrong.

Well at the end of the day you only need to convince Bjorn of the 
interface, so I'm perfectly fine with it as long as it serves my use 
case as well :)

But I still would like to understand your intention, cause that really 
helps not to accidentally break something in the long term.

Now when I take a look at the pure PCI hardware level, what I have is a 
transaction between an initiator and a target, and not multiple devices 
in one operation.

I mean you must have a very good reason that you now want to deal with 
multiple devices in the software layer, but neither from the code nor 
from your explanation that reason becomes obvious to me.

Thanks,
Christian.

>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-04 14:27               ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-04 14:27 UTC (permalink / raw)


Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe:
>
> On 03/05/18 11:29 AM, Christian K?nig wrote:
>> Ok, that is the point where I'm stuck. Why do we need that in one
>> function call in the PCIe subsystem?
>>
>> The problem at least with GPUs is that we seriously don't have that
>> information here, cause the PCI subsystem might not be aware of all the
>> interconnections.
>>
>> For example it isn't uncommon to put multiple GPUs on one board. To the
>> PCI subsystem that looks like separate devices, but in reality all GPUs
>> are interconnected and can access each others memory directly without
>> going over the PCIe bus.
>>
>> I seriously don't want to model that in the PCI subsystem, but rather
>> the driver. That's why it feels like a mistake to me to push all that
>> into the PCI function.
> Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
> list to this API, if you want. If the driver is _sure_ they are all the
> same, you only have to send one. In your terminology, you'd just have to
> call the interface with:
>
> pci_p2pdma_distance(target, [initiator, target])

Ok, I expected that something like that would do it.

So just to confirm: When I have a bunch of GPUs which could be the 
initiator I only need to do "pci_p2pdma_distance(target, [first GPU, 
target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, 
third GPU, forth...., target])" ?

>> Why can't we model that as two separate transactions?
> You could, but this is more convenient for users of the API that need to
> deal with multiple devices (and manage devices that may be added or
> removed at any time).

Are you sure that this is more convenient? At least on first glance it 
feels overly complicated.

I mean what's the difference between the two approaches?

 ??? sum = pci_p2pdma_distance(target, [A, B, C, target]);

and

 ??? sum = pci_p2pdma_distance(target, A);
 ??? sum += pci_p2pdma_distance(target, B);
 ??? sum += pci_p2pdma_distance(target, C);

>> Yeah, same for me. If Bjorn is ok with that specialized NVM functions
>> that I'm fine with that as well.
>>
>> I think it would just be more convenient when we can come up with
>> functions which can handle all use cases, cause there still seems to be
>> a lot of similarities.
> The way it's implemented is more general and can handle all use cases.
> You are arguing for a function that can handle your case (albeit with a
> bit more fuss) but can't handle mine and is therefore less general.
> Calling my interface specialized is wrong.

Well at the end of the day you only need to convince Bjorn of the 
interface, so I'm perfectly fine with it as long as it serves my use 
case as well :)

But I still would like to understand your intention, cause that really 
helps not to accidentally break something in the long term.

Now when I take a look at the pure PCI hardware level, what I have is a 
transaction between an initiator and a target, and not multiple devices 
in one operation.

I mean you must have a very good reason that you now want to deal with 
multiple devices in the software layer, but neither from the code nor 
from your explanation that reason becomes obvious to me.

Thanks,
Christian.

>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-04 14:27               ` Christian König
  (?)
  (?)
@ 2018-05-04 15:52                 ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-04 15:52 UTC (permalink / raw)
  To: Christian König, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 04/05/18 08:27 AM, Christian König wrote:
> Are you sure that this is more convenient? At least on first glance it 
> feels overly complicated.
> 
> I mean what's the difference between the two approaches?
> 
>      sum = pci_p2pdma_distance(target, [A, B, C, target]);
> 
> and
> 
>      sum = pci_p2pdma_distance(target, A);
>      sum += pci_p2pdma_distance(target, B);
>      sum += pci_p2pdma_distance(target, C);

Well, it's more for consistency with the pci_p2pdma_find() which has to
take a list of devices to find a resource which matches all of them.
(You can't use multiple calls in that case because all the devices in
the list might not have the same set of compatible providers.) That way
we can use the same list to check the distance (when the user specifies
a device) as we do to find a compatible device (when the user wants to
automatically find one.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-04 15:52                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-04 15:52 UTC (permalink / raw)
  To: Christian König, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson



On 04/05/18 08:27 AM, Christian König wrote:
> Are you sure that this is more convenient? At least on first glance it 
> feels overly complicated.
> 
> I mean what's the difference between the two approaches?
> 
>      sum = pci_p2pdma_distance(target, [A, B, C, target]);
> 
> and
> 
>      sum = pci_p2pdma_distance(target, A);
>      sum += pci_p2pdma_distance(target, B);
>      sum += pci_p2pdma_distance(target, C);

Well, it's more for consistency with the pci_p2pdma_find() which has to
take a list of devices to find a resource which matches all of them.
(You can't use multiple calls in that case because all the devices in
the list might not have the same set of compatible providers.) That way
we can use the same list to check the distance (when the user specifies
a device) as we do to find a compatible device (when the user wants to
automatically find one.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-04 15:52                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-04 15:52 UTC (permalink / raw)
  To: Christian König, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 04/05/18 08:27 AM, Christian König wrote:
> Are you sure that this is more convenient? At least on first glance it 
> feels overly complicated.
> 
> I mean what's the difference between the two approaches?
> 
>      sum = pci_p2pdma_distance(target, [A, B, C, target]);
> 
> and
> 
>      sum = pci_p2pdma_distance(target, A);
>      sum += pci_p2pdma_distance(target, B);
>      sum += pci_p2pdma_distance(target, C);

Well, it's more for consistency with the pci_p2pdma_find() which has to
take a list of devices to find a resource which matches all of them.
(You can't use multiple calls in that case because all the devices in
the list might not have the same set of compatible providers.) That way
we can use the same list to check the distance (when the user specifies
a device) as we do to find a compatible device (when the user wants to
automatically find one.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-04 15:52                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-04 15:52 UTC (permalink / raw)




On 04/05/18 08:27 AM, Christian K?nig wrote:
> Are you sure that this is more convenient? At least on first glance it 
> feels overly complicated.
> 
> I mean what's the difference between the two approaches?
> 
>  ??? sum = pci_p2pdma_distance(target, [A, B, C, target]);
> 
> and
> 
>  ??? sum = pci_p2pdma_distance(target, A);
>  ??? sum += pci_p2pdma_distance(target, B);
>  ??? sum += pci_p2pdma_distance(target, C);

Well, it's more for consistency with the pci_p2pdma_find() which has to
take a list of devices to find a resource which matches all of them.
(You can't use multiple calls in that case because all the devices in
the list might not have the same set of compatible providers.) That way
we can use the same list to check the distance (when the user specifies
a device) as we do to find a compatible device (when the user wants to
automatically find one.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
  2018-04-23 23:30   ` Logan Gunthorpe
                       ` (2 preceding siblings ...)
  (?)
@ 2018-05-07 23:00     ` Bjorn Helgaas
  -1 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Mon, Apr 23, 2018 at 05:30:33PM -0600, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in peer-to-peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> 
> Add an interface for other subsystems to find and allocate chunks of P2P
> memory as necessary to facilitate transfers between two PCI peers:
> 
> int pci_p2pdma_add_client();
> struct pci_dev *pci_p2pmem_find();
> void *pci_alloc_p2pmem();
> 
> The new interface requires a driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary (adding incompatible clients
> will fail). With a suitable p2pmem device, memory can then be
> allocated with pci_alloc_p2pmem() for use in DMA transactions.
> 
> Depending on hardware, using peer-to-peer memory may reduce the bandwidth
> of the transfer but can significantly reduce pressure on system memory.
> This may be desirable in many cases: for example a system could be designed
> with a small CPU connected to a PCI switch by a small number of lanes

s/PCI/PCIe/

> which would maximize the number of lanes available to connect to NVMe
> devices.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same root port (typically through

s/root port/PCI bridge/

> a network of PCIe switches). This is because we have no way of knowing
> whether peer-to-peer routing between PCIe Root Ports is supported
> (PCIe r4.0, sec 1.3.1).  Additionally, the benefits of P2P transfers that
> go through the RC is limited to only reducing DRAM usage and, in some
> cases, coding convenience. The PCI-SIG may be exploring adding a new
> capability bit to advertise whether this is possible for future
> hardware.
> 
> This commit includes significant rework and feedback from Christoph
> Hellwig.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  17 ++
>  drivers/pci/Makefile       |   1 +
>  drivers/pci/p2pdma.c       | 694 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/memremap.h   |  18 ++
>  include/linux/pci-p2pdma.h | 100 +++++++
>  include/linux/pci.h        |   4 +
>  6 files changed, 834 insertions(+)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..b2396c22b53e 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,23 @@ config PCI_PASID
>  
>  	  If unsure, say N.
>  
> +config PCI_P2PDMA
> +	bool "PCI peer-to-peer transfer support"
> +	depends on PCI && ZONE_DEVICE && EXPERT
> +	select GENERIC_ALLOCATOR
> +	help
> +	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
> +	  BARs that are exposed in other devices that are the part of
> +	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +	  specification to work (ie. anything below a single PCI bridge).
> +
> +	  Many PCIe root complexes do not support P2P transactions and
> +	  it's hard to tell which support it at all, so at this time, DMA
> +	  transations must be between devices behind the same root port.

s/DMA transactions/PCIe DMA transactions/

(Theoretically P2P should work on conventional PCI, and this sentence only
applies to PCIe.)

> +	  (Typically behind a network of PCIe switches).

Not sure this last sentence adds useful information.

> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,694 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI Peer 2 Peer DMA support.
> + *
> + * Copyright (c) 2016-2018, Logan Gunthorpe
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig
> + * Copyright (c) 2018, Eideticom Inc.
> + *

Nit: unnecessary blank line.

> +/*
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge():
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)

This function doesn't seem to be used anymore.  Thanks for all your hard
work to get rid of it!

> +{
> +	struct pci_dev *up1, *up2;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> +	if (!up1)
> +		return NULL;
> +
> +	up2 = pci_dev_get(pci_upstream_bridge(up1));
> +	pci_dev_put(up1);
> +
> +	return up2;
> +}
> +
> +/*
> + * Find the distance through the nearest common upstream bridge between
> + * two PCI devices.
> + *
> + * If the two devices are the same device then 0 will be returned.
> + *
> + * If there are two virtual functions of the same device behind the same
> + * bridge port then 2 will be returned (one step down to the bridge then

s/bridge/PCIe switch/

> + * one step back to the same device).
> + *
> + * In the case where two devices are connected to the same PCIe switch, the
> + * value 4 will be returned. This corresponds to the following PCI tree:
> + *
> + *     -+  Root Port
> + *      \+ Switch Upstream Port
> + *       +-+ Switch Downstream Port
> + *       + \- Device A
> + *       \-+ Switch Downstream Port
> + *         \- Device B
> + *
> + * The distance is 4 because we traverse from Device A through the downstream
> + * port of the switch, to the common upstream port, back up to the second
> + * downstream port and then to Device B.
> + *
> + * Any two devices that don't have a common upstream bridge will return -1.
> + * In this way devices on seperate root ports will be rejected, which

s/seperate/separate/
s/root port/PCIe root ports/
(Again, since P2P should work on conventional PCI)

> + * is what we want for peer-to-peer seeing there's no way to determine
> + * if the root complex supports forwarding between root ports.

s/seeing there's no way.../
  seeing each PCIe root port defines a separate hierarchy domain and
  there's no way to determine whether the root complex supports forwarding
  between them./

> + *
> + * In the case where two devices are connected to different PCIe switches
> + * this function will still return a positive distance as long as both
> + * switches evenutally have a common upstream bridge. Note this covers
> + * the case of using multiple PCIe switches to achieve a desired level of
> + * fan-out from a root port. The exact distance will be a function of the
> + * number of switches between Device A and Device B.
> + *

Nit: unnecessary blank line.

> + */
> +static int upstream_bridge_distance(struct pci_dev *a, > +				    struct pci_dev *b)

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-05-07 23:00     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Christian König

On Mon, Apr 23, 2018 at 05:30:33PM -0600, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in peer-to-peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> 
> Add an interface for other subsystems to find and allocate chunks of P2P
> memory as necessary to facilitate transfers between two PCI peers:
> 
> int pci_p2pdma_add_client();
> struct pci_dev *pci_p2pmem_find();
> void *pci_alloc_p2pmem();
> 
> The new interface requires a driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary (adding incompatible clients
> will fail). With a suitable p2pmem device, memory can then be
> allocated with pci_alloc_p2pmem() for use in DMA transactions.
> 
> Depending on hardware, using peer-to-peer memory may reduce the bandwidth
> of the transfer but can significantly reduce pressure on system memory.
> This may be desirable in many cases: for example a system could be designed
> with a small CPU connected to a PCI switch by a small number of lanes

s/PCI/PCIe/

> which would maximize the number of lanes available to connect to NVMe
> devices.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same root port (typically through

s/root port/PCI bridge/

> a network of PCIe switches). This is because we have no way of knowing
> whether peer-to-peer routing between PCIe Root Ports is supported
> (PCIe r4.0, sec 1.3.1).  Additionally, the benefits of P2P transfers that
> go through the RC is limited to only reducing DRAM usage and, in some
> cases, coding convenience. The PCI-SIG may be exploring adding a new
> capability bit to advertise whether this is possible for future
> hardware.
> 
> This commit includes significant rework and feedback from Christoph
> Hellwig.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  17 ++
>  drivers/pci/Makefile       |   1 +
>  drivers/pci/p2pdma.c       | 694 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/memremap.h   |  18 ++
>  include/linux/pci-p2pdma.h | 100 +++++++
>  include/linux/pci.h        |   4 +
>  6 files changed, 834 insertions(+)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..b2396c22b53e 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,23 @@ config PCI_PASID
>  
>  	  If unsure, say N.
>  
> +config PCI_P2PDMA
> +	bool "PCI peer-to-peer transfer support"
> +	depends on PCI && ZONE_DEVICE && EXPERT
> +	select GENERIC_ALLOCATOR
> +	help
> +	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
> +	  BARs that are exposed in other devices that are the part of
> +	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +	  specification to work (ie. anything below a single PCI bridge).
> +
> +	  Many PCIe root complexes do not support P2P transactions and
> +	  it's hard to tell which support it at all, so at this time, DMA
> +	  transations must be between devices behind the same root port.

s/DMA transactions/PCIe DMA transactions/

(Theoretically P2P should work on conventional PCI, and this sentence only
applies to PCIe.)

> +	  (Typically behind a network of PCIe switches).

Not sure this last sentence adds useful information.

> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,694 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI Peer 2 Peer DMA support.
> + *
> + * Copyright (c) 2016-2018, Logan Gunthorpe
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig
> + * Copyright (c) 2018, Eideticom Inc.
> + *

Nit: unnecessary blank line.

> +/*
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge():
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)

This function doesn't seem to be used anymore.  Thanks for all your hard
work to get rid of it!

> +{
> +	struct pci_dev *up1, *up2;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> +	if (!up1)
> +		return NULL;
> +
> +	up2 = pci_dev_get(pci_upstream_bridge(up1));
> +	pci_dev_put(up1);
> +
> +	return up2;
> +}
> +
> +/*
> + * Find the distance through the nearest common upstream bridge between
> + * two PCI devices.
> + *
> + * If the two devices are the same device then 0 will be returned.
> + *
> + * If there are two virtual functions of the same device behind the same
> + * bridge port then 2 will be returned (one step down to the bridge then

s/bridge/PCIe switch/

> + * one step back to the same device).
> + *
> + * In the case where two devices are connected to the same PCIe switch, the
> + * value 4 will be returned. This corresponds to the following PCI tree:
> + *
> + *     -+  Root Port
> + *      \+ Switch Upstream Port
> + *       +-+ Switch Downstream Port
> + *       + \- Device A
> + *       \-+ Switch Downstream Port
> + *         \- Device B
> + *
> + * The distance is 4 because we traverse from Device A through the downstream
> + * port of the switch, to the common upstream port, back up to the second
> + * downstream port and then to Device B.
> + *
> + * Any two devices that don't have a common upstream bridge will return -1.
> + * In this way devices on seperate root ports will be rejected, which

s/seperate/separate/
s/root port/PCIe root ports/
(Again, since P2P should work on conventional PCI)

> + * is what we want for peer-to-peer seeing there's no way to determine
> + * if the root complex supports forwarding between root ports.

s/seeing there's no way.../
  seeing each PCIe root port defines a separate hierarchy domain and
  there's no way to determine whether the root complex supports forwarding
  between them./

> + *
> + * In the case where two devices are connected to different PCIe switches
> + * this function will still return a positive distance as long as both
> + * switches evenutally have a common upstream bridge. Note this covers
> + * the case of using multiple PCIe switches to achieve a desired level of
> + * fan-out from a root port. The exact distance will be a function of the
> + * number of switches between Device A and Device B.
> + *

Nit: unnecessary blank line.

> + */
> +static int upstream_bridge_distance(struct pci_dev *a, > +				    struct pci_dev *b)

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-05-07 23:00     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, Apr 23, 2018 at 05:30:33PM -0600, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in peer-to-peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> 
> Add an interface for other subsystems to find and allocate chunks of P2P
> memory as necessary to facilitate transfers between two PCI peers:
> 
> int pci_p2pdma_add_client();
> struct pci_dev *pci_p2pmem_find();
> void *pci_alloc_p2pmem();
> 
> The new interface requires a driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary (adding incompatible clients
> will fail). With a suitable p2pmem device, memory can then be
> allocated with pci_alloc_p2pmem() for use in DMA transactions.
> 
> Depending on hardware, using peer-to-peer memory may reduce the bandwidth
> of the transfer but can significantly reduce pressure on system memory.
> This may be desirable in many cases: for example a system could be designed
> with a small CPU connected to a PCI switch by a small number of lanes

s/PCI/PCIe/

> which would maximize the number of lanes available to connect to NVMe
> devices.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same root port (typically through

s/root port/PCI bridge/

> a network of PCIe switches). This is because we have no way of knowing
> whether peer-to-peer routing between PCIe Root Ports is supported
> (PCIe r4.0, sec 1.3.1).  Additionally, the benefits of P2P transfers that
> go through the RC is limited to only reducing DRAM usage and, in some
> cases, coding convenience. The PCI-SIG may be exploring adding a new
> capability bit to advertise whether this is possible for future
> hardware.
> 
> This commit includes significant rework and feedback from Christoph
> Hellwig.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  17 ++
>  drivers/pci/Makefile       |   1 +
>  drivers/pci/p2pdma.c       | 694 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/memremap.h   |  18 ++
>  include/linux/pci-p2pdma.h | 100 +++++++
>  include/linux/pci.h        |   4 +
>  6 files changed, 834 insertions(+)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..b2396c22b53e 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,23 @@ config PCI_PASID
>  
>  	  If unsure, say N.
>  
> +config PCI_P2PDMA
> +	bool "PCI peer-to-peer transfer support"
> +	depends on PCI && ZONE_DEVICE && EXPERT
> +	select GENERIC_ALLOCATOR
> +	help
> +	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
> +	  BARs that are exposed in other devices that are the part of
> +	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +	  specification to work (ie. anything below a single PCI bridge).
> +
> +	  Many PCIe root complexes do not support P2P transactions and
> +	  it's hard to tell which support it at all, so at this time, DMA
> +	  transations must be between devices behind the same root port.

s/DMA transactions/PCIe DMA transactions/

(Theoretically P2P should work on conventional PCI, and this sentence only
applies to PCIe.)

> +	  (Typically behind a network of PCIe switches).

Not sure this last sentence adds useful information.

> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,694 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI Peer 2 Peer DMA support.
> + *
> + * Copyright (c) 2016-2018, Logan Gunthorpe
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig
> + * Copyright (c) 2018, Eideticom Inc.
> + *

Nit: unnecessary blank line.

> +/*
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge():
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)

This function doesn't seem to be used anymore.  Thanks for all your hard
work to get rid of it!

> +{
> +	struct pci_dev *up1, *up2;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> +	if (!up1)
> +		return NULL;
> +
> +	up2 = pci_dev_get(pci_upstream_bridge(up1));
> +	pci_dev_put(up1);
> +
> +	return up2;
> +}
> +
> +/*
> + * Find the distance through the nearest common upstream bridge between
> + * two PCI devices.
> + *
> + * If the two devices are the same device then 0 will be returned.
> + *
> + * If there are two virtual functions of the same device behind the same
> + * bridge port then 2 will be returned (one step down to the bridge then

s/bridge/PCIe switch/

> + * one step back to the same device).
> + *
> + * In the case where two devices are connected to the same PCIe switch, the
> + * value 4 will be returned. This corresponds to the following PCI tree:
> + *
> + *     -+  Root Port
> + *      \+ Switch Upstream Port
> + *       +-+ Switch Downstream Port
> + *       + \- Device A
> + *       \-+ Switch Downstream Port
> + *         \- Device B
> + *
> + * The distance is 4 because we traverse from Device A through the downstream
> + * port of the switch, to the common upstream port, back up to the second
> + * downstream port and then to Device B.
> + *
> + * Any two devices that don't have a common upstream bridge will return -1.
> + * In this way devices on seperate root ports will be rejected, which

s/seperate/separate/
s/root port/PCIe root ports/
(Again, since P2P should work on conventional PCI)

> + * is what we want for peer-to-peer seeing there's no way to determine
> + * if the root complex supports forwarding between root ports.

s/seeing there's no way.../
  seeing each PCIe root port defines a separate hierarchy domain and
  there's no way to determine whether the root complex supports forwarding
  between them./

> + *
> + * In the case where two devices are connected to different PCIe switches
> + * this function will still return a positive distance as long as both
> + * switches evenutally have a common upstream bridge. Note this covers
> + * the case of using multiple PCIe switches to achieve a desired level of
> + * fan-out from a root port. The exact distance will be a function of the
> + * number of switches between Device A and Device B.
> + *

Nit: unnecessary blank line.

> + */
> +static int upstream_bridge_distance(struct pci_dev *a, > +				    struct pci_dev *b)

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-05-07 23:00     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, Sagi Grimberg,
	linux-nvdimm, linux-rdma, linux-pci, linux-kernel, linux-nvme,
	Stephen Bates, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Dan Williams, Christoph Hellwig

T24gTW9uLCBBcHIgMjMsIDIwMTggYXQgMDU6MzA6MzNQTSAtMDYwMCwgTG9nYW4gR3VudGhvcnBl
IHdyb3RlOgo+IFNvbWUgUENJIGRldmljZXMgbWF5IGhhdmUgbWVtb3J5IG1hcHBlZCBpbiBhIEJB
UiBzcGFjZSB0aGF0J3MKPiBpbnRlbmRlZCBmb3IgdXNlIGluIHBlZXItdG8tcGVlciB0cmFuc2Fj
dGlvbnMuIEluIG9yZGVyIHRvIGVuYWJsZQo+IHN1Y2ggdHJhbnNhY3Rpb25zIHRoZSBtZW1vcnkg
bXVzdCBiZSByZWdpc3RlcmVkIHdpdGggWk9ORV9ERVZJQ0UgcGFnZXMKPiBzbyBpdCBjYW4gYmUg
dXNlZCBieSBETUEgaW50ZXJmYWNlcyBpbiBleGlzdGluZyBkcml2ZXJzLgo+IAo+IEFkZCBhbiBp
bnRlcmZhY2UgZm9yIG90aGVyIHN1YnN5c3RlbXMgdG8gZmluZCBhbmQgYWxsb2NhdGUgY2h1bmtz
IG9mIFAyUAo+IG1lbW9yeSBhcyBuZWNlc3NhcnkgdG8gZmFjaWxpdGF0ZSB0cmFuc2ZlcnMgYmV0
d2VlbiB0d28gUENJIHBlZXJzOgo+IAo+IGludCBwY2lfcDJwZG1hX2FkZF9jbGllbnQoKTsKPiBz
dHJ1Y3QgcGNpX2RldiAqcGNpX3AycG1lbV9maW5kKCk7Cj4gdm9pZCAqcGNpX2FsbG9jX3AycG1l
bSgpOwo+IAo+IFRoZSBuZXcgaW50ZXJmYWNlIHJlcXVpcmVzIGEgZHJpdmVyIHRvIGNvbGxlY3Qg
YSBsaXN0IG9mIGNsaWVudCBkZXZpY2VzCj4gaW52b2x2ZWQgaW4gdGhlIHRyYW5zYWN0aW9uIHdp
dGggdGhlIHBjaV9wMnBtZW1fYWRkX2NsaWVudCooKSBmdW5jdGlvbnMKPiB0aGVuIGNhbGwgcGNp
X3AycG1lbV9maW5kKCkgdG8gb2J0YWluIGFueSBzdWl0YWJsZSBQMlAgbWVtb3J5LiBPbmNlCj4g
dGhpcyBpcyBkb25lIHRoZSBsaXN0IGlzIGJvdW5kIHRvIHRoZSBtZW1vcnkgYW5kIHRoZSBjYWxs
aW5nIGRyaXZlciBpcwo+IGZyZWUgdG8gYWRkIGFuZCByZW1vdmUgY2xpZW50cyBhcyBuZWNlc3Nh
cnkgKGFkZGluZyBpbmNvbXBhdGlibGUgY2xpZW50cwo+IHdpbGwgZmFpbCkuIFdpdGggYSBzdWl0
YWJsZSBwMnBtZW0gZGV2aWNlLCBtZW1vcnkgY2FuIHRoZW4gYmUKPiBhbGxvY2F0ZWQgd2l0aCBw
Y2lfYWxsb2NfcDJwbWVtKCkgZm9yIHVzZSBpbiBETUEgdHJhbnNhY3Rpb25zLgo+IAo+IERlcGVu
ZGluZyBvbiBoYXJkd2FyZSwgdXNpbmcgcGVlci10by1wZWVyIG1lbW9yeSBtYXkgcmVkdWNlIHRo
ZSBiYW5kd2lkdGgKPiBvZiB0aGUgdHJhbnNmZXIgYnV0IGNhbiBzaWduaWZpY2FudGx5IHJlZHVj
ZSBwcmVzc3VyZSBvbiBzeXN0ZW0gbWVtb3J5Lgo+IFRoaXMgbWF5IGJlIGRlc2lyYWJsZSBpbiBt
YW55IGNhc2VzOiBmb3IgZXhhbXBsZSBhIHN5c3RlbSBjb3VsZCBiZSBkZXNpZ25lZAo+IHdpdGgg
YSBzbWFsbCBDUFUgY29ubmVjdGVkIHRvIGEgUENJIHN3aXRjaCBieSBhIHNtYWxsIG51bWJlciBv
ZiBsYW5lcwoKcy9QQ0kvUENJZS8KCj4gd2hpY2ggd291bGQgbWF4aW1pemUgdGhlIG51bWJlciBv
ZiBsYW5lcyBhdmFpbGFibGUgdG8gY29ubmVjdCB0byBOVk1lCj4gZGV2aWNlcy4KPiAKPiBUaGUg
Y29kZSBpcyBkZXNpZ25lZCB0byBvbmx5IHV0aWxpemUgdGhlIHAycG1lbSBkZXZpY2UgaWYgYWxs
IHRoZSBkZXZpY2VzCj4gaW52b2x2ZWQgaW4gYSB0cmFuc2ZlciBhcmUgYmVoaW5kIHRoZSBzYW1l
IHJvb3QgcG9ydCAodHlwaWNhbGx5IHRocm91Z2gKCnMvcm9vdCBwb3J0L1BDSSBicmlkZ2UvCgo+
IGEgbmV0d29yayBvZiBQQ0llIHN3aXRjaGVzKS4gVGhpcyBpcyBiZWNhdXNlIHdlIGhhdmUgbm8g
d2F5IG9mIGtub3dpbmcKPiB3aGV0aGVyIHBlZXItdG8tcGVlciByb3V0aW5nIGJldHdlZW4gUENJ
ZSBSb290IFBvcnRzIGlzIHN1cHBvcnRlZAo+IChQQ0llIHI0LjAsIHNlYyAxLjMuMSkuICBBZGRp
dGlvbmFsbHksIHRoZSBiZW5lZml0cyBvZiBQMlAgdHJhbnNmZXJzIHRoYXQKPiBnbyB0aHJvdWdo
IHRoZSBSQyBpcyBsaW1pdGVkIHRvIG9ubHkgcmVkdWNpbmcgRFJBTSB1c2FnZSBhbmQsIGluIHNv
bWUKPiBjYXNlcywgY29kaW5nIGNvbnZlbmllbmNlLiBUaGUgUENJLVNJRyBtYXkgYmUgZXhwbG9y
aW5nIGFkZGluZyBhIG5ldwo+IGNhcGFiaWxpdHkgYml0IHRvIGFkdmVydGlzZSB3aGV0aGVyIHRo
aXMgaXMgcG9zc2libGUgZm9yIGZ1dHVyZQo+IGhhcmR3YXJlLgo+IAo+IFRoaXMgY29tbWl0IGlu
Y2x1ZGVzIHNpZ25pZmljYW50IHJld29yayBhbmQgZmVlZGJhY2sgZnJvbSBDaHJpc3RvcGgKPiBI
ZWxsd2lnLgo+IAo+IFNpZ25lZC1vZmYtYnk6IENocmlzdG9waCBIZWxsd2lnIDxoY2hAbHN0LmRl
Pgo+IFNpZ25lZC1vZmYtYnk6IExvZ2FuIEd1bnRob3JwZSA8bG9nYW5nQGRlbHRhdGVlLmNvbT4K
PiAtLS0KPiAgZHJpdmVycy9wY2kvS2NvbmZpZyAgICAgICAgfCAgMTcgKysKPiAgZHJpdmVycy9w
Y2kvTWFrZWZpbGUgICAgICAgfCAgIDEgKwo+ICBkcml2ZXJzL3BjaS9wMnBkbWEuYyAgICAgICB8
IDY5NCArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysKPiAgaW5j
bHVkZS9saW51eC9tZW1yZW1hcC5oICAgfCAgMTggKysKPiAgaW5jbHVkZS9saW51eC9wY2ktcDJw
ZG1hLmggfCAxMDAgKysrKysrKwo+ICBpbmNsdWRlL2xpbnV4L3BjaS5oICAgICAgICB8ICAgNCAr
Cj4gIDYgZmlsZXMgY2hhbmdlZCwgODM0IGluc2VydGlvbnMoKykKPiAgY3JlYXRlIG1vZGUgMTAw
NjQ0IGRyaXZlcnMvcGNpL3AycGRtYS5jCj4gIGNyZWF0ZSBtb2RlIDEwMDY0NCBpbmNsdWRlL2xp
bnV4L3BjaS1wMnBkbWEuaAo+IAo+IGRpZmYgLS1naXQgYS9kcml2ZXJzL3BjaS9LY29uZmlnIGIv
ZHJpdmVycy9wY2kvS2NvbmZpZwo+IGluZGV4IDM0YjU2YThmODQ4MC4uYjIzOTZjMjJiNTNlIDEw
MDY0NAo+IC0tLSBhL2RyaXZlcnMvcGNpL0tjb25maWcKPiArKysgYi9kcml2ZXJzL3BjaS9LY29u
ZmlnCj4gQEAgLTEyNCw2ICsxMjQsMjMgQEAgY29uZmlnIFBDSV9QQVNJRAo+ICAKPiAgCSAgSWYg
dW5zdXJlLCBzYXkgTi4KPiAgCj4gK2NvbmZpZyBQQ0lfUDJQRE1BCj4gKwlib29sICJQQ0kgcGVl
ci10by1wZWVyIHRyYW5zZmVyIHN1cHBvcnQiCj4gKwlkZXBlbmRzIG9uIFBDSSAmJiBaT05FX0RF
VklDRSAmJiBFWFBFUlQKPiArCXNlbGVjdCBHRU5FUklDX0FMTE9DQVRPUgo+ICsJaGVscAo+ICsJ
ICBFbmFibGXRlSBkcml2ZXJzIHRvIGRvIFBDSSBwZWVyLXRvLXBlZXIgdHJhbnNhY3Rpb25zIHRv
IGFuZCBmcm9tCj4gKwkgIEJBUnMgdGhhdCBhcmUgZXhwb3NlZCBpbiBvdGhlciBkZXZpY2VzIHRo
YXQgYXJlIHRoZSBwYXJ0IG9mCj4gKwkgIHRoZSBoaWVyYXJjaHkgd2hlcmUgcGVlci10by1wZWVy
IERNQSBpcyBndWFyYW50ZWVkIGJ5IHRoZSBQQ0kKPiArCSAgc3BlY2lmaWNhdGlvbiB0byB3b3Jr
IChpZS4gYW55dGhpbmcgYmVsb3cgYSBzaW5nbGUgUENJIGJyaWRnZSkuCj4gKwo+ICsJICBNYW55
IFBDSWUgcm9vdCBjb21wbGV4ZXMgZG8gbm90IHN1cHBvcnQgUDJQIHRyYW5zYWN0aW9ucyBhbmQK
PiArCSAgaXQncyBoYXJkIHRvIHRlbGwgd2hpY2ggc3VwcG9ydCBpdCBhdCBhbGwsIHNvIGF0IHRo
aXMgdGltZSwgRE1BCj4gKwkgIHRyYW5zYXRpb25zIG11c3QgYmUgYmV0d2VlbiBkZXZpY2VzIGJl
aGluZCB0aGUgc2FtZSByb290IHBvcnQuCgpzL0RNQSB0cmFuc2FjdGlvbnMvUENJZSBETUEgdHJh
bnNhY3Rpb25zLwoKKFRoZW9yZXRpY2FsbHkgUDJQIHNob3VsZCB3b3JrIG9uIGNvbnZlbnRpb25h
bCBQQ0ksIGFuZCB0aGlzIHNlbnRlbmNlIG9ubHkKYXBwbGllcyB0byBQQ0llLikKCj4gKwkgIChU
eXBpY2FsbHkgYmVoaW5kIGEgbmV0d29yayBvZiBQQ0llIHN3aXRjaGVzKS4KCk5vdCBzdXJlIHRo
aXMgbGFzdCBzZW50ZW5jZSBhZGRzIHVzZWZ1bCBpbmZvcm1hdGlvbi4KCj4gKysrIGIvZHJpdmVy
cy9wY2kvcDJwZG1hLmMKPiBAQCAtMCwwICsxLDY5NCBAQAo+ICsvLyBTUERYLUxpY2Vuc2UtSWRl
bnRpZmllcjogR1BMLTIuMAo+ICsvKgo+ICsgKiBQQ0kgUGVlciAyIFBlZXIgRE1BIHN1cHBvcnQu
Cj4gKyAqCj4gKyAqIENvcHlyaWdodCAoYykgMjAxNi0yMDE4LCBMb2dhbiBHdW50aG9ycGUKPiAr
ICogQ29weXJpZ2h0IChjKSAyMDE2LTIwMTcsIE1pY3Jvc2VtaSBDb3Jwb3JhdGlvbgo+ICsgKiBD
b3B5cmlnaHQgKGMpIDIwMTcsIENocmlzdG9waCBIZWxsd2lnCj4gKyAqIENvcHlyaWdodCAoYykg
MjAxOCwgRWlkZXRpY29tIEluYy4KPiArICoKCk5pdDogdW5uZWNlc3NhcnkgYmxhbmsgbGluZS4K
Cj4gKy8qCj4gKyAqIElmIGEgZGV2aWNlIGlzIGJlaGluZCBhIHN3aXRjaCwgd2UgdHJ5IHRvIGZp
bmQgdGhlIHVwc3RyZWFtIGJyaWRnZQo+ICsgKiBwb3J0IG9mIHRoZSBzd2l0Y2guIFRoaXMgcmVx
dWlyZXMgdHdvIGNhbGxzIHRvIHBjaV91cHN0cmVhbV9icmlkZ2UoKToKPiArICogb25lIGZvciB0
aGUgdXBzdHJlYW0gcG9ydCBvbiB0aGUgc3dpdGNoLCBvbmUgb24gdGhlIHVwc3RyZWFtIHBvcnQK
PiArICogZm9yIHRoZSBuZXh0IGxldmVsIGluIHRoZSBoaWVyYXJjaHkuIEJlY2F1c2Ugb2YgdGhp
cywgZGV2aWNlcyBjb25uZWN0ZWQKPiArICogdG8gdGhlIHJvb3QgcG9ydCB3aWxsIGJlIHJlamVj
dGVkLgo+ICsgKi8KPiArc3RhdGljIHN0cnVjdCBwY2lfZGV2ICpnZXRfdXBzdHJlYW1fYnJpZGdl
X3BvcnQoc3RydWN0IHBjaV9kZXYgKnBkZXYpCgpUaGlzIGZ1bmN0aW9uIGRvZXNuJ3Qgc2VlbSB0
byBiZSB1c2VkIGFueW1vcmUuICBUaGFua3MgZm9yIGFsbCB5b3VyIGhhcmQKd29yayB0byBnZXQg
cmlkIG9mIGl0IQoKPiArewo+ICsJc3RydWN0IHBjaV9kZXYgKnVwMSwgKnVwMjsKPiArCj4gKwlp
ZiAoIXBkZXYpCj4gKwkJcmV0dXJuIE5VTEw7Cj4gKwo+ICsJdXAxID0gcGNpX2Rldl9nZXQocGNp
X3Vwc3RyZWFtX2JyaWRnZShwZGV2KSk7Cj4gKwlpZiAoIXVwMSkKPiArCQlyZXR1cm4gTlVMTDsK
PiArCj4gKwl1cDIgPSBwY2lfZGV2X2dldChwY2lfdXBzdHJlYW1fYnJpZGdlKHVwMSkpOwo+ICsJ
cGNpX2Rldl9wdXQodXAxKTsKPiArCj4gKwlyZXR1cm4gdXAyOwo+ICt9Cj4gKwo+ICsvKgo+ICsg
KiBGaW5kIHRoZSBkaXN0YW5jZSB0aHJvdWdoIHRoZSBuZWFyZXN0IGNvbW1vbiB1cHN0cmVhbSBi
cmlkZ2UgYmV0d2Vlbgo+ICsgKiB0d28gUENJIGRldmljZXMuCj4gKyAqCj4gKyAqIElmIHRoZSB0
d28gZGV2aWNlcyBhcmUgdGhlIHNhbWUgZGV2aWNlIHRoZW4gMCB3aWxsIGJlIHJldHVybmVkLgo+
ICsgKgo+ICsgKiBJZiB0aGVyZSBhcmUgdHdvIHZpcnR1YWwgZnVuY3Rpb25zIG9mIHRoZSBzYW1l
IGRldmljZSBiZWhpbmQgdGhlIHNhbWUKPiArICogYnJpZGdlIHBvcnQgdGhlbiAyIHdpbGwgYmUg
cmV0dXJuZWQgKG9uZSBzdGVwIGRvd24gdG8gdGhlIGJyaWRnZSB0aGVuCgpzL2JyaWRnZS9QQ0ll
IHN3aXRjaC8KCj4gKyAqIG9uZSBzdGVwIGJhY2sgdG8gdGhlIHNhbWUgZGV2aWNlKS4KPiArICoK
PiArICogSW4gdGhlIGNhc2Ugd2hlcmUgdHdvIGRldmljZXMgYXJlIGNvbm5lY3RlZCB0byB0aGUg
c2FtZSBQQ0llIHN3aXRjaCwgdGhlCj4gKyAqIHZhbHVlIDQgd2lsbCBiZSByZXR1cm5lZC4gVGhp
cyBjb3JyZXNwb25kcyB0byB0aGUgZm9sbG93aW5nIFBDSSB0cmVlOgo+ICsgKgo+ICsgKiAgICAg
LSsgIFJvb3QgUG9ydAo+ICsgKiAgICAgIFwrIFN3aXRjaCBVcHN0cmVhbSBQb3J0Cj4gKyAqICAg
ICAgICstKyBTd2l0Y2ggRG93bnN0cmVhbSBQb3J0Cj4gKyAqICAgICAgICsgXC0gRGV2aWNlIEEK
PiArICogICAgICAgXC0rIFN3aXRjaCBEb3duc3RyZWFtIFBvcnQKPiArICogICAgICAgICBcLSBE
ZXZpY2UgQgo+ICsgKgo+ICsgKiBUaGUgZGlzdGFuY2UgaXMgNCBiZWNhdXNlIHdlIHRyYXZlcnNl
IGZyb20gRGV2aWNlIEEgdGhyb3VnaCB0aGUgZG93bnN0cmVhbQo+ICsgKiBwb3J0IG9mIHRoZSBz
d2l0Y2gsIHRvIHRoZSBjb21tb24gdXBzdHJlYW0gcG9ydCwgYmFjayB1cCB0byB0aGUgc2Vjb25k
Cj4gKyAqIGRvd25zdHJlYW0gcG9ydCBhbmQgdGhlbiB0byBEZXZpY2UgQi4KPiArICoKPiArICog
QW55IHR3byBkZXZpY2VzIHRoYXQgZG9uJ3QgaGF2ZSBhIGNvbW1vbiB1cHN0cmVhbSBicmlkZ2Ug
d2lsbCByZXR1cm4gLTEuCj4gKyAqIEluIHRoaXMgd2F5IGRldmljZXMgb24gc2VwZXJhdGUgcm9v
dCBwb3J0cyB3aWxsIGJlIHJlamVjdGVkLCB3aGljaAoKcy9zZXBlcmF0ZS9zZXBhcmF0ZS8Kcy9y
b290IHBvcnQvUENJZSByb290IHBvcnRzLwooQWdhaW4sIHNpbmNlIFAyUCBzaG91bGQgd29yayBv
biBjb252ZW50aW9uYWwgUENJKQoKPiArICogaXMgd2hhdCB3ZSB3YW50IGZvciBwZWVyLXRvLXBl
ZXIgc2VlaW5nIHRoZXJlJ3Mgbm8gd2F5IHRvIGRldGVybWluZQo+ICsgKiBpZiB0aGUgcm9vdCBj
b21wbGV4IHN1cHBvcnRzIGZvcndhcmRpbmcgYmV0d2VlbiByb290IHBvcnRzLgoKcy9zZWVpbmcg
dGhlcmUncyBubyB3YXkuLi4vCiAgc2VlaW5nIGVhY2ggUENJZSByb290IHBvcnQgZGVmaW5lcyBh
IHNlcGFyYXRlIGhpZXJhcmNoeSBkb21haW4gYW5kCiAgdGhlcmUncyBubyB3YXkgdG8gZGV0ZXJt
aW5lIHdoZXRoZXIgdGhlIHJvb3QgY29tcGxleCBzdXBwb3J0cyBmb3J3YXJkaW5nCiAgYmV0d2Vl
biB0aGVtLi8KCj4gKyAqCj4gKyAqIEluIHRoZSBjYXNlIHdoZXJlIHR3byBkZXZpY2VzIGFyZSBj
b25uZWN0ZWQgdG8gZGlmZmVyZW50IFBDSWUgc3dpdGNoZXMKPiArICogdGhpcyBmdW5jdGlvbiB3
aWxsIHN0aWxsIHJldHVybiBhIHBvc2l0aXZlIGRpc3RhbmNlIGFzIGxvbmcgYXMgYm90aAo+ICsg
KiBzd2l0Y2hlcyBldmVudXRhbGx5IGhhdmUgYSBjb21tb24gdXBzdHJlYW0gYnJpZGdlLiBOb3Rl
IHRoaXMgY292ZXJzCj4gKyAqIHRoZSBjYXNlIG9mIHVzaW5nIG11bHRpcGxlIFBDSWUgc3dpdGNo
ZXMgdG8gYWNoaWV2ZSBhIGRlc2lyZWQgbGV2ZWwgb2YKPiArICogZmFuLW91dCBmcm9tIGEgcm9v
dCBwb3J0LiBUaGUgZXhhY3QgZGlzdGFuY2Ugd2lsbCBiZSBhIGZ1bmN0aW9uIG9mIHRoZQo+ICsg
KiBudW1iZXIgb2Ygc3dpdGNoZXMgYmV0d2VlbiBEZXZpY2UgQSBhbmQgRGV2aWNlIEIuCj4gKyAq
CgpOaXQ6IHVubmVjZXNzYXJ5IGJsYW5rIGxpbmUuCgo+ICsgKi8KPiArc3RhdGljIGludCB1cHN0
cmVhbV9icmlkZ2VfZGlzdGFuY2Uoc3RydWN0IHBjaV9kZXYgKmEsID4gKwkJCQkgICAgc3RydWN0
IHBjaV9kZXYgKmIpCgoKX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX18KTGludXgtbnZtZSBtYWlsaW5nIGxpc3QKTGludXgtbnZtZUBsaXN0cy5pbmZyYWRlYWQu
b3JnCmh0dHA6Ly9saXN0cy5pbmZyYWRlYWQub3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtbnZt
ZQo=

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-05-07 23:00     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:00 UTC (permalink / raw)


On Mon, Apr 23, 2018@05:30:33PM -0600, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in peer-to-peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> 
> Add an interface for other subsystems to find and allocate chunks of P2P
> memory as necessary to facilitate transfers between two PCI peers:
> 
> int pci_p2pdma_add_client();
> struct pci_dev *pci_p2pmem_find();
> void *pci_alloc_p2pmem();
> 
> The new interface requires a driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary (adding incompatible clients
> will fail). With a suitable p2pmem device, memory can then be
> allocated with pci_alloc_p2pmem() for use in DMA transactions.
> 
> Depending on hardware, using peer-to-peer memory may reduce the bandwidth
> of the transfer but can significantly reduce pressure on system memory.
> This may be desirable in many cases: for example a system could be designed
> with a small CPU connected to a PCI switch by a small number of lanes

s/PCI/PCIe/

> which would maximize the number of lanes available to connect to NVMe
> devices.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same root port (typically through

s/root port/PCI bridge/

> a network of PCIe switches). This is because we have no way of knowing
> whether peer-to-peer routing between PCIe Root Ports is supported
> (PCIe r4.0, sec 1.3.1).  Additionally, the benefits of P2P transfers that
> go through the RC is limited to only reducing DRAM usage and, in some
> cases, coding convenience. The PCI-SIG may be exploring adding a new
> capability bit to advertise whether this is possible for future
> hardware.
> 
> This commit includes significant rework and feedback from Christoph
> Hellwig.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> ---
>  drivers/pci/Kconfig        |  17 ++
>  drivers/pci/Makefile       |   1 +
>  drivers/pci/p2pdma.c       | 694 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/memremap.h   |  18 ++
>  include/linux/pci-p2pdma.h | 100 +++++++
>  include/linux/pci.h        |   4 +
>  6 files changed, 834 insertions(+)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..b2396c22b53e 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,23 @@ config PCI_PASID
>  
>  	  If unsure, say N.
>  
> +config PCI_P2PDMA
> +	bool "PCI peer-to-peer transfer support"
> +	depends on PCI && ZONE_DEVICE && EXPERT
> +	select GENERIC_ALLOCATOR
> +	help
> +	  Enable? drivers to do PCI peer-to-peer transactions to and from
> +	  BARs that are exposed in other devices that are the part of
> +	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +	  specification to work (ie. anything below a single PCI bridge).
> +
> +	  Many PCIe root complexes do not support P2P transactions and
> +	  it's hard to tell which support it at all, so at this time, DMA
> +	  transations must be between devices behind the same root port.

s/DMA transactions/PCIe DMA transactions/

(Theoretically P2P should work on conventional PCI, and this sentence only
applies to PCIe.)

> +	  (Typically behind a network of PCIe switches).

Not sure this last sentence adds useful information.

> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,694 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI Peer 2 Peer DMA support.
> + *
> + * Copyright (c) 2016-2018, Logan Gunthorpe
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig
> + * Copyright (c) 2018, Eideticom Inc.
> + *

Nit: unnecessary blank line.

> +/*
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge():
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)

This function doesn't seem to be used anymore.  Thanks for all your hard
work to get rid of it!

> +{
> +	struct pci_dev *up1, *up2;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> +	if (!up1)
> +		return NULL;
> +
> +	up2 = pci_dev_get(pci_upstream_bridge(up1));
> +	pci_dev_put(up1);
> +
> +	return up2;
> +}
> +
> +/*
> + * Find the distance through the nearest common upstream bridge between
> + * two PCI devices.
> + *
> + * If the two devices are the same device then 0 will be returned.
> + *
> + * If there are two virtual functions of the same device behind the same
> + * bridge port then 2 will be returned (one step down to the bridge then

s/bridge/PCIe switch/

> + * one step back to the same device).
> + *
> + * In the case where two devices are connected to the same PCIe switch, the
> + * value 4 will be returned. This corresponds to the following PCI tree:
> + *
> + *     -+  Root Port
> + *      \+ Switch Upstream Port
> + *       +-+ Switch Downstream Port
> + *       + \- Device A
> + *       \-+ Switch Downstream Port
> + *         \- Device B
> + *
> + * The distance is 4 because we traverse from Device A through the downstream
> + * port of the switch, to the common upstream port, back up to the second
> + * downstream port and then to Device B.
> + *
> + * Any two devices that don't have a common upstream bridge will return -1.
> + * In this way devices on seperate root ports will be rejected, which

s/seperate/separate/
s/root port/PCIe root ports/
(Again, since P2P should work on conventional PCI)

> + * is what we want for peer-to-peer seeing there's no way to determine
> + * if the root complex supports forwarding between root ports.

s/seeing there's no way.../
  seeing each PCIe root port defines a separate hierarchy domain and
  there's no way to determine whether the root complex supports forwarding
  between them./

> + *
> + * In the case where two devices are connected to different PCIe switches
> + * this function will still return a positive distance as long as both
> + * switches evenutally have a common upstream bridge. Note this covers
> + * the case of using multiple PCIe switches to achieve a desired level of
> + * fan-out from a root port. The exact distance will be a function of the
> + * number of switches between Device A and Device B.
> + *

Nit: unnecessary blank line.

> + */
> +static int upstream_bridge_distance(struct pci_dev *a, > +				    struct pci_dev *b)

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  2018-04-23 23:30   ` Logan Gunthorpe
                       ` (2 preceding siblings ...)
  (?)
@ 2018-05-07 23:02     ` Bjorn Helgaas
  -1 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

s/dma/DMA/ (in subject)

On Mon, Apr 23, 2018 at 05:30:35PM -0600, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> P2P memory or no P2P memory.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-05-07 23:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Christian König

s/dma/DMA/ (in subject)

On Mon, Apr 23, 2018 at 05:30:35PM -0600, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> P2P memory or no P2P memory.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-05-07 23:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

s/dma/DMA/ (in subject)

On Mon, Apr 23, 2018 at 05:30:35PM -0600, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> P2P memory or no P2P memory.
> 
> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-05-07 23:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, Sagi Grimberg,
	linux-nvdimm, linux-rdma, linux-pci, linux-kernel, linux-nvme,
	Stephen Bates, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Dan Williams, Christoph Hellwig

s/dma/DMA/ (in subject)

On Mon, Apr 23, 2018 at 05:30:35PM -0600, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> P2P memory or no P2P memory.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-05-07 23:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:02 UTC (permalink / raw)


s/dma/DMA/ (in subject)

On Mon, Apr 23, 2018@05:30:35PM -0600, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> P2P memory or no P2P memory.
> 
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
  2018-05-07 23:00     ` Bjorn Helgaas
                         ` (2 preceding siblings ...)
  (?)
@ 2018-05-07 23:09       ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-07 23:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

Thanks for the review. I'll apply all of these for the changes for next
version of the set.
>> +/*
>> + * If a device is behind a switch, we try to find the upstream bridge
>> + * port of the switch. This requires two calls to pci_upstream_bridge():
>> + * one for the upstream port on the switch, one on the upstream port
>> + * for the next level in the hierarchy. Because of this, devices connected
>> + * to the root port will be rejected.
>> + */
>> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> 
> This function doesn't seem to be used anymore.  Thanks for all your hard
> work to get rid of it!

Oops, I thought I had gotten rid of it entirely, but I guess I messed it
up a bit and it gets removed in patch 4. I'll fix it for v5.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-05-07 23:09       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-07 23:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Christian König

Thanks for the review. I'll apply all of these for the changes for next
version of the set.
>> +/*
>> + * If a device is behind a switch, we try to find the upstream bridge
>> + * port of the switch. This requires two calls to pci_upstream_bridge():
>> + * one for the upstream port on the switch, one on the upstream port
>> + * for the next level in the hierarchy. Because of this, devices connected
>> + * to the root port will be rejected.
>> + */
>> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> 
> This function doesn't seem to be used anymore.  Thanks for all your hard
> work to get rid of it!

Oops, I thought I had gotten rid of it entirely, but I guess I messed it
up a bit and it gets removed in patch 4. I'll fix it for v5.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-05-07 23:09       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-07 23:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Thanks for the review. I'll apply all of these for the changes for next
version of the set.
>> +/*
>> + * If a device is behind a switch, we try to find the upstream bridge
>> + * port of the switch. This requires two calls to pci_upstream_bridge():
>> + * one for the upstream port on the switch, one on the upstream port
>> + * for the next level in the hierarchy. Because of this, devices connected
>> + * to the root port will be rejected.
>> + */
>> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> 
> This function doesn't seem to be used anymore.  Thanks for all your hard
> work to get rid of it!

Oops, I thought I had gotten rid of it entirely, but I guess I messed it
up a bit and it gets removed in patch 4. I'll fix it for v5.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-05-07 23:09       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-07 23:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, Sagi Grimberg,
	linux-nvdimm, linux-rdma, linux-pci, linux-kernel, linux-nvme,
	Stephen Bates, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Dan Williams, Christoph Hellwig

Thanks for the review. I'll apply all of these for the changes for next
version of the set.
>> +/*
>> + * If a device is behind a switch, we try to find the upstream bridge
>> + * port of the switch. This requires two calls to pci_upstream_bridge():
>> + * one for the upstream port on the switch, one on the upstream port
>> + * for the next level in the hierarchy. Because of this, devices connected
>> + * to the root port will be rejected.
>> + */
>> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> 
> This function doesn't seem to be used anymore.  Thanks for all your hard
> work to get rid of it!

Oops, I thought I had gotten rid of it entirely, but I guess I messed it
up a bit and it gets removed in patch 4. I'll fix it for v5.

Logan

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory
@ 2018-05-07 23:09       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-07 23:09 UTC (permalink / raw)


Thanks for the review. I'll apply all of these for the changes for next
version of the set.
>> +/*
>> + * If a device is behind a switch, we try to find the upstream bridge
>> + * port of the switch. This requires two calls to pci_upstream_bridge():
>> + * one for the upstream port on the switch, one on the upstream port
>> + * for the next level in the hierarchy. Because of this, devices connected
>> + * to the root port will be rejected.
>> + */
>> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> 
> This function doesn't seem to be used anymore.  Thanks for all your hard
> work to get rid of it!

Oops, I thought I had gotten rid of it entirely, but I guess I messed it
up a bit and it gets removed in patch 4. I'll fix it for v5.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-04-23 23:30   ` Logan Gunthorpe
                       ` (2 preceding siblings ...)
  (?)
@ 2018-05-07 23:13     ` Bjorn Helgaas
  -1 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:13 UTC (permalink / raw)
  To: Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and 
I don't know if there are other places we would care?

On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
> 
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
> 
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>  }
>  
>  /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)

> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>   */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>  {
> -	struct pci_dev *up1, *up2;
> +	int pos;
> +	u16 ctrl;
>  
> -	if (!pdev)
> -		return NULL;
> +	if (!pci_is_bridge(pdev))
> +		return 0;
>  
> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> -	if (!up1)
> -		return NULL;
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>  
> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
> -	pci_dev_put(up1);
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>  
> -	return up2;
> +	return 1;
>  }
>  
>  /*
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e597655a5643..7e2f5724ba22 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +#ifdef CONFIG_PCI_P2PDMA
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;
> +#endif
> +
>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 0cde88341eeb..fcb3437a2f3c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -18,6 +18,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-07 23:13     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:13 UTC (permalink / raw)
  To: Logan Gunthorpe, Alex Williamson
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Christian König

[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and 
I don't know if there are other places we would care?

On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
> 
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
> 
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>  }
>  
>  /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)

> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>   */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>  {
> -	struct pci_dev *up1, *up2;
> +	int pos;
> +	u16 ctrl;
>  
> -	if (!pdev)
> -		return NULL;
> +	if (!pci_is_bridge(pdev))
> +		return 0;
>  
> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> -	if (!up1)
> -		return NULL;
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>  
> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
> -	pci_dev_put(up1);
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>  
> -	return up2;
> +	return 1;
>  }
>  
>  /*
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e597655a5643..7e2f5724ba22 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +#ifdef CONFIG_PCI_P2PDMA
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;
> +#endif
> +
>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 0cde88341eeb..fcb3437a2f3c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -18,6 +18,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-07 23:13     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:13 UTC (permalink / raw)
  To: Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and 
I don't know if there are other places we would care?

On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
> 
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
> 
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
> 
> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>  }
>  
>  /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)

> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>   */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>  {
> -	struct pci_dev *up1, *up2;
> +	int pos;
> +	u16 ctrl;
>  
> -	if (!pdev)
> -		return NULL;
> +	if (!pci_is_bridge(pdev))
> +		return 0;
>  
> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> -	if (!up1)
> -		return NULL;
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>  
> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
> -	pci_dev_put(up1);
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>  
> -	return up2;
> +	return 1;
>  }
>  
>  /*
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e597655a5643..7e2f5724ba22 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +#ifdef CONFIG_PCI_P2PDMA
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;
> +#endif
> +
>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 0cde88341eeb..fcb3437a2f3c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -18,6 +18,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-07 23:13     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:13 UTC (permalink / raw)
  To: Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, Stephen Bates, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Christoph Hellwig

[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and 
I don't know if there are other places we would care?

On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
> 
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
> 
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>  }
>  
>  /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)

> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>   */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>  {
> -	struct pci_dev *up1, *up2;
> +	int pos;
> +	u16 ctrl;
>  
> -	if (!pdev)
> -		return NULL;
> +	if (!pci_is_bridge(pdev))
> +		return 0;
>  
> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> -	if (!up1)
> -		return NULL;
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>  
> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
> -	pci_dev_put(up1);
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>  
> -	return up2;
> +	return 1;
>  }
>  
>  /*
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e597655a5643..7e2f5724ba22 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +#ifdef CONFIG_PCI_P2PDMA
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;
> +#endif
> +
>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 0cde88341eeb..fcb3437a2f3c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -18,6 +18,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-07 23:13     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:13 UTC (permalink / raw)


[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and 
I don't know if there are other places we would care?

On Mon, Apr 23, 2018@05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
> 
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
> 
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
> 
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>  	  transations must be between devices behind the same root port.
>  	  (Typically behind a network of PCIe switches).
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effectively puts all devices behind any
> +	  switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)

> +	  individual devices behind any switch will not be able to be
> +	  assigned to separate VMs because there is no isolation between
> +	  them. Additionally, any malicious PCIe devices will be able to
> +	  DMA to memory exposed by other EPs in the same domain as TLPs
> +	  will not be checked by the IOMMU.
> +
>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>  }
>  
>  /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)

> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>   */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>  {
> -	struct pci_dev *up1, *up2;
> +	int pos;
> +	u16 ctrl;
>  
> -	if (!pdev)
> -		return NULL;
> +	if (!pci_is_bridge(pdev))
> +		return 0;
>  
> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> -	if (!up1)
> -		return NULL;
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>  
> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
> -	pci_dev_put(up1);
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>  
> -	return up2;
> +	return 1;
>  }
>  
>  /*
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e597655a5643..7e2f5724ba22 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +#ifdef CONFIG_PCI_P2PDMA
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;
> +#endif
> +
>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 0cde88341eeb..fcb3437a2f3c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -18,6 +18,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
  2018-04-23 23:30   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-07 23:20     ` Bjorn Helgaas
  -1 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Jonathan Corbet, linux-kernel, linux-nvme,
	linux-block, Jérôme Glisse, Jason Gunthorpe,
	Christian König, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Mon, Apr 23, 2018 at 05:30:38PM -0600, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> ---
>  Documentation/PCI/index.rst             |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
>  Documentation/index.rst                 |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> 
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> new file mode 100644
> index 000000000000..2fdc4b3c291d
> --- /dev/null
> +++ b/Documentation/PCI/index.rst
> @@ -0,0 +1,14 @@
> +==================================
> +Linux PCI Driver Developer's Guide
> +==================================
> +
> +.. toctree::
> +
> +   p2pdma
> +
> +.. only::  subproject and html
> +
> +   Indices
> +   =======
> +
> +   * :ref:`genindex`
> diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
> index 03b57cbf8cc2..d12eeafbfc90 100644
> --- a/Documentation/driver-api/pci/index.rst
> +++ b/Documentation/driver-api/pci/index.rst
> @@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
>     :maxdepth: 2
>  
>     pci
> +   p2pdma
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is

s/endpoints/devices/

> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required

s/PCI Root Complexes .../
  PCI doesn't require forwarding transactions between hierarchy domains,
and in PCIe, each Root Port defines a separate hierarchy domain./

> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.

s/endpoints involved .../
  devices involved are all behind the same PCI bridge, as such devices are
  all in the same PCI hierarchy domain, and the spec guarantees that all
  transactions within the hierarchy will be routable, but it does not
  require routing between hierarchies./

> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-07 23:20     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Christian König,
	Jonathan Corbet

On Mon, Apr 23, 2018 at 05:30:38PM -0600, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> ---
>  Documentation/PCI/index.rst             |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
>  Documentation/index.rst                 |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> 
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> new file mode 100644
> index 000000000000..2fdc4b3c291d
> --- /dev/null
> +++ b/Documentation/PCI/index.rst
> @@ -0,0 +1,14 @@
> +==================================
> +Linux PCI Driver Developer's Guide
> +==================================
> +
> +.. toctree::
> +
> +   p2pdma
> +
> +.. only::  subproject and html
> +
> +   Indices
> +   =======
> +
> +   * :ref:`genindex`
> diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
> index 03b57cbf8cc2..d12eeafbfc90 100644
> --- a/Documentation/driver-api/pci/index.rst
> +++ b/Documentation/driver-api/pci/index.rst
> @@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
>     :maxdepth: 2
>  
>     pci
> +   p2pdma
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is

s/endpoints/devices/

> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required

s/PCI Root Complexes .../
  PCI doesn't require forwarding transactions between hierarchy domains,
and in PCIe, each Root Port defines a separate hierarchy domain./

> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.

s/endpoints involved .../
  devices involved are all behind the same PCI bridge, as such devices are
  all in the same PCI hierarchy domain, and the spec guarantees that all
  transactions within the hierarchy will be routable, but it does not
  require routing between hierarchies./

> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-07 23:20     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Jonathan Corbet,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, Apr 23, 2018 at 05:30:38PM -0600, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> Cc: Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>
> ---
>  Documentation/PCI/index.rst             |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
>  Documentation/index.rst                 |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> 
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> new file mode 100644
> index 000000000000..2fdc4b3c291d
> --- /dev/null
> +++ b/Documentation/PCI/index.rst
> @@ -0,0 +1,14 @@
> +==================================
> +Linux PCI Driver Developer's Guide
> +==================================
> +
> +.. toctree::
> +
> +   p2pdma
> +
> +.. only::  subproject and html
> +
> +   Indices
> +   =======
> +
> +   * :ref:`genindex`
> diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
> index 03b57cbf8cc2..d12eeafbfc90 100644
> --- a/Documentation/driver-api/pci/index.rst
> +++ b/Documentation/driver-api/pci/index.rst
> @@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
>     :maxdepth: 2
>  
>     pci
> +   p2pdma
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is

s/endpoints/devices/

> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required

s/PCI Root Complexes .../
  PCI doesn't require forwarding transactions between hierarchy domains,
and in PCIe, each Root Port defines a separate hierarchy domain./

> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.

s/endpoints involved .../
  devices involved are all behind the same PCI bridge, as such devices are
  all in the same PCI hierarchy domain, and the spec guarantees that all
  transactions within the hierarchy will be routable, but it does not
  require routing between hierarchies./

> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-07 23:20     ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:20 UTC (permalink / raw)


On Mon, Apr 23, 2018@05:30:38PM -0600, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> Cc: Jonathan Corbet <corbet at lwn.net>
> ---
>  Documentation/PCI/index.rst             |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
>  Documentation/index.rst                 |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> 
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> new file mode 100644
> index 000000000000..2fdc4b3c291d
> --- /dev/null
> +++ b/Documentation/PCI/index.rst
> @@ -0,0 +1,14 @@
> +==================================
> +Linux PCI Driver Developer's Guide
> +==================================
> +
> +.. toctree::
> +
> +   p2pdma
> +
> +.. only::  subproject and html
> +
> +   Indices
> +   =======
> +
> +   * :ref:`genindex`
> diff --git a/Documentation/driver-api/pci/index.rst b/Documentation/driver-api/pci/index.rst
> index 03b57cbf8cc2..d12eeafbfc90 100644
> --- a/Documentation/driver-api/pci/index.rst
> +++ b/Documentation/driver-api/pci/index.rst
> @@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
>     :maxdepth: 2
>  
>     pci
> +   p2pdma
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is

s/endpoints/devices/

> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required

s/PCI Root Complexes .../
  PCI doesn't require forwarding transactions between hierarchy domains,
and in PCIe, each Root Port defines a separate hierarchy domain./

> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.

s/endpoints involved .../
  devices involved are all behind the same PCI bridge, as such devices are
  all in the same PCI hierarchy domain, and the spec guarantees that all
  transactions within the hierarchy will be routable, but it does not
  require routing between hierarchies./

> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-04-23 23:30 ` Logan Gunthorpe
                     ` (2 preceding siblings ...)
  (?)
@ 2018-05-07 23:23   ` Bjorn Helgaas
  -1 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:23 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
>     rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>  Documentation/PCI/index.rst                |  14 +
>  Documentation/driver-api/index.rst         |   2 +-
>  Documentation/driver-api/pci/index.rst     |  20 +
>  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst                    |   3 +-
>  block/blk-core.c                           |   3 +
>  drivers/infiniband/core/rw.c               |  13 +-
>  drivers/nvme/host/core.c                   |   4 +
>  drivers/nvme/host/nvme.h                   |   8 +
>  drivers/nvme/host/pci.c                    | 118 +++--
>  drivers/nvme/target/configfs.c             |  67 +++
>  drivers/nvme/target/core.c                 | 143 ++++-
>  drivers/nvme/target/io-cmd.c               |   3 +
>  drivers/nvme/target/nvmet.h                |  15 +
>  drivers/nvme/target/rdma.c                 |  22 +-
>  drivers/pci/Kconfig                        |  26 +
>  drivers/pci/Makefile                       |   1 +
>  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>  drivers/pci/pci.c                          |   6 +
>  include/linux/blk_types.h                  |  18 +-
>  include/linux/blkdev.h                     |   3 +
>  include/linux/memremap.h                   |  19 +
>  include/linux/pci-p2pdma.h                 | 118 +++++
>  include/linux/pci.h                        |   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-07 23:23   ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:23 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Christian König

On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
>     rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>  Documentation/PCI/index.rst                |  14 +
>  Documentation/driver-api/index.rst         |   2 +-
>  Documentation/driver-api/pci/index.rst     |  20 +
>  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst                    |   3 +-
>  block/blk-core.c                           |   3 +
>  drivers/infiniband/core/rw.c               |  13 +-
>  drivers/nvme/host/core.c                   |   4 +
>  drivers/nvme/host/nvme.h                   |   8 +
>  drivers/nvme/host/pci.c                    | 118 +++--
>  drivers/nvme/target/configfs.c             |  67 +++
>  drivers/nvme/target/core.c                 | 143 ++++-
>  drivers/nvme/target/io-cmd.c               |   3 +
>  drivers/nvme/target/nvmet.h                |  15 +
>  drivers/nvme/target/rdma.c                 |  22 +-
>  drivers/pci/Kconfig                        |  26 +
>  drivers/pci/Makefile                       |   1 +
>  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>  drivers/pci/pci.c                          |   6 +
>  include/linux/blk_types.h                  |  18 +-
>  include/linux/blkdev.h                     |   3 +
>  include/linux/memremap.h                   |  19 +
>  include/linux/pci-p2pdma.h                 | 118 +++++
>  include/linux/pci.h                        |   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-07 23:23   ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:23 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
>     rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>  Documentation/PCI/index.rst                |  14 +
>  Documentation/driver-api/index.rst         |   2 +-
>  Documentation/driver-api/pci/index.rst     |  20 +
>  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst                    |   3 +-
>  block/blk-core.c                           |   3 +
>  drivers/infiniband/core/rw.c               |  13 +-
>  drivers/nvme/host/core.c                   |   4 +
>  drivers/nvme/host/nvme.h                   |   8 +
>  drivers/nvme/host/pci.c                    | 118 +++--
>  drivers/nvme/target/configfs.c             |  67 +++
>  drivers/nvme/target/core.c                 | 143 ++++-
>  drivers/nvme/target/io-cmd.c               |   3 +
>  drivers/nvme/target/nvmet.h                |  15 +
>  drivers/nvme/target/rdma.c                 |  22 +-
>  drivers/pci/Kconfig                        |  26 +
>  drivers/pci/Makefile                       |   1 +
>  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>  drivers/pci/pci.c                          |   6 +
>  include/linux/blk_types.h                  |  18 +-
>  include/linux/blkdev.h                     |   3 +
>  include/linux/memremap.h                   |  19 +
>  include/linux/pci-p2pdma.h                 | 118 +++++
>  include/linux/pci.h                        |   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-07 23:23   ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:23 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, Sagi Grimberg,
	linux-nvdimm, linux-rdma, linux-pci, linux-kernel, linux-nvme,
	Stephen Bates, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Dan Williams, Christoph Hellwig

On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
>     rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>  Documentation/PCI/index.rst                |  14 +
>  Documentation/driver-api/index.rst         |   2 +-
>  Documentation/driver-api/pci/index.rst     |  20 +
>  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst                    |   3 +-
>  block/blk-core.c                           |   3 +
>  drivers/infiniband/core/rw.c               |  13 +-
>  drivers/nvme/host/core.c                   |   4 +
>  drivers/nvme/host/nvme.h                   |   8 +
>  drivers/nvme/host/pci.c                    | 118 +++--
>  drivers/nvme/target/configfs.c             |  67 +++
>  drivers/nvme/target/core.c                 | 143 ++++-
>  drivers/nvme/target/io-cmd.c               |   3 +
>  drivers/nvme/target/nvmet.h                |  15 +
>  drivers/nvme/target/rdma.c                 |  22 +-
>  drivers/pci/Kconfig                        |  26 +
>  drivers/pci/Makefile                       |   1 +
>  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>  drivers/pci/pci.c                          |   6 +
>  include/linux/blk_types.h                  |  18 +-
>  include/linux/blkdev.h                     |   3 +
>  include/linux/memremap.h                   |  19 +
>  include/linux/pci-p2pdma.h                 | 118 +++++
>  include/linux/pci.h                        |   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-07 23:23   ` Bjorn Helgaas
  0 siblings, 0 replies; 460+ messages in thread
From: Bjorn Helgaas @ 2018-05-07 23:23 UTC (permalink / raw)


On Mon, Apr 23, 2018@05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
>     rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>  Documentation/PCI/index.rst                |  14 +
>  Documentation/driver-api/index.rst         |   2 +-
>  Documentation/driver-api/pci/index.rst     |  20 +
>  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst                    |   3 +-
>  block/blk-core.c                           |   3 +
>  drivers/infiniband/core/rw.c               |  13 +-
>  drivers/nvme/host/core.c                   |   4 +
>  drivers/nvme/host/nvme.h                   |   8 +
>  drivers/nvme/host/pci.c                    | 118 +++--
>  drivers/nvme/target/configfs.c             |  67 +++
>  drivers/nvme/target/core.c                 | 143 ++++-
>  drivers/nvme/target/io-cmd.c               |   3 +
>  drivers/nvme/target/nvmet.h                |  15 +
>  drivers/nvme/target/rdma.c                 |  22 +-
>  drivers/pci/Kconfig                        |  26 +
>  drivers/pci/Makefile                       |   1 +
>  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>  drivers/pci/pci.c                          |   6 +
>  include/linux/blk_types.h                  |  18 +-
>  include/linux/blkdev.h                     |   3 +
>  include/linux/memremap.h                   |  19 +
>  include/linux/pci-p2pdma.h                 | 118 +++++
>  include/linux/pci.h                        |   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-07 23:23   ` Bjorn Helgaas
  (?)
  (?)
@ 2018-05-07 23:34     ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-07 23:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig


> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

Honestly, I don't know. I guess with your ACK on the PCI parts, the vast
balance is NVMe stuff so we could look at merging it through that tree.
The block patch and IB patch are pretty small.

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-07 23:34     ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-07 23:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Christian König


> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

Honestly, I don't know. I guess with your ACK on the PCI parts, the vast
balance is NVMe stuff so we could look at merging it through that tree.
The block patch and IB patch are pretty small.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-07 23:34     ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-07 23:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig


> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

Honestly, I don't know. I guess with your ACK on the PCI parts, the vast
balance is NVMe stuff so we could look at merging it through that tree.
The block patch and IB patch are pretty small.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-07 23:34     ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-07 23:34 UTC (permalink / raw)



> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

Honestly, I don't know. I guess with your ACK on the PCI parts, the vast
balance is NVMe stuff so we could look at merging it through that tree.
The block patch and IB patch are pretty small.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-07 23:13     ` Bjorn Helgaas
                         ` (2 preceding siblings ...)
  (?)
@ 2018-05-08  7:17       ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08  7:17 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Bjorn,

Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas:
> [+to Alex]
>
> Alex,
>
> Are you happy with this strategy of turning off ACS based on
> CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and
> I don't know if there are other places we would care?

thanks for pointing this out, I totally missed this hack.

AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
CPU when IOMMU is enabled or otherwise you will break SVM.

Similar problems arise when you do this for dedicated GPU, but we 
haven't upstreamed the support for this yet.

So that is a clear NAK from my side for the approach.

And what exactly is the problem here? I'm currently testing P2P with 
GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
perfectly fine.

Regards,
Christian.

>
> On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
>> For peer-to-peer transactions to work the downstream ports in each
>> switch must not have the ACS flags set. At this time there is no way
>> to dynamically change the flags and update the corresponding IOMMU
>> groups so this is done at enumeration time before the groups are
>> assigned.
>>
>> This effectively means that if CONFIG_PCI_P2PDMA is selected then
>> all devices behind any PCIe switch heirarchy will be in the same IOMMU
>> group. Which implies that individual devices behind any switch
>> heirarchy will not be able to be assigned to separate VMs because
>> there is no isolation between them. Additionally, any malicious PCIe
>> devices will be able to DMA to memory exposed by other EPs in the same
>> domain as TLPs will not be checked by the IOMMU.
>>
>> Given that the intended use case of P2P Memory is for users with
>> custom hardware designed for purpose, we do not expect distributors
>> to ever need to enable this option. Users that want to use P2P
>> must have compiled a custom kernel with this configuration option
>> and understand the implications regarding ACS. They will either
>> not require ACS or will have design the system in such a way that
>> devices that require isolation will be separate from those using P2P
>> transactions.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> ---
>>   drivers/pci/Kconfig        |  9 +++++++++
>>   drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>>   drivers/pci/pci.c          |  6 ++++++
>>   include/linux/pci-p2pdma.h |  5 +++++
>>   4 files changed, 50 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index b2396c22b53e..b6db41d4b708 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>>   	  transations must be between devices behind the same root port.
>>   	  (Typically behind a network of PCIe switches).
>>   
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effectively puts all devices behind any
>> +	  switch heirarchy into the same IOMMU group. Which implies that
> s/heirarchy/hierarchy/ (also above in changelog)
>
>> +	  individual devices behind any switch will not be able to be
>> +	  assigned to separate VMs because there is no isolation between
>> +	  them. Additionally, any malicious PCIe devices will be able to
>> +	  DMA to memory exposed by other EPs in the same domain as TLPs
>> +	  will not be checked by the IOMMU.
>> +
>>   	  If unsure, say N.
>>   
>>   config PCI_LABEL
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index ed9dce8552a2..e9f43b43acac 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>>   }
>>   
>>   /*
>> - * If a device is behind a switch, we try to find the upstream bridge
>> - * port of the switch. This requires two calls to pci_upstream_bridge():
>> - * one for the upstream port on the switch, one on the upstream port
>> - * for the next level in the hierarchy. Because of this, devices connected
>> - * to the root port will be rejected.
>> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
>> + * @pdev: device to disable ACS flags for
>> + *
>> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
>> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
>> + * up to the RC which is not what we want for P2P.
> s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)
>
>> + *
>> + * This function is called when the devices are first enumerated and
>> + * will result in all devices behind any bridge to be in the same IOMMU
>> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
>> + * on this largish hammer. If you need the devices to be in separate groups
>> + * don't enable CONFIG_PCI_P2PDMA.
>> + *
>> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>>    */
>> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>>   {
>> -	struct pci_dev *up1, *up2;
>> +	int pos;
>> +	u16 ctrl;
>>   
>> -	if (!pdev)
>> -		return NULL;
>> +	if (!pci_is_bridge(pdev))
>> +		return 0;
>>   
>> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
>> -	if (!up1)
>> -		return NULL;
>> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
>> +	if (!pos)
>> +		return 0;
>> +
>> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
>> +
>> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
>> +
>> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>>   
>> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
>> -	pci_dev_put(up1);
>> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>>   
>> -	return up2;
>> +	return 1;
>>   }
>>   
>>   /*
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e597655a5643..7e2f5724ba22 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -16,6 +16,7 @@
>>   #include <linux/of.h>
>>   #include <linux/of_pci.h>
>>   #include <linux/pci.h>
>> +#include <linux/pci-p2pdma.h>
>>   #include <linux/pm.h>
>>   #include <linux/slab.h>
>>   #include <linux/module.h>
>> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>>    */
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +#ifdef CONFIG_PCI_P2PDMA
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
>> +#endif
>> +
>>   	if (!pci_acs_enable)
>>   		return;
>>   
>> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
>> index 0cde88341eeb..fcb3437a2f3c 100644
>> --- a/include/linux/pci-p2pdma.h
>> +++ b/include/linux/pci-p2pdma.h
>> @@ -18,6 +18,7 @@ struct block_device;
>>   struct scatterlist;
>>   
>>   #ifdef CONFIG_PCI_P2PDMA
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>>   int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>>   		u64 offset);
>>   int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
>> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   			 enum dma_data_direction dir);
>>   #else /* CONFIG_PCI_P2PDMA */
>> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>> +{
>> +	return 0;
>> +}
>>   static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>>   		size_t size, u64 offset)
>>   {
>> -- 
>> 2.11.0
>>

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08  7:17       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08  7:17 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt

Hi Bjorn,

Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas:
> [+to Alex]
>
> Alex,
>
> Are you happy with this strategy of turning off ACS based on
> CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and
> I don't know if there are other places we would care?

thanks for pointing this out, I totally missed this hack.

AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
CPU when IOMMU is enabled or otherwise you will break SVM.

Similar problems arise when you do this for dedicated GPU, but we 
haven't upstreamed the support for this yet.

So that is a clear NAK from my side for the approach.

And what exactly is the problem here? I'm currently testing P2P with 
GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
perfectly fine.

Regards,
Christian.

>
> On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
>> For peer-to-peer transactions to work the downstream ports in each
>> switch must not have the ACS flags set. At this time there is no way
>> to dynamically change the flags and update the corresponding IOMMU
>> groups so this is done at enumeration time before the groups are
>> assigned.
>>
>> This effectively means that if CONFIG_PCI_P2PDMA is selected then
>> all devices behind any PCIe switch heirarchy will be in the same IOMMU
>> group. Which implies that individual devices behind any switch
>> heirarchy will not be able to be assigned to separate VMs because
>> there is no isolation between them. Additionally, any malicious PCIe
>> devices will be able to DMA to memory exposed by other EPs in the same
>> domain as TLPs will not be checked by the IOMMU.
>>
>> Given that the intended use case of P2P Memory is for users with
>> custom hardware designed for purpose, we do not expect distributors
>> to ever need to enable this option. Users that want to use P2P
>> must have compiled a custom kernel with this configuration option
>> and understand the implications regarding ACS. They will either
>> not require ACS or will have design the system in such a way that
>> devices that require isolation will be separate from those using P2P
>> transactions.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> ---
>>   drivers/pci/Kconfig        |  9 +++++++++
>>   drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>>   drivers/pci/pci.c          |  6 ++++++
>>   include/linux/pci-p2pdma.h |  5 +++++
>>   4 files changed, 50 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index b2396c22b53e..b6db41d4b708 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>>   	  transations must be between devices behind the same root port.
>>   	  (Typically behind a network of PCIe switches).
>>   
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effectively puts all devices behind any
>> +	  switch heirarchy into the same IOMMU group. Which implies that
> s/heirarchy/hierarchy/ (also above in changelog)
>
>> +	  individual devices behind any switch will not be able to be
>> +	  assigned to separate VMs because there is no isolation between
>> +	  them. Additionally, any malicious PCIe devices will be able to
>> +	  DMA to memory exposed by other EPs in the same domain as TLPs
>> +	  will not be checked by the IOMMU.
>> +
>>   	  If unsure, say N.
>>   
>>   config PCI_LABEL
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index ed9dce8552a2..e9f43b43acac 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>>   }
>>   
>>   /*
>> - * If a device is behind a switch, we try to find the upstream bridge
>> - * port of the switch. This requires two calls to pci_upstream_bridge():
>> - * one for the upstream port on the switch, one on the upstream port
>> - * for the next level in the hierarchy. Because of this, devices connected
>> - * to the root port will be rejected.
>> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
>> + * @pdev: device to disable ACS flags for
>> + *
>> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
>> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
>> + * up to the RC which is not what we want for P2P.
> s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)
>
>> + *
>> + * This function is called when the devices are first enumerated and
>> + * will result in all devices behind any bridge to be in the same IOMMU
>> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
>> + * on this largish hammer. If you need the devices to be in separate groups
>> + * don't enable CONFIG_PCI_P2PDMA.
>> + *
>> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>>    */
>> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>>   {
>> -	struct pci_dev *up1, *up2;
>> +	int pos;
>> +	u16 ctrl;
>>   
>> -	if (!pdev)
>> -		return NULL;
>> +	if (!pci_is_bridge(pdev))
>> +		return 0;
>>   
>> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
>> -	if (!up1)
>> -		return NULL;
>> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
>> +	if (!pos)
>> +		return 0;
>> +
>> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
>> +
>> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
>> +
>> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>>   
>> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
>> -	pci_dev_put(up1);
>> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>>   
>> -	return up2;
>> +	return 1;
>>   }
>>   
>>   /*
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e597655a5643..7e2f5724ba22 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -16,6 +16,7 @@
>>   #include <linux/of.h>
>>   #include <linux/of_pci.h>
>>   #include <linux/pci.h>
>> +#include <linux/pci-p2pdma.h>
>>   #include <linux/pm.h>
>>   #include <linux/slab.h>
>>   #include <linux/module.h>
>> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>>    */
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +#ifdef CONFIG_PCI_P2PDMA
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
>> +#endif
>> +
>>   	if (!pci_acs_enable)
>>   		return;
>>   
>> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
>> index 0cde88341eeb..fcb3437a2f3c 100644
>> --- a/include/linux/pci-p2pdma.h
>> +++ b/include/linux/pci-p2pdma.h
>> @@ -18,6 +18,7 @@ struct block_device;
>>   struct scatterlist;
>>   
>>   #ifdef CONFIG_PCI_P2PDMA
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>>   int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>>   		u64 offset);
>>   int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
>> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   			 enum dma_data_direction dir);
>>   #else /* CONFIG_PCI_P2PDMA */
>> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>> +{
>> +	return 0;
>> +}
>>   static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>>   		size_t size, u64 offset)
>>   {
>> -- 
>> 2.11.0
>>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08  7:17       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08  7:17 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Bjorn,

Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas:
> [+to Alex]
>
> Alex,
>
> Are you happy with this strategy of turning off ACS based on
> CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and
> I don't know if there are other places we would care?

thanks for pointing this out, I totally missed this hack.

AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
CPU when IOMMU is enabled or otherwise you will break SVM.

Similar problems arise when you do this for dedicated GPU, but we 
haven't upstreamed the support for this yet.

So that is a clear NAK from my side for the approach.

And what exactly is the problem here? I'm currently testing P2P with 
GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
perfectly fine.

Regards,
Christian.

>
> On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
>> For peer-to-peer transactions to work the downstream ports in each
>> switch must not have the ACS flags set. At this time there is no way
>> to dynamically change the flags and update the corresponding IOMMU
>> groups so this is done at enumeration time before the groups are
>> assigned.
>>
>> This effectively means that if CONFIG_PCI_P2PDMA is selected then
>> all devices behind any PCIe switch heirarchy will be in the same IOMMU
>> group. Which implies that individual devices behind any switch
>> heirarchy will not be able to be assigned to separate VMs because
>> there is no isolation between them. Additionally, any malicious PCIe
>> devices will be able to DMA to memory exposed by other EPs in the same
>> domain as TLPs will not be checked by the IOMMU.
>>
>> Given that the intended use case of P2P Memory is for users with
>> custom hardware designed for purpose, we do not expect distributors
>> to ever need to enable this option. Users that want to use P2P
>> must have compiled a custom kernel with this configuration option
>> and understand the implications regarding ACS. They will either
>> not require ACS or will have design the system in such a way that
>> devices that require isolation will be separate from those using P2P
>> transactions.
>>
>> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
>> ---
>>   drivers/pci/Kconfig        |  9 +++++++++
>>   drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>>   drivers/pci/pci.c          |  6 ++++++
>>   include/linux/pci-p2pdma.h |  5 +++++
>>   4 files changed, 50 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index b2396c22b53e..b6db41d4b708 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>>   	  transations must be between devices behind the same root port.
>>   	  (Typically behind a network of PCIe switches).
>>   
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effectively puts all devices behind any
>> +	  switch heirarchy into the same IOMMU group. Which implies that
> s/heirarchy/hierarchy/ (also above in changelog)
>
>> +	  individual devices behind any switch will not be able to be
>> +	  assigned to separate VMs because there is no isolation between
>> +	  them. Additionally, any malicious PCIe devices will be able to
>> +	  DMA to memory exposed by other EPs in the same domain as TLPs
>> +	  will not be checked by the IOMMU.
>> +
>>   	  If unsure, say N.
>>   
>>   config PCI_LABEL
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index ed9dce8552a2..e9f43b43acac 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>>   }
>>   
>>   /*
>> - * If a device is behind a switch, we try to find the upstream bridge
>> - * port of the switch. This requires two calls to pci_upstream_bridge():
>> - * one for the upstream port on the switch, one on the upstream port
>> - * for the next level in the hierarchy. Because of this, devices connected
>> - * to the root port will be rejected.
>> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
>> + * @pdev: device to disable ACS flags for
>> + *
>> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
>> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
>> + * up to the RC which is not what we want for P2P.
> s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)
>
>> + *
>> + * This function is called when the devices are first enumerated and
>> + * will result in all devices behind any bridge to be in the same IOMMU
>> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
>> + * on this largish hammer. If you need the devices to be in separate groups
>> + * don't enable CONFIG_PCI_P2PDMA.
>> + *
>> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>>    */
>> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>>   {
>> -	struct pci_dev *up1, *up2;
>> +	int pos;
>> +	u16 ctrl;
>>   
>> -	if (!pdev)
>> -		return NULL;
>> +	if (!pci_is_bridge(pdev))
>> +		return 0;
>>   
>> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
>> -	if (!up1)
>> -		return NULL;
>> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
>> +	if (!pos)
>> +		return 0;
>> +
>> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
>> +
>> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
>> +
>> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>>   
>> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
>> -	pci_dev_put(up1);
>> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>>   
>> -	return up2;
>> +	return 1;
>>   }
>>   
>>   /*
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e597655a5643..7e2f5724ba22 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -16,6 +16,7 @@
>>   #include <linux/of.h>
>>   #include <linux/of_pci.h>
>>   #include <linux/pci.h>
>> +#include <linux/pci-p2pdma.h>
>>   #include <linux/pm.h>
>>   #include <linux/slab.h>
>>   #include <linux/module.h>
>> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>>    */
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +#ifdef CONFIG_PCI_P2PDMA
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
>> +#endif
>> +
>>   	if (!pci_acs_enable)
>>   		return;
>>   
>> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
>> index 0cde88341eeb..fcb3437a2f3c 100644
>> --- a/include/linux/pci-p2pdma.h
>> +++ b/include/linux/pci-p2pdma.h
>> @@ -18,6 +18,7 @@ struct block_device;
>>   struct scatterlist;
>>   
>>   #ifdef CONFIG_PCI_P2PDMA
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>>   int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>>   		u64 offset);
>>   int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
>> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   			 enum dma_data_direction dir);
>>   #else /* CONFIG_PCI_P2PDMA */
>> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>> +{
>> +	return 0;
>> +}
>>   static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>>   		size_t size, u64 offset)
>>   {
>> -- 
>> 2.11.0
>>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08  7:17       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08  7:17 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, Stephen Bates, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Dan Williams, Christoph Hellwig

Hi Bjorn,

Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas:
> [+to Alex]
>
> Alex,
>
> Are you happy with this strategy of turning off ACS based on
> CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and
> I don't know if there are other places we would care?

thanks for pointing this out, I totally missed this hack.

AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
CPU when IOMMU is enabled or otherwise you will break SVM.

Similar problems arise when you do this for dedicated GPU, but we 
haven't upstreamed the support for this yet.

So that is a clear NAK from my side for the approach.

And what exactly is the problem here? I'm currently testing P2P with 
GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
perfectly fine.

Regards,
Christian.

>
> On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
>> For peer-to-peer transactions to work the downstream ports in each
>> switch must not have the ACS flags set. At this time there is no way
>> to dynamically change the flags and update the corresponding IOMMU
>> groups so this is done at enumeration time before the groups are
>> assigned.
>>
>> This effectively means that if CONFIG_PCI_P2PDMA is selected then
>> all devices behind any PCIe switch heirarchy will be in the same IOMMU
>> group. Which implies that individual devices behind any switch
>> heirarchy will not be able to be assigned to separate VMs because
>> there is no isolation between them. Additionally, any malicious PCIe
>> devices will be able to DMA to memory exposed by other EPs in the same
>> domain as TLPs will not be checked by the IOMMU.
>>
>> Given that the intended use case of P2P Memory is for users with
>> custom hardware designed for purpose, we do not expect distributors
>> to ever need to enable this option. Users that want to use P2P
>> must have compiled a custom kernel with this configuration option
>> and understand the implications regarding ACS. They will either
>> not require ACS or will have design the system in such a way that
>> devices that require isolation will be separate from those using P2P
>> transactions.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> ---
>>   drivers/pci/Kconfig        |  9 +++++++++
>>   drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>>   drivers/pci/pci.c          |  6 ++++++
>>   include/linux/pci-p2pdma.h |  5 +++++
>>   4 files changed, 50 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index b2396c22b53e..b6db41d4b708 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>>   	  transations must be between devices behind the same root port.
>>   	  (Typically behind a network of PCIe switches).
>>   
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effectively puts all devices behind any
>> +	  switch heirarchy into the same IOMMU group. Which implies that
> s/heirarchy/hierarchy/ (also above in changelog)
>
>> +	  individual devices behind any switch will not be able to be
>> +	  assigned to separate VMs because there is no isolation between
>> +	  them. Additionally, any malicious PCIe devices will be able to
>> +	  DMA to memory exposed by other EPs in the same domain as TLPs
>> +	  will not be checked by the IOMMU.
>> +
>>   	  If unsure, say N.
>>   
>>   config PCI_LABEL
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index ed9dce8552a2..e9f43b43acac 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>>   }
>>   
>>   /*
>> - * If a device is behind a switch, we try to find the upstream bridge
>> - * port of the switch. This requires two calls to pci_upstream_bridge():
>> - * one for the upstream port on the switch, one on the upstream port
>> - * for the next level in the hierarchy. Because of this, devices connected
>> - * to the root port will be rejected.
>> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
>> + * @pdev: device to disable ACS flags for
>> + *
>> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
>> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
>> + * up to the RC which is not what we want for P2P.
> s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)
>
>> + *
>> + * This function is called when the devices are first enumerated and
>> + * will result in all devices behind any bridge to be in the same IOMMU
>> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
>> + * on this largish hammer. If you need the devices to be in separate groups
>> + * don't enable CONFIG_PCI_P2PDMA.
>> + *
>> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>>    */
>> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>>   {
>> -	struct pci_dev *up1, *up2;
>> +	int pos;
>> +	u16 ctrl;
>>   
>> -	if (!pdev)
>> -		return NULL;
>> +	if (!pci_is_bridge(pdev))
>> +		return 0;
>>   
>> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
>> -	if (!up1)
>> -		return NULL;
>> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
>> +	if (!pos)
>> +		return 0;
>> +
>> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
>> +
>> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
>> +
>> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>>   
>> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
>> -	pci_dev_put(up1);
>> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>>   
>> -	return up2;
>> +	return 1;
>>   }
>>   
>>   /*
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e597655a5643..7e2f5724ba22 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -16,6 +16,7 @@
>>   #include <linux/of.h>
>>   #include <linux/of_pci.h>
>>   #include <linux/pci.h>
>> +#include <linux/pci-p2pdma.h>
>>   #include <linux/pm.h>
>>   #include <linux/slab.h>
>>   #include <linux/module.h>
>> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>>    */
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +#ifdef CONFIG_PCI_P2PDMA
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
>> +#endif
>> +
>>   	if (!pci_acs_enable)
>>   		return;
>>   
>> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
>> index 0cde88341eeb..fcb3437a2f3c 100644
>> --- a/include/linux/pci-p2pdma.h
>> +++ b/include/linux/pci-p2pdma.h
>> @@ -18,6 +18,7 @@ struct block_device;
>>   struct scatterlist;
>>   
>>   #ifdef CONFIG_PCI_P2PDMA
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>>   int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>>   		u64 offset);
>>   int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
>> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   			 enum dma_data_direction dir);
>>   #else /* CONFIG_PCI_P2PDMA */
>> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>> +{
>> +	return 0;
>> +}
>>   static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>>   		size_t size, u64 offset)
>>   {
>> -- 
>> 2.11.0
>>


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08  7:17       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08  7:17 UTC (permalink / raw)


Hi Bjorn,

Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas:
> [+to Alex]
>
> Alex,
>
> Are you happy with this strategy of turning off ACS based on
> CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and
> I don't know if there are other places we would care?

thanks for pointing this out, I totally missed this hack.

AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
CPU when IOMMU is enabled or otherwise you will break SVM.

Similar problems arise when you do this for dedicated GPU, but we 
haven't upstreamed the support for this yet.

So that is a clear NAK from my side for the approach.

And what exactly is the problem here? I'm currently testing P2P with 
GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
perfectly fine.

Regards,
Christian.

>
> On Mon, Apr 23, 2018@05:30:36PM -0600, Logan Gunthorpe wrote:
>> For peer-to-peer transactions to work the downstream ports in each
>> switch must not have the ACS flags set. At this time there is no way
>> to dynamically change the flags and update the corresponding IOMMU
>> groups so this is done at enumeration time before the groups are
>> assigned.
>>
>> This effectively means that if CONFIG_PCI_P2PDMA is selected then
>> all devices behind any PCIe switch heirarchy will be in the same IOMMU
>> group. Which implies that individual devices behind any switch
>> heirarchy will not be able to be assigned to separate VMs because
>> there is no isolation between them. Additionally, any malicious PCIe
>> devices will be able to DMA to memory exposed by other EPs in the same
>> domain as TLPs will not be checked by the IOMMU.
>>
>> Given that the intended use case of P2P Memory is for users with
>> custom hardware designed for purpose, we do not expect distributors
>> to ever need to enable this option. Users that want to use P2P
>> must have compiled a custom kernel with this configuration option
>> and understand the implications regarding ACS. They will either
>> not require ACS or will have design the system in such a way that
>> devices that require isolation will be separate from those using P2P
>> transactions.
>>
>> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
>> ---
>>   drivers/pci/Kconfig        |  9 +++++++++
>>   drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>>   drivers/pci/pci.c          |  6 ++++++
>>   include/linux/pci-p2pdma.h |  5 +++++
>>   4 files changed, 50 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index b2396c22b53e..b6db41d4b708 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>>   	  transations must be between devices behind the same root port.
>>   	  (Typically behind a network of PCIe switches).
>>   
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effectively puts all devices behind any
>> +	  switch heirarchy into the same IOMMU group. Which implies that
> s/heirarchy/hierarchy/ (also above in changelog)
>
>> +	  individual devices behind any switch will not be able to be
>> +	  assigned to separate VMs because there is no isolation between
>> +	  them. Additionally, any malicious PCIe devices will be able to
>> +	  DMA to memory exposed by other EPs in the same domain as TLPs
>> +	  will not be checked by the IOMMU.
>> +
>>   	  If unsure, say N.
>>   
>>   config PCI_LABEL
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index ed9dce8552a2..e9f43b43acac 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev)
>>   }
>>   
>>   /*
>> - * If a device is behind a switch, we try to find the upstream bridge
>> - * port of the switch. This requires two calls to pci_upstream_bridge():
>> - * one for the upstream port on the switch, one on the upstream port
>> - * for the next level in the hierarchy. Because of this, devices connected
>> - * to the root port will be rejected.
>> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
>> + * @pdev: device to disable ACS flags for
>> + *
>> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
>> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
>> + * up to the RC which is not what we want for P2P.
> s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)
>
>> + *
>> + * This function is called when the devices are first enumerated and
>> + * will result in all devices behind any bridge to be in the same IOMMU
>> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
>> + * on this largish hammer. If you need the devices to be in separate groups
>> + * don't enable CONFIG_PCI_P2PDMA.
>> + *
>> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>>    */
>> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>>   {
>> -	struct pci_dev *up1, *up2;
>> +	int pos;
>> +	u16 ctrl;
>>   
>> -	if (!pdev)
>> -		return NULL;
>> +	if (!pci_is_bridge(pdev))
>> +		return 0;
>>   
>> -	up1 = pci_dev_get(pci_upstream_bridge(pdev));
>> -	if (!up1)
>> -		return NULL;
>> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
>> +	if (!pos)
>> +		return 0;
>> +
>> +	pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
>> +
>> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
>> +
>> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>>   
>> -	up2 = pci_dev_get(pci_upstream_bridge(up1));
>> -	pci_dev_put(up1);
>> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
>>   
>> -	return up2;
>> +	return 1;
>>   }
>>   
>>   /*
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e597655a5643..7e2f5724ba22 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -16,6 +16,7 @@
>>   #include <linux/of.h>
>>   #include <linux/of_pci.h>
>>   #include <linux/pci.h>
>> +#include <linux/pci-p2pdma.h>
>>   #include <linux/pm.h>
>>   #include <linux/slab.h>
>>   #include <linux/module.h>
>> @@ -2835,6 +2836,11 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>>    */
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +#ifdef CONFIG_PCI_P2PDMA
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
>> +#endif
>> +
>>   	if (!pci_acs_enable)
>>   		return;
>>   
>> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
>> index 0cde88341eeb..fcb3437a2f3c 100644
>> --- a/include/linux/pci-p2pdma.h
>> +++ b/include/linux/pci-p2pdma.h
>> @@ -18,6 +18,7 @@ struct block_device;
>>   struct scatterlist;
>>   
>>   #ifdef CONFIG_PCI_P2PDMA
>> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>>   int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>>   		u64 offset);
>>   int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
>> @@ -40,6 +41,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>>   			 enum dma_data_direction dir);
>>   #else /* CONFIG_PCI_P2PDMA */
>> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>> +{
>> +	return 0;
>> +}
>>   static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>>   		size_t size, u64 offset)
>>   {
>> -- 
>> 2.11.0
>>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08  7:17       ` Christian König
                           ` (3 preceding siblings ...)
  (?)
@ 2018-05-08 14:25         ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:25 UTC (permalink / raw)
  To: Christian König
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

    
Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 
> haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. 
    
> So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
    
> And what exactly is the problem here?
 
We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. 

Cheers

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:25         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:25 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

ICAgIA0KSGkgQ2hyaXN0aWFuDQoNCj4gQU1EIEFQVXMgbWFuZGF0b3J5IG5lZWQgdGhlIEFDUyBm
bGFnIHNldCBmb3IgdGhlIEdQVSBpbnRlZ3JhdGVkIGluIHRoZSANCj4gQ1BVIHdoZW4gSU9NTVUg
aXMgZW5hYmxlZCBvciBvdGhlcndpc2UgeW91IHdpbGwgYnJlYWsgU1ZNLg0KDQpPSyBidXQgaW4g
dGhpcyBjYXNlIGFyZW4ndCB5b3UgbG9zaW5nIChtYW55IG9mKSB0aGUgYmVuZWZpdHMgb2YgUDJQ
IHNpbmNlIGFsbCBETUFzIHdpbGwgbm93IGdldCByb3V0ZWQgdXAgdG8gdGhlIElPTU1VIGJlZm9y
ZSBiZWluZyBwYXNzZWQgZG93biB0byB0aGUgZGVzdGluYXRpb24gUENJZSBFUD8NCg0KPiBTaW1p
bGFyIHByb2JsZW1zIGFyaXNlIHdoZW4geW91IGRvIHRoaXMgZm9yIGRlZGljYXRlZCBHUFUsIGJ1
dCB3ZSANCj4gaGF2ZW4ndCB1cHN0cmVhbWVkIHRoZSBzdXBwb3J0IGZvciB0aGlzIHlldC4NCg0K
SG1tLCBhcyBhYm92ZS4gV2l0aCBBQ1MgZW5hYmxlZCBvbiBhbGwgZG93bnN0cmVhbSBwb3J0cyBh
bnkgUDJQIGVuYWJsZWQgRE1BIHdpbGwgYmUgcm91dGVkIHRvIHRoZSBJT01NVSB3aGljaCByZW1v
dmVzIGEgbG90IG9mIHRoZSBiZW5lZml0LiANCiAgICANCj4gU28gdGhhdCBpcyBhIGNsZWFyIE5B
SyBmcm9tIG15IHNpZGUgZm9yIHRoZSBhcHByb2FjaC4NCg0KRG8geW91IGhhdmUgYW4gYWx0ZXJu
YXRpdmU/IFRoaXMgaXMgdGhlIGFwcHJvYWNoIHdlIGFycml2ZWQgaXQgYWZ0ZXIgYSByZWFzb25h
Ymx5IGxlbmd0aHkgZGlzY3Vzc2lvbiBvbiB0aGUgbWFpbGluZyBsaXN0cy4gQWxleCwgYXJlIHlv
dSBzdGlsbCBjb21mb3J0YWJsZSB3aXRoIHRoaXMgYXBwcm9hY2g/DQogICAgDQo+IEFuZCB3aGF0
IGV4YWN0bHkgaXMgdGhlIHByb2JsZW0gaGVyZT8NCiANCldlIGhhZCBhIHByZXR0eSBsZW5ndGh5
IGRpc2N1c3Npb24gb24gdGhpcyB0b3BpYyBvbiBvbmUgb2YgdGhlIHByZXZpb3VzIHJldmlzaW9u
cy4gVGhlIGlzc3VlIGlzIHRoYXQgY3VycmVudGx5IHRoZXJlIGlzIG5vIG1lY2hhbmlzbSBpbiB0
aGUgSU9NTVUgY29kZSB0byBpbmZvcm0gVk1zIGlmIElPTU1VIGdyb3VwaW5ncyBjaGFuZ2UuIFNp
bmNlIHAycGRtYSBjYW4gZHluYW1pY2FsbHkgY2hhbmdlIGl0cyB0b3BvbG9neSAoZHVlIHRvIFBD
SSBob3RwbHVnKSB3ZSBoYWQgdG8gYmUgY29nbml6YW50IG9mIHRoZSBmYWN0IHRoYXQgQUNTIHNl
dHRpbmdzIGNvdWxkIGNoYW5nZS4gU2luY2UgdGhlcmUgaXMgbm8gd2F5IHRvIGN1cnJlbnRseSBo
YW5kbGUgY2hhbmdpbmcgQUNTIHNldHRpbmdzIGFuZCBoZW5jZSBJT01NVSBncm91cGluZ3MgdGhl
IGNvbnNlbnN1cyB3YXMgdG8gc2ltcGx5IGRpc2FibGUgQUNTIG9uIGFsbCBwb3J0cyBpbiBhIHAy
cGRtYSBkb21haW4uIFRoaXMgZWZmZWN0aXZlbHkgbWFrZXMgYWxsIHRoZSBkZXZpY2VzIGluIHRo
ZSBwMnBkbWEgZG9tYWluIHBhcnQgb2YgdGhlIHNhbWUgSU9NTVUgZ3JvdXBpbmcuIFRoZSBwbGFu
IHdpbGwgYmUgdG8gYWRkcmVzcyB0aGlzIGluIHRpbWUgYW5kIGFkZCBhIG1lY2hhbmlzbSBmb3Ig
SU9NTVUgZ3JvdXBpbmcgY2hhbmdlcyBhbmQgbm90aWZpY2F0aW9uIHRvIFZNcyBidXQgdGhhdCdz
IG5vdCBwYXJ0IG9mIHRoaXMgc2VyaWVzLiBOb3RlIHlvdSBhcmUgc3RpbGwgYWxsb3dlZCB0byBo
YXZlIEFDUyBmdW5jdGlvbmluZyBvbiBvdGhlciBQQ0kgZG9tYWlucyBzbyBpZiB5b3UgZG8gbm90
IGEgcGx1cmFsaXR5IG9mIElPTU1VIGdyb3VwaW5ncyB5b3UgY2FuIHN0aWxsIGFjaGlldmUgaXQg
KGJ1dCB5b3UgY2FuJ3QgZG8gcDJwZG1hIGFjcm9zcyBJT01NVSBncm91cGluZ3MsIHdoaWNoIGlz
IHNhZmUpLg0KDQo+IEknbSBjdXJyZW50bHkgdGVzdGluZyBQMlAgd2l0aCAgR1BVcyBpbiBkaWZm
ZXJlbnQgSU9NTVUgZG9tYWlucyBhbmQgYXQgbGVhc3Qgd2l0aCBBTUQgSU9NTVVzIHRoYXQgd29y
a3MgcGVyZmVjdGx5IGZpbmUuDQoNCll1cCB0aGF0IHNob3VsZCB3b3JrIHRob3VnaCBhZ2FpbiBJ
IGhhdmUgdG8gYXNrIGFyZSB5b3UgZGlzYWJsaW5nIEFDUyBvbiB0aGUgcG9ydHMgYmV0d2VlbiB0
aGUgdHdvIHBlZXIgZGV2aWNlcyB0byBnZXQgdGhlIHAycCBiZW5lZml0PyBJZiBub3QgeW91IGFy
ZSBub3QgZ2V0dGluZyBhbGwgdGhlIHBlcmZvcm1hbmNlIGJlbmVmaXQgKGR1ZSB0byBJT01NVSBy
b3V0aW5nKSwgaWYgeW91IGFyZSB0aGVuIHRoZXJlIGFyZSBvYnZpb3VzbHkgc2VjdXJpdHkgaW1w
bGljYXRpb25zIGJldHdlZW4gdGhvc2UgSU9NTVUgZG9tYWlucyBpZiB0aGV5IGFyZSBhc3NpZ25l
ZCB0byBkaWZmZXJlbnQgVk1zLiBBbmQgbm93IHRoZSBpc3N1ZSBpcyBpZiBuZXcgZGV2aWNlcyBh
cmUgYWRkZWQgYW5kIHRoZSBwMnAgdG9wb2xvZ3kgbmVlZGVkIHRvIGNoYW5nZSB0aGVyZSB3b3Vs
ZCBiZSBubyB3YXkgdG8gaW5mb3JtIHRoZSBWTXMgb2YgYW55IElPTU1VIGdyb3VwIGNoYW5nZS4g
DQoNCkNoZWVycw0KDQpTdGVwaGVuDQogICAgDQoNCg==

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:25         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:25 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

    
Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 
> haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. 
    
> So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
    
> And what exactly is the problem here?
 
We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI dom
 ains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. 

Cheers

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:25         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:25 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

    
Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 
> haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. 
    
> So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
    
> And what exactly is the problem here?
 
We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. 

Cheers

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:25         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:25 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Dan Williams, Christoph Hellwig

    
Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 
> haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. 
    
> So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
    
> And what exactly is the problem here?
 
We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. 

Cheers

Stephen
    

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:25         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:25 UTC (permalink / raw)


    
Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 
> haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. 
    
> So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
    
> And what exactly is the problem here?
 
We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. 

Cheers

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-04-23 23:30   ` Logan Gunthorpe
                       ` (2 preceding siblings ...)
  (?)
@ 2018-05-08 14:31     ` Dan Williams
  -1 siblings, 0 replies; 460+ messages in thread
From: Dan Williams @ 2018-05-08 14:31 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	linux-block, Jérôme Glisse, Jason Gunthorpe,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König, Christoph Hellwig

On Mon, Apr 23, 2018 at 4:30 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.

>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>           transations must be between devices behind the same root port.
>           (Typically behind a network of PCIe switches).
>
> +         Enabling this option will also disable ACS on all ports behind
> +         any PCIe switch. This effectively puts all devices behind any
> +         switch heirarchy into the same IOMMU group. Which implies that
> +         individual devices behind any switch will not be able to be
> +         assigned to separate VMs because there is no isolation between
> +         them. Additionally, any malicious PCIe devices will be able to
> +         DMA to memory exposed by other EPs in the same domain as TLPs
> +         will not be checked by the IOMMU.
> +
>           If unsure, say N.

It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?

Why is this text added in a follow on patch and not the patch that
introduced the config option?

I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:31     ` Dan Williams
  0 siblings, 0 replies; 460+ messages in thread
From: Dan Williams @ 2018-05-08 14:31 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Christian König

On Mon, Apr 23, 2018 at 4:30 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.

>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>           transations must be between devices behind the same root port.
>           (Typically behind a network of PCIe switches).
>
> +         Enabling this option will also disable ACS on all ports behind
> +         any PCIe switch. This effectively puts all devices behind any
> +         switch heirarchy into the same IOMMU group. Which implies that
> +         individual devices behind any switch will not be able to be
> +         assigned to separate VMs because there is no isolation between
> +         them. Additionally, any malicious PCIe devices will be able to
> +         DMA to memory exposed by other EPs in the same domain as TLPs
> +         will not be checked by the IOMMU.
> +
>           If unsure, say N.

It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?

Why is this text added in a follow on patch and not the patch that
introduced the config option?

I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:31     ` Dan Williams
  0 siblings, 0 replies; 460+ messages in thread
From: Dan Williams @ 2018-05-08 14:31 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christian König, Christoph Hellwig

On Mon, Apr 23, 2018 at 4:30 PM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.

>
> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>           transations must be between devices behind the same root port.
>           (Typically behind a network of PCIe switches).
>
> +         Enabling this option will also disable ACS on all ports behind
> +         any PCIe switch. This effectively puts all devices behind any
> +         switch heirarchy into the same IOMMU group. Which implies that
> +         individual devices behind any switch will not be able to be
> +         assigned to separate VMs because there is no isolation between
> +         them. Additionally, any malicious PCIe devices will be able to
> +         DMA to memory exposed by other EPs in the same domain as TLPs
> +         will not be checked by the IOMMU.
> +
>           If unsure, say N.

It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?

Why is this text added in a follow on patch and not the patch that
introduced the config option?

I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:31     ` Dan Williams
  0 siblings, 0 replies; 460+ messages in thread
From: Dan Williams @ 2018-05-08 14:31 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, Sagi Grimberg,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, Stephen Bates, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christian König, Christoph Hellwig

On Mon, Apr 23, 2018 at 4:30 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.

>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>           transations must be between devices behind the same root port.
>           (Typically behind a network of PCIe switches).
>
> +         Enabling this option will also disable ACS on all ports behind
> +         any PCIe switch. This effectively puts all devices behind any
> +         switch heirarchy into the same IOMMU group. Which implies that
> +         individual devices behind any switch will not be able to be
> +         assigned to separate VMs because there is no isolation between
> +         them. Additionally, any malicious PCIe devices will be able to
> +         DMA to memory exposed by other EPs in the same domain as TLPs
> +         will not be checked by the IOMMU.
> +
>           If unsure, say N.

It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?

Why is this text added in a follow on patch and not the patch that
introduced the config option?

I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:31     ` Dan Williams
  0 siblings, 0 replies; 460+ messages in thread
From: Dan Williams @ 2018-05-08 14:31 UTC (permalink / raw)


On Mon, Apr 23, 2018@4:30 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.

>
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> ---
>  drivers/pci/Kconfig        |  9 +++++++++
>  drivers/pci/p2pdma.c       | 45 ++++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.c          |  6 ++++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>           transations must be between devices behind the same root port.
>           (Typically behind a network of PCIe switches).
>
> +         Enabling this option will also disable ACS on all ports behind
> +         any PCIe switch. This effectively puts all devices behind any
> +         switch heirarchy into the same IOMMU group. Which implies that
> +         individual devices behind any switch will not be able to be
> +         assigned to separate VMs because there is no isolation between
> +         them. Additionally, any malicious PCIe devices will be able to
> +         DMA to memory exposed by other EPs in the same domain as TLPs
> +         will not be checked by the IOMMU.
> +
>           If unsure, say N.

It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?

Why is this text added in a follow on patch and not the patch that
introduced the config option?

I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 14:31     ` Dan Williams
                         ` (3 preceding siblings ...)
  (?)
@ 2018-05-08 14:44       ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:44 UTC (permalink / raw)
  To: Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Christian König, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Dan

>    It seems unwieldy that this is a compile time option and not a runtime
>    option. Can't we have a kernel command line option to opt-in to this
>    behavior rather than require a wholly separate kernel image?
  
I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that
>  introduced the config option?

Because the ACS section was added later in the series and this information is associated with that additional functionality.
    
> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.

By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:44       ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:44 UTC (permalink / raw)
  To: Dan Williams, Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König

SGkgRGFuDQoNCj4gICAgSXQgc2VlbXMgdW53aWVsZHkgdGhhdCB0aGlzIGlzIGEgY29tcGlsZSB0
aW1lIG9wdGlvbiBhbmQgbm90IGEgcnVudGltZQ0KPiAgICBvcHRpb24uIENhbid0IHdlIGhhdmUg
YSBrZXJuZWwgY29tbWFuZCBsaW5lIG9wdGlvbiB0byBvcHQtaW4gdG8gdGhpcw0KPiAgICBiZWhh
dmlvciByYXRoZXIgdGhhbiByZXF1aXJlIGEgd2hvbGx5IHNlcGFyYXRlIGtlcm5lbCBpbWFnZT8N
CiAgDQpJIHRoaW5rIGJlY2F1c2Ugb2YgdGhlIHNlY3VyaXR5IGltcGxpY2F0aW9ucyBhc3NvY2lh
dGVkIHdpdGggcDJwZG1hIGFuZCBBQ1Mgd2Ugd2FudGVkIHRvIG1ha2UgaXQgdmVyeSBjbGVhciBw
ZW9wbGUgd2VyZSBjaG9vc2luZyBvbmUgKHAycGRtYSkgb3IgdGhlIG90aGVyIChJT01NVSBncm91
cGluZ3MgYW5kIGlzb2xhdGlvbikuIEhvd2V2ZXIgcGVyc29uYWxseSBJIHdvdWxkIHByZWZlciBp
bmNsdWRpbmcgdGhlIG9wdGlvbiBvZiBhIHJ1bi10aW1lIGtlcm5lbCBwYXJhbWV0ZXIgdG9vLiBJ
biBmYWN0IGEgZmV3IG1vbnRocyBhZ28gSSBwcm9wb3NlZCBhIHNtYWxsIHBhdGNoIHRoYXQgZGlk
IGp1c3QgdGhhdCBbMV0uIEl0IG5ldmVyIHJlYWxseSB3ZW50IGFueXdoZXJlIGJ1dCBpZiBwZW9w
bGUgd2VyZSBvcGVuIHRvIHRoZSBpZGVhIHdlIGNvdWxkIGxvb2sgYXQgYWRkaW5nIGl0IHRvIHRo
ZSBzZXJpZXMuDQogIA0KPiBXaHkgaXMgdGhpcyB0ZXh0IGFkZGVkIGluIGEgZm9sbG93IG9uIHBh
dGNoIGFuZCBub3QgdGhlIHBhdGNoIHRoYXQNCj4gIGludHJvZHVjZWQgdGhlIGNvbmZpZyBvcHRp
b24/DQoNCkJlY2F1c2UgdGhlIEFDUyBzZWN0aW9uIHdhcyBhZGRlZCBsYXRlciBpbiB0aGUgc2Vy
aWVzIGFuZCB0aGlzIGluZm9ybWF0aW9uIGlzIGFzc29jaWF0ZWQgd2l0aCB0aGF0IGFkZGl0aW9u
YWwgZnVuY3Rpb25hbGl0eS4NCiAgICANCj4gSSdtIGFsc28gd29uZGVyaW5nIGlmIHRoYXQgY29t
bWFuZCBsaW5lIG9wdGlvbiBjYW4gdGFrZSBhICdidXMgZGV2aWNlDQo+IGZ1bmN0aW9uJyBhZGRy
ZXNzIG9mIGEgc3dpdGNoIHRvIGxpbWl0IHRoZSBzY29wZSBvZiB3aGVyZSBBQ1MgaXMNCj4gZGlz
YWJsZWQuDQoNCkJ5IHRoaXMgeW91IG1lYW4gdGhlIGFkZHJlc3MgZm9yIGVpdGhlciBhIFJQLCBE
U1AsIFVTUCBvciBNRiBFUCBiZWxvdyB3aGljaCB3ZSBkaXNhYmxlIEFDUz8gV2UgY291bGQgZG8g
dGhhdCBidXQgSSBkb24ndCB0aGluayBpdCBhdm9pZHMgdGhlIGlzc3VlIG9mIGNoYW5nZXMgaW4g
SU9NTVUgZ3JvdXBpbmdzIGFzIGRldmljZXMgYXJlIGFkZGVkL3JlbW92ZWQuIEl0IHNpbXBseSBj
aGFuZ2VzIHRoZSBwcm9ibGVtIGZyb20gYWZmZWN0aW5nIGFuZCBlbnRpcmUgUENJIGRvbWFpbiB0
byBhIHN1Yi1zZXQgb2YgdGhlIGRvbWFpbi4gV2UgY2FuIGFscmVhZHkgaGFuZGxlIHRoaXMgYnkg
ZG9pbmcgcDJwZG1hIG9uIG9uZSBSUCBhbmQgbm9ybWFsIElPTU1VIGlzb2xhdGlvbiBvbiB0aGUg
b3RoZXIgUlBzIGluIHRoZSBzeXN0ZW0uDQoNClN0ZXBoZW4NCg0KWzFdIGh0dHBzOi8vbWFyYy5p
bmZvLz9sPWxpbnV4LWRvYyZtPTE1MDkwNzE4ODMxMDgzOCZ3PTINCiAgICANCg0K

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:44       ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:44 UTC (permalink / raw)
  To: Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christian König, linux-block-u79uwXL29TY76Z2rM5mHXA,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Dan

>    It seems unwieldy that this is a compile time option and not a runtime
>    option. Can't we have a kernel command line option to opt-in to this
>    behavior rather than require a wholly separate kernel image?
  
I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that
>  introduced the config option?

Because the ACS section was added later in the series and this information is associated with that additional functionality.
    
> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.

By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc&m=150907188310838&w=2

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:44       ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:44 UTC (permalink / raw)
  To: Dan Williams, Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König

Hi Dan

>    It seems unwieldy that this is a compile time option and not a runtime
>    option. Can't we have a kernel command line option to opt-in to this
>    behavior rather than require a wholly separate kernel image?
  
I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that
>  introduced the config option?

Because the ACS section was added later in the series and this information is associated with that additional functionality.
    
> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.

By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:44       ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:44 UTC (permalink / raw)
  To: Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, Sagi Grimberg,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, Christian König, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Dan

>    It seems unwieldy that this is a compile time option and not a runtime
>    option. Can't we have a kernel command line option to opt-in to this
>    behavior rather than require a wholly separate kernel image?
  
I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that
>  introduced the config option?

Because the ACS section was added later in the series and this information is associated with that additional functionality.
    
> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.

By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
    

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 14:44       ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 14:44 UTC (permalink / raw)


Hi Dan

>    It seems unwieldy that this is a compile time option and not a runtime
>    option. Can't we have a kernel command line option to opt-in to this
>    behavior rather than require a wholly separate kernel image?
  
I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that
>  introduced the config option?

Because the ACS section was added later in the series and this information is associated with that additional functionality.
    
> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.

By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08  7:17       ` Christian König
  (?)
  (?)
@ 2018-05-08 16:27         ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 16:27 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 08/05/18 01:17 AM, Christian König wrote:
> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

Well, given that the current set only disables ACS bits on bridges
(previous versions were only on switches) this shouldn't be an issue for
integrated devices. We do not disable ACS flags globally.

> And what exactly is the problem here? I'm currently testing P2P with 
> GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
> perfectly fine.

In addition to Stephen's comments, seeing we've established a general
need to avoid the root complex (until we have a whitelist at least) we
must have ACS disabled along the path between the devices. Otherwise,
all TLPs will go through the root complex and if there is no support it
will fail.

If the consensus is we want a command line option, then so be it. But
we'll have to deny pretty much all P2P transactions unless the user
correctly disables ACS along the path using the command line option and
this is really annoying for users of this functionality to understand
how to do that correctly.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:27         ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 16:27 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Alex Williamson
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt



On 08/05/18 01:17 AM, Christian König wrote:
> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

Well, given that the current set only disables ACS bits on bridges
(previous versions were only on switches) this shouldn't be an issue for
integrated devices. We do not disable ACS flags globally.

> And what exactly is the problem here? I'm currently testing P2P with 
> GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
> perfectly fine.

In addition to Stephen's comments, seeing we've established a general
need to avoid the root complex (until we have a whitelist at least) we
must have ACS disabled along the path between the devices. Otherwise,
all TLPs will go through the root complex and if there is no support it
will fail.

If the consensus is we want a command line option, then so be it. But
we'll have to deny pretty much all P2P transactions unless the user
correctly disables ACS along the path using the command line option and
this is really annoying for users of this functionality to understand
how to do that correctly.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:27         ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 16:27 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 08/05/18 01:17 AM, Christian König wrote:
> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

Well, given that the current set only disables ACS bits on bridges
(previous versions were only on switches) this shouldn't be an issue for
integrated devices. We do not disable ACS flags globally.

> And what exactly is the problem here? I'm currently testing P2P with 
> GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
> perfectly fine.

In addition to Stephen's comments, seeing we've established a general
need to avoid the root complex (until we have a whitelist at least) we
must have ACS disabled along the path between the devices. Otherwise,
all TLPs will go through the root complex and if there is no support it
will fail.

If the consensus is we want a command line option, then so be it. But
we'll have to deny pretty much all P2P transactions unless the user
correctly disables ACS along the path using the command line option and
this is really annoying for users of this functionality to understand
how to do that correctly.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:27         ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 16:27 UTC (permalink / raw)




On 08/05/18 01:17 AM, Christian K?nig wrote:
> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

Well, given that the current set only disables ACS bits on bridges
(previous versions were only on switches) this shouldn't be an issue for
integrated devices. We do not disable ACS flags globally.

> And what exactly is the problem here? I'm currently testing P2P with 
> GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
> perfectly fine.

In addition to Stephen's comments, seeing we've established a general
need to avoid the root complex (until we have a whitelist at least) we
must have ACS disabled along the path between the devices. Otherwise,
all TLPs will go through the root complex and if there is no support it
will fail.

If the consensus is we want a command line option, then so be it. But
we'll have to deny pretty much all P2P transactions unless the user
correctly disables ACS along the path using the command line option and
this is really annoying for users of this functionality to understand
how to do that correctly.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 14:25         ` Stephen  Bates
                             ` (2 preceding siblings ...)
  (?)
@ 2018-05-08 16:37           ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08 16:37 UTC (permalink / raw)
  To: Stephen Bates, Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 08.05.2018 um 16:25 schrieb Stephen Bates:
>      
> Hi Christian
>
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

Well I'm not an expert on this, but I think that is an incorrect 
assumption you guys use here.

At least in the default configuration even with IOMMU enabled P2P 
transactions does NOT necessary travel up to the root complex for 
translation.

It's already late here, but if nobody beats me I'm going to dig up the 
necessary documentation tomorrow.

Regards,
Christian.

>
>> Similar problems arise when you do this for dedicated GPU, but we
>> haven't upstreamed the support for this yet.
> Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit.
>      
>> So that is a clear NAK from my side for the approach.
> Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
>      
>> And what exactly is the problem here?
>   
> We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).
>
>> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.
> Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change.
>
> Cheers
>
> Stephen
>      
>

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:37           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08 16:37 UTC (permalink / raw)
  To: Stephen Bates, Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

Am 08.05.2018 um 16:25 schrieb Stephen Bates:
>      
> Hi Christian
>
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

Well I'm not an expert on this, but I think that is an incorrect 
assumption you guys use here.

At least in the default configuration even with IOMMU enabled P2P 
transactions does NOT necessary travel up to the root complex for 
translation.

It's already late here, but if nobody beats me I'm going to dig up the 
necessary documentation tomorrow.

Regards,
Christian.

>
>> Similar problems arise when you do this for dedicated GPU, but we
>> haven't upstreamed the support for this yet.
> Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit.
>      
>> So that is a clear NAK from my side for the approach.
> Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
>      
>> And what exactly is the problem here?
>   
> We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).
>
>> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.
> Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change.
>
> Cheers
>
> Stephen
>      
>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:37           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08 16:37 UTC (permalink / raw)
  To: Stephen Bates, Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 08.05.2018 um 16:25 schrieb Stephen Bates:
>      
> Hi Christian
>
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

Well I'm not an expert on this, but I think that is an incorrect 
assumption you guys use here.

At least in the default configuration even with IOMMU enabled P2P 
transactions does NOT necessary travel up to the root complex for 
translation.

It's already late here, but if nobody beats me I'm going to dig up the 
necessary documentation tomorrow.

Regards,
Christian.

>
>> Similar problems arise when you do this for dedicated GPU, but we
>> haven't upstreamed the support for this yet.
> Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit.
>      
>> So that is a clear NAK from my side for the approach.
> Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
>      
>> And what exactly is the problem here?
>   
> We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI d
 omains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).
>
>> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.
> Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change.
>
> Cheers
>
> Stephen
>      
>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:37           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08 16:37 UTC (permalink / raw)
  To: Stephen Bates, Bjorn Helgaas, Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Dan Williams, Christoph Hellwig

Am 08.05.2018 um 16:25 schrieb Stephen Bates:
>      
> Hi Christian
>
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

Well I'm not an expert on this, but I think that is an incorrect 
assumption you guys use here.

At least in the default configuration even with IOMMU enabled P2P 
transactions does NOT necessary travel up to the root complex for 
translation.

It's already late here, but if nobody beats me I'm going to dig up the 
necessary documentation tomorrow.

Regards,
Christian.

>
>> Similar problems arise when you do this for dedicated GPU, but we
>> haven't upstreamed the support for this yet.
> Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit.
>      
>> So that is a clear NAK from my side for the approach.
> Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
>      
>> And what exactly is the problem here?
>   
> We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).
>
>> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.
> Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change.
>
> Cheers
>
> Stephen
>      
>


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:37           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08 16:37 UTC (permalink / raw)


Am 08.05.2018 um 16:25 schrieb Stephen Bates:
>      
> Hi Christian
>
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP?

Well I'm not an expert on this, but I think that is an incorrect 
assumption you guys use here.

At least in the default configuration even with IOMMU enabled P2P 
transactions does NOT necessary travel up to the root complex for 
translation.

It's already late here, but if nobody beats me I'm going to dig up the 
necessary documentation tomorrow.

Regards,
Christian.

>
>> Similar problems arise when you do this for dedicated GPU, but we
>> haven't upstreamed the support for this yet.
> Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit.
>      
>> So that is a clear NAK from my side for the approach.
> Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach?
>      
>> And what exactly is the problem here?
>   
> We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).
>
>> I'm currently testing P2P with  GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine.
> Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change.
>
> Cheers
>
> Stephen
>      
>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 16:27         ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-08 16:50           ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08 16:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Bjorn Helgaas, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe:
>
> On 08/05/18 01:17 AM, Christian König wrote:
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> Well, given that the current set only disables ACS bits on bridges
> (previous versions were only on switches) this shouldn't be an issue for
> integrated devices. We do not disable ACS flags globally.

Ok, that is at least a step in the right direction. But I think we 
seriously need to test that for side effects.

>
>> And what exactly is the problem here? I'm currently testing P2P with
>> GPUs in different IOMMU domains and at least with AMD IOMMUs that works
>> perfectly fine.
> In addition to Stephen's comments, seeing we've established a general
> need to avoid the root complex (until we have a whitelist at least) we
> must have ACS disabled along the path between the devices. Otherwise,
> all TLPs will go through the root complex and if there is no support it
> will fail.

Well I'm not an expert on this, but if I'm not completely mistaken that 
is not correct.

E.g. transactions are initially send to the root complex for 
translation, that's for sure. But at least for AMD GPUs the root complex 
answers with the translated address which is then cached in the device.

So further transactions for the same address range then go directly to 
the destination.

What you don't want is device isolation, cause in this case the root 
complex handles the transaction themselves. IIRC there where also 
something like "force_isolation" and "nobypass" parameters for the IOMMU 
to control that behavior.

It's already late here, but going to dig up the documentation for that 
tomorrow and/or contact a hardware engineer involved in the ACS spec.

Regards,
Christian.

>
> If the consensus is we want a command line option, then so be it. But
> we'll have to deny pretty much all P2P transactions unless the user
> correctly disables ACS along the path using the command line option and
> this is really annoying for users of this functionality to understand
> how to do that correctly.
>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:50           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08 16:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Bjorn Helgaas, Alex Williamson
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt

Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe:
>
> On 08/05/18 01:17 AM, Christian König wrote:
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> Well, given that the current set only disables ACS bits on bridges
> (previous versions were only on switches) this shouldn't be an issue for
> integrated devices. We do not disable ACS flags globally.

Ok, that is at least a step in the right direction. But I think we 
seriously need to test that for side effects.

>
>> And what exactly is the problem here? I'm currently testing P2P with
>> GPUs in different IOMMU domains and at least with AMD IOMMUs that works
>> perfectly fine.
> In addition to Stephen's comments, seeing we've established a general
> need to avoid the root complex (until we have a whitelist at least) we
> must have ACS disabled along the path between the devices. Otherwise,
> all TLPs will go through the root complex and if there is no support it
> will fail.

Well I'm not an expert on this, but if I'm not completely mistaken that 
is not correct.

E.g. transactions are initially send to the root complex for 
translation, that's for sure. But at least for AMD GPUs the root complex 
answers with the translated address which is then cached in the device.

So further transactions for the same address range then go directly to 
the destination.

What you don't want is device isolation, cause in this case the root 
complex handles the transaction themselves. IIRC there where also 
something like "force_isolation" and "nobypass" parameters for the IOMMU 
to control that behavior.

It's already late here, but going to dig up the documentation for that 
tomorrow and/or contact a hardware engineer involved in the ACS spec.

Regards,
Christian.

>
> If the consensus is we want a command line option, then so be it. But
> we'll have to deny pretty much all P2P transactions unless the user
> correctly disables ACS along the path using the command line option and
> this is really annoying for users of this functionality to understand
> how to do that correctly.
>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:50           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08 16:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Bjorn Helgaas, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe:
>
> On 08/05/18 01:17 AM, Christian König wrote:
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> Well, given that the current set only disables ACS bits on bridges
> (previous versions were only on switches) this shouldn't be an issue for
> integrated devices. We do not disable ACS flags globally.

Ok, that is at least a step in the right direction. But I think we 
seriously need to test that for side effects.

>
>> And what exactly is the problem here? I'm currently testing P2P with
>> GPUs in different IOMMU domains and at least with AMD IOMMUs that works
>> perfectly fine.
> In addition to Stephen's comments, seeing we've established a general
> need to avoid the root complex (until we have a whitelist at least) we
> must have ACS disabled along the path between the devices. Otherwise,
> all TLPs will go through the root complex and if there is no support it
> will fail.

Well I'm not an expert on this, but if I'm not completely mistaken that 
is not correct.

E.g. transactions are initially send to the root complex for 
translation, that's for sure. But at least for AMD GPUs the root complex 
answers with the translated address which is then cached in the device.

So further transactions for the same address range then go directly to 
the destination.

What you don't want is device isolation, cause in this case the root 
complex handles the transaction themselves. IIRC there where also 
something like "force_isolation" and "nobypass" parameters for the IOMMU 
to control that behavior.

It's already late here, but going to dig up the documentation for that 
tomorrow and/or contact a hardware engineer involved in the ACS spec.

Regards,
Christian.

>
> If the consensus is we want a command line option, then so be it. But
> we'll have to deny pretty much all P2P transactions unless the user
> correctly disables ACS along the path using the command line option and
> this is really annoying for users of this functionality to understand
> how to do that correctly.
>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 16:50           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-08 16:50 UTC (permalink / raw)


Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe:
>
> On 08/05/18 01:17 AM, Christian K?nig wrote:
>> AMD APUs mandatory need the ACS flag set for the GPU integrated in the
>> CPU when IOMMU is enabled or otherwise you will break SVM.
> Well, given that the current set only disables ACS bits on bridges
> (previous versions were only on switches) this shouldn't be an issue for
> integrated devices. We do not disable ACS flags globally.

Ok, that is at least a step in the right direction. But I think we 
seriously need to test that for side effects.

>
>> And what exactly is the problem here? I'm currently testing P2P with
>> GPUs in different IOMMU domains and at least with AMD IOMMUs that works
>> perfectly fine.
> In addition to Stephen's comments, seeing we've established a general
> need to avoid the root complex (until we have a whitelist at least) we
> must have ACS disabled along the path between the devices. Otherwise,
> all TLPs will go through the root complex and if there is no support it
> will fail.

Well I'm not an expert on this, but if I'm not completely mistaken that 
is not correct.

E.g. transactions are initially send to the root complex for 
translation, that's for sure. But at least for AMD GPUs the root complex 
answers with the translated address which is then cached in the device.

So further transactions for the same address range then go directly to 
the destination.

What you don't want is device isolation, cause in this case the root 
complex handles the transaction themselves. IIRC there where also 
something like "force_isolation" and "nobypass" parameters for the IOMMU 
to control that behavior.

It's already late here, but going to dig up the documentation for that 
tomorrow and/or contact a hardware engineer involved in the ACS spec.

Regards,
Christian.

>
> If the consensus is we want a command line option, then so be it. But
> we'll have to deny pretty much all P2P transactions unless the user
> correctly disables ACS along the path using the command line option and
> this is really annoying for users of this functionality to understand
> how to do that correctly.
>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-07 23:23   ` Bjorn Helgaas
  (?)
  (?)
@ 2018-05-08 16:57     ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 16:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> > Hi Everyone,
> > 
> > Here's v4 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.17-rc2. A git repo
> > is here:
> > 
> > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> > ...  
> 
> > Logan Gunthorpe (14):
> >   PCI/P2PDMA: Support peer-to-peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   docs-rst: Add a new directory for PCI documentation
> >   PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Ensure we map P2P memory correctly in
> >     rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Introduce helper functions to allocate and free request SGLs
> >   nvmet-rdma: Use new SGL alloc/free helper for requests
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >  Documentation/PCI/index.rst                |  14 +
> >  Documentation/driver-api/index.rst         |   2 +-
> >  Documentation/driver-api/pci/index.rst     |  20 +
> >  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >  Documentation/driver-api/{ => pci}/pci.rst |   0
> >  Documentation/index.rst                    |   3 +-
> >  block/blk-core.c                           |   3 +
> >  drivers/infiniband/core/rw.c               |  13 +-
> >  drivers/nvme/host/core.c                   |   4 +
> >  drivers/nvme/host/nvme.h                   |   8 +
> >  drivers/nvme/host/pci.c                    | 118 +++--
> >  drivers/nvme/target/configfs.c             |  67 +++
> >  drivers/nvme/target/core.c                 | 143 ++++-
> >  drivers/nvme/target/io-cmd.c               |   3 +
> >  drivers/nvme/target/nvmet.h                |  15 +
> >  drivers/nvme/target/rdma.c                 |  22 +-
> >  drivers/pci/Kconfig                        |  26 +
> >  drivers/pci/Makefile                       |   1 +
> >  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >  drivers/pci/pci.c                          |   6 +
> >  include/linux/blk_types.h                  |  18 +-
> >  include/linux/blkdev.h                     |   3 +
> >  include/linux/memremap.h                   |  19 +
> >  include/linux/pci-p2pdma.h                 | 118 +++++
> >  include/linux/pci.h                        |   4 +
> >  26 files changed, 1579 insertions(+), 56 deletions(-)
> >  create mode 100644 Documentation/PCI/index.rst
> >  create mode 100644 Documentation/driver-api/pci/index.rst
> >  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h  
> 
> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option.  I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both.  Is this really such
a unique feature that distro users aren't going to be asking for both
features?  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 16:57     ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 16:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt,
	Christian König

On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> > Hi Everyone,
> > 
> > Here's v4 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.17-rc2. A git repo
> > is here:
> > 
> > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> > ...  
> 
> > Logan Gunthorpe (14):
> >   PCI/P2PDMA: Support peer-to-peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   docs-rst: Add a new directory for PCI documentation
> >   PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Ensure we map P2P memory correctly in
> >     rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Introduce helper functions to allocate and free request SGLs
> >   nvmet-rdma: Use new SGL alloc/free helper for requests
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >  Documentation/PCI/index.rst                |  14 +
> >  Documentation/driver-api/index.rst         |   2 +-
> >  Documentation/driver-api/pci/index.rst     |  20 +
> >  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >  Documentation/driver-api/{ => pci}/pci.rst |   0
> >  Documentation/index.rst                    |   3 +-
> >  block/blk-core.c                           |   3 +
> >  drivers/infiniband/core/rw.c               |  13 +-
> >  drivers/nvme/host/core.c                   |   4 +
> >  drivers/nvme/host/nvme.h                   |   8 +
> >  drivers/nvme/host/pci.c                    | 118 +++--
> >  drivers/nvme/target/configfs.c             |  67 +++
> >  drivers/nvme/target/core.c                 | 143 ++++-
> >  drivers/nvme/target/io-cmd.c               |   3 +
> >  drivers/nvme/target/nvmet.h                |  15 +
> >  drivers/nvme/target/rdma.c                 |  22 +-
> >  drivers/pci/Kconfig                        |  26 +
> >  drivers/pci/Makefile                       |   1 +
> >  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >  drivers/pci/pci.c                          |   6 +
> >  include/linux/blk_types.h                  |  18 +-
> >  include/linux/blkdev.h                     |   3 +
> >  include/linux/memremap.h                   |  19 +
> >  include/linux/pci-p2pdma.h                 | 118 +++++
> >  include/linux/pci.h                        |   4 +
> >  26 files changed, 1579 insertions(+), 56 deletions(-)
> >  create mode 100644 Documentation/PCI/index.rst
> >  create mode 100644 Documentation/driver-api/pci/index.rst
> >  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h  
> 
> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option.  I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both.  Is this really such
a unique feature that distro users aren't going to be asking for both
features?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 16:57     ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 16:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas <helgaas-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:

> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> > Hi Everyone,
> > 
> > Here's v4 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.17-rc2. A git repo
> > is here:
> > 
> > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> > ...  
> 
> > Logan Gunthorpe (14):
> >   PCI/P2PDMA: Support peer-to-peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   docs-rst: Add a new directory for PCI documentation
> >   PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Ensure we map P2P memory correctly in
> >     rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Introduce helper functions to allocate and free request SGLs
> >   nvmet-rdma: Use new SGL alloc/free helper for requests
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >  Documentation/PCI/index.rst                |  14 +
> >  Documentation/driver-api/index.rst         |   2 +-
> >  Documentation/driver-api/pci/index.rst     |  20 +
> >  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >  Documentation/driver-api/{ => pci}/pci.rst |   0
> >  Documentation/index.rst                    |   3 +-
> >  block/blk-core.c                           |   3 +
> >  drivers/infiniband/core/rw.c               |  13 +-
> >  drivers/nvme/host/core.c                   |   4 +
> >  drivers/nvme/host/nvme.h                   |   8 +
> >  drivers/nvme/host/pci.c                    | 118 +++--
> >  drivers/nvme/target/configfs.c             |  67 +++
> >  drivers/nvme/target/core.c                 | 143 ++++-
> >  drivers/nvme/target/io-cmd.c               |   3 +
> >  drivers/nvme/target/nvmet.h                |  15 +
> >  drivers/nvme/target/rdma.c                 |  22 +-
> >  drivers/pci/Kconfig                        |  26 +
> >  drivers/pci/Makefile                       |   1 +
> >  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >  drivers/pci/pci.c                          |   6 +
> >  include/linux/blk_types.h                  |  18 +-
> >  include/linux/blkdev.h                     |   3 +
> >  include/linux/memremap.h                   |  19 +
> >  include/linux/pci-p2pdma.h                 | 118 +++++
> >  include/linux/pci.h                        |   4 +
> >  26 files changed, 1579 insertions(+), 56 deletions(-)
> >  create mode 100644 Documentation/PCI/index.rst
> >  create mode 100644 Documentation/driver-api/pci/index.rst
> >  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h  
> 
> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option.  I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both.  Is this really such
a unique feature that distro users aren't going to be asking for both
features?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 16:57     ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 16:57 UTC (permalink / raw)


On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Apr 23, 2018@05:30:32PM -0600, Logan Gunthorpe wrote:
> > Hi Everyone,
> > 
> > Here's v4 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.17-rc2. A git repo
> > is here:
> > 
> > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> > ...  
> 
> > Logan Gunthorpe (14):
> >   PCI/P2PDMA: Support peer-to-peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   docs-rst: Add a new directory for PCI documentation
> >   PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Ensure we map P2P memory correctly in
> >     rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Introduce helper functions to allocate and free request SGLs
> >   nvmet-rdma: Use new SGL alloc/free helper for requests
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >  Documentation/PCI/index.rst                |  14 +
> >  Documentation/driver-api/index.rst         |   2 +-
> >  Documentation/driver-api/pci/index.rst     |  20 +
> >  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >  Documentation/driver-api/{ => pci}/pci.rst |   0
> >  Documentation/index.rst                    |   3 +-
> >  block/blk-core.c                           |   3 +
> >  drivers/infiniband/core/rw.c               |  13 +-
> >  drivers/nvme/host/core.c                   |   4 +
> >  drivers/nvme/host/nvme.h                   |   8 +
> >  drivers/nvme/host/pci.c                    | 118 +++--
> >  drivers/nvme/target/configfs.c             |  67 +++
> >  drivers/nvme/target/core.c                 | 143 ++++-
> >  drivers/nvme/target/io-cmd.c               |   3 +
> >  drivers/nvme/target/nvmet.h                |  15 +
> >  drivers/nvme/target/rdma.c                 |  22 +-
> >  drivers/pci/Kconfig                        |  26 +
> >  drivers/pci/Makefile                       |   1 +
> >  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >  drivers/pci/pci.c                          |   6 +
> >  include/linux/blk_types.h                  |  18 +-
> >  include/linux/blkdev.h                     |   3 +
> >  include/linux/memremap.h                   |  19 +
> >  include/linux/pci-p2pdma.h                 | 118 +++++
> >  include/linux/pci.h                        |   4 +
> >  26 files changed, 1579 insertions(+), 56 deletions(-)
> >  create mode 100644 Documentation/PCI/index.rst
> >  create mode 100644 Documentation/driver-api/pci/index.rst
> >  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h  
> 
> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option.  I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both.  Is this really such
a unique feature that distro users aren't going to be asking for both
features?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 16:50           ` Christian König
  (?)
  (?)
@ 2018-05-08 19:13             ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:13 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 08/05/18 10:50 AM, Christian König wrote:
> E.g. transactions are initially send to the root complex for 
> translation, that's for sure. But at least for AMD GPUs the root complex 
> answers with the translated address which is then cached in the device.
> 
> So further transactions for the same address range then go directly to 
> the destination.

Sounds like you are referring to Address Translation Services (ATS).
This is quite separate from ACS and, to my knowledge, isn't widely
supported by switch hardware.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 19:13             ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:13 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Alex Williamson
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt



On 08/05/18 10:50 AM, Christian König wrote:
> E.g. transactions are initially send to the root complex for 
> translation, that's for sure. But at least for AMD GPUs the root complex 
> answers with the translated address which is then cached in the device.
> 
> So further transactions for the same address range then go directly to 
> the destination.

Sounds like you are referring to Address Translation Services (ATS).
This is quite separate from ACS and, to my knowledge, isn't widely
supported by switch hardware.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 19:13             ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:13 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 08/05/18 10:50 AM, Christian König wrote:
> E.g. transactions are initially send to the root complex for 
> translation, that's for sure. But at least for AMD GPUs the root complex 
> answers with the translated address which is then cached in the device.
> 
> So further transactions for the same address range then go directly to 
> the destination.

Sounds like you are referring to Address Translation Services (ATS).
This is quite separate from ACS and, to my knowledge, isn't widely
supported by switch hardware.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 19:13             ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:13 UTC (permalink / raw)




On 08/05/18 10:50 AM, Christian K?nig wrote:
> E.g. transactions are initially send to the root complex for 
> translation, that's for sure. But at least for AMD GPUs the root complex 
> answers with the translated address which is then cached in the device.
> 
> So further transactions for the same address range then go directly to 
> the destination.

Sounds like you are referring to Address Translation Services (ATS).
This is quite separate from ACS and, to my knowledge, isn't widely
supported by switch hardware.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-08 16:57     ` Alex Williamson
  (?)
  (?)
@ 2018-05-08 19:14       ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:14 UTC (permalink / raw)
  To: Alex Williamson, Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 08/05/18 10:57 AM, Alex Williamson wrote:
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,

I think it is. But it sounds like the majority want this to be a command
line option. So we will look at doing that for v5.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 19:14       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:14 UTC (permalink / raw)
  To: Alex Williamson, Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Christian König



On 08/05/18 10:57 AM, Alex Williamson wrote:
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,

I think it is. But it sounds like the majority want this to be a command
line option. So we will look at doing that for v5.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 19:14       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:14 UTC (permalink / raw)
  To: Alex Williamson, Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 08/05/18 10:57 AM, Alex Williamson wrote:
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,

I think it is. But it sounds like the majority want this to be a command
line option. So we will look at doing that for v5.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 19:14       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:14 UTC (permalink / raw)




On 08/05/18 10:57 AM, Alex Williamson wrote:
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,

I think it is. But it sounds like the majority want this to be a command
line option. So we will look at doing that for v5.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 19:13             ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-08 19:34               ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 19:34 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 13:13:40 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 10:50 AM, Christian König wrote:
> > E.g. transactions are initially send to the root complex for 
> > translation, that's for sure. But at least for AMD GPUs the root complex 
> > answers with the translated address which is then cached in the device.
> > 
> > So further transactions for the same address range then go directly to 
> > the destination.  
> 
> Sounds like you are referring to Address Translation Services (ATS).
> This is quite separate from ACS and, to my knowledge, isn't widely
> supported by switch hardware.

They are not so unrelated, see the ACS Direct Translated P2P
capability, which in fact must be implemented by switch downstream
ports implementing ACS and works specifically with ATS.  This appears to
be the way the PCI SIG would intend for P2P to occur within an IOMMU
managed topology, routing pre-translated DMA directly between peer
devices while requiring non-translated requests to bounce through the
IOMMU.  Really, what's the value of having an I/O virtual address space
provided by an IOMMU if we're going to allow physical DMA between
downstream devices, couldn't we just turn off the IOMMU altogether?  Of
course ATS is not without holes itself, basically that we trust the
endpoint's implementation of ATS implicitly.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 19:34               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 19:34 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

On Tue, 8 May 2018 13:13:40 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 10:50 AM, Christian K=C3=B6nig wrote:
> > E.g. transactions are initially send to the root complex for=20
> > translation, that's for sure. But at least for AMD GPUs the root comple=
x=20
> > answers with the translated address which is then cached in the device.
> >=20
> > So further transactions for the same address range then go directly to=
=20
> > the destination. =20
>=20
> Sounds like you are referring to Address Translation Services (ATS).
> This is quite separate from ACS and, to my knowledge, isn't widely
> supported by switch hardware.

They are not so unrelated, see the ACS Direct Translated P2P
capability, which in fact must be implemented by switch downstream
ports implementing ACS and works specifically with ATS.  This appears to
be the way the PCI SIG would intend for P2P to occur within an IOMMU
managed topology, routing pre-translated DMA directly between peer
devices while requiring non-translated requests to bounce through the
IOMMU.  Really, what's the value of having an I/O virtual address space
provided by an IOMMU if we're going to allow physical DMA between
downstream devices, couldn't we just turn off the IOMMU altogether?  Of
course ATS is not without holes itself, basically that we trust the
endpoint's implementation of ATS implicitly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 19:34               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 19:34 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

On Tue, 8 May 2018 13:13:40 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 10:50 AM, Christian König wrote:
> > E.g. transactions are initially send to the root complex for 
> > translation, that's for sure. But at least for AMD GPUs the root complex 
> > answers with the translated address which is then cached in the device.
> > 
> > So further transactions for the same address range then go directly to 
> > the destination.  
> 
> Sounds like you are referring to Address Translation Services (ATS).
> This is quite separate from ACS and, to my knowledge, isn't widely
> supported by switch hardware.

They are not so unrelated, see the ACS Direct Translated P2P
capability, which in fact must be implemented by switch downstream
ports implementing ACS and works specifically with ATS.  This appears to
be the way the PCI SIG would intend for P2P to occur within an IOMMU
managed topology, routing pre-translated DMA directly between peer
devices while requiring non-translated requests to bounce through the
IOMMU.  Really, what's the value of having an I/O virtual address space
provided by an IOMMU if we're going to allow physical DMA between
downstream devices, couldn't we just turn off the IOMMU altogether?  Of
course ATS is not without holes itself, basically that we trust the
endpoint's implementation of ATS implicitly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 19:34               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 19:34 UTC (permalink / raw)


On Tue, 8 May 2018 13:13:40 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 10:50 AM, Christian K?nig wrote:
> > E.g. transactions are initially send to the root complex for 
> > translation, that's for sure. But at least for AMD GPUs the root complex 
> > answers with the translated address which is then cached in the device.
> > 
> > So further transactions for the same address range then go directly to 
> > the destination.  
> 
> Sounds like you are referring to Address Translation Services (ATS).
> This is quite separate from ACS and, to my knowledge, isn't widely
> supported by switch hardware.

They are not so unrelated, see the ACS Direct Translated P2P
capability, which in fact must be implemented by switch downstream
ports implementing ACS and works specifically with ATS.  This appears to
be the way the PCI SIG would intend for P2P to occur within an IOMMU
managed topology, routing pre-translated DMA directly between peer
devices while requiring non-translated requests to bounce through the
IOMMU.  Really, what's the value of having an I/O virtual address space
provided by an IOMMU if we're going to allow physical DMA between
downstream devices, couldn't we just turn off the IOMMU altogether?  Of
course ATS is not without holes itself, basically that we trust the
endpoint's implementation of ATS implicitly.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 19:34               ` Alex Williamson
  (?)
  (?)
@ 2018-05-08 19:45                 ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König



On 08/05/18 01:34 PM, Alex Williamson wrote:
> They are not so unrelated, see the ACS Direct Translated P2P
> capability, which in fact must be implemented by switch downstream
> ports implementing ACS and works specifically with ATS.  This appears to
> be the way the PCI SIG would intend for P2P to occur within an IOMMU
> managed topology, routing pre-translated DMA directly between peer
> devices while requiring non-translated requests to bounce through the
> IOMMU.  Really, what's the value of having an I/O virtual address space
> provided by an IOMMU if we're going to allow physical DMA between
> downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> course ATS is not without holes itself, basically that we trust the
> endpoint's implementation of ATS implicitly.  Thanks,

I agree that this is what the SIG intends, but I don't think hardware
fully supports this methodology yet. The Direct Translated capability
just requires switches to forward packets that have the AT request type
set. It does not require them to do the translation or to support ATS
such that P2P requests can be translated by the IOMMU. I expect this is
so that an downstream device can implement ATS and not get messed up by
an upstream switch that doesn't support it.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 19:45                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt



On 08/05/18 01:34 PM, Alex Williamson wrote:
> They are not so unrelated, see the ACS Direct Translated P2P
> capability, which in fact must be implemented by switch downstream
> ports implementing ACS and works specifically with ATS.  This appears to
> be the way the PCI SIG would intend for P2P to occur within an IOMMU
> managed topology, routing pre-translated DMA directly between peer
> devices while requiring non-translated requests to bounce through the
> IOMMU.  Really, what's the value of having an I/O virtual address space
> provided by an IOMMU if we're going to allow physical DMA between
> downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> course ATS is not without holes itself, basically that we trust the
> endpoint's implementation of ATS implicitly.  Thanks,

I agree that this is what the SIG intends, but I don't think hardware
fully supports this methodology yet. The Direct Translated capability
just requires switches to forward packets that have the AT request type
set. It does not require them to do the translation or to support ATS
such that P2P requests can be translated by the IOMMU. I expect this is
so that an downstream device can implement ATS and not get messed up by
an upstream switch that doesn't support it.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 19:45                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König



On 08/05/18 01:34 PM, Alex Williamson wrote:
> They are not so unrelated, see the ACS Direct Translated P2P
> capability, which in fact must be implemented by switch downstream
> ports implementing ACS and works specifically with ATS.  This appears to
> be the way the PCI SIG would intend for P2P to occur within an IOMMU
> managed topology, routing pre-translated DMA directly between peer
> devices while requiring non-translated requests to bounce through the
> IOMMU.  Really, what's the value of having an I/O virtual address space
> provided by an IOMMU if we're going to allow physical DMA between
> downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> course ATS is not without holes itself, basically that we trust the
> endpoint's implementation of ATS implicitly.  Thanks,

I agree that this is what the SIG intends, but I don't think hardware
fully supports this methodology yet. The Direct Translated capability
just requires switches to forward packets that have the AT request type
set. It does not require them to do the translation or to support ATS
such that P2P requests can be translated by the IOMMU. I expect this is
so that an downstream device can implement ATS and not get messed up by
an upstream switch that doesn't support it.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 19:45                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 19:45 UTC (permalink / raw)




On 08/05/18 01:34 PM, Alex Williamson wrote:
> They are not so unrelated, see the ACS Direct Translated P2P
> capability, which in fact must be implemented by switch downstream
> ports implementing ACS and works specifically with ATS.  This appears to
> be the way the PCI SIG would intend for P2P to occur within an IOMMU
> managed topology, routing pre-translated DMA directly between peer
> devices while requiring non-translated requests to bounce through the
> IOMMU.  Really, what's the value of having an I/O virtual address space
> provided by an IOMMU if we're going to allow physical DMA between
> downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> course ATS is not without holes itself, basically that we trust the
> endpoint's implementation of ATS implicitly.  Thanks,

I agree that this is what the SIG intends, but I don't think hardware
fully supports this methodology yet. The Direct Translated capability
just requires switches to forward packets that have the AT request type
set. It does not require them to do the translation or to support ATS
such that P2P requests can be translated by the IOMMU. I expect this is
so that an downstream device can implement ATS and not get messed up by
an upstream switch that doesn't support it.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 19:45                 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-08 20:13                   ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 20:13 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 13:45:50 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 01:34 PM, Alex Williamson wrote:
> > They are not so unrelated, see the ACS Direct Translated P2P
> > capability, which in fact must be implemented by switch downstream
> > ports implementing ACS and works specifically with ATS.  This appears to
> > be the way the PCI SIG would intend for P2P to occur within an IOMMU
> > managed topology, routing pre-translated DMA directly between peer
> > devices while requiring non-translated requests to bounce through the
> > IOMMU.  Really, what's the value of having an I/O virtual address space
> > provided by an IOMMU if we're going to allow physical DMA between
> > downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> > course ATS is not without holes itself, basically that we trust the
> > endpoint's implementation of ATS implicitly.  Thanks,  
> 
> I agree that this is what the SIG intends, but I don't think hardware
> fully supports this methodology yet. The Direct Translated capability
> just requires switches to forward packets that have the AT request type
> set. It does not require them to do the translation or to support ATS
> such that P2P requests can be translated by the IOMMU. I expect this is
> so that an downstream device can implement ATS and not get messed up by
> an upstream switch that doesn't support it.

Well, I'm a bit confused, this patch series is specifically disabling
ACS on switches, but per the spec downstream switch ports implementing
ACS MUST implement direct translated P2P.  So it seems the only
potential gap here is the endpoint, which must support ATS or else
there's nothing for direct translated P2P to do.  The switch port plays
no part in the actual translation of the request, ATS on the endpoint
has already cached the translation and is now attempting to use it.
For the switch port, this only becomes a routing decision, the request
is already translated, therefore ACS RR and EC can be ignored to
perform "normal" (direct) routing, as if ACS were not present.  It would
be a shame to go to all the trouble of creating this no-ACS mode to find
out the target hardware supports ATS and should have simply used it, or
we should have disabled the IOMMU altogether, which leaves ACS disabled.
Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:13                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 20:13 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

On Tue, 8 May 2018 13:45:50 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 01:34 PM, Alex Williamson wrote:
> > They are not so unrelated, see the ACS Direct Translated P2P
> > capability, which in fact must be implemented by switch downstream
> > ports implementing ACS and works specifically with ATS.  This appears to
> > be the way the PCI SIG would intend for P2P to occur within an IOMMU
> > managed topology, routing pre-translated DMA directly between peer
> > devices while requiring non-translated requests to bounce through the
> > IOMMU.  Really, what's the value of having an I/O virtual address space
> > provided by an IOMMU if we're going to allow physical DMA between
> > downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> > course ATS is not without holes itself, basically that we trust the
> > endpoint's implementation of ATS implicitly.  Thanks,  
> 
> I agree that this is what the SIG intends, but I don't think hardware
> fully supports this methodology yet. The Direct Translated capability
> just requires switches to forward packets that have the AT request type
> set. It does not require them to do the translation or to support ATS
> such that P2P requests can be translated by the IOMMU. I expect this is
> so that an downstream device can implement ATS and not get messed up by
> an upstream switch that doesn't support it.

Well, I'm a bit confused, this patch series is specifically disabling
ACS on switches, but per the spec downstream switch ports implementing
ACS MUST implement direct translated P2P.  So it seems the only
potential gap here is the endpoint, which must support ATS or else
there's nothing for direct translated P2P to do.  The switch port plays
no part in the actual translation of the request, ATS on the endpoint
has already cached the translation and is now attempting to use it.
For the switch port, this only becomes a routing decision, the request
is already translated, therefore ACS RR and EC can be ignored to
perform "normal" (direct) routing, as if ACS were not present.  It would
be a shame to go to all the trouble of creating this no-ACS mode to find
out the target hardware supports ATS and should have simply used it, or
we should have disabled the IOMMU altogether, which leaves ACS disabled.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:13                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 20:13 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 13:45:50 -0600
Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:

> On 08/05/18 01:34 PM, Alex Williamson wrote:
> > They are not so unrelated, see the ACS Direct Translated P2P
> > capability, which in fact must be implemented by switch downstream
> > ports implementing ACS and works specifically with ATS.  This appears to
> > be the way the PCI SIG would intend for P2P to occur within an IOMMU
> > managed topology, routing pre-translated DMA directly between peer
> > devices while requiring non-translated requests to bounce through the
> > IOMMU.  Really, what's the value of having an I/O virtual address space
> > provided by an IOMMU if we're going to allow physical DMA between
> > downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> > course ATS is not without holes itself, basically that we trust the
> > endpoint's implementation of ATS implicitly.  Thanks,  
> 
> I agree that this is what the SIG intends, but I don't think hardware
> fully supports this methodology yet. The Direct Translated capability
> just requires switches to forward packets that have the AT request type
> set. It does not require them to do the translation or to support ATS
> such that P2P requests can be translated by the IOMMU. I expect this is
> so that an downstream device can implement ATS and not get messed up by
> an upstream switch that doesn't support it.

Well, I'm a bit confused, this patch series is specifically disabling
ACS on switches, but per the spec downstream switch ports implementing
ACS MUST implement direct translated P2P.  So it seems the only
potential gap here is the endpoint, which must support ATS or else
there's nothing for direct translated P2P to do.  The switch port plays
no part in the actual translation of the request, ATS on the endpoint
has already cached the translation and is now attempting to use it.
For the switch port, this only becomes a routing decision, the request
is already translated, therefore ACS RR and EC can be ignored to
perform "normal" (direct) routing, as if ACS were not present.  It would
be a shame to go to all the trouble of creating this no-ACS mode to find
out the target hardware supports ATS and should have simply used it, or
we should have disabled the IOMMU altogether, which leaves ACS disabled.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:13                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 20:13 UTC (permalink / raw)


On Tue, 8 May 2018 13:45:50 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 01:34 PM, Alex Williamson wrote:
> > They are not so unrelated, see the ACS Direct Translated P2P
> > capability, which in fact must be implemented by switch downstream
> > ports implementing ACS and works specifically with ATS.  This appears to
> > be the way the PCI SIG would intend for P2P to occur within an IOMMU
> > managed topology, routing pre-translated DMA directly between peer
> > devices while requiring non-translated requests to bounce through the
> > IOMMU.  Really, what's the value of having an I/O virtual address space
> > provided by an IOMMU if we're going to allow physical DMA between
> > downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> > course ATS is not without holes itself, basically that we trust the
> > endpoint's implementation of ATS implicitly.  Thanks,  
> 
> I agree that this is what the SIG intends, but I don't think hardware
> fully supports this methodology yet. The Direct Translated capability
> just requires switches to forward packets that have the AT request type
> set. It does not require them to do the translation or to support ATS
> such that P2P requests can be translated by the IOMMU. I expect this is
> so that an downstream device can implement ATS and not get messed up by
> an upstream switch that doesn't support it.

Well, I'm a bit confused, this patch series is specifically disabling
ACS on switches, but per the spec downstream switch ports implementing
ACS MUST implement direct translated P2P.  So it seems the only
potential gap here is the endpoint, which must support ATS or else
there's nothing for direct translated P2P to do.  The switch port plays
no part in the actual translation of the request, ATS on the endpoint
has already cached the translation and is now attempting to use it.
For the switch port, this only becomes a routing decision, the request
is already translated, therefore ACS RR and EC can be ignored to
perform "normal" (direct) routing, as if ACS were not present.  It would
be a shame to go to all the trouble of creating this no-ACS mode to find
out the target hardware supports ATS and should have simply used it, or
we should have disabled the IOMMU altogether, which leaves ACS disabled.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 20:13                   ` Alex Williamson
  (?)
  (?)
@ 2018-05-08 20:19                     ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 20:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König



On 08/05/18 02:13 PM, Alex Williamson wrote:
> Well, I'm a bit confused, this patch series is specifically disabling
> ACS on switches, but per the spec downstream switch ports implementing
> ACS MUST implement direct translated P2P.  So it seems the only
> potential gap here is the endpoint, which must support ATS or else
> there's nothing for direct translated P2P to do.  The switch port plays
> no part in the actual translation of the request, ATS on the endpoint
> has already cached the translation and is now attempting to use it.
> For the switch port, this only becomes a routing decision, the request
> is already translated, therefore ACS RR and EC can be ignored to
> perform "normal" (direct) routing, as if ACS were not present.  It would
> be a shame to go to all the trouble of creating this no-ACS mode to find
> out the target hardware supports ATS and should have simply used it, or
> we should have disabled the IOMMU altogether, which leaves ACS disabled.

Ah, ok, I didn't think it was the endpoint that had to implement ATS.
But in that case, for our application, we need NVMe cards and RDMA NICs
to all have ATS support and I expect that is just as unlikely. At least
none of the endpoints on my system support it. Maybe only certain GPUs
have this support.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:19                     ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 20:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt



On 08/05/18 02:13 PM, Alex Williamson wrote:
> Well, I'm a bit confused, this patch series is specifically disabling
> ACS on switches, but per the spec downstream switch ports implementing
> ACS MUST implement direct translated P2P.  So it seems the only
> potential gap here is the endpoint, which must support ATS or else
> there's nothing for direct translated P2P to do.  The switch port plays
> no part in the actual translation of the request, ATS on the endpoint
> has already cached the translation and is now attempting to use it.
> For the switch port, this only becomes a routing decision, the request
> is already translated, therefore ACS RR and EC can be ignored to
> perform "normal" (direct) routing, as if ACS were not present.  It would
> be a shame to go to all the trouble of creating this no-ACS mode to find
> out the target hardware supports ATS and should have simply used it, or
> we should have disabled the IOMMU altogether, which leaves ACS disabled.

Ah, ok, I didn't think it was the endpoint that had to implement ATS.
But in that case, for our application, we need NVMe cards and RDMA NICs
to all have ATS support and I expect that is just as unlikely. At least
none of the endpoints on my system support it. Maybe only certain GPUs
have this support.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:19                     ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 20:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König



On 08/05/18 02:13 PM, Alex Williamson wrote:
> Well, I'm a bit confused, this patch series is specifically disabling
> ACS on switches, but per the spec downstream switch ports implementing
> ACS MUST implement direct translated P2P.  So it seems the only
> potential gap here is the endpoint, which must support ATS or else
> there's nothing for direct translated P2P to do.  The switch port plays
> no part in the actual translation of the request, ATS on the endpoint
> has already cached the translation and is now attempting to use it.
> For the switch port, this only becomes a routing decision, the request
> is already translated, therefore ACS RR and EC can be ignored to
> perform "normal" (direct) routing, as if ACS were not present.  It would
> be a shame to go to all the trouble of creating this no-ACS mode to find
> out the target hardware supports ATS and should have simply used it, or
> we should have disabled the IOMMU altogether, which leaves ACS disabled.

Ah, ok, I didn't think it was the endpoint that had to implement ATS.
But in that case, for our application, we need NVMe cards and RDMA NICs
to all have ATS support and I expect that is just as unlikely. At least
none of the endpoints on my system support it. Maybe only certain GPUs
have this support.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:19                     ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 20:19 UTC (permalink / raw)




On 08/05/18 02:13 PM, Alex Williamson wrote:
> Well, I'm a bit confused, this patch series is specifically disabling
> ACS on switches, but per the spec downstream switch ports implementing
> ACS MUST implement direct translated P2P.  So it seems the only
> potential gap here is the endpoint, which must support ATS or else
> there's nothing for direct translated P2P to do.  The switch port plays
> no part in the actual translation of the request, ATS on the endpoint
> has already cached the translation and is now attempting to use it.
> For the switch port, this only becomes a routing decision, the request
> is already translated, therefore ACS RR and EC can be ignored to
> perform "normal" (direct) routing, as if ACS were not present.  It would
> be a shame to go to all the trouble of creating this no-ACS mode to find
> out the target hardware supports ATS and should have simply used it, or
> we should have disabled the IOMMU altogether, which leaves ACS disabled.

Ah, ok, I didn't think it was the endpoint that had to implement ATS.
But in that case, for our application, we need NVMe cards and RDMA NICs
to all have ATS support and I expect that is just as unlikely. At least
none of the endpoints on my system support it. Maybe only certain GPUs
have this support.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 20:19                     ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-08 20:43                       ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 20:43 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 14:19:05 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.  
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
dumb question, why not simply turn off the IOMMU and thus ACS?  The
argument of using the IOMMU for security is rather diminished if we're
specifically enabling devices to poke one another directly and clearly
this isn't favorable for device assignment either.  Are there target
systems where this is not a simple kernel commandline option?  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:43                       ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 20:43 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

On Tue, 8 May 2018 14:19:05 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.  
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
dumb question, why not simply turn off the IOMMU and thus ACS?  The
argument of using the IOMMU for security is rather diminished if we're
specifically enabling devices to poke one another directly and clearly
this isn't favorable for device assignment either.  Are there target
systems where this is not a simple kernel commandline option?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:43                       ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 20:43 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 14:19:05 -0600
Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:

> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.  
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
dumb question, why not simply turn off the IOMMU and thus ACS?  The
argument of using the IOMMU for security is rather diminished if we're
specifically enabling devices to poke one another directly and clearly
this isn't favorable for device assignment either.  Are there target
systems where this is not a simple kernel commandline option?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:43                       ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 20:43 UTC (permalink / raw)


On Tue, 8 May 2018 14:19:05 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.  
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
dumb question, why not simply turn off the IOMMU and thus ACS?  The
argument of using the IOMMU for security is rather diminished if we're
specifically enabling devices to poke one another directly and clearly
this isn't favorable for device assignment either.  Are there target
systems where this is not a simple kernel commandline option?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 20:43                       ` Alex Williamson
  (?)
  (?)
@ 2018-05-08 20:49                         ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 20:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König



On 08/05/18 02:43 PM, Alex Williamson wrote:
> Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> dumb question, why not simply turn off the IOMMU and thus ACS?  The
> argument of using the IOMMU for security is rather diminished if we're
> specifically enabling devices to poke one another directly and clearly
> this isn't favorable for device assignment either.  Are there target
> systems where this is not a simple kernel commandline option?  Thanks,

Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
into some bios's that set the bits on boot (which is annoying).

I also don't expect people will respond well to making the IOMMU and P2P
exclusive. The IOMMU is often used for more than just security and on
many platforms it's enabled by default. I'd much rather allow IOMMU use
but have fewer isolation groups in much the same way as if you had PCI
bridges that didn't support ACS.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:49                         ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 20:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt



On 08/05/18 02:43 PM, Alex Williamson wrote:
> Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> dumb question, why not simply turn off the IOMMU and thus ACS?  The
> argument of using the IOMMU for security is rather diminished if we're
> specifically enabling devices to poke one another directly and clearly
> this isn't favorable for device assignment either.  Are there target
> systems where this is not a simple kernel commandline option?  Thanks,

Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
into some bios's that set the bits on boot (which is annoying).

I also don't expect people will respond well to making the IOMMU and P2P
exclusive. The IOMMU is often used for more than just security and on
many platforms it's enabled by default. I'd much rather allow IOMMU use
but have fewer isolation groups in much the same way as if you had PCI
bridges that didn't support ACS.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:49                         ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 20:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König



On 08/05/18 02:43 PM, Alex Williamson wrote:
> Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> dumb question, why not simply turn off the IOMMU and thus ACS?  The
> argument of using the IOMMU for security is rather diminished if we're
> specifically enabling devices to poke one another directly and clearly
> this isn't favorable for device assignment either.  Are there target
> systems where this is not a simple kernel commandline option?  Thanks,

Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
into some bios's that set the bits on boot (which is annoying).

I also don't expect people will respond well to making the IOMMU and P2P
exclusive. The IOMMU is often used for more than just security and on
many platforms it's enabled by default. I'd much rather allow IOMMU use
but have fewer isolation groups in much the same way as if you had PCI
bridges that didn't support ACS.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:49                         ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 20:49 UTC (permalink / raw)




On 08/05/18 02:43 PM, Alex Williamson wrote:
> Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> dumb question, why not simply turn off the IOMMU and thus ACS?  The
> argument of using the IOMMU for security is rather diminished if we're
> specifically enabling devices to poke one another directly and clearly
> this isn't favorable for device assignment either.  Are there target
> systems where this is not a simple kernel commandline option?  Thanks,

Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
into some bios's that set the bits on boot (which is annoying).

I also don't expect people will respond well to making the IOMMU and P2P
exclusive. The IOMMU is often used for more than just security and on
many platforms it's enabled by default. I'd much rather allow IOMMU use
but have fewer isolation groups in much the same way as if you had PCI
bridges that didn't support ACS.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 20:19                     ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-08 20:50                       ` Jerome Glisse
  -1 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-08 20:50 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, May 08, 2018 at 02:19:05PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.

ATS is implemented by the IOMMU not by the device (well device implement
the client side of it). Also ATS is meaningless without something like
PASID as far as i know.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:50                       ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-08 20:50 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Alex Williamson, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Tue, May 08, 2018 at 02:19:05PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.

ATS is implemented by the IOMMU not by the device (well device implement
the client side of it). Also ATS is meaningless without something like
PASID as far as i know.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:50                       ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-08 20:50 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Alex Williamson, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Tue, May 08, 2018 at 02:19:05PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.

ATS is implemented by the IOMMU not by the device (well device implement
the client side of it). Also ATS is meaningless without something like
PASID as far as i know.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 20:50                       ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-08 20:50 UTC (permalink / raw)


On Tue, May 08, 2018@02:19:05PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.

ATS is implemented by the IOMMU not by the device (well device implement
the client side of it). Also ATS is meaningless without something like
PASID as far as i know.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 14:44       ` Stephen  Bates
                           ` (2 preceding siblings ...)
  (?)
@ 2018-05-08 21:04         ` Don Dutile
  -1 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 21:04 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Christian König, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On 05/08/2018 10:44 AM, Stephen  Bates wrote:
> Hi Dan
> 
>>     It seems unwieldy that this is a compile time option and not a runtime
>>     option. Can't we have a kernel command line option to opt-in to this
>>     behavior rather than require a wholly separate kernel image?
>    
> I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
> 
It is clear if it is a kernel command-line option or a CONFIG option.
One does not have access to the kernel command-line w/o a few privs.
A CONFIG option prevents a distribution to have a default, locked-down kernel _and_ the ability to be 'unlocked' if the customer/site is 'secure' via other means.
A run/boot-time option is more flexible and achieves the best of both.
    
>> Why is this text added in a follow on patch and not the patch that
>>   introduced the config option?
> 
> Because the ACS section was added later in the series and this information is associated with that additional functionality.
>      
>> I'm also wondering if that command line option can take a 'bus device
>> function' address of a switch to limit the scope of where ACS is
>> disabled.
> 
Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
I recommend doing so via a sysfs method.

That way, the system can limit the 'unsecure' space btwn two devices, likely configured on a separate switch, from the rest of the still-secured/ACS-enabled PCIe tree.
PCIe is pt-to-pt, effectively; maybe one would have multiple nics/fabrics p2p to/from NVME, but one could look at it as a list of pairs (nic1<->nvme1; nic2<->nvme2; ....).
A pair-listing would be optimal, allowing the kernel to figure out the ACS path, and not making it endpoint-switch-switch...-switch-endpt error-entry prone.
Additionally, systems that can/prefer to do so via a RP's IOMMU, albeit not optimal, but better then all the way to/from memory, and a security/iova-check possible,
can modify the pt-to-pt ACS algorithm to accomodate over time (e.g., cap bits be they hw or device-driver/extension/quirk defined for each bridge/RP in a PCI domain).

Kernels that never want to support P2P could build w/o it enabled.... cmdline option is moot.
Kernels built with it on, *still* need cmdline option, to be blunt that the kernel is enabling a feature that could render the entire (IO sub)system unsecure.

> By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.
> 
as devices are added, they start in ACS-enabled, secured mode.
As sysfs entry modifies p2p ability, IOMMU group is modified as well.


btw -- IOMMU grouping is a host/HV control issue, not a VM control/knowledge issue.
        So I don't understand the comments why VMs should need to know.
        -- configure p2p _before_ assigning devices to VMs. ... iommu groups are checked at assignment time.
           -- so even if hot-add, separate iommu group, then enable p2p, becomes same IOMMU group, then can only assign to same VM.
        -- VMs don't know IOMMU's & ACS are involved now, and won't later, even if device's dynamically added/removed

Is there a thread I need to read up to explain /clear-up the thoughts above?

> Stephen
> 
> [1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
>      
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:04         ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 21:04 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König

On 05/08/2018 10:44 AM, Stephen  Bates wrote:
> Hi Dan
> 
>>     It seems unwieldy that this is a compile time option and not a runtime
>>     option. Can't we have a kernel command line option to opt-in to this
>>     behavior rather than require a wholly separate kernel image?
>    
> I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
> 
It is clear if it is a kernel command-line option or a CONFIG option.
One does not have access to the kernel command-line w/o a few privs.
A CONFIG option prevents a distribution to have a default, locked-down kernel _and_ the ability to be 'unlocked' if the customer/site is 'secure' via other means.
A run/boot-time option is more flexible and achieves the best of both.
    
>> Why is this text added in a follow on patch and not the patch that
>>   introduced the config option?
> 
> Because the ACS section was added later in the series and this information is associated with that additional functionality.
>      
>> I'm also wondering if that command line option can take a 'bus device
>> function' address of a switch to limit the scope of where ACS is
>> disabled.
> 
Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
I recommend doing so via a sysfs method.

That way, the system can limit the 'unsecure' space btwn two devices, likely configured on a separate switch, from the rest of the still-secured/ACS-enabled PCIe tree.
PCIe is pt-to-pt, effectively; maybe one would have multiple nics/fabrics p2p to/from NVME, but one could look at it as a list of pairs (nic1<->nvme1; nic2<->nvme2; ....).
A pair-listing would be optimal, allowing the kernel to figure out the ACS path, and not making it endpoint-switch-switch...-switch-endpt error-entry prone.
Additionally, systems that can/prefer to do so via a RP's IOMMU, albeit not optimal, but better then all the way to/from memory, and a security/iova-check possible,
can modify the pt-to-pt ACS algorithm to accomodate over time (e.g., cap bits be they hw or device-driver/extension/quirk defined for each bridge/RP in a PCI domain).

Kernels that never want to support P2P could build w/o it enabled.... cmdline option is moot.
Kernels built with it on, *still* need cmdline option, to be blunt that the kernel is enabling a feature that could render the entire (IO sub)system unsecure.

> By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.
> 
as devices are added, they start in ACS-enabled, secured mode.
As sysfs entry modifies p2p ability, IOMMU group is modified as well.


btw -- IOMMU grouping is a host/HV control issue, not a VM control/knowledge issue.
        So I don't understand the comments why VMs should need to know.
        -- configure p2p _before_ assigning devices to VMs. ... iommu groups are checked at assignment time.
           -- so even if hot-add, separate iommu group, then enable p2p, becomes same IOMMU group, then can only assign to same VM.
        -- VMs don't know IOMMU's & ACS are involved now, and won't later, even if device's dynamically added/removed

Is there a thread I need to read up to explain /clear-up the thoughts above?

> Stephen
> 
> [1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
>      
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:04         ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 21:04 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christian König, linux-block-u79uwXL29TY76Z2rM5mHXA,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 05/08/2018 10:44 AM, Stephen  Bates wrote:
> Hi Dan
> 
>>     It seems unwieldy that this is a compile time option and not a runtime
>>     option. Can't we have a kernel command line option to opt-in to this
>>     behavior rather than require a wholly separate kernel image?
>    
> I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
> 
It is clear if it is a kernel command-line option or a CONFIG option.
One does not have access to the kernel command-line w/o a few privs.
A CONFIG option prevents a distribution to have a default, locked-down kernel _and_ the ability to be 'unlocked' if the customer/site is 'secure' via other means.
A run/boot-time option is more flexible and achieves the best of both.
    
>> Why is this text added in a follow on patch and not the patch that
>>   introduced the config option?
> 
> Because the ACS section was added later in the series and this information is associated with that additional functionality.
>      
>> I'm also wondering if that command line option can take a 'bus device
>> function' address of a switch to limit the scope of where ACS is
>> disabled.
> 
Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
I recommend doing so via a sysfs method.

That way, the system can limit the 'unsecure' space btwn two devices, likely configured on a separate switch, from the rest of the still-secured/ACS-enabled PCIe tree.
PCIe is pt-to-pt, effectively; maybe one would have multiple nics/fabrics p2p to/from NVME, but one could look at it as a list of pairs (nic1<->nvme1; nic2<->nvme2; ....).
A pair-listing would be optimal, allowing the kernel to figure out the ACS path, and not making it endpoint-switch-switch...-switch-endpt error-entry prone.
Additionally, systems that can/prefer to do so via a RP's IOMMU, albeit not optimal, but better then all the way to/from memory, and a security/iova-check possible,
can modify the pt-to-pt ACS algorithm to accomodate over time (e.g., cap bits be they hw or device-driver/extension/quirk defined for each bridge/RP in a PCI domain).

Kernels that never want to support P2P could build w/o it enabled.... cmdline option is moot.
Kernels built with it on, *still* need cmdline option, to be blunt that the kernel is enabling a feature that could render the entire (IO sub)system unsecure.

> By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.
> 
as devices are added, they start in ACS-enabled, secured mode.
As sysfs entry modifies p2p ability, IOMMU group is modified as well.


btw -- IOMMU grouping is a host/HV control issue, not a VM control/knowledge issue.
        So I don't understand the comments why VMs should need to know.
        -- configure p2p _before_ assigning devices to VMs. ... iommu groups are checked at assignment time.
           -- so even if hot-add, separate iommu group, then enable p2p, becomes same IOMMU group, then can only assign to same VM.
        -- VMs don't know IOMMU's & ACS are involved now, and won't later, even if device's dynamically added/removed

Is there a thread I need to read up to explain /clear-up the thoughts above?

> Stephen
> 
> [1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
>      
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:04         ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 21:04 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, Sagi Grimberg,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, Christian König, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 05/08/2018 10:44 AM, Stephen  Bates wrote:
> Hi Dan
> 
>>     It seems unwieldy that this is a compile time option and not a runtime
>>     option. Can't we have a kernel command line option to opt-in to this
>>     behavior rather than require a wholly separate kernel image?
>    
> I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
> 
It is clear if it is a kernel command-line option or a CONFIG option.
One does not have access to the kernel command-line w/o a few privs.
A CONFIG option prevents a distribution to have a default, locked-down kernel _and_ the ability to be 'unlocked' if the customer/site is 'secure' via other means.
A run/boot-time option is more flexible and achieves the best of both.
    
>> Why is this text added in a follow on patch and not the patch that
>>   introduced the config option?
> 
> Because the ACS section was added later in the series and this information is associated with that additional functionality.
>      
>> I'm also wondering if that command line option can take a 'bus device
>> function' address of a switch to limit the scope of where ACS is
>> disabled.
> 
Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
I recommend doing so via a sysfs method.

That way, the system can limit the 'unsecure' space btwn two devices, likely configured on a separate switch, from the rest of the still-secured/ACS-enabled PCIe tree.
PCIe is pt-to-pt, effectively; maybe one would have multiple nics/fabrics p2p to/from NVME, but one could look at it as a list of pairs (nic1<->nvme1; nic2<->nvme2; ....).
A pair-listing would be optimal, allowing the kernel to figure out the ACS path, and not making it endpoint-switch-switch...-switch-endpt error-entry prone.
Additionally, systems that can/prefer to do so via a RP's IOMMU, albeit not optimal, but better then all the way to/from memory, and a security/iova-check possible,
can modify the pt-to-pt ACS algorithm to accomodate over time (e.g., cap bits be they hw or device-driver/extension/quirk defined for each bridge/RP in a PCI domain).

Kernels that never want to support P2P could build w/o it enabled.... cmdline option is moot.
Kernels built with it on, *still* need cmdline option, to be blunt that the kernel is enabling a feature that could render the entire (IO sub)system unsecure.

> By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.
> 
as devices are added, they start in ACS-enabled, secured mode.
As sysfs entry modifies p2p ability, IOMMU group is modified as well.


btw -- IOMMU grouping is a host/HV control issue, not a VM control/knowledge issue.
        So I don't understand the comments why VMs should need to know.
        -- configure p2p _before_ assigning devices to VMs. ... iommu groups are checked at assignment time.
           -- so even if hot-add, separate iommu group, then enable p2p, becomes same IOMMU group, then can only assign to same VM.
        -- VMs don't know IOMMU's & ACS are involved now, and won't later, even if device's dynamically added/removed

Is there a thread I need to read up to explain /clear-up the thoughts above?

> Stephen
> 
> [1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
>      
> 


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:04         ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 21:04 UTC (permalink / raw)


On 05/08/2018 10:44 AM, Stephen  Bates wrote:
> Hi Dan
> 
>>     It seems unwieldy that this is a compile time option and not a runtime
>>     option. Can't we have a kernel command line option to opt-in to this
>>     behavior rather than require a wholly separate kernel image?
>    
> I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.
> 
It is clear if it is a kernel command-line option or a CONFIG option.
One does not have access to the kernel command-line w/o a few privs.
A CONFIG option prevents a distribution to have a default, locked-down kernel _and_ the ability to be 'unlocked' if the customer/site is 'secure' via other means.
A run/boot-time option is more flexible and achieves the best of both.
    
>> Why is this text added in a follow on patch and not the patch that
>>   introduced the config option?
> 
> Because the ACS section was added later in the series and this information is associated with that additional functionality.
>      
>> I'm also wondering if that command line option can take a 'bus device
>> function' address of a switch to limit the scope of where ACS is
>> disabled.
> 
Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
I recommend doing so via a sysfs method.

That way, the system can limit the 'unsecure' space btwn two devices, likely configured on a separate switch, from the rest of the still-secured/ACS-enabled PCIe tree.
PCIe is pt-to-pt, effectively; maybe one would have multiple nics/fabrics p2p to/from NVME, but one could look at it as a list of pairs (nic1<->nvme1; nic2<->nvme2; ....).
A pair-listing would be optimal, allowing the kernel to figure out the ACS path, and not making it endpoint-switch-switch...-switch-endpt error-entry prone.
Additionally, systems that can/prefer to do so via a RP's IOMMU, albeit not optimal, but better then all the way to/from memory, and a security/iova-check possible,
can modify the pt-to-pt ACS algorithm to accomodate over time (e.g., cap bits be they hw or device-driver/extension/quirk defined for each bridge/RP in a PCI domain).

Kernels that never want to support P2P could build w/o it enabled.... cmdline option is moot.
Kernels built with it on, *still* need cmdline option, to be blunt that the kernel is enabling a feature that could render the entire (IO sub)system unsecure.

> By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system.
> 
as devices are added, they start in ACS-enabled, secured mode.
As sysfs entry modifies p2p ability, IOMMU group is modified as well.


btw -- IOMMU grouping is a host/HV control issue, not a VM control/knowledge issue.
        So I don't understand the comments why VMs should need to know.
        -- configure p2p _before_ assigning devices to VMs. ... iommu groups are checked at assignment time.
           -- so even if hot-add, separate iommu group, then enable p2p, becomes same IOMMU group, then can only assign to same VM.
        -- VMs don't know IOMMU's & ACS are involved now, and won't later, even if device's dynamically added/removed

Is there a thread I need to read up to explain /clear-up the thoughts above?

> Stephen
> 
> [1] https://marc.info/?l=linux-doc&m=150907188310838&w=2
>      
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-08 16:57     ` Alex Williamson
  (?)
  (?)
@ 2018-05-08 21:25       ` Don Dutile
  -1 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 21:25 UTC (permalink / raw)
  To: Alex Williamson, Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 05/08/2018 12:57 PM, Alex Williamson wrote:
> On Mon, 7 May 2018 18:23:46 -0500
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> 
>> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
>>> Hi Everyone,
>>>
>>> Here's v4 of our series to introduce P2P based copy offload to NVMe
>>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
>>> is here:
>>>
>>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>>> ...
>>
>>> Logan Gunthorpe (14):
>>>    PCI/P2PDMA: Support peer-to-peer memory
>>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>>>    docs-rst: Add a new directory for PCI documentation
>>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>>>    block: Introduce PCI P2P flags for request and request queue
>>>    IB/core: Ensure we map P2P memory correctly in
>>>      rdma_rw_ctx_[init|destroy]()
>>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>>>    nvme-pci: Add support for P2P memory in requests
>>>    nvme-pci: Add a quirk for a pseudo CMB
>>>    nvmet: Introduce helper functions to allocate and free request SGLs
>>>    nvmet-rdma: Use new SGL alloc/free helper for requests
>>>    nvmet: Optionally use PCI P2P memory
>>>
>>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>>>   Documentation/PCI/index.rst                |  14 +
>>>   Documentation/driver-api/index.rst         |   2 +-
>>>   Documentation/driver-api/pci/index.rst     |  20 +
>>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>>>   Documentation/driver-api/{ => pci}/pci.rst |   0
>>>   Documentation/index.rst                    |   3 +-
>>>   block/blk-core.c                           |   3 +
>>>   drivers/infiniband/core/rw.c               |  13 +-
>>>   drivers/nvme/host/core.c                   |   4 +
>>>   drivers/nvme/host/nvme.h                   |   8 +
>>>   drivers/nvme/host/pci.c                    | 118 +++--
>>>   drivers/nvme/target/configfs.c             |  67 +++
>>>   drivers/nvme/target/core.c                 | 143 ++++-
>>>   drivers/nvme/target/io-cmd.c               |   3 +
>>>   drivers/nvme/target/nvmet.h                |  15 +
>>>   drivers/nvme/target/rdma.c                 |  22 +-
>>>   drivers/pci/Kconfig                        |  26 +
>>>   drivers/pci/Makefile                       |   1 +
>>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>>>   drivers/pci/pci.c                          |   6 +
>>>   include/linux/blk_types.h                  |  18 +-
>>>   include/linux/blkdev.h                     |   3 +
>>>   include/linux/memremap.h                   |  19 +
>>>   include/linux/pci-p2pdma.h                 | 118 +++++
>>>   include/linux/pci.h                        |   4 +
>>>   26 files changed, 1579 insertions(+), 56 deletions(-)
>>>   create mode 100644 Documentation/PCI/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>>>   create mode 100644 drivers/pci/p2pdma.c
>>>   create mode 100644 include/linux/pci-p2pdma.h
>>
>> How do you envison merging this?  There's a big chunk in drivers/pci, but
>> really no opportunity for conflicts there, and there's significant stuff in
>> block and nvme that I don't really want to merge.
>>
>> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
>> merge it elsewhere?
> 
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,
> 
> Alex
At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
a (layer of) switch(ing) provides.
To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
method to make two points p2p dma capable.

Worse case, the whole system is one large IOMMU group (current mindset of this static or run-time config option),
or best case (over time, more hw), a secure set of the primary system with p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure',
the latter the case of today's VM with an assigned device -- can scribble all over the VM, but no other VM and not the host/HV.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 21:25       ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 21:25 UTC (permalink / raw)
  To: Alex Williamson, Bjorn Helgaas
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt,
	Christian König

On 05/08/2018 12:57 PM, Alex Williamson wrote:
> On Mon, 7 May 2018 18:23:46 -0500
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> 
>> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
>>> Hi Everyone,
>>>
>>> Here's v4 of our series to introduce P2P based copy offload to NVMe
>>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
>>> is here:
>>>
>>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>>> ...
>>
>>> Logan Gunthorpe (14):
>>>    PCI/P2PDMA: Support peer-to-peer memory
>>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>>>    docs-rst: Add a new directory for PCI documentation
>>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>>>    block: Introduce PCI P2P flags for request and request queue
>>>    IB/core: Ensure we map P2P memory correctly in
>>>      rdma_rw_ctx_[init|destroy]()
>>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>>>    nvme-pci: Add support for P2P memory in requests
>>>    nvme-pci: Add a quirk for a pseudo CMB
>>>    nvmet: Introduce helper functions to allocate and free request SGLs
>>>    nvmet-rdma: Use new SGL alloc/free helper for requests
>>>    nvmet: Optionally use PCI P2P memory
>>>
>>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>>>   Documentation/PCI/index.rst                |  14 +
>>>   Documentation/driver-api/index.rst         |   2 +-
>>>   Documentation/driver-api/pci/index.rst     |  20 +
>>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>>>   Documentation/driver-api/{ => pci}/pci.rst |   0
>>>   Documentation/index.rst                    |   3 +-
>>>   block/blk-core.c                           |   3 +
>>>   drivers/infiniband/core/rw.c               |  13 +-
>>>   drivers/nvme/host/core.c                   |   4 +
>>>   drivers/nvme/host/nvme.h                   |   8 +
>>>   drivers/nvme/host/pci.c                    | 118 +++--
>>>   drivers/nvme/target/configfs.c             |  67 +++
>>>   drivers/nvme/target/core.c                 | 143 ++++-
>>>   drivers/nvme/target/io-cmd.c               |   3 +
>>>   drivers/nvme/target/nvmet.h                |  15 +
>>>   drivers/nvme/target/rdma.c                 |  22 +-
>>>   drivers/pci/Kconfig                        |  26 +
>>>   drivers/pci/Makefile                       |   1 +
>>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>>>   drivers/pci/pci.c                          |   6 +
>>>   include/linux/blk_types.h                  |  18 +-
>>>   include/linux/blkdev.h                     |   3 +
>>>   include/linux/memremap.h                   |  19 +
>>>   include/linux/pci-p2pdma.h                 | 118 +++++
>>>   include/linux/pci.h                        |   4 +
>>>   26 files changed, 1579 insertions(+), 56 deletions(-)
>>>   create mode 100644 Documentation/PCI/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>>>   create mode 100644 drivers/pci/p2pdma.c
>>>   create mode 100644 include/linux/pci-p2pdma.h
>>
>> How do you envison merging this?  There's a big chunk in drivers/pci, but
>> really no opportunity for conflicts there, and there's significant stuff in
>> block and nvme that I don't really want to merge.
>>
>> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
>> merge it elsewhere?
> 
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,
> 
> Alex
At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
a (layer of) switch(ing) provides.
To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
method to make two points p2p dma capable.

Worse case, the whole system is one large IOMMU group (current mindset of this static or run-time config option),
or best case (over time, more hw), a secure set of the primary system with p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure',
the latter the case of today's VM with an assigned device -- can scribble all over the VM, but no other VM and not the host/HV.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 21:25       ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 21:25 UTC (permalink / raw)
  To: Alex Williamson, Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 05/08/2018 12:57 PM, Alex Williamson wrote:
> On Mon, 7 May 2018 18:23:46 -0500
> Bjorn Helgaas <helgaas-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> 
>> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
>>> Hi Everyone,
>>>
>>> Here's v4 of our series to introduce P2P based copy offload to NVMe
>>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
>>> is here:
>>>
>>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>>> ...
>>
>>> Logan Gunthorpe (14):
>>>    PCI/P2PDMA: Support peer-to-peer memory
>>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>>>    docs-rst: Add a new directory for PCI documentation
>>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>>>    block: Introduce PCI P2P flags for request and request queue
>>>    IB/core: Ensure we map P2P memory correctly in
>>>      rdma_rw_ctx_[init|destroy]()
>>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>>>    nvme-pci: Add support for P2P memory in requests
>>>    nvme-pci: Add a quirk for a pseudo CMB
>>>    nvmet: Introduce helper functions to allocate and free request SGLs
>>>    nvmet-rdma: Use new SGL alloc/free helper for requests
>>>    nvmet: Optionally use PCI P2P memory
>>>
>>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>>>   Documentation/PCI/index.rst                |  14 +
>>>   Documentation/driver-api/index.rst         |   2 +-
>>>   Documentation/driver-api/pci/index.rst     |  20 +
>>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>>>   Documentation/driver-api/{ => pci}/pci.rst |   0
>>>   Documentation/index.rst                    |   3 +-
>>>   block/blk-core.c                           |   3 +
>>>   drivers/infiniband/core/rw.c               |  13 +-
>>>   drivers/nvme/host/core.c                   |   4 +
>>>   drivers/nvme/host/nvme.h                   |   8 +
>>>   drivers/nvme/host/pci.c                    | 118 +++--
>>>   drivers/nvme/target/configfs.c             |  67 +++
>>>   drivers/nvme/target/core.c                 | 143 ++++-
>>>   drivers/nvme/target/io-cmd.c               |   3 +
>>>   drivers/nvme/target/nvmet.h                |  15 +
>>>   drivers/nvme/target/rdma.c                 |  22 +-
>>>   drivers/pci/Kconfig                        |  26 +
>>>   drivers/pci/Makefile                       |   1 +
>>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>>>   drivers/pci/pci.c                          |   6 +
>>>   include/linux/blk_types.h                  |  18 +-
>>>   include/linux/blkdev.h                     |   3 +
>>>   include/linux/memremap.h                   |  19 +
>>>   include/linux/pci-p2pdma.h                 | 118 +++++
>>>   include/linux/pci.h                        |   4 +
>>>   26 files changed, 1579 insertions(+), 56 deletions(-)
>>>   create mode 100644 Documentation/PCI/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>>>   create mode 100644 drivers/pci/p2pdma.c
>>>   create mode 100644 include/linux/pci-p2pdma.h
>>
>> How do you envison merging this?  There's a big chunk in drivers/pci, but
>> really no opportunity for conflicts there, and there's significant stuff in
>> block and nvme that I don't really want to merge.
>>
>> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
>> merge it elsewhere?
> 
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,
> 
> Alex
At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
a (layer of) switch(ing) provides.
To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
method to make two points p2p dma capable.

Worse case, the whole system is one large IOMMU group (current mindset of this static or run-time config option),
or best case (over time, more hw), a secure set of the primary system with p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure',
the latter the case of today's VM with an assigned device -- can scribble all over the VM, but no other VM and not the host/HV.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 21:25       ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 21:25 UTC (permalink / raw)


On 05/08/2018 12:57 PM, Alex Williamson wrote:
> On Mon, 7 May 2018 18:23:46 -0500
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> 
>> On Mon, Apr 23, 2018@05:30:32PM -0600, Logan Gunthorpe wrote:
>>> Hi Everyone,
>>>
>>> Here's v4 of our series to introduce P2P based copy offload to NVMe
>>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
>>> is here:
>>>
>>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>>> ...
>>
>>> Logan Gunthorpe (14):
>>>    PCI/P2PDMA: Support peer-to-peer memory
>>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>>>    docs-rst: Add a new directory for PCI documentation
>>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>>>    block: Introduce PCI P2P flags for request and request queue
>>>    IB/core: Ensure we map P2P memory correctly in
>>>      rdma_rw_ctx_[init|destroy]()
>>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>>>    nvme-pci: Add support for P2P memory in requests
>>>    nvme-pci: Add a quirk for a pseudo CMB
>>>    nvmet: Introduce helper functions to allocate and free request SGLs
>>>    nvmet-rdma: Use new SGL alloc/free helper for requests
>>>    nvmet: Optionally use PCI P2P memory
>>>
>>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>>>   Documentation/PCI/index.rst                |  14 +
>>>   Documentation/driver-api/index.rst         |   2 +-
>>>   Documentation/driver-api/pci/index.rst     |  20 +
>>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>>>   Documentation/driver-api/{ => pci}/pci.rst |   0
>>>   Documentation/index.rst                    |   3 +-
>>>   block/blk-core.c                           |   3 +
>>>   drivers/infiniband/core/rw.c               |  13 +-
>>>   drivers/nvme/host/core.c                   |   4 +
>>>   drivers/nvme/host/nvme.h                   |   8 +
>>>   drivers/nvme/host/pci.c                    | 118 +++--
>>>   drivers/nvme/target/configfs.c             |  67 +++
>>>   drivers/nvme/target/core.c                 | 143 ++++-
>>>   drivers/nvme/target/io-cmd.c               |   3 +
>>>   drivers/nvme/target/nvmet.h                |  15 +
>>>   drivers/nvme/target/rdma.c                 |  22 +-
>>>   drivers/pci/Kconfig                        |  26 +
>>>   drivers/pci/Makefile                       |   1 +
>>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>>>   drivers/pci/pci.c                          |   6 +
>>>   include/linux/blk_types.h                  |  18 +-
>>>   include/linux/blkdev.h                     |   3 +
>>>   include/linux/memremap.h                   |  19 +
>>>   include/linux/pci-p2pdma.h                 | 118 +++++
>>>   include/linux/pci.h                        |   4 +
>>>   26 files changed, 1579 insertions(+), 56 deletions(-)
>>>   create mode 100644 Documentation/PCI/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>>>   create mode 100644 drivers/pci/p2pdma.c
>>>   create mode 100644 include/linux/pci-p2pdma.h
>>
>> How do you envison merging this?  There's a big chunk in drivers/pci, but
>> really no opportunity for conflicts there, and there's significant stuff in
>> block and nvme that I don't really want to merge.
>>
>> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
>> merge it elsewhere?
> 
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,
> 
> Alex
At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
a (layer of) switch(ing) provides.
To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
method to make two points p2p dma capable.

Worse case, the whole system is one large IOMMU group (current mindset of this static or run-time config option),
or best case (over time, more hw), a secure set of the primary system with p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure',
the latter the case of today's VM with an assigned device -- can scribble all over the VM, but no other VM and not the host/HV.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 20:49                         ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-08 21:26                           ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 21:26 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 14:49:23 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 02:43 PM, Alex Williamson wrote:
> > Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> > dumb question, why not simply turn off the IOMMU and thus ACS?  The
> > argument of using the IOMMU for security is rather diminished if we're
> > specifically enabling devices to poke one another directly and clearly
> > this isn't favorable for device assignment either.  Are there target
> > systems where this is not a simple kernel commandline option?  Thanks,  
> 
> Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
> into some bios's that set the bits on boot (which is annoying).

But it would be a much easier proposal to disable ACS when the IOMMU is
not enabled, ACS has no real purpose in that case.

> I also don't expect people will respond well to making the IOMMU and P2P
> exclusive. The IOMMU is often used for more than just security and on
> many platforms it's enabled by default. I'd much rather allow IOMMU use
> but have fewer isolation groups in much the same way as if you had PCI
> bridges that didn't support ACS.

The IOMMU and P2P are already not exclusive, we can bounce off the
IOMMU or make use of ATS as we've previously discussed.  We were
previously talking about a build time config option that you didn't
expect distros to use, so I don't think intervention for the user to
disable the IOMMU if it's enabled by default is a serious concern
either.

What you're trying to do is enabled direct peer-to-peer for endpoints
which do not support ATS when the IOMMU is enabled, which is not
something that necessarily makes sense to me.  As I mentioned in a
previous reply, the IOMMU provides us with an I/O virtual address space
for devices, ACS is meant to fill the topology based gaps in that
virtual address space, making transactions follow IOMMU compliant
routing rules to avoid aliases between the IOVA and physical address
spaces.  But this series specifically wants to leave those gaps open
for direct P2P access.

So we compromise the P2P aspect of security, still protecting RAM, but
potentially only to the extent that a device cannot hop through or
interfere with other devices to do its bidding.  Device assignment is
mostly tossed out the window because not only are bigger groups more
difficult to deal with, the IOVA space is riddled with gaps, which is
not really a solved problem.  So that leaves avoiding bounce buffers as
the remaining IOMMU feature, but we're dealing with native express
devices and relatively high end devices that are probably installed in
modern systems, so that seems like a non-issue.

Are there other uses I'm forgetting?  We can enable interrupt remapping
separate from DMA translation, so we can exclude that one.  I'm still
not seeing why it's terribly undesirable to require devices to support
ATS if they want to do direct P2P with an IOMMU enabled.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:26                           ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 21:26 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

On Tue, 8 May 2018 14:49:23 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 02:43 PM, Alex Williamson wrote:
> > Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> > dumb question, why not simply turn off the IOMMU and thus ACS?  The
> > argument of using the IOMMU for security is rather diminished if we're
> > specifically enabling devices to poke one another directly and clearly
> > this isn't favorable for device assignment either.  Are there target
> > systems where this is not a simple kernel commandline option?  Thanks,  
> 
> Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
> into some bios's that set the bits on boot (which is annoying).

But it would be a much easier proposal to disable ACS when the IOMMU is
not enabled, ACS has no real purpose in that case.

> I also don't expect people will respond well to making the IOMMU and P2P
> exclusive. The IOMMU is often used for more than just security and on
> many platforms it's enabled by default. I'd much rather allow IOMMU use
> but have fewer isolation groups in much the same way as if you had PCI
> bridges that didn't support ACS.

The IOMMU and P2P are already not exclusive, we can bounce off the
IOMMU or make use of ATS as we've previously discussed.  We were
previously talking about a build time config option that you didn't
expect distros to use, so I don't think intervention for the user to
disable the IOMMU if it's enabled by default is a serious concern
either.

What you're trying to do is enabled direct peer-to-peer for endpoints
which do not support ATS when the IOMMU is enabled, which is not
something that necessarily makes sense to me.  As I mentioned in a
previous reply, the IOMMU provides us with an I/O virtual address space
for devices, ACS is meant to fill the topology based gaps in that
virtual address space, making transactions follow IOMMU compliant
routing rules to avoid aliases between the IOVA and physical address
spaces.  But this series specifically wants to leave those gaps open
for direct P2P access.

So we compromise the P2P aspect of security, still protecting RAM, but
potentially only to the extent that a device cannot hop through or
interfere with other devices to do its bidding.  Device assignment is
mostly tossed out the window because not only are bigger groups more
difficult to deal with, the IOVA space is riddled with gaps, which is
not really a solved problem.  So that leaves avoiding bounce buffers as
the remaining IOMMU feature, but we're dealing with native express
devices and relatively high end devices that are probably installed in
modern systems, so that seems like a non-issue.

Are there other uses I'm forgetting?  We can enable interrupt remapping
separate from DMA translation, so we can exclude that one.  I'm still
not seeing why it's terribly undesirable to require devices to support
ATS if they want to do direct P2P with an IOMMU enabled.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:26                           ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 21:26 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 14:49:23 -0600
Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:

> On 08/05/18 02:43 PM, Alex Williamson wrote:
> > Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> > dumb question, why not simply turn off the IOMMU and thus ACS?  The
> > argument of using the IOMMU for security is rather diminished if we're
> > specifically enabling devices to poke one another directly and clearly
> > this isn't favorable for device assignment either.  Are there target
> > systems where this is not a simple kernel commandline option?  Thanks,  
> 
> Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
> into some bios's that set the bits on boot (which is annoying).

But it would be a much easier proposal to disable ACS when the IOMMU is
not enabled, ACS has no real purpose in that case.

> I also don't expect people will respond well to making the IOMMU and P2P
> exclusive. The IOMMU is often used for more than just security and on
> many platforms it's enabled by default. I'd much rather allow IOMMU use
> but have fewer isolation groups in much the same way as if you had PCI
> bridges that didn't support ACS.

The IOMMU and P2P are already not exclusive, we can bounce off the
IOMMU or make use of ATS as we've previously discussed.  We were
previously talking about a build time config option that you didn't
expect distros to use, so I don't think intervention for the user to
disable the IOMMU if it's enabled by default is a serious concern
either.

What you're trying to do is enabled direct peer-to-peer for endpoints
which do not support ATS when the IOMMU is enabled, which is not
something that necessarily makes sense to me.  As I mentioned in a
previous reply, the IOMMU provides us with an I/O virtual address space
for devices, ACS is meant to fill the topology based gaps in that
virtual address space, making transactions follow IOMMU compliant
routing rules to avoid aliases between the IOVA and physical address
spaces.  But this series specifically wants to leave those gaps open
for direct P2P access.

So we compromise the P2P aspect of security, still protecting RAM, but
potentially only to the extent that a device cannot hop through or
interfere with other devices to do its bidding.  Device assignment is
mostly tossed out the window because not only are bigger groups more
difficult to deal with, the IOVA space is riddled with gaps, which is
not really a solved problem.  So that leaves avoiding bounce buffers as
the remaining IOMMU feature, but we're dealing with native express
devices and relatively high end devices that are probably installed in
modern systems, so that seems like a non-issue.

Are there other uses I'm forgetting?  We can enable interrupt remapping
separate from DMA translation, so we can exclude that one.  I'm still
not seeing why it's terribly undesirable to require devices to support
ATS if they want to do direct P2P with an IOMMU enabled.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:26                           ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 21:26 UTC (permalink / raw)


On Tue, 8 May 2018 14:49:23 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 02:43 PM, Alex Williamson wrote:
> > Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> > dumb question, why not simply turn off the IOMMU and thus ACS?  The
> > argument of using the IOMMU for security is rather diminished if we're
> > specifically enabling devices to poke one another directly and clearly
> > this isn't favorable for device assignment either.  Are there target
> > systems where this is not a simple kernel commandline option?  Thanks,  
> 
> Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
> into some bios's that set the bits on boot (which is annoying).

But it would be a much easier proposal to disable ACS when the IOMMU is
not enabled, ACS has no real purpose in that case.

> I also don't expect people will respond well to making the IOMMU and P2P
> exclusive. The IOMMU is often used for more than just security and on
> many platforms it's enabled by default. I'd much rather allow IOMMU use
> but have fewer isolation groups in much the same way as if you had PCI
> bridges that didn't support ACS.

The IOMMU and P2P are already not exclusive, we can bounce off the
IOMMU or make use of ATS as we've previously discussed.  We were
previously talking about a build time config option that you didn't
expect distros to use, so I don't think intervention for the user to
disable the IOMMU if it's enabled by default is a serious concern
either.

What you're trying to do is enabled direct peer-to-peer for endpoints
which do not support ATS when the IOMMU is enabled, which is not
something that necessarily makes sense to me.  As I mentioned in a
previous reply, the IOMMU provides us with an I/O virtual address space
for devices, ACS is meant to fill the topology based gaps in that
virtual address space, making transactions follow IOMMU compliant
routing rules to avoid aliases between the IOVA and physical address
spaces.  But this series specifically wants to leave those gaps open
for direct P2P access.

So we compromise the P2P aspect of security, still protecting RAM, but
potentially only to the extent that a device cannot hop through or
interfere with other devices to do its bidding.  Device assignment is
mostly tossed out the window because not only are bigger groups more
difficult to deal with, the IOVA space is riddled with gaps, which is
not really a solved problem.  So that leaves avoiding bounce buffers as
the remaining IOMMU feature, but we're dealing with native express
devices and relatively high end devices that are probably installed in
modern systems, so that seems like a non-issue.

Are there other uses I'm forgetting?  We can enable interrupt remapping
separate from DMA translation, so we can exclude that one.  I'm still
not seeing why it's terribly undesirable to require devices to support
ATS if they want to do direct P2P with an IOMMU enabled.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 21:04         ` Don Dutile
                             ` (3 preceding siblings ...)
  (?)
@ 2018-05-08 21:27           ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:27 UTC (permalink / raw)
  To: Don Dutile, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Christian König, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>    That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>    I recommend doing so via a sysfs method.

Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series. 

    
>            So I don't understand the comments why VMs should need to know.

As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
    
> Is there a thread I need to read up to explain /clear-up the thoughts above?

If you search for p2pdma you should find the previous discussions. Thanks for the input!

Stephen
    
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:27           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:27 UTC (permalink / raw)
  To: Don Dutile, Dan Williams, Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König

SGkgRG9uDQoNCj5XZWxsLCBwMnAgRE1BIGlzIGEgZnVuY3Rpb24gb2YgYSBjb29wZXJhdGluZyAn
YWdlbnQnIHNvbWV3aGVyZSBhYm92ZSB0aGUgdHdvIGRldmljZXMuDQo+ICAgIFRoYXQgYWdlbnQg
c2hvdWxkICdyZXF1ZXN0JyB0byB0aGUga2VybmVsIHRoYXQgQUNTIGJlIHJlbW92ZWQvY2lyY3Vt
dmVudGVkIChwMnAgZW5hYmxlZCkgYnR3biB0d28gZW5kcG9pbnRzLg0KPiAgICBJIHJlY29tbWVu
ZCBkb2luZyBzbyB2aWEgYSBzeXNmcyBtZXRob2QuDQoNClllcyB3ZSBsb29rZWQgYXQgc29tZXRo
aW5nIGxpa2UgdGhpcyBpbiB0aGUgcGFzdCBidXQgaXQgZG9lcyBoaXQgdGhlIElPTU1VIGdyb3Vw
aW5nIGlzc3VlIEkgZGlzY3Vzc2VkIGVhcmxpZXIgdG9kYXkgd2hpY2ggaXMgbm90IGFjY2VwdGFi
bGUgcmlnaHQgbm93LiBJbiB0aGUgbG9uZyB0ZXJtLCBvbmNlIHdlIGdldCBJT01NVSBncm91cGlu
ZyBjYWxsYmFja3MgdG8gVk1zIHdlIGNhbiBsb29rIGF0IGV4dGVuZGluZyBwMnBkbWEgaW4gdGhp
cyB3YXkuIEJ1dCBJIGRvbid0IHRoaW5rIHRoaXMgaXMgdmlhYmxlIGZvciB0aGUgaW5pdGlhbCBz
ZXJpZXMuIA0KDQogICAgDQo+ICAgICAgICAgICAgU28gSSBkb24ndCB1bmRlcnN0YW5kIHRoZSBj
b21tZW50cyB3aHkgVk1zIHNob3VsZCBuZWVkIHRvIGtub3cuDQoNCkFzIEkgdW5kZXJzdGFuZCBp
dCBWTXMgbmVlZCB0byBrbm93IGJlY2F1c2UgVkZJTyBwYXNzZXMgSU9NTVUgZ3JvdXBpbmcgdXAg
aW50byB0aGUgVk1zLiBTbyBpZiBhIElPTU1VIGdyb3VwaW5nIGNoYW5nZXMgdGhlIFZNJ3Mgdmll
dyBvZiBpdHMgUENJZSB0b3BvbG9neSBjaGFuZ2VzLiBJIHRoaW5rIHdlIGV2ZW4gaGF2ZSB0byBi
ZSBjb2duaXphbnQgb2YgdGhlIGZhY3QgdGhlIE9TIHJ1bm5pbmcgb24gdGhlIFZNIG1heSBub3Qg
ZXZlbiBzdXBwb3J0IGhvdC1wbHVnIG9mIFBDSSBkZXZpY2VzLg0KICAgIA0KPiBJcyB0aGVyZSBh
IHRocmVhZCBJIG5lZWQgdG8gcmVhZCB1cCB0byBleHBsYWluIC9jbGVhci11cCB0aGUgdGhvdWdo
dHMgYWJvdmU/DQoNCklmIHlvdSBzZWFyY2ggZm9yIHAycGRtYSB5b3Ugc2hvdWxkIGZpbmQgdGhl
IHByZXZpb3VzIGRpc2N1c3Npb25zLiBUaGFua3MgZm9yIHRoZSBpbnB1dCENCg0KU3RlcGhlbg0K
ICAgIA0KICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:27           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:27 UTC (permalink / raw)
  To: Don Dutile, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christian König, linux-block-u79uwXL29TY76Z2rM5mHXA,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>    That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>    I recommend doing so via a sysfs method.

Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series. 

    
>            So I don't understand the comments why VMs should need to know.

As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
    
> Is there a thread I need to read up to explain /clear-up the thoughts above?

If you search for p2pdma you should find the previous discussions. Thanks for the input!

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:27           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:27 UTC (permalink / raw)
  To: Don Dutile, Dan Williams, Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König

Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>    That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>    I recommend doing so via a sysfs method.

Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series. 

    
>            So I don't understand the comments why VMs should need to know.

As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
    
> Is there a thread I need to read up to explain /clear-up the thoughts above?

If you search for p2pdma you should find the previous discussions. Thanks for the input!

Stephen
    
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:27           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:27 UTC (permalink / raw)
  To: Don Dutile, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, Sagi Grimberg,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, Christian König, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>    That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>    I recommend doing so via a sysfs method.

Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series. 

    
>            So I don't understand the comments why VMs should need to know.

As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
    
> Is there a thread I need to read up to explain /clear-up the thoughts above?

If you search for p2pdma you should find the previous discussions. Thanks for the input!

Stephen
    
    

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:27           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:27 UTC (permalink / raw)


Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>    That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>    I recommend doing so via a sysfs method.

Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series. 

    
>            So I don't understand the comments why VMs should need to know.

As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
    
> Is there a thread I need to read up to explain /clear-up the thoughts above?

If you search for p2pdma you should find the previous discussions. Thanks for the input!

Stephen
    
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 20:50                       ` Jerome Glisse
                                           ` (3 preceding siblings ...)
  (?)
@ 2018-05-08 21:35                         ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:35 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

Hi Jerome

>    I think there is confusion here, Alex properly explained the scheme
>   PCIE-device do a ATS request to the IOMMU which returns a valid
>    translation for a virtual address. Device can then use that address
>    directly without going through IOMMU for translation.

This makes sense and to be honest I now understand ATS and its interaction with ACS a lot better than I did 24 hours ago ;-).

>    ATS is implemented by the IOMMU not by the device (well device implement
>    the client side of it). Also ATS is meaningless without something like
>    PASID as far as i know.
    
I think it's the client side that is what is important to us. Not many EPs support ATS today and it's not clear if many will in the future.  So assuming we want to do p2pdma between devices (some of) which do NOT support ATS how best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me given this impacts all the PCI domains in the system and not just the domain we wish to do P2P on.

Stephen

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:35                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:35 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Alex Williamson, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Benjamin Herrenschmidt

SGkgSmVyb21lDQoNCj4gICAgSSB0aGluayB0aGVyZSBpcyBjb25mdXNpb24gaGVyZSwgQWxleCBw
cm9wZXJseSBleHBsYWluZWQgdGhlIHNjaGVtZQ0KPiAgIFBDSUUtZGV2aWNlIGRvIGEgQVRTIHJl
cXVlc3QgdG8gdGhlIElPTU1VIHdoaWNoIHJldHVybnMgYSB2YWxpZA0KPiAgICB0cmFuc2xhdGlv
biBmb3IgYSB2aXJ0dWFsIGFkZHJlc3MuIERldmljZSBjYW4gdGhlbiB1c2UgdGhhdCBhZGRyZXNz
DQo+ICAgIGRpcmVjdGx5IHdpdGhvdXQgZ29pbmcgdGhyb3VnaCBJT01NVSBmb3IgdHJhbnNsYXRp
b24uDQoNClRoaXMgbWFrZXMgc2Vuc2UgYW5kIHRvIGJlIGhvbmVzdCBJIG5vdyB1bmRlcnN0YW5k
IEFUUyBhbmQgaXRzIGludGVyYWN0aW9uIHdpdGggQUNTIGEgbG90IGJldHRlciB0aGFuIEkgZGlk
IDI0IGhvdXJzIGFnbyA7LSkuDQoNCj4gICAgQVRTIGlzIGltcGxlbWVudGVkIGJ5IHRoZSBJT01N
VSBub3QgYnkgdGhlIGRldmljZSAod2VsbCBkZXZpY2UgaW1wbGVtZW50DQo+ICAgIHRoZSBjbGll
bnQgc2lkZSBvZiBpdCkuIEFsc28gQVRTIGlzIG1lYW5pbmdsZXNzIHdpdGhvdXQgc29tZXRoaW5n
IGxpa2UNCj4gICAgUEFTSUQgYXMgZmFyIGFzIGkga25vdy4NCiAgICANCkkgdGhpbmsgaXQncyB0
aGUgY2xpZW50IHNpZGUgdGhhdCBpcyB3aGF0IGlzIGltcG9ydGFudCB0byB1cy4gTm90IG1hbnkg
RVBzIHN1cHBvcnQgQVRTIHRvZGF5IGFuZCBpdCdzIG5vdCBjbGVhciBpZiBtYW55IHdpbGwgaW4g
dGhlIGZ1dHVyZS4gIFNvIGFzc3VtaW5nIHdlIHdhbnQgdG8gZG8gcDJwZG1hIGJldHdlZW4gZGV2
aWNlcyAoc29tZSBvZikgd2hpY2ggZG8gTk9UIHN1cHBvcnQgQVRTIGhvdyBiZXN0IGRvIHdlIGhh
bmRsZSB0aGUgQUNTIGlzc3VlPyBEaXNhYmxpbmcgdGhlIElPTU1VIHNlZW1zIGEgYml0IHN0cm9u
ZyB0byBtZSBnaXZlbiB0aGlzIGltcGFjdHMgYWxsIHRoZSBQQ0kgZG9tYWlucyBpbiB0aGUgc3lz
dGVtIGFuZCBub3QganVzdCB0aGUgZG9tYWluIHdlIHdpc2ggdG8gZG8gUDJQIG9uLg0KDQpTdGVw
aGVuDQoNCg==

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:35                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:35 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

Hi Jerome

>    I think there is confusion here, Alex properly explained the scheme
>   PCIE-device do a ATS request to the IOMMU which returns a valid
>    translation for a virtual address. Device can then use that address
>    directly without going through IOMMU for translation.

This makes sense and to be honest I now understand ATS and its interaction with ACS a lot better than I did 24 hours ago ;-).

>    ATS is implemented by the IOMMU not by the device (well device implement
>    the client side of it). Also ATS is meaningless without something like
>    PASID as far as i know.
    
I think it's the client side that is what is important to us. Not many EPs support ATS today and it's not clear if many will in the future.  So assuming we want to do p2pdma between devices (some of) which do NOT support ATS how best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me given this impacts all the PCI domains in the system and not just the domain we wish to do P2P on.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:35                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:35 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Alex Williamson, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Benjamin Herrenschmidt

Hi Jerome

>    I think there is confusion here, Alex properly explained the scheme
>   PCIE-device do a ATS request to the IOMMU which returns a valid
>    translation for a virtual address. Device can then use that address
>    directly without going through IOMMU for translation.

This makes sense and to be honest I now understand ATS and its interaction with ACS a lot better than I did 24 hours ago ;-).

>    ATS is implemented by the IOMMU not by the device (well device implement
>    the client side of it). Also ATS is meaningless without something like
>    PASID as far as i know.
    
I think it's the client side that is what is important to us. Not many EPs support ATS today and it's not clear if many will in the future.  So assuming we want to do p2pdma between devices (some of) which do NOT support ATS how best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me given this impacts all the PCI domains in the system and not just the domain we wish to do P2P on.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:35                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:35 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, Christoph Hellwig,
	linux-block, Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Christian König

Hi Jerome

>    I think there is confusion here, Alex properly explained the scheme
>   PCIE-device do a ATS request to the IOMMU which returns a valid
>    translation for a virtual address. Device can then use that address
>    directly without going through IOMMU for translation.

This makes sense and to be honest I now understand ATS and its interaction with ACS a lot better than I did 24 hours ago ;-).

>    ATS is implemented by the IOMMU not by the device (well device implement
>    the client side of it). Also ATS is meaningless without something like
>    PASID as far as i know.
    
I think it's the client side that is what is important to us. Not many EPs support ATS today and it's not clear if many will in the future.  So assuming we want to do p2pdma between devices (some of) which do NOT support ATS how best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me given this impacts all the PCI domains in the system and not just the domain we wish to do P2P on.

Stephen

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:35                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:35 UTC (permalink / raw)


Hi Jerome

>    I think there is confusion here, Alex properly explained the scheme
>   PCIE-device do a ATS request to the IOMMU which returns a valid
>    translation for a virtual address. Device can then use that address
>    directly without going through IOMMU for translation.

This makes sense and to be honest I now understand ATS and its interaction with ACS a lot better than I did 24 hours ago ;-).

>    ATS is implemented by the IOMMU not by the device (well device implement
>    the client side of it). Also ATS is meaningless without something like
>    PASID as far as i know.
    
I think it's the client side that is what is important to us. Not many EPs support ATS today and it's not clear if many will in the future.  So assuming we want to do p2pdma between devices (some of) which do NOT support ATS how best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me given this impacts all the PCI domains in the system and not just the domain we wish to do P2P on.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-05-08 21:25       ` Don Dutile
  (?)
  (?)
@ 2018-05-08 21:40         ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 21:40 UTC (permalink / raw)
  To: Don Dutile
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Tue, 8 May 2018 17:25:24 -0400
Don Dutile <ddutile@redhat.com> wrote:

> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> >   
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:  
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...  
> >>  
> >>> Logan Gunthorpe (14):
> >>>    PCI/P2PDMA: Support peer-to-peer memory
> >>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>>    docs-rst: Add a new directory for PCI documentation
> >>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>>    block: Introduce PCI P2P flags for request and request queue
> >>>    IB/core: Ensure we map P2P memory correctly in
> >>>      rdma_rw_ctx_[init|destroy]()
> >>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>>    nvme-pci: Add support for P2P memory in requests
> >>>    nvme-pci: Add a quirk for a pseudo CMB
> >>>    nvmet: Introduce helper functions to allocate and free request SGLs
> >>>    nvmet-rdma: Use new SGL alloc/free helper for requests
> >>>    nvmet: Optionally use PCI P2P memory
> >>>
> >>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >>>   Documentation/PCI/index.rst                |  14 +
> >>>   Documentation/driver-api/index.rst         |   2 +-
> >>>   Documentation/driver-api/pci/index.rst     |  20 +
> >>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >>>   Documentation/driver-api/{ => pci}/pci.rst |   0
> >>>   Documentation/index.rst                    |   3 +-
> >>>   block/blk-core.c                           |   3 +
> >>>   drivers/infiniband/core/rw.c               |  13 +-
> >>>   drivers/nvme/host/core.c                   |   4 +
> >>>   drivers/nvme/host/nvme.h                   |   8 +
> >>>   drivers/nvme/host/pci.c                    | 118 +++--
> >>>   drivers/nvme/target/configfs.c             |  67 +++
> >>>   drivers/nvme/target/core.c                 | 143 ++++-
> >>>   drivers/nvme/target/io-cmd.c               |   3 +
> >>>   drivers/nvme/target/nvmet.h                |  15 +
> >>>   drivers/nvme/target/rdma.c                 |  22 +-
> >>>   drivers/pci/Kconfig                        |  26 +
> >>>   drivers/pci/Makefile                       |   1 +
> >>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >>>   drivers/pci/pci.c                          |   6 +
> >>>   include/linux/blk_types.h                  |  18 +-
> >>>   include/linux/blkdev.h                     |   3 +
> >>>   include/linux/memremap.h                   |  19 +
> >>>   include/linux/pci-p2pdma.h                 | 118 +++++
> >>>   include/linux/pci.h                        |   4 +
> >>>   26 files changed, 1579 insertions(+), 56 deletions(-)
> >>>   create mode 100644 Documentation/PCI/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>>   create mode 100644 drivers/pci/p2pdma.c
> >>>   create mode 100644 include/linux/pci-p2pdma.h  
> >>
> >> How do you envison merging this?  There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?  
> > 
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option.  I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both.  Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features?  Thanks,
> > 
> > Alex  
> At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.

That's not what's done here AIUI.  There are also some complications to
making IOMMU groups dynamic, for instance could a downstream endpoint
already be in use by a userspace tool as ACS is being twiddled in
sysfs?  Probably the easiest solution would be that all devices
affected by the ACS change are soft unplugged before and re-added after
the ACS change.  Note that "affected" is not necessarily only the
downstream devices if the downstream port at which we're playing with
ACS is part of a multifunction device.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 21:40         ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 21:40 UTC (permalink / raw)
  To: Don Dutile
  Cc: Bjorn Helgaas, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt,
	Christian König

On Tue, 8 May 2018 17:25:24 -0400
Don Dutile <ddutile@redhat.com> wrote:

> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> >   
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:  
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...  
> >>  
> >>> Logan Gunthorpe (14):
> >>>    PCI/P2PDMA: Support peer-to-peer memory
> >>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>>    docs-rst: Add a new directory for PCI documentation
> >>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>>    block: Introduce PCI P2P flags for request and request queue
> >>>    IB/core: Ensure we map P2P memory correctly in
> >>>      rdma_rw_ctx_[init|destroy]()
> >>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>>    nvme-pci: Add support for P2P memory in requests
> >>>    nvme-pci: Add a quirk for a pseudo CMB
> >>>    nvmet: Introduce helper functions to allocate and free request SGLs
> >>>    nvmet-rdma: Use new SGL alloc/free helper for requests
> >>>    nvmet: Optionally use PCI P2P memory
> >>>
> >>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >>>   Documentation/PCI/index.rst                |  14 +
> >>>   Documentation/driver-api/index.rst         |   2 +-
> >>>   Documentation/driver-api/pci/index.rst     |  20 +
> >>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >>>   Documentation/driver-api/{ => pci}/pci.rst |   0
> >>>   Documentation/index.rst                    |   3 +-
> >>>   block/blk-core.c                           |   3 +
> >>>   drivers/infiniband/core/rw.c               |  13 +-
> >>>   drivers/nvme/host/core.c                   |   4 +
> >>>   drivers/nvme/host/nvme.h                   |   8 +
> >>>   drivers/nvme/host/pci.c                    | 118 +++--
> >>>   drivers/nvme/target/configfs.c             |  67 +++
> >>>   drivers/nvme/target/core.c                 | 143 ++++-
> >>>   drivers/nvme/target/io-cmd.c               |   3 +
> >>>   drivers/nvme/target/nvmet.h                |  15 +
> >>>   drivers/nvme/target/rdma.c                 |  22 +-
> >>>   drivers/pci/Kconfig                        |  26 +
> >>>   drivers/pci/Makefile                       |   1 +
> >>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >>>   drivers/pci/pci.c                          |   6 +
> >>>   include/linux/blk_types.h                  |  18 +-
> >>>   include/linux/blkdev.h                     |   3 +
> >>>   include/linux/memremap.h                   |  19 +
> >>>   include/linux/pci-p2pdma.h                 | 118 +++++
> >>>   include/linux/pci.h                        |   4 +
> >>>   26 files changed, 1579 insertions(+), 56 deletions(-)
> >>>   create mode 100644 Documentation/PCI/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>>   create mode 100644 drivers/pci/p2pdma.c
> >>>   create mode 100644 include/linux/pci-p2pdma.h  
> >>
> >> How do you envison merging this?  There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?  
> > 
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option.  I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both.  Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features?  Thanks,
> > 
> > Alex  
> At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.

That's not what's done here AIUI.  There are also some complications to
making IOMMU groups dynamic, for instance could a downstream endpoint
already be in use by a userspace tool as ACS is being twiddled in
sysfs?  Probably the easiest solution would be that all devices
affected by the ACS change are soft unplugged before and re-added after
the ACS change.  Note that "affected" is not necessarily only the
downstream devices if the downstream port at which we're playing with
ACS is part of a multifunction device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 21:40         ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 21:40 UTC (permalink / raw)
  To: Don Dutile
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Tue, 8 May 2018 17:25:24 -0400
Don Dutile <ddutile-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas <helgaas-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >   
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:  
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...  
> >>  
> >>> Logan Gunthorpe (14):
> >>>    PCI/P2PDMA: Support peer-to-peer memory
> >>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>>    docs-rst: Add a new directory for PCI documentation
> >>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>>    block: Introduce PCI P2P flags for request and request queue
> >>>    IB/core: Ensure we map P2P memory correctly in
> >>>      rdma_rw_ctx_[init|destroy]()
> >>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>>    nvme-pci: Add support for P2P memory in requests
> >>>    nvme-pci: Add a quirk for a pseudo CMB
> >>>    nvmet: Introduce helper functions to allocate and free request SGLs
> >>>    nvmet-rdma: Use new SGL alloc/free helper for requests
> >>>    nvmet: Optionally use PCI P2P memory
> >>>
> >>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >>>   Documentation/PCI/index.rst                |  14 +
> >>>   Documentation/driver-api/index.rst         |   2 +-
> >>>   Documentation/driver-api/pci/index.rst     |  20 +
> >>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >>>   Documentation/driver-api/{ => pci}/pci.rst |   0
> >>>   Documentation/index.rst                    |   3 +-
> >>>   block/blk-core.c                           |   3 +
> >>>   drivers/infiniband/core/rw.c               |  13 +-
> >>>   drivers/nvme/host/core.c                   |   4 +
> >>>   drivers/nvme/host/nvme.h                   |   8 +
> >>>   drivers/nvme/host/pci.c                    | 118 +++--
> >>>   drivers/nvme/target/configfs.c             |  67 +++
> >>>   drivers/nvme/target/core.c                 | 143 ++++-
> >>>   drivers/nvme/target/io-cmd.c               |   3 +
> >>>   drivers/nvme/target/nvmet.h                |  15 +
> >>>   drivers/nvme/target/rdma.c                 |  22 +-
> >>>   drivers/pci/Kconfig                        |  26 +
> >>>   drivers/pci/Makefile                       |   1 +
> >>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >>>   drivers/pci/pci.c                          |   6 +
> >>>   include/linux/blk_types.h                  |  18 +-
> >>>   include/linux/blkdev.h                     |   3 +
> >>>   include/linux/memremap.h                   |  19 +
> >>>   include/linux/pci-p2pdma.h                 | 118 +++++
> >>>   include/linux/pci.h                        |   4 +
> >>>   26 files changed, 1579 insertions(+), 56 deletions(-)
> >>>   create mode 100644 Documentation/PCI/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>>   create mode 100644 drivers/pci/p2pdma.c
> >>>   create mode 100644 include/linux/pci-p2pdma.h  
> >>
> >> How do you envison merging this?  There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?  
> > 
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option.  I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both.  Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features?  Thanks,
> > 
> > Alex  
> At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.

That's not what's done here AIUI.  There are also some complications to
making IOMMU groups dynamic, for instance could a downstream endpoint
already be in use by a userspace tool as ACS is being twiddled in
sysfs?  Probably the easiest solution would be that all devices
affected by the ACS change are soft unplugged before and re-added after
the ACS change.  Note that "affected" is not necessarily only the
downstream devices if the downstream port at which we're playing with
ACS is part of a multifunction device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-05-08 21:40         ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 21:40 UTC (permalink / raw)


On Tue, 8 May 2018 17:25:24 -0400
Don Dutile <ddutile@redhat.com> wrote:

> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> >   
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:  
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...  
> >>  
> >>> Logan Gunthorpe (14):
> >>>    PCI/P2PDMA: Support peer-to-peer memory
> >>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>>    docs-rst: Add a new directory for PCI documentation
> >>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>>    block: Introduce PCI P2P flags for request and request queue
> >>>    IB/core: Ensure we map P2P memory correctly in
> >>>      rdma_rw_ctx_[init|destroy]()
> >>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>>    nvme-pci: Add support for P2P memory in requests
> >>>    nvme-pci: Add a quirk for a pseudo CMB
> >>>    nvmet: Introduce helper functions to allocate and free request SGLs
> >>>    nvmet-rdma: Use new SGL alloc/free helper for requests
> >>>    nvmet: Optionally use PCI P2P memory
> >>>
> >>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >>>   Documentation/PCI/index.rst                |  14 +
> >>>   Documentation/driver-api/index.rst         |   2 +-
> >>>   Documentation/driver-api/pci/index.rst     |  20 +
> >>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >>>   Documentation/driver-api/{ => pci}/pci.rst |   0
> >>>   Documentation/index.rst                    |   3 +-
> >>>   block/blk-core.c                           |   3 +
> >>>   drivers/infiniband/core/rw.c               |  13 +-
> >>>   drivers/nvme/host/core.c                   |   4 +
> >>>   drivers/nvme/host/nvme.h                   |   8 +
> >>>   drivers/nvme/host/pci.c                    | 118 +++--
> >>>   drivers/nvme/target/configfs.c             |  67 +++
> >>>   drivers/nvme/target/core.c                 | 143 ++++-
> >>>   drivers/nvme/target/io-cmd.c               |   3 +
> >>>   drivers/nvme/target/nvmet.h                |  15 +
> >>>   drivers/nvme/target/rdma.c                 |  22 +-
> >>>   drivers/pci/Kconfig                        |  26 +
> >>>   drivers/pci/Makefile                       |   1 +
> >>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >>>   drivers/pci/pci.c                          |   6 +
> >>>   include/linux/blk_types.h                  |  18 +-
> >>>   include/linux/blkdev.h                     |   3 +
> >>>   include/linux/memremap.h                   |  19 +
> >>>   include/linux/pci-p2pdma.h                 | 118 +++++
> >>>   include/linux/pci.h                        |   4 +
> >>>   26 files changed, 1579 insertions(+), 56 deletions(-)
> >>>   create mode 100644 Documentation/PCI/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>>   create mode 100644 drivers/pci/p2pdma.c
> >>>   create mode 100644 include/linux/pci-p2pdma.h  
> >>
> >> How do you envison merging this?  There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?  
> > 
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option.  I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both.  Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features?  Thanks,
> > 
> > Alex  
> At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.

That's not what's done here AIUI.  There are also some complications to
making IOMMU groups dynamic, for instance could a downstream endpoint
already be in use by a userspace tool as ACS is being twiddled in
sysfs?  Probably the easiest solution would be that all devices
affected by the ACS change are soft unplugged before and re-added after
the ACS change.  Note that "affected" is not necessarily only the
downstream devices if the downstream port at which we're playing with
ACS is part of a multifunction device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 21:26                           ` Alex Williamson
                                               ` (2 preceding siblings ...)
  (?)
@ 2018-05-08 21:42                             ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:42 UTC (permalink / raw)
  To: Alex Williamson, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

Hi Alex

>    But it would be a much easier proposal to disable ACS when the IOMMU is
>    not enabled, ACS has no real purpose in that case.

I guess one issue I have with this is that it disables IOMMU groups for all Root Ports and not just the one(s) we wish to do p2pdma on. 
    
>    The IOMMU and P2P are already not exclusive, we can bounce off the
>    IOMMU or make use of ATS as we've previously discussed.  We were
>    previously talking about a build time config option that you didn't
>    expect distros to use, so I don't think intervention for the user to
>    disable the IOMMU if it's enabled by default is a serious concern
>    either.

ATS definitely makes things more interesting for the cases where the EPs support it. However I don't really have a handle on how common ATS support is going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA NICs mostly). 
    
> What you're trying to do is enabled direct peer-to-peer for endpoints
>  which do not support ATS when the IOMMU is enabled, which is not
>  something that necessarily makes sense to me. 

As above the advantage of leaving the IOMMU on is that it allows for both p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is just that these domains will be separate to each other.

>  So that leaves avoiding bounce buffers as the remaining IOMMU feature

I agree with you here that the devices we will want to use for p2p will probably not require a bounce buffer and will support 64 bit DMA addressing.
    
> I'm still not seeing why it's terribly undesirable to require devices to support
> ATS if they want to do direct P2P with an IOMMU enabled.

I think the one reason is for the use-case above. Allowing IOMMU groupings on one domain and p2pdma on another domain....
    
Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:42                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:42 UTC (permalink / raw)
  To: Alex Williamson, Logan Gunthorpe
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

SGkgQWxleA0KDQo+ICAgIEJ1dCBpdCB3b3VsZCBiZSBhIG11Y2ggZWFzaWVyIHByb3Bvc2FsIHRv
IGRpc2FibGUgQUNTIHdoZW4gdGhlIElPTU1VIGlzDQo+ICAgIG5vdCBlbmFibGVkLCBBQ1MgaGFz
IG5vIHJlYWwgcHVycG9zZSBpbiB0aGF0IGNhc2UuDQoNCkkgZ3Vlc3Mgb25lIGlzc3VlIEkgaGF2
ZSB3aXRoIHRoaXMgaXMgdGhhdCBpdCBkaXNhYmxlcyBJT01NVSBncm91cHMgZm9yIGFsbCBSb290
IFBvcnRzIGFuZCBub3QganVzdCB0aGUgb25lKHMpIHdlIHdpc2ggdG8gZG8gcDJwZG1hIG9uLiAN
CiAgICANCj4gICAgVGhlIElPTU1VIGFuZCBQMlAgYXJlIGFscmVhZHkgbm90IGV4Y2x1c2l2ZSwg
d2UgY2FuIGJvdW5jZSBvZmYgdGhlDQo+ICAgIElPTU1VIG9yIG1ha2UgdXNlIG9mIEFUUyBhcyB3
ZSd2ZSBwcmV2aW91c2x5IGRpc2N1c3NlZC4gIFdlIHdlcmUNCj4gICAgcHJldmlvdXNseSB0YWxr
aW5nIGFib3V0IGEgYnVpbGQgdGltZSBjb25maWcgb3B0aW9uIHRoYXQgeW91IGRpZG4ndA0KPiAg
ICBleHBlY3QgZGlzdHJvcyB0byB1c2UsIHNvIEkgZG9uJ3QgdGhpbmsgaW50ZXJ2ZW50aW9uIGZv
ciB0aGUgdXNlciB0bw0KPiAgICBkaXNhYmxlIHRoZSBJT01NVSBpZiBpdCdzIGVuYWJsZWQgYnkg
ZGVmYXVsdCBpcyBhIHNlcmlvdXMgY29uY2Vybg0KPiAgICBlaXRoZXIuDQoNCkFUUyBkZWZpbml0
ZWx5IG1ha2VzIHRoaW5ncyBtb3JlIGludGVyZXN0aW5nIGZvciB0aGUgY2FzZXMgd2hlcmUgdGhl
IEVQcyBzdXBwb3J0IGl0LiBIb3dldmVyIEkgZG9uJ3QgcmVhbGx5IGhhdmUgYSBoYW5kbGUgb24g
aG93IGNvbW1vbiBBVFMgc3VwcG9ydCBpcyBnb2luZyB0byBiZSBpbiB0aGUga2luZHMgb2YgZGV2
aWNlcyB3ZSBoYXZlIGJlZW4gZm9jdXNlZCBvbiAoTlZNZSBTU0RzIGFuZCBSRE1BIE5JQ3MgbW9z
dGx5KS4gDQogICAgDQo+IFdoYXQgeW91J3JlIHRyeWluZyB0byBkbyBpcyBlbmFibGVkIGRpcmVj
dCBwZWVyLXRvLXBlZXIgZm9yIGVuZHBvaW50cw0KPiAgd2hpY2ggZG8gbm90IHN1cHBvcnQgQVRT
IHdoZW4gdGhlIElPTU1VIGlzIGVuYWJsZWQsIHdoaWNoIGlzIG5vdA0KPiAgc29tZXRoaW5nIHRo
YXQgbmVjZXNzYXJpbHkgbWFrZXMgc2Vuc2UgdG8gbWUuIA0KDQpBcyBhYm92ZSB0aGUgYWR2YW50
YWdlIG9mIGxlYXZpbmcgdGhlIElPTU1VIG9uIGlzIHRoYXQgaXQgYWxsb3dzIGZvciBib3RoIHAy
cGRtYSBQQ0kgZG9tYWlucyBhbmQgSU9NTVUgZ3JvdXBpbmdzIFBDSSBkb21haW5zIGluIHRoZSBz
YW1lIHN5c3RlbS4gSXQgaXMganVzdCB0aGF0IHRoZXNlIGRvbWFpbnMgd2lsbCBiZSBzZXBhcmF0
ZSB0byBlYWNoIG90aGVyLg0KDQo+ICBTbyB0aGF0IGxlYXZlcyBhdm9pZGluZyBib3VuY2UgYnVm
ZmVycyBhcyB0aGUgcmVtYWluaW5nIElPTU1VIGZlYXR1cmUNCg0KSSBhZ3JlZSB3aXRoIHlvdSBo
ZXJlIHRoYXQgdGhlIGRldmljZXMgd2Ugd2lsbCB3YW50IHRvIHVzZSBmb3IgcDJwIHdpbGwgcHJv
YmFibHkgbm90IHJlcXVpcmUgYSBib3VuY2UgYnVmZmVyIGFuZCB3aWxsIHN1cHBvcnQgNjQgYml0
IERNQSBhZGRyZXNzaW5nLg0KICAgIA0KPiBJJ20gc3RpbGwgbm90IHNlZWluZyB3aHkgaXQncyB0
ZXJyaWJseSB1bmRlc2lyYWJsZSB0byByZXF1aXJlIGRldmljZXMgdG8gc3VwcG9ydA0KPiBBVFMg
aWYgdGhleSB3YW50IHRvIGRvIGRpcmVjdCBQMlAgd2l0aCBhbiBJT01NVSBlbmFibGVkLg0KDQpJ
IHRoaW5rIHRoZSBvbmUgcmVhc29uIGlzIGZvciB0aGUgdXNlLWNhc2UgYWJvdmUuIEFsbG93aW5n
IElPTU1VIGdyb3VwaW5ncyBvbiBvbmUgZG9tYWluIGFuZCBwMnBkbWEgb24gYW5vdGhlciBkb21h
aW4uLi4uDQogICAgDQpTdGVwaGVuDQogICAgDQoNCg==

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:42                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:42 UTC (permalink / raw)
  To: Alex Williamson, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

Hi Alex

>    But it would be a much easier proposal to disable ACS when the IOMMU is
>    not enabled, ACS has no real purpose in that case.

I guess one issue I have with this is that it disables IOMMU groups for all Root Ports and not just the one(s) we wish to do p2pdma on. 
    
>    The IOMMU and P2P are already not exclusive, we can bounce off the
>    IOMMU or make use of ATS as we've previously discussed.  We were
>    previously talking about a build time config option that you didn't
>    expect distros to use, so I don't think intervention for the user to
>    disable the IOMMU if it's enabled by default is a serious concern
>    either.

ATS definitely makes things more interesting for the cases where the EPs support it. However I don't really have a handle on how common ATS support is going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA NICs mostly). 
    
> What you're trying to do is enabled direct peer-to-peer for endpoints
>  which do not support ATS when the IOMMU is enabled, which is not
>  something that necessarily makes sense to me. 

As above the advantage of leaving the IOMMU on is that it allows for both p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is just that these domains will be separate to each other.

>  So that leaves avoiding bounce buffers as the remaining IOMMU feature

I agree with you here that the devices we will want to use for p2p will probably not require a bounce buffer and will support 64 bit DMA addressing.
    
> I'm still not seeing why it's terribly undesirable to require devices to support
> ATS if they want to do direct P2P with an IOMMU enabled.

I think the one reason is for the use-case above. Allowing IOMMU groupings on one domain and p2pdma on another domain....
    
Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:42                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:42 UTC (permalink / raw)
  To: Alex Williamson, Logan Gunthorpe
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

Hi Alex

>    But it would be a much easier proposal to disable ACS when the IOMMU is
>    not enabled, ACS has no real purpose in that case.

I guess one issue I have with this is that it disables IOMMU groups for all Root Ports and not just the one(s) we wish to do p2pdma on. 
    
>    The IOMMU and P2P are already not exclusive, we can bounce off the
>    IOMMU or make use of ATS as we've previously discussed.  We were
>    previously talking about a build time config option that you didn't
>    expect distros to use, so I don't think intervention for the user to
>    disable the IOMMU if it's enabled by default is a serious concern
>    either.

ATS definitely makes things more interesting for the cases where the EPs support it. However I don't really have a handle on how common ATS support is going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA NICs mostly). 
    
> What you're trying to do is enabled direct peer-to-peer for endpoints
>  which do not support ATS when the IOMMU is enabled, which is not
>  something that necessarily makes sense to me. 

As above the advantage of leaving the IOMMU on is that it allows for both p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is just that these domains will be separate to each other.

>  So that leaves avoiding bounce buffers as the remaining IOMMU feature

I agree with you here that the devices we will want to use for p2p will probably not require a bounce buffer and will support 64 bit DMA addressing.
    
> I'm still not seeing why it's terribly undesirable to require devices to support
> ATS if they want to do direct P2P with an IOMMU enabled.

I think the one reason is for the use-case above. Allowing IOMMU groupings on one domain and p2pdma on another domain....
    
Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 21:42                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 21:42 UTC (permalink / raw)


Hi Alex

>    But it would be a much easier proposal to disable ACS when the IOMMU is
>    not enabled, ACS has no real purpose in that case.

I guess one issue I have with this is that it disables IOMMU groups for all Root Ports and not just the one(s) we wish to do p2pdma on. 
    
>    The IOMMU and P2P are already not exclusive, we can bounce off the
>    IOMMU or make use of ATS as we've previously discussed.  We were
>    previously talking about a build time config option that you didn't
>    expect distros to use, so I don't think intervention for the user to
>    disable the IOMMU if it's enabled by default is a serious concern
>    either.

ATS definitely makes things more interesting for the cases where the EPs support it. However I don't really have a handle on how common ATS support is going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA NICs mostly). 
    
> What you're trying to do is enabled direct peer-to-peer for endpoints
>  which do not support ATS when the IOMMU is enabled, which is not
>  something that necessarily makes sense to me. 

As above the advantage of leaving the IOMMU on is that it allows for both p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is just that these domains will be separate to each other.

>  So that leaves avoiding bounce buffers as the remaining IOMMU feature

I agree with you here that the devices we will want to use for p2p will probably not require a bounce buffer and will support 64 bit DMA addressing.
    
> I'm still not seeing why it's terribly undesirable to require devices to support
> ATS if they want to do direct P2P with an IOMMU enabled.

I think the one reason is for the use-case above. Allowing IOMMU groupings on one domain and p2pdma on another domain....
    
Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 21:42                             ` Stephen  Bates
  (?)
  (?)
@ 2018-05-08 22:03                               ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 22:03 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

On Tue, 8 May 2018 21:42:27 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Hi Alex
> 
> >    But it would be a much easier proposal to disable ACS when the
> > IOMMU is not enabled, ACS has no real purpose in that case.  
> 
> I guess one issue I have with this is that it disables IOMMU groups
> for all Root Ports and not just the one(s) we wish to do p2pdma on. 

But as I understand this series, we're not really targeting specific
sets of devices either.  It's more of a shotgun approach that we
disable ACS on downstream switch ports and hope that we get the right
set of devices, but with the indecisiveness that we might later
white-list select root ports to further increase the blast radius.

> >    The IOMMU and P2P are already not exclusive, we can bounce off
> > the IOMMU or make use of ATS as we've previously discussed.  We were
> >    previously talking about a build time config option that you
> > didn't expect distros to use, so I don't think intervention for the
> > user to disable the IOMMU if it's enabled by default is a serious
> > concern either.  
> 
> ATS definitely makes things more interesting for the cases where the
> EPs support it. However I don't really have a handle on how common
> ATS support is going to be in the kinds of devices we have been
> focused on (NVMe SSDs and RDMA NICs mostly). 
>
> > What you're trying to do is enabled direct peer-to-peer for
> > endpoints which do not support ATS when the IOMMU is enabled, which
> > is not something that necessarily makes sense to me.   
> 
> As above the advantage of leaving the IOMMU on is that it allows for
> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
> system. It is just that these domains will be separate to each other.

That argument makes sense if we had the ability to select specific sets
of devices, but that's not the case here, right?  With the shotgun
approach, we're clearly favoring one at the expense of the other and
it's not clear why we don't simple force the needle all the way in that
direction such that the results are at least predictable.

> >  So that leaves avoiding bounce buffers as the remaining IOMMU
> > feature  
> 
> I agree with you here that the devices we will want to use for p2p
> will probably not require a bounce buffer and will support 64 bit DMA
> addressing. 
>
> > I'm still not seeing why it's terribly undesirable to require
> > devices to support ATS if they want to do direct P2P with an IOMMU
> > enabled.  
> 
> I think the one reason is for the use-case above. Allowing IOMMU
> groupings on one domain and p2pdma on another domain.... 

If IOMMU grouping implies device assignment (because nobody else uses
it to the same extent as device assignment) then the build-time option
falls to pieces, we need a single kernel that can do both.  I think we
need to get more clever about allowing the user to specify exactly at
which points in the topology they want to disable isolation.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:03                               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 22:03 UTC (permalink / raw)
  To: Stephen  Bates
  Cc: Logan Gunthorpe, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

On Tue, 8 May 2018 21:42:27 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Hi Alex
> 
> >    But it would be a much easier proposal to disable ACS when the
> > IOMMU is not enabled, ACS has no real purpose in that case.  
> 
> I guess one issue I have with this is that it disables IOMMU groups
> for all Root Ports and not just the one(s) we wish to do p2pdma on. 

But as I understand this series, we're not really targeting specific
sets of devices either.  It's more of a shotgun approach that we
disable ACS on downstream switch ports and hope that we get the right
set of devices, but with the indecisiveness that we might later
white-list select root ports to further increase the blast radius.

> >    The IOMMU and P2P are already not exclusive, we can bounce off
> > the IOMMU or make use of ATS as we've previously discussed.  We were
> >    previously talking about a build time config option that you
> > didn't expect distros to use, so I don't think intervention for the
> > user to disable the IOMMU if it's enabled by default is a serious
> > concern either.  
> 
> ATS definitely makes things more interesting for the cases where the
> EPs support it. However I don't really have a handle on how common
> ATS support is going to be in the kinds of devices we have been
> focused on (NVMe SSDs and RDMA NICs mostly). 
>
> > What you're trying to do is enabled direct peer-to-peer for
> > endpoints which do not support ATS when the IOMMU is enabled, which
> > is not something that necessarily makes sense to me.   
> 
> As above the advantage of leaving the IOMMU on is that it allows for
> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
> system. It is just that these domains will be separate to each other.

That argument makes sense if we had the ability to select specific sets
of devices, but that's not the case here, right?  With the shotgun
approach, we're clearly favoring one at the expense of the other and
it's not clear why we don't simple force the needle all the way in that
direction such that the results are at least predictable.

> >  So that leaves avoiding bounce buffers as the remaining IOMMU
> > feature  
> 
> I agree with you here that the devices we will want to use for p2p
> will probably not require a bounce buffer and will support 64 bit DMA
> addressing. 
>
> > I'm still not seeing why it's terribly undesirable to require
> > devices to support ATS if they want to do direct P2P with an IOMMU
> > enabled.  
> 
> I think the one reason is for the use-case above. Allowing IOMMU
> groupings on one domain and p2pdma on another domain.... 

If IOMMU grouping implies device assignment (because nobody else uses
it to the same extent as device assignment) then the build-time option
falls to pieces, we need a single kernel that can do both.  I think we
need to get more clever about allowing the user to specify exactly at
which points in the topology they want to disable isolation.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:03                               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 22:03 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

On Tue, 8 May 2018 21:42:27 +0000
"Stephen  Bates" <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org> wrote:

> Hi Alex
> 
> >    But it would be a much easier proposal to disable ACS when the
> > IOMMU is not enabled, ACS has no real purpose in that case.  
> 
> I guess one issue I have with this is that it disables IOMMU groups
> for all Root Ports and not just the one(s) we wish to do p2pdma on. 

But as I understand this series, we're not really targeting specific
sets of devices either.  It's more of a shotgun approach that we
disable ACS on downstream switch ports and hope that we get the right
set of devices, but with the indecisiveness that we might later
white-list select root ports to further increase the blast radius.

> >    The IOMMU and P2P are already not exclusive, we can bounce off
> > the IOMMU or make use of ATS as we've previously discussed.  We were
> >    previously talking about a build time config option that you
> > didn't expect distros to use, so I don't think intervention for the
> > user to disable the IOMMU if it's enabled by default is a serious
> > concern either.  
> 
> ATS definitely makes things more interesting for the cases where the
> EPs support it. However I don't really have a handle on how common
> ATS support is going to be in the kinds of devices we have been
> focused on (NVMe SSDs and RDMA NICs mostly). 
>
> > What you're trying to do is enabled direct peer-to-peer for
> > endpoints which do not support ATS when the IOMMU is enabled, which
> > is not something that necessarily makes sense to me.   
> 
> As above the advantage of leaving the IOMMU on is that it allows for
> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
> system. It is just that these domains will be separate to each other.

That argument makes sense if we had the ability to select specific sets
of devices, but that's not the case here, right?  With the shotgun
approach, we're clearly favoring one at the expense of the other and
it's not clear why we don't simple force the needle all the way in that
direction such that the results are at least predictable.

> >  So that leaves avoiding bounce buffers as the remaining IOMMU
> > feature  
> 
> I agree with you here that the devices we will want to use for p2p
> will probably not require a bounce buffer and will support 64 bit DMA
> addressing. 
>
> > I'm still not seeing why it's terribly undesirable to require
> > devices to support ATS if they want to do direct P2P with an IOMMU
> > enabled.  
> 
> I think the one reason is for the use-case above. Allowing IOMMU
> groupings on one domain and p2pdma on another domain.... 

If IOMMU grouping implies device assignment (because nobody else uses
it to the same extent as device assignment) then the build-time option
falls to pieces, we need a single kernel that can do both.  I think we
need to get more clever about allowing the user to specify exactly at
which points in the topology they want to disable isolation.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:03                               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 22:03 UTC (permalink / raw)


On Tue, 8 May 2018 21:42:27 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Hi Alex
> 
> >    But it would be a much easier proposal to disable ACS when the
> > IOMMU is not enabled, ACS has no real purpose in that case.  
> 
> I guess one issue I have with this is that it disables IOMMU groups
> for all Root Ports and not just the one(s) we wish to do p2pdma on. 

But as I understand this series, we're not really targeting specific
sets of devices either.  It's more of a shotgun approach that we
disable ACS on downstream switch ports and hope that we get the right
set of devices, but with the indecisiveness that we might later
white-list select root ports to further increase the blast radius.

> >    The IOMMU and P2P are already not exclusive, we can bounce off
> > the IOMMU or make use of ATS as we've previously discussed.  We were
> >    previously talking about a build time config option that you
> > didn't expect distros to use, so I don't think intervention for the
> > user to disable the IOMMU if it's enabled by default is a serious
> > concern either.  
> 
> ATS definitely makes things more interesting for the cases where the
> EPs support it. However I don't really have a handle on how common
> ATS support is going to be in the kinds of devices we have been
> focused on (NVMe SSDs and RDMA NICs mostly). 
>
> > What you're trying to do is enabled direct peer-to-peer for
> > endpoints which do not support ATS when the IOMMU is enabled, which
> > is not something that necessarily makes sense to me.   
> 
> As above the advantage of leaving the IOMMU on is that it allows for
> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
> system. It is just that these domains will be separate to each other.

That argument makes sense if we had the ability to select specific sets
of devices, but that's not the case here, right?  With the shotgun
approach, we're clearly favoring one at the expense of the other and
it's not clear why we don't simple force the needle all the way in that
direction such that the results are at least predictable.

> >  So that leaves avoiding bounce buffers as the remaining IOMMU
> > feature  
> 
> I agree with you here that the devices we will want to use for p2p
> will probably not require a bounce buffer and will support 64 bit DMA
> addressing. 
>
> > I'm still not seeing why it's terribly undesirable to require
> > devices to support ATS if they want to do direct P2P with an IOMMU
> > enabled.  
> 
> I think the one reason is for the use-case above. Allowing IOMMU
> groupings on one domain and p2pdma on another domain.... 

If IOMMU grouping implies device assignment (because nobody else uses
it to the same extent as device assignment) then the build-time option
falls to pieces, we need a single kernel that can do both.  I think we
need to get more clever about allowing the user to specify exactly at
which points in the topology they want to disable isolation.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 22:03                               ` Alex Williamson
  (?)
  (?)
@ 2018-05-08 22:10                                 ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 22:10 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König



On 08/05/18 04:03 PM, Alex Williamson wrote:
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,


Yeah, so based on the discussion I'm leaning toward just having a
command line option that takes a list of BDFs and disables ACS for them.
(Essentially as Dan has suggested.) This avoids the shotgun.

Then, the pci_p2pdma_distance command needs to check that ACS is
disabled for all bridges between the two devices. If this is not the
case, it returns -1. Future work can check if the EP has ATS support, in
which case it has to check for the ACS direct translated bit.

A user then needs to either disable the IOMMU and/or add the command
line option to disable ACS for the specific downstream ports in the PCI
hierarchy. This means the IOMMU groups will be less granular but
presumably the person adding the command line argument understands this.

We may also want to do some work so that there's informative dmesgs on
which BDFs need to be specified on the command line so it's not so
difficult for the user to figure out.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:10                                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 22:10 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt



On 08/05/18 04:03 PM, Alex Williamson wrote:
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,


Yeah, so based on the discussion I'm leaning toward just having a
command line option that takes a list of BDFs and disables ACS for them.
(Essentially as Dan has suggested.) This avoids the shotgun.

Then, the pci_p2pdma_distance command needs to check that ACS is
disabled for all bridges between the two devices. If this is not the
case, it returns -1. Future work can check if the EP has ATS support, in
which case it has to check for the ACS direct translated bit.

A user then needs to either disable the IOMMU and/or add the command
line option to disable ACS for the specific downstream ports in the PCI
hierarchy. This means the IOMMU groups will be less granular but
presumably the person adding the command line argument understands this.

We may also want to do some work so that there's informative dmesgs on
which BDFs need to be specified on the command line so it's not so
difficult for the user to figure out.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:10                                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 22:10 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König



On 08/05/18 04:03 PM, Alex Williamson wrote:
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,


Yeah, so based on the discussion I'm leaning toward just having a
command line option that takes a list of BDFs and disables ACS for them.
(Essentially as Dan has suggested.) This avoids the shotgun.

Then, the pci_p2pdma_distance command needs to check that ACS is
disabled for all bridges between the two devices. If this is not the
case, it returns -1. Future work can check if the EP has ATS support, in
which case it has to check for the ACS direct translated bit.

A user then needs to either disable the IOMMU and/or add the command
line option to disable ACS for the specific downstream ports in the PCI
hierarchy. This means the IOMMU groups will be less granular but
presumably the person adding the command line argument understands this.

We may also want to do some work so that there's informative dmesgs on
which BDFs need to be specified on the command line so it's not so
difficult for the user to figure out.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:10                                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 22:10 UTC (permalink / raw)




On 08/05/18 04:03 PM, Alex Williamson wrote:
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,


Yeah, so based on the discussion I'm leaning toward just having a
command line option that takes a list of BDFs and disables ACS for them.
(Essentially as Dan has suggested.) This avoids the shotgun.

Then, the pci_p2pdma_distance command needs to check that ACS is
disabled for all bridges between the two devices. If this is not the
case, it returns -1. Future work can check if the EP has ATS support, in
which case it has to check for the ACS direct translated bit.

A user then needs to either disable the IOMMU and/or add the command
line option to disable ACS for the specific downstream ports in the PCI
hierarchy. This means the IOMMU groups will be less granular but
presumably the person adding the command line argument understands this.

We may also want to do some work so that there's informative dmesgs on
which BDFs need to be specified on the command line so it's not so
difficult for the user to figure out.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 22:03                               ` Alex Williamson
  (?)
  (?)
@ 2018-05-08 22:21                                 ` Don Dutile
  -1 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 22:21 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

On 05/08/2018 06:03 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 21:42:27 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:
> 
>> Hi Alex
>>
>>>     But it would be a much easier proposal to disable ACS when the
>>> IOMMU is not enabled, ACS has no real purpose in that case.
>>
>> I guess one issue I have with this is that it disables IOMMU groups
>> for all Root Ports and not just the one(s) we wish to do p2pdma on.
> 
> But as I understand this series, we're not really targeting specific
> sets of devices either.  It's more of a shotgun approach that we
> disable ACS on downstream switch ports and hope that we get the right
> set of devices, but with the indecisiveness that we might later
> white-list select root ports to further increase the blast radius.
> 
>>>     The IOMMU and P2P are already not exclusive, we can bounce off
>>> the IOMMU or make use of ATS as we've previously discussed.  We were
>>>     previously talking about a build time config option that you
>>> didn't expect distros to use, so I don't think intervention for the
>>> user to disable the IOMMU if it's enabled by default is a serious
>>> concern either.
>>
>> ATS definitely makes things more interesting for the cases where the
>> EPs support it. However I don't really have a handle on how common
>> ATS support is going to be in the kinds of devices we have been
>> focused on (NVMe SSDs and RDMA NICs mostly).
>>
>>> What you're trying to do is enabled direct peer-to-peer for
>>> endpoints which do not support ATS when the IOMMU is enabled, which
>>> is not something that necessarily makes sense to me.
>>
>> As above the advantage of leaving the IOMMU on is that it allows for
>> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
>> system. It is just that these domains will be separate to each other.
> 
> That argument makes sense if we had the ability to select specific sets
> of devices, but that's not the case here, right?  With the shotgun
> approach, we're clearly favoring one at the expense of the other and
> it's not clear why we don't simple force the needle all the way in that
> direction such that the results are at least predictable.
> 
>>>   So that leaves avoiding bounce buffers as the remaining IOMMU
>>> feature
>>
>> I agree with you here that the devices we will want to use for p2p
>> will probably not require a bounce buffer and will support 64 bit DMA
>> addressing.
>>
>>> I'm still not seeing why it's terribly undesirable to require
>>> devices to support ATS if they want to do direct P2P with an IOMMU
>>> enabled.
>>
>> I think the one reason is for the use-case above. Allowing IOMMU
>> groupings on one domain and p2pdma on another domain....
> 
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,
> 
> Alex

+1/ack

RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:21                                 ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 22:21 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Logan Gunthorpe, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

On 05/08/2018 06:03 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 21:42:27 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:
> 
>> Hi Alex
>>
>>>     But it would be a much easier proposal to disable ACS when the
>>> IOMMU is not enabled, ACS has no real purpose in that case.
>>
>> I guess one issue I have with this is that it disables IOMMU groups
>> for all Root Ports and not just the one(s) we wish to do p2pdma on.
> 
> But as I understand this series, we're not really targeting specific
> sets of devices either.  It's more of a shotgun approach that we
> disable ACS on downstream switch ports and hope that we get the right
> set of devices, but with the indecisiveness that we might later
> white-list select root ports to further increase the blast radius.
> 
>>>     The IOMMU and P2P are already not exclusive, we can bounce off
>>> the IOMMU or make use of ATS as we've previously discussed.  We were
>>>     previously talking about a build time config option that you
>>> didn't expect distros to use, so I don't think intervention for the
>>> user to disable the IOMMU if it's enabled by default is a serious
>>> concern either.
>>
>> ATS definitely makes things more interesting for the cases where the
>> EPs support it. However I don't really have a handle on how common
>> ATS support is going to be in the kinds of devices we have been
>> focused on (NVMe SSDs and RDMA NICs mostly).
>>
>>> What you're trying to do is enabled direct peer-to-peer for
>>> endpoints which do not support ATS when the IOMMU is enabled, which
>>> is not something that necessarily makes sense to me.
>>
>> As above the advantage of leaving the IOMMU on is that it allows for
>> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
>> system. It is just that these domains will be separate to each other.
> 
> That argument makes sense if we had the ability to select specific sets
> of devices, but that's not the case here, right?  With the shotgun
> approach, we're clearly favoring one at the expense of the other and
> it's not clear why we don't simple force the needle all the way in that
> direction such that the results are at least predictable.
> 
>>>   So that leaves avoiding bounce buffers as the remaining IOMMU
>>> feature
>>
>> I agree with you here that the devices we will want to use for p2p
>> will probably not require a bounce buffer and will support 64 bit DMA
>> addressing.
>>
>>> I'm still not seeing why it's terribly undesirable to require
>>> devices to support ATS if they want to do direct P2P with an IOMMU
>>> enabled.
>>
>> I think the one reason is for the use-case above. Allowing IOMMU
>> groupings on one domain and p2pdma on another domain....
> 
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,
> 
> Alex

+1/ack

RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:21                                 ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 22:21 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

On 05/08/2018 06:03 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 21:42:27 +0000
> "Stephen  Bates" <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org> wrote:
> 
>> Hi Alex
>>
>>>     But it would be a much easier proposal to disable ACS when the
>>> IOMMU is not enabled, ACS has no real purpose in that case.
>>
>> I guess one issue I have with this is that it disables IOMMU groups
>> for all Root Ports and not just the one(s) we wish to do p2pdma on.
> 
> But as I understand this series, we're not really targeting specific
> sets of devices either.  It's more of a shotgun approach that we
> disable ACS on downstream switch ports and hope that we get the right
> set of devices, but with the indecisiveness that we might later
> white-list select root ports to further increase the blast radius.
> 
>>>     The IOMMU and P2P are already not exclusive, we can bounce off
>>> the IOMMU or make use of ATS as we've previously discussed.  We were
>>>     previously talking about a build time config option that you
>>> didn't expect distros to use, so I don't think intervention for the
>>> user to disable the IOMMU if it's enabled by default is a serious
>>> concern either.
>>
>> ATS definitely makes things more interesting for the cases where the
>> EPs support it. However I don't really have a handle on how common
>> ATS support is going to be in the kinds of devices we have been
>> focused on (NVMe SSDs and RDMA NICs mostly).
>>
>>> What you're trying to do is enabled direct peer-to-peer for
>>> endpoints which do not support ATS when the IOMMU is enabled, which
>>> is not something that necessarily makes sense to me.
>>
>> As above the advantage of leaving the IOMMU on is that it allows for
>> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
>> system. It is just that these domains will be separate to each other.
> 
> That argument makes sense if we had the ability to select specific sets
> of devices, but that's not the case here, right?  With the shotgun
> approach, we're clearly favoring one at the expense of the other and
> it's not clear why we don't simple force the needle all the way in that
> direction such that the results are at least predictable.
> 
>>>   So that leaves avoiding bounce buffers as the remaining IOMMU
>>> feature
>>
>> I agree with you here that the devices we will want to use for p2p
>> will probably not require a bounce buffer and will support 64 bit DMA
>> addressing.
>>
>>> I'm still not seeing why it's terribly undesirable to require
>>> devices to support ATS if they want to do direct P2P with an IOMMU
>>> enabled.
>>
>> I think the one reason is for the use-case above. Allowing IOMMU
>> groupings on one domain and p2pdma on another domain....
> 
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,
> 
> Alex

+1/ack

RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:21                                 ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 22:21 UTC (permalink / raw)


On 05/08/2018 06:03 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 21:42:27 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:
> 
>> Hi Alex
>>
>>>     But it would be a much easier proposal to disable ACS when the
>>> IOMMU is not enabled, ACS has no real purpose in that case.
>>
>> I guess one issue I have with this is that it disables IOMMU groups
>> for all Root Ports and not just the one(s) we wish to do p2pdma on.
> 
> But as I understand this series, we're not really targeting specific
> sets of devices either.  It's more of a shotgun approach that we
> disable ACS on downstream switch ports and hope that we get the right
> set of devices, but with the indecisiveness that we might later
> white-list select root ports to further increase the blast radius.
> 
>>>     The IOMMU and P2P are already not exclusive, we can bounce off
>>> the IOMMU or make use of ATS as we've previously discussed.  We were
>>>     previously talking about a build time config option that you
>>> didn't expect distros to use, so I don't think intervention for the
>>> user to disable the IOMMU if it's enabled by default is a serious
>>> concern either.
>>
>> ATS definitely makes things more interesting for the cases where the
>> EPs support it. However I don't really have a handle on how common
>> ATS support is going to be in the kinds of devices we have been
>> focused on (NVMe SSDs and RDMA NICs mostly).
>>
>>> What you're trying to do is enabled direct peer-to-peer for
>>> endpoints which do not support ATS when the IOMMU is enabled, which
>>> is not something that necessarily makes sense to me.
>>
>> As above the advantage of leaving the IOMMU on is that it allows for
>> both p2pdma PCI domains and IOMMU groupings PCI domains in the same
>> system. It is just that these domains will be separate to each other.
> 
> That argument makes sense if we had the ability to select specific sets
> of devices, but that's not the case here, right?  With the shotgun
> approach, we're clearly favoring one at the expense of the other and
> it's not clear why we don't simple force the needle all the way in that
> direction such that the results are at least predictable.
> 
>>>   So that leaves avoiding bounce buffers as the remaining IOMMU
>>> feature
>>
>> I agree with you here that the devices we will want to use for p2p
>> will probably not require a bounce buffer and will support 64 bit DMA
>> addressing.
>>
>>> I'm still not seeing why it's terribly undesirable to require
>>> devices to support ATS if they want to do direct P2P with an IOMMU
>>> enabled.
>>
>> I think the one reason is for the use-case above. Allowing IOMMU
>> groupings on one domain and p2pdma on another domain....
> 
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,
> 
> Alex

+1/ack

RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 22:10                                 ` Logan Gunthorpe
                                                     ` (2 preceding siblings ...)
  (?)
@ 2018-05-08 22:25                                   ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 22:25 UTC (permalink / raw)
  To: Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

>    Yeah, so based on the discussion I'm leaning toward just having a
>    command line option that takes a list of BDFs and disables ACS for them.
>    (Essentially as Dan has suggested.) This avoids the shotgun.

I concur that this seems to be where the conversation is taking us.

@Alex - Before we go do this can you provide input on the approach? I don't want to re-spin only to find we are still not converging on the ACS issue....

Thanks

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:25                                   ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 22:25 UTC (permalink / raw)
  To: Logan Gunthorpe, Alex Williamson
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

PiAgICBZZWFoLCBzbyBiYXNlZCBvbiB0aGUgZGlzY3Vzc2lvbiBJJ20gbGVhbmluZyB0b3dhcmQg
anVzdCBoYXZpbmcgYQ0KPiAgICBjb21tYW5kIGxpbmUgb3B0aW9uIHRoYXQgdGFrZXMgYSBsaXN0
IG9mIEJERnMgYW5kIGRpc2FibGVzIEFDUyBmb3IgdGhlbS4NCj4gICAgKEVzc2VudGlhbGx5IGFz
IERhbiBoYXMgc3VnZ2VzdGVkLikgVGhpcyBhdm9pZHMgdGhlIHNob3RndW4uDQoNCkkgY29uY3Vy
IHRoYXQgdGhpcyBzZWVtcyB0byBiZSB3aGVyZSB0aGUgY29udmVyc2F0aW9uIGlzIHRha2luZyB1
cy4NCg0KQEFsZXggLSBCZWZvcmUgd2UgZ28gZG8gdGhpcyBjYW4geW91IHByb3ZpZGUgaW5wdXQg
b24gdGhlIGFwcHJvYWNoPyBJIGRvbid0IHdhbnQgdG8gcmUtc3BpbiBvbmx5IHRvIGZpbmQgd2Ug
YXJlIHN0aWxsIG5vdCBjb252ZXJnaW5nIG9uIHRoZSBBQ1MgaXNzdWUuLi4uDQoNClRoYW5rcw0K
DQpTdGVwaGVuDQogICAgDQoNCg==

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:25                                   ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 22:25 UTC (permalink / raw)
  To: Logan Gunthorpe, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

>    Yeah, so based on the discussion I'm leaning toward just having a
>    command line option that takes a list of BDFs and disables ACS for them.
>    (Essentially as Dan has suggested.) This avoids the shotgun.

I concur that this seems to be where the conversation is taking us.

@Alex - Before we go do this can you provide input on the approach? I don't want to re-spin only to find we are still not converging on the ACS issue....

Thanks

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:25                                   ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 22:25 UTC (permalink / raw)
  To: Logan Gunthorpe, Alex Williamson
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

>    Yeah, so based on the discussion I'm leaning toward just having a
>    command line option that takes a list of BDFs and disables ACS for them.
>    (Essentially as Dan has suggested.) This avoids the shotgun.

I concur that this seems to be where the conversation is taking us.

@Alex - Before we go do this can you provide input on the approach? I don't want to re-spin only to find we are still not converging on the ACS issue....

Thanks

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:25                                   ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-08 22:25 UTC (permalink / raw)


>    Yeah, so based on the discussion I'm leaning toward just having a
>    command line option that takes a list of BDFs and disables ACS for them.
>    (Essentially as Dan has suggested.) This avoids the shotgun.

I concur that this seems to be where the conversation is taking us.

@Alex - Before we go do this can you provide input on the approach? I don't want to re-spin only to find we are still not converging on the ACS issue....

Thanks

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 22:10                                 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-08 22:32                                   ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 22:32 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 16:10:19 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 04:03 PM, Alex Williamson wrote:
> > If IOMMU grouping implies device assignment (because nobody else uses
> > it to the same extent as device assignment) then the build-time option
> > falls to pieces, we need a single kernel that can do both.  I think we
> > need to get more clever about allowing the user to specify exactly at
> > which points in the topology they want to disable isolation.  Thanks,  
> 
> 
> Yeah, so based on the discussion I'm leaning toward just having a
> command line option that takes a list of BDFs and disables ACS for them.
> (Essentially as Dan has suggested.) This avoids the shotgun.
> 
> Then, the pci_p2pdma_distance command needs to check that ACS is
> disabled for all bridges between the two devices. If this is not the
> case, it returns -1. Future work can check if the EP has ATS support, in
> which case it has to check for the ACS direct translated bit.
> 
> A user then needs to either disable the IOMMU and/or add the command
> line option to disable ACS for the specific downstream ports in the PCI
> hierarchy. This means the IOMMU groups will be less granular but
> presumably the person adding the command line argument understands this.
> 
> We may also want to do some work so that there's informative dmesgs on
> which BDFs need to be specified on the command line so it's not so
> difficult for the user to figure out.

I'd advise caution with a user supplied BDF approach, we have no
guaranteed persistence for a device's PCI address.  Adding a device
might renumber the buses, replacing a device with one that consumes
more/less bus numbers can renumber the buses, motherboard firmware
updates could renumber the buses, pci=assign-buses can renumber the
buses, etc.  This is why the VT-d spec makes use of device paths when
describing PCI hierarchies, firmware can't know what bus number will be
assigned to a device, but it does know the base bus number and the path
of devfns needed to get to it.  I don't know how we come up with an
option that's easy enough for a user to understand, but reasonably
robust against hardware changes.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:32                                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 22:32 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Stephen Bates, Christian König, Bjorn Helgaas, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

On Tue, 8 May 2018 16:10:19 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 04:03 PM, Alex Williamson wrote:
> > If IOMMU grouping implies device assignment (because nobody else uses
> > it to the same extent as device assignment) then the build-time option
> > falls to pieces, we need a single kernel that can do both.  I think we
> > need to get more clever about allowing the user to specify exactly at
> > which points in the topology they want to disable isolation.  Thanks,  
> 
> 
> Yeah, so based on the discussion I'm leaning toward just having a
> command line option that takes a list of BDFs and disables ACS for them.
> (Essentially as Dan has suggested.) This avoids the shotgun.
> 
> Then, the pci_p2pdma_distance command needs to check that ACS is
> disabled for all bridges between the two devices. If this is not the
> case, it returns -1. Future work can check if the EP has ATS support, in
> which case it has to check for the ACS direct translated bit.
> 
> A user then needs to either disable the IOMMU and/or add the command
> line option to disable ACS for the specific downstream ports in the PCI
> hierarchy. This means the IOMMU groups will be less granular but
> presumably the person adding the command line argument understands this.
> 
> We may also want to do some work so that there's informative dmesgs on
> which BDFs need to be specified on the command line so it's not so
> difficult for the user to figure out.

I'd advise caution with a user supplied BDF approach, we have no
guaranteed persistence for a device's PCI address.  Adding a device
might renumber the buses, replacing a device with one that consumes
more/less bus numbers can renumber the buses, motherboard firmware
updates could renumber the buses, pci=assign-buses can renumber the
buses, etc.  This is why the VT-d spec makes use of device paths when
describing PCI hierarchies, firmware can't know what bus number will be
assigned to a device, but it does know the base bus number and the path
of devfns needed to get to it.  I don't know how we come up with an
option that's easy enough for a user to understand, but reasonably
robust against hardware changes.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:32                                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 22:32 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 16:10:19 -0600
Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:

> On 08/05/18 04:03 PM, Alex Williamson wrote:
> > If IOMMU grouping implies device assignment (because nobody else uses
> > it to the same extent as device assignment) then the build-time option
> > falls to pieces, we need a single kernel that can do both.  I think we
> > need to get more clever about allowing the user to specify exactly at
> > which points in the topology they want to disable isolation.  Thanks,  
> 
> 
> Yeah, so based on the discussion I'm leaning toward just having a
> command line option that takes a list of BDFs and disables ACS for them.
> (Essentially as Dan has suggested.) This avoids the shotgun.
> 
> Then, the pci_p2pdma_distance command needs to check that ACS is
> disabled for all bridges between the two devices. If this is not the
> case, it returns -1. Future work can check if the EP has ATS support, in
> which case it has to check for the ACS direct translated bit.
> 
> A user then needs to either disable the IOMMU and/or add the command
> line option to disable ACS for the specific downstream ports in the PCI
> hierarchy. This means the IOMMU groups will be less granular but
> presumably the person adding the command line argument understands this.
> 
> We may also want to do some work so that there's informative dmesgs on
> which BDFs need to be specified on the command line so it's not so
> difficult for the user to figure out.

I'd advise caution with a user supplied BDF approach, we have no
guaranteed persistence for a device's PCI address.  Adding a device
might renumber the buses, replacing a device with one that consumes
more/less bus numbers can renumber the buses, motherboard firmware
updates could renumber the buses, pci=assign-buses can renumber the
buses, etc.  This is why the VT-d spec makes use of device paths when
describing PCI hierarchies, firmware can't know what bus number will be
assigned to a device, but it does know the base bus number and the path
of devfns needed to get to it.  I don't know how we come up with an
option that's easy enough for a user to understand, but reasonably
robust against hardware changes.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 22:32                                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 22:32 UTC (permalink / raw)


On Tue, 8 May 2018 16:10:19 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 08/05/18 04:03 PM, Alex Williamson wrote:
> > If IOMMU grouping implies device assignment (because nobody else uses
> > it to the same extent as device assignment) then the build-time option
> > falls to pieces, we need a single kernel that can do both.  I think we
> > need to get more clever about allowing the user to specify exactly at
> > which points in the topology they want to disable isolation.  Thanks,  
> 
> 
> Yeah, so based on the discussion I'm leaning toward just having a
> command line option that takes a list of BDFs and disables ACS for them.
> (Essentially as Dan has suggested.) This avoids the shotgun.
> 
> Then, the pci_p2pdma_distance command needs to check that ACS is
> disabled for all bridges between the two devices. If this is not the
> case, it returns -1. Future work can check if the EP has ATS support, in
> which case it has to check for the ACS direct translated bit.
> 
> A user then needs to either disable the IOMMU and/or add the command
> line option to disable ACS for the specific downstream ports in the PCI
> hierarchy. This means the IOMMU groups will be less granular but
> presumably the person adding the command line argument understands this.
> 
> We may also want to do some work so that there's informative dmesgs on
> which BDFs need to be specified on the command line so it's not so
> difficult for the user to figure out.

I'd advise caution with a user supplied BDF approach, we have no
guaranteed persistence for a device's PCI address.  Adding a device
might renumber the buses, replacing a device with one that consumes
more/less bus numbers can renumber the buses, motherboard firmware
updates could renumber the buses, pci=assign-buses can renumber the
buses, etc.  This is why the VT-d spec makes use of device paths when
describing PCI hierarchies, firmware can't know what bus number will be
assigned to a device, but it does know the base bus number and the path
of devfns needed to get to it.  I don't know how we come up with an
option that's easy enough for a user to understand, but reasonably
robust against hardware changes.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 22:32                                   ` Alex Williamson
  (?)
  (?)
@ 2018-05-08 23:00                                     ` Dan Williams
  -1 siblings, 0 replies; 460+ messages in thread
From: Dan Williams @ 2018-05-08 23:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	Christoph Hellwig, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

On Tue, May 8, 2018 at 3:32 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Tue, 8 May 2018 16:10:19 -0600
> Logan Gunthorpe <logang@deltatee.com> wrote:
>
>> On 08/05/18 04:03 PM, Alex Williamson wrote:
>> > If IOMMU grouping implies device assignment (because nobody else uses
>> > it to the same extent as device assignment) then the build-time option
>> > falls to pieces, we need a single kernel that can do both.  I think we
>> > need to get more clever about allowing the user to specify exactly at
>> > which points in the topology they want to disable isolation.  Thanks,
>>
>>
>> Yeah, so based on the discussion I'm leaning toward just having a
>> command line option that takes a list of BDFs and disables ACS for them.
>> (Essentially as Dan has suggested.) This avoids the shotgun.
>>
>> Then, the pci_p2pdma_distance command needs to check that ACS is
>> disabled for all bridges between the two devices. If this is not the
>> case, it returns -1. Future work can check if the EP has ATS support, in
>> which case it has to check for the ACS direct translated bit.
>>
>> A user then needs to either disable the IOMMU and/or add the command
>> line option to disable ACS for the specific downstream ports in the PCI
>> hierarchy. This means the IOMMU groups will be less granular but
>> presumably the person adding the command line argument understands this.
>>
>> We may also want to do some work so that there's informative dmesgs on
>> which BDFs need to be specified on the command line so it's not so
>> difficult for the user to figure out.
>
> I'd advise caution with a user supplied BDF approach, we have no
> guaranteed persistence for a device's PCI address.  Adding a device
> might renumber the buses, replacing a device with one that consumes
> more/less bus numbers can renumber the buses, motherboard firmware
> updates could renumber the buses, pci=assign-buses can renumber the
> buses, etc.  This is why the VT-d spec makes use of device paths when
> describing PCI hierarchies, firmware can't know what bus number will be
> assigned to a device, but it does know the base bus number and the path
> of devfns needed to get to it.  I don't know how we come up with an
> option that's easy enough for a user to understand, but reasonably
> robust against hardware changes.  Thanks,

True, but at the same time this feature is for "users with custom
hardware designed for purpose", I assume they would be willing to take
on the bus renumbering risk. It's already the case that
/sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
make a similar interface for the command line. Ideally we could later
get something into ACPI or other platform firmware to arrange for
bridges to disable ACS by default if we see p2p becoming a
common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
given PCI-E sub-domain.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:00                                     ` Dan Williams
  0 siblings, 0 replies; 460+ messages in thread
From: Dan Williams @ 2018-05-08 23:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Logan Gunthorpe, Stephen Bates, Christian König,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt

On Tue, May 8, 2018 at 3:32 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Tue, 8 May 2018 16:10:19 -0600
> Logan Gunthorpe <logang@deltatee.com> wrote:
>
>> On 08/05/18 04:03 PM, Alex Williamson wrote:
>> > If IOMMU grouping implies device assignment (because nobody else uses
>> > it to the same extent as device assignment) then the build-time option
>> > falls to pieces, we need a single kernel that can do both.  I think we
>> > need to get more clever about allowing the user to specify exactly at
>> > which points in the topology they want to disable isolation.  Thanks,
>>
>>
>> Yeah, so based on the discussion I'm leaning toward just having a
>> command line option that takes a list of BDFs and disables ACS for them.
>> (Essentially as Dan has suggested.) This avoids the shotgun.
>>
>> Then, the pci_p2pdma_distance command needs to check that ACS is
>> disabled for all bridges between the two devices. If this is not the
>> case, it returns -1. Future work can check if the EP has ATS support, in
>> which case it has to check for the ACS direct translated bit.
>>
>> A user then needs to either disable the IOMMU and/or add the command
>> line option to disable ACS for the specific downstream ports in the PCI
>> hierarchy. This means the IOMMU groups will be less granular but
>> presumably the person adding the command line argument understands this.
>>
>> We may also want to do some work so that there's informative dmesgs on
>> which BDFs need to be specified on the command line so it's not so
>> difficult for the user to figure out.
>
> I'd advise caution with a user supplied BDF approach, we have no
> guaranteed persistence for a device's PCI address.  Adding a device
> might renumber the buses, replacing a device with one that consumes
> more/less bus numbers can renumber the buses, motherboard firmware
> updates could renumber the buses, pci=assign-buses can renumber the
> buses, etc.  This is why the VT-d spec makes use of device paths when
> describing PCI hierarchies, firmware can't know what bus number will be
> assigned to a device, but it does know the base bus number and the path
> of devfns needed to get to it.  I don't know how we come up with an
> option that's easy enough for a user to understand, but reasonably
> robust against hardware changes.  Thanks,

True, but at the same time this feature is for "users with custom
hardware designed for purpose", I assume they would be willing to take
on the bus renumbering risk. It's already the case that
/sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
make a similar interface for the command line. Ideally we could later
get something into ACPI or other platform firmware to arrange for
bridges to disable ACS by default if we see p2p becoming a
common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
given PCI-E sub-domain.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:00                                     ` Dan Williams
  0 siblings, 0 replies; 460+ messages in thread
From: Dan Williams @ 2018-05-08 23:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

On Tue, May 8, 2018 at 3:32 PM, Alex Williamson
<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, 8 May 2018 16:10:19 -0600
> Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>
>> On 08/05/18 04:03 PM, Alex Williamson wrote:
>> > If IOMMU grouping implies device assignment (because nobody else uses
>> > it to the same extent as device assignment) then the build-time option
>> > falls to pieces, we need a single kernel that can do both.  I think we
>> > need to get more clever about allowing the user to specify exactly at
>> > which points in the topology they want to disable isolation.  Thanks,
>>
>>
>> Yeah, so based on the discussion I'm leaning toward just having a
>> command line option that takes a list of BDFs and disables ACS for them.
>> (Essentially as Dan has suggested.) This avoids the shotgun.
>>
>> Then, the pci_p2pdma_distance command needs to check that ACS is
>> disabled for all bridges between the two devices. If this is not the
>> case, it returns -1. Future work can check if the EP has ATS support, in
>> which case it has to check for the ACS direct translated bit.
>>
>> A user then needs to either disable the IOMMU and/or add the command
>> line option to disable ACS for the specific downstream ports in the PCI
>> hierarchy. This means the IOMMU groups will be less granular but
>> presumably the person adding the command line argument understands this.
>>
>> We may also want to do some work so that there's informative dmesgs on
>> which BDFs need to be specified on the command line so it's not so
>> difficult for the user to figure out.
>
> I'd advise caution with a user supplied BDF approach, we have no
> guaranteed persistence for a device's PCI address.  Adding a device
> might renumber the buses, replacing a device with one that consumes
> more/less bus numbers can renumber the buses, motherboard firmware
> updates could renumber the buses, pci=assign-buses can renumber the
> buses, etc.  This is why the VT-d spec makes use of device paths when
> describing PCI hierarchies, firmware can't know what bus number will be
> assigned to a device, but it does know the base bus number and the path
> of devfns needed to get to it.  I don't know how we come up with an
> option that's easy enough for a user to understand, but reasonably
> robust against hardware changes.  Thanks,

True, but at the same time this feature is for "users with custom
hardware designed for purpose", I assume they would be willing to take
on the bus renumbering risk. It's already the case that
/sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
make a similar interface for the command line. Ideally we could later
get something into ACPI or other platform firmware to arrange for
bridges to disable ACS by default if we see p2p becoming a
common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
given PCI-E sub-domain.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:00                                     ` Dan Williams
  0 siblings, 0 replies; 460+ messages in thread
From: Dan Williams @ 2018-05-08 23:00 UTC (permalink / raw)


On Tue, May 8, 2018 at 3:32 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Tue, 8 May 2018 16:10:19 -0600
> Logan Gunthorpe <logang@deltatee.com> wrote:
>
>> On 08/05/18 04:03 PM, Alex Williamson wrote:
>> > If IOMMU grouping implies device assignment (because nobody else uses
>> > it to the same extent as device assignment) then the build-time option
>> > falls to pieces, we need a single kernel that can do both.  I think we
>> > need to get more clever about allowing the user to specify exactly at
>> > which points in the topology they want to disable isolation.  Thanks,
>>
>>
>> Yeah, so based on the discussion I'm leaning toward just having a
>> command line option that takes a list of BDFs and disables ACS for them.
>> (Essentially as Dan has suggested.) This avoids the shotgun.
>>
>> Then, the pci_p2pdma_distance command needs to check that ACS is
>> disabled for all bridges between the two devices. If this is not the
>> case, it returns -1. Future work can check if the EP has ATS support, in
>> which case it has to check for the ACS direct translated bit.
>>
>> A user then needs to either disable the IOMMU and/or add the command
>> line option to disable ACS for the specific downstream ports in the PCI
>> hierarchy. This means the IOMMU groups will be less granular but
>> presumably the person adding the command line argument understands this.
>>
>> We may also want to do some work so that there's informative dmesgs on
>> which BDFs need to be specified on the command line so it's not so
>> difficult for the user to figure out.
>
> I'd advise caution with a user supplied BDF approach, we have no
> guaranteed persistence for a device's PCI address.  Adding a device
> might renumber the buses, replacing a device with one that consumes
> more/less bus numbers can renumber the buses, motherboard firmware
> updates could renumber the buses, pci=assign-buses can renumber the
> buses, etc.  This is why the VT-d spec makes use of device paths when
> describing PCI hierarchies, firmware can't know what bus number will be
> assigned to a device, but it does know the base bus number and the path
> of devfns needed to get to it.  I don't know how we come up with an
> option that's easy enough for a user to understand, but reasonably
> robust against hardware changes.  Thanks,

True, but at the same time this feature is for "users with custom
hardware designed for purpose", I assume they would be willing to take
on the bus renumbering risk. It's already the case that
/sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
make a similar interface for the command line. Ideally we could later
get something into ACPI or other platform firmware to arrange for
bridges to disable ACS by default if we see p2p becoming a
common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
given PCI-E sub-domain.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 21:27           ` Stephen  Bates
                               ` (2 preceding siblings ...)
  (?)
@ 2018-05-08 23:06             ` Don Dutile
  -1 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 23:06 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Christian König, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
Alex:
Really? IOMMU groups are created by the kernel, so don't know how they would be passed into the VMs, unless indirectly via  PCI(e) layout.
At best, twiddling w/ACS enablement (emulation) would cause VMs to see different IOMMU groups, but again, VMs are not the security point/level, the host/HV's are.

>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
> Stephen
>      
>      
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:06             ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 23:06 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König

On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
Alex:
Really? IOMMU groups are created by the kernel, so don't know how they would be passed into the VMs, unless indirectly via  PCI(e) layout.
At best, twiddling w/ACS enablement (emulation) would cause VMs to see different IOMMU groups, but again, VMs are not the security point/level, the host/HV's are.

>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
> Stephen
>      
>      
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:06             ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 23:06 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christian König, linux-block-u79uwXL29TY76Z2rM5mHXA,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
Alex:
Really? IOMMU groups are created by the kernel, so don't know how they would be passed into the VMs, unless indirectly via  PCI(e) layout.
At best, twiddling w/ACS enablement (emulation) would cause VMs to see different IOMMU groups, but again, VMs are not the security point/level, the host/HV's are.

>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
> Stephen
>      
>      
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:06             ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 23:06 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, Sagi Grimberg,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, Christian König, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
Alex:
Really? IOMMU groups are created by the kernel, so don't know how they would be passed into the VMs, unless indirectly via  PCI(e) layout.
At best, twiddling w/ACS enablement (emulation) would cause VMs to see different IOMMU groups, but again, VMs are not the security point/level, the host/HV's are.

>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
> Stephen
>      
>      
> 


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:06             ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-08 23:06 UTC (permalink / raw)


On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
Alex:
Really? IOMMU groups are created by the kernel, so don't know how they would be passed into the VMs, unless indirectly via  PCI(e) layout.
At best, twiddling w/ACS enablement (emulation) would cause VMs to see different IOMMU groups, but again, VMs are not the security point/level, the host/HV's are.

>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
> Stephen
>      
>      
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 22:25                                   ` Stephen  Bates
  (?)
  (?)
@ 2018-05-08 23:11                                     ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 23:11 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

On Tue, 8 May 2018 22:25:06 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> >    Yeah, so based on the discussion I'm leaning toward just having a
> >    command line option that takes a list of BDFs and disables ACS
> > for them. (Essentially as Dan has suggested.) This avoids the
> > shotgun.  
> 
> I concur that this seems to be where the conversation is taking us.
> 
> @Alex - Before we go do this can you provide input on the approach? I
> don't want to re-spin only to find we are still not converging on the
> ACS issue....

I can envision numerous implementation details that makes this less
trivial than it sounds, but it seems like the thing we need to decide
first is if intentionally leaving windows between devices with the
intention of exploiting them for direct P2P DMA in an otherwise IOMMU
managed address space is something we want to do.  From a security
perspective, we already handle this with IOMMU groups because many
devices do not support ACS, the new thing is embracing this rather than
working around it.  It makes me a little twitchy, but so long as the
IOMMU groups match the expected worst case routing between devices,
it's really no different than if we could wipe the ACS capability from
the device.

On to the implementation details... I already mentioned the BDF issue
in my other reply.  If we had a way to persistently identify a device,
would we specify the downstream points at which we want to disable ACS
or the endpoints that we want to connect?  The latter has a problem
that the grouping upstream of an endpoint is already set by the time we
discover the endpoint, so we might need to unwind to get the grouping
correct.  The former might be more difficult for users to find the
necessary nodes, but easier for the kernel to deal with during
discovery.  A runtime, sysfs approach has some benefits here,
especially in identifying the device assuming we're ok with leaving
the persistence problem to userspace tools.  I'm still a little fond of
the idea of exposing an acs_flags attribute for devices in sysfs where
a write would do a soft unplug and re-add of all affected devices to
automatically recreate the proper grouping.  Any dynamic change in
routing and grouping would require all DMA be re-established anyway and
a soft hotplug seems like an elegant way of handling it.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:11                                     ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 23:11 UTC (permalink / raw)
  To: Stephen  Bates
  Cc: Logan Gunthorpe, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

On Tue, 8 May 2018 22:25:06 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> >    Yeah, so based on the discussion I'm leaning toward just having a
> >    command line option that takes a list of BDFs and disables ACS
> > for them. (Essentially as Dan has suggested.) This avoids the
> > shotgun.  
> 
> I concur that this seems to be where the conversation is taking us.
> 
> @Alex - Before we go do this can you provide input on the approach? I
> don't want to re-spin only to find we are still not converging on the
> ACS issue....

I can envision numerous implementation details that makes this less
trivial than it sounds, but it seems like the thing we need to decide
first is if intentionally leaving windows between devices with the
intention of exploiting them for direct P2P DMA in an otherwise IOMMU
managed address space is something we want to do.  From a security
perspective, we already handle this with IOMMU groups because many
devices do not support ACS, the new thing is embracing this rather than
working around it.  It makes me a little twitchy, but so long as the
IOMMU groups match the expected worst case routing between devices,
it's really no different than if we could wipe the ACS capability from
the device.

On to the implementation details... I already mentioned the BDF issue
in my other reply.  If we had a way to persistently identify a device,
would we specify the downstream points at which we want to disable ACS
or the endpoints that we want to connect?  The latter has a problem
that the grouping upstream of an endpoint is already set by the time we
discover the endpoint, so we might need to unwind to get the grouping
correct.  The former might be more difficult for users to find the
necessary nodes, but easier for the kernel to deal with during
discovery.  A runtime, sysfs approach has some benefits here,
especially in identifying the device assuming we're ok with leaving
the persistence problem to userspace tools.  I'm still a little fond of
the idea of exposing an acs_flags attribute for devices in sysfs where
a write would do a soft unplug and re-add of all affected devices to
automatically recreate the proper grouping.  Any dynamic change in
routing and grouping would require all DMA be re-established anyway and
a soft hotplug seems like an elegant way of handling it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:11                                     ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 23:11 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

On Tue, 8 May 2018 22:25:06 +0000
"Stephen  Bates" <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org> wrote:

> >    Yeah, so based on the discussion I'm leaning toward just having a
> >    command line option that takes a list of BDFs and disables ACS
> > for them. (Essentially as Dan has suggested.) This avoids the
> > shotgun.  
> 
> I concur that this seems to be where the conversation is taking us.
> 
> @Alex - Before we go do this can you provide input on the approach? I
> don't want to re-spin only to find we are still not converging on the
> ACS issue....

I can envision numerous implementation details that makes this less
trivial than it sounds, but it seems like the thing we need to decide
first is if intentionally leaving windows between devices with the
intention of exploiting them for direct P2P DMA in an otherwise IOMMU
managed address space is something we want to do.  From a security
perspective, we already handle this with IOMMU groups because many
devices do not support ACS, the new thing is embracing this rather than
working around it.  It makes me a little twitchy, but so long as the
IOMMU groups match the expected worst case routing between devices,
it's really no different than if we could wipe the ACS capability from
the device.

On to the implementation details... I already mentioned the BDF issue
in my other reply.  If we had a way to persistently identify a device,
would we specify the downstream points at which we want to disable ACS
or the endpoints that we want to connect?  The latter has a problem
that the grouping upstream of an endpoint is already set by the time we
discover the endpoint, so we might need to unwind to get the grouping
correct.  The former might be more difficult for users to find the
necessary nodes, but easier for the kernel to deal with during
discovery.  A runtime, sysfs approach has some benefits here,
especially in identifying the device assuming we're ok with leaving
the persistence problem to userspace tools.  I'm still a little fond of
the idea of exposing an acs_flags attribute for devices in sysfs where
a write would do a soft unplug and re-add of all affected devices to
automatically recreate the proper grouping.  Any dynamic change in
routing and grouping would require all DMA be re-established anyway and
a soft hotplug seems like an elegant way of handling it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:11                                     ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-08 23:11 UTC (permalink / raw)


On Tue, 8 May 2018 22:25:06 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> >    Yeah, so based on the discussion I'm leaning toward just having a
> >    command line option that takes a list of BDFs and disables ACS
> > for them. (Essentially as Dan has suggested.) This avoids the
> > shotgun.  
> 
> I concur that this seems to be where the conversation is taking us.
> 
> @Alex - Before we go do this can you provide input on the approach? I
> don't want to re-spin only to find we are still not converging on the
> ACS issue....

I can envision numerous implementation details that makes this less
trivial than it sounds, but it seems like the thing we need to decide
first is if intentionally leaving windows between devices with the
intention of exploiting them for direct P2P DMA in an otherwise IOMMU
managed address space is something we want to do.  From a security
perspective, we already handle this with IOMMU groups because many
devices do not support ACS, the new thing is embracing this rather than
working around it.  It makes me a little twitchy, but so long as the
IOMMU groups match the expected worst case routing between devices,
it's really no different than if we could wipe the ACS capability from
the device.

On to the implementation details... I already mentioned the BDF issue
in my other reply.  If we had a way to persistently identify a device,
would we specify the downstream points at which we want to disable ACS
or the endpoints that we want to connect?  The latter has a problem
that the grouping upstream of an endpoint is already set by the time we
discover the endpoint, so we might need to unwind to get the grouping
correct.  The former might be more difficult for users to find the
necessary nodes, but easier for the kernel to deal with during
discovery.  A runtime, sysfs approach has some benefits here,
especially in identifying the device assuming we're ok with leaving
the persistence problem to userspace tools.  I'm still a little fond of
the idea of exposing an acs_flags attribute for devices in sysfs where
a write would do a soft unplug and re-add of all affected devices to
automatically recreate the proper grouping.  Any dynamic change in
routing and grouping would require all DMA be re-established anyway and
a soft hotplug seems like an elegant way of handling it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 23:00                                     ` Dan Williams
  (?)
  (?)
@ 2018-05-08 23:15                                       ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 23:15 UTC (permalink / raw)
  To: Dan Williams, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König



On 08/05/18 05:00 PM, Dan Williams wrote:
>> I'd advise caution with a user supplied BDF approach, we have no
>> guaranteed persistence for a device's PCI address.  Adding a device
>> might renumber the buses, replacing a device with one that consumes
>> more/less bus numbers can renumber the buses, motherboard firmware
>> updates could renumber the buses, pci=assign-buses can renumber the
>> buses, etc.  This is why the VT-d spec makes use of device paths when
>> describing PCI hierarchies, firmware can't know what bus number will be
>> assigned to a device, but it does know the base bus number and the path
>> of devfns needed to get to it.  I don't know how we come up with an
>> option that's easy enough for a user to understand, but reasonably
>> robust against hardware changes.  Thanks,
> 
> True, but at the same time this feature is for "users with custom
> hardware designed for purpose", I assume they would be willing to take
> on the bus renumbering risk. It's already the case that
> /sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
> make a similar interface for the command line. Ideally we could later
> get something into ACPI or other platform firmware to arrange for
> bridges to disable ACS by default if we see p2p becoming a
> common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
> given PCI-E sub-domain.

Yeah, I'm having a hard time coming up with an easy enough solution for
the user. I agree with Dan though, the bus renumbering risk would be
fairly low in the custom hardware seeing the switches are likely going
to be directly soldered to the same board with the CPU.

That being said, I supposed we could allow the command line to take both
a BDF or a BaseBus/DF/DF/DF path. Though, implementing this sounds like
a bit of a challenge.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:15                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 23:15 UTC (permalink / raw)
  To: Dan Williams, Alex Williamson
  Cc: Stephen Bates, Christian König, Bjorn Helgaas, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Benjamin Herrenschmidt



On 08/05/18 05:00 PM, Dan Williams wrote:
>> I'd advise caution with a user supplied BDF approach, we have no
>> guaranteed persistence for a device's PCI address.  Adding a device
>> might renumber the buses, replacing a device with one that consumes
>> more/less bus numbers can renumber the buses, motherboard firmware
>> updates could renumber the buses, pci=assign-buses can renumber the
>> buses, etc.  This is why the VT-d spec makes use of device paths when
>> describing PCI hierarchies, firmware can't know what bus number will be
>> assigned to a device, but it does know the base bus number and the path
>> of devfns needed to get to it.  I don't know how we come up with an
>> option that's easy enough for a user to understand, but reasonably
>> robust against hardware changes.  Thanks,
> 
> True, but at the same time this feature is for "users with custom
> hardware designed for purpose", I assume they would be willing to take
> on the bus renumbering risk. It's already the case that
> /sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
> make a similar interface for the command line. Ideally we could later
> get something into ACPI or other platform firmware to arrange for
> bridges to disable ACS by default if we see p2p becoming a
> common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
> given PCI-E sub-domain.

Yeah, I'm having a hard time coming up with an easy enough solution for
the user. I agree with Dan though, the bus renumbering risk would be
fairly low in the custom hardware seeing the switches are likely going
to be directly soldered to the same board with the CPU.

That being said, I supposed we could allow the command line to take both
a BDF or a BaseBus/DF/DF/DF path. Though, implementing this sounds like
a bit of a challenge.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:15                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 23:15 UTC (permalink / raw)
  To: Dan Williams, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König



On 08/05/18 05:00 PM, Dan Williams wrote:
>> I'd advise caution with a user supplied BDF approach, we have no
>> guaranteed persistence for a device's PCI address.  Adding a device
>> might renumber the buses, replacing a device with one that consumes
>> more/less bus numbers can renumber the buses, motherboard firmware
>> updates could renumber the buses, pci=assign-buses can renumber the
>> buses, etc.  This is why the VT-d spec makes use of device paths when
>> describing PCI hierarchies, firmware can't know what bus number will be
>> assigned to a device, but it does know the base bus number and the path
>> of devfns needed to get to it.  I don't know how we come up with an
>> option that's easy enough for a user to understand, but reasonably
>> robust against hardware changes.  Thanks,
> 
> True, but at the same time this feature is for "users with custom
> hardware designed for purpose", I assume they would be willing to take
> on the bus renumbering risk. It's already the case that
> /sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
> make a similar interface for the command line. Ideally we could later
> get something into ACPI or other platform firmware to arrange for
> bridges to disable ACS by default if we see p2p becoming a
> common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
> given PCI-E sub-domain.

Yeah, I'm having a hard time coming up with an easy enough solution for
the user. I agree with Dan though, the bus renumbering risk would be
fairly low in the custom hardware seeing the switches are likely going
to be directly soldered to the same board with the CPU.

That being said, I supposed we could allow the command line to take both
a BDF or a BaseBus/DF/DF/DF path. Though, implementing this sounds like
a bit of a challenge.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:15                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 23:15 UTC (permalink / raw)




On 08/05/18 05:00 PM, Dan Williams wrote:
>> I'd advise caution with a user supplied BDF approach, we have no
>> guaranteed persistence for a device's PCI address.  Adding a device
>> might renumber the buses, replacing a device with one that consumes
>> more/less bus numbers can renumber the buses, motherboard firmware
>> updates could renumber the buses, pci=assign-buses can renumber the
>> buses, etc.  This is why the VT-d spec makes use of device paths when
>> describing PCI hierarchies, firmware can't know what bus number will be
>> assigned to a device, but it does know the base bus number and the path
>> of devfns needed to get to it.  I don't know how we come up with an
>> option that's easy enough for a user to understand, but reasonably
>> robust against hardware changes.  Thanks,
> 
> True, but at the same time this feature is for "users with custom
> hardware designed for purpose", I assume they would be willing to take
> on the bus renumbering risk. It's already the case that
> /sys/bus/pci/drivers/<x>/bind takes BDF, which is why it seemed to
> make a similar interface for the command line. Ideally we could later
> get something into ACPI or other platform firmware to arrange for
> bridges to disable ACS by default if we see p2p becoming a
> common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
> given PCI-E sub-domain.

Yeah, I'm having a hard time coming up with an easy enough solution for
the user. I agree with Dan though, the bus renumbering risk would be
fairly low in the custom hardware seeing the switches are likely going
to be directly soldered to the same board with the CPU.

That being said, I supposed we could allow the command line to take both
a BDF or a BaseBus/DF/DF/DF path. Though, implementing this sounds like
a bit of a challenge.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 23:11                                     ` Alex Williamson
  (?)
  (?)
@ 2018-05-08 23:31                                       ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 23:31 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König



On 08/05/18 05:11 PM, Alex Williamson wrote:
> On to the implementation details... I already mentioned the BDF issue
> in my other reply.  If we had a way to persistently identify a device,
> would we specify the downstream points at which we want to disable ACS
> or the endpoints that we want to connect?  The latter has a problem
> that the grouping upstream of an endpoint is already set by the time we
> discover the endpoint, so we might need to unwind to get the grouping
> correct.  The former might be more difficult for users to find the
> necessary nodes, but easier for the kernel to deal with during
> discovery.  

I was envisioning the former with kernel helping by printing a dmesg in
certain circumstances to help with figuring out which devices need to be
specified. Specifying a list of endpoints on the command line and having
the kernel try to figure out which downstream ports need to be adjusted
while we are in the middle of enumerating the bus is, like you said, a
nightmare.

> A runtime, sysfs approach has some benefits here,
> especially in identifying the device assuming we're ok with leaving
> the persistence problem to userspace tools.  I'm still a little fond of
> the idea of exposing an acs_flags attribute for devices in sysfs where
> a write would do a soft unplug and re-add of all affected devices to
> automatically recreate the proper grouping.  Any dynamic change in
> routing and grouping would require all DMA be re-established anyway and
> a soft hotplug seems like an elegant way of handling it.  Thanks,

This approach sounds like it has a lot more issues to contend with:

For starters, a soft unplug/re-add of all the devices behind a switch is
going to be difficult if a lot of those devices have had drivers
installed and their respective resources are now mounted or otherwise in
use.

Then, do we have to redo a the soft-replace every time we change the ACS
bit for every downstream port? That could mean you have to do dozens
soft-replaces before you have all the ACS bits set which means you have
a storm of drivers being added and removed.

This would require some kind of fancy custom setup software that runs at
just the right time in the boot sequence or a lot of work on the users
part to unbind all the resources, setup the ACS bits and then rebind
everything (assuming the soft re-add doesn't rebind it every time you
adjust one ACS bit). Ugly.

IMO, if we need to do the sysfs approach then we need to be able to
adjust the groups dynamically in a sensible way and not through the
large hammer that is soft-replaces. I think this would be great but I
don't think we will be tackling that with this patch set.

Logan


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:31                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 23:31 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt



On 08/05/18 05:11 PM, Alex Williamson wrote:
> On to the implementation details... I already mentioned the BDF issue
> in my other reply.  If we had a way to persistently identify a device,
> would we specify the downstream points at which we want to disable ACS
> or the endpoints that we want to connect?  The latter has a problem
> that the grouping upstream of an endpoint is already set by the time we
> discover the endpoint, so we might need to unwind to get the grouping
> correct.  The former might be more difficult for users to find the
> necessary nodes, but easier for the kernel to deal with during
> discovery.  

I was envisioning the former with kernel helping by printing a dmesg in
certain circumstances to help with figuring out which devices need to be
specified. Specifying a list of endpoints on the command line and having
the kernel try to figure out which downstream ports need to be adjusted
while we are in the middle of enumerating the bus is, like you said, a
nightmare.

> A runtime, sysfs approach has some benefits here,
> especially in identifying the device assuming we're ok with leaving
> the persistence problem to userspace tools.  I'm still a little fond of
> the idea of exposing an acs_flags attribute for devices in sysfs where
> a write would do a soft unplug and re-add of all affected devices to
> automatically recreate the proper grouping.  Any dynamic change in
> routing and grouping would require all DMA be re-established anyway and
> a soft hotplug seems like an elegant way of handling it.  Thanks,

This approach sounds like it has a lot more issues to contend with:

For starters, a soft unplug/re-add of all the devices behind a switch is
going to be difficult if a lot of those devices have had drivers
installed and their respective resources are now mounted or otherwise in
use.

Then, do we have to redo a the soft-replace every time we change the ACS
bit for every downstream port? That could mean you have to do dozens
soft-replaces before you have all the ACS bits set which means you have
a storm of drivers being added and removed.

This would require some kind of fancy custom setup software that runs at
just the right time in the boot sequence or a lot of work on the users
part to unbind all the resources, setup the ACS bits and then rebind
everything (assuming the soft re-add doesn't rebind it every time you
adjust one ACS bit). Ugly.

IMO, if we need to do the sysfs approach then we need to be able to
adjust the groups dynamically in a sensible way and not through the
large hammer that is soft-replaces. I think this would be great but I
don't think we will be tackling that with this patch set.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:31                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 23:31 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König



On 08/05/18 05:11 PM, Alex Williamson wrote:
> On to the implementation details... I already mentioned the BDF issue
> in my other reply.  If we had a way to persistently identify a device,
> would we specify the downstream points at which we want to disable ACS
> or the endpoints that we want to connect?  The latter has a problem
> that the grouping upstream of an endpoint is already set by the time we
> discover the endpoint, so we might need to unwind to get the grouping
> correct.  The former might be more difficult for users to find the
> necessary nodes, but easier for the kernel to deal with during
> discovery.  

I was envisioning the former with kernel helping by printing a dmesg in
certain circumstances to help with figuring out which devices need to be
specified. Specifying a list of endpoints on the command line and having
the kernel try to figure out which downstream ports need to be adjusted
while we are in the middle of enumerating the bus is, like you said, a
nightmare.

> A runtime, sysfs approach has some benefits here,
> especially in identifying the device assuming we're ok with leaving
> the persistence problem to userspace tools.  I'm still a little fond of
> the idea of exposing an acs_flags attribute for devices in sysfs where
> a write would do a soft unplug and re-add of all affected devices to
> automatically recreate the proper grouping.  Any dynamic change in
> routing and grouping would require all DMA be re-established anyway and
> a soft hotplug seems like an elegant way of handling it.  Thanks,

This approach sounds like it has a lot more issues to contend with:

For starters, a soft unplug/re-add of all the devices behind a switch is
going to be difficult if a lot of those devices have had drivers
installed and their respective resources are now mounted or otherwise in
use.

Then, do we have to redo a the soft-replace every time we change the ACS
bit for every downstream port? That could mean you have to do dozens
soft-replaces before you have all the ACS bits set which means you have
a storm of drivers being added and removed.

This would require some kind of fancy custom setup software that runs at
just the right time in the boot sequence or a lot of work on the users
part to unbind all the resources, setup the ACS bits and then rebind
everything (assuming the soft re-add doesn't rebind it every time you
adjust one ACS bit). Ugly.

IMO, if we need to do the sysfs approach then we need to be able to
adjust the groups dynamically in a sensible way and not through the
large hammer that is soft-replaces. I think this would be great but I
don't think we will be tackling that with this patch set.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-08 23:31                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-08 23:31 UTC (permalink / raw)




On 08/05/18 05:11 PM, Alex Williamson wrote:
> On to the implementation details... I already mentioned the BDF issue
> in my other reply.  If we had a way to persistently identify a device,
> would we specify the downstream points at which we want to disable ACS
> or the endpoints that we want to connect?  The latter has a problem
> that the grouping upstream of an endpoint is already set by the time we
> discover the endpoint, so we might need to unwind to get the grouping
> correct.  The former might be more difficult for users to find the
> necessary nodes, but easier for the kernel to deal with during
> discovery.  

I was envisioning the former with kernel helping by printing a dmesg in
certain circumstances to help with figuring out which devices need to be
specified. Specifying a list of endpoints on the command line and having
the kernel try to figure out which downstream ports need to be adjusted
while we are in the middle of enumerating the bus is, like you said, a
nightmare.

> A runtime, sysfs approach has some benefits here,
> especially in identifying the device assuming we're ok with leaving
> the persistence problem to userspace tools.  I'm still a little fond of
> the idea of exposing an acs_flags attribute for devices in sysfs where
> a write would do a soft unplug and re-add of all affected devices to
> automatically recreate the proper grouping.  Any dynamic change in
> routing and grouping would require all DMA be re-established anyway and
> a soft hotplug seems like an elegant way of handling it.  Thanks,

This approach sounds like it has a lot more issues to contend with:

For starters, a soft unplug/re-add of all the devices behind a switch is
going to be difficult if a lot of those devices have had drivers
installed and their respective resources are now mounted or otherwise in
use.

Then, do we have to redo a the soft-replace every time we change the ACS
bit for every downstream port? That could mean you have to do dozens
soft-replaces before you have all the ACS bits set which means you have
a storm of drivers being added and removed.

This would require some kind of fancy custom setup software that runs at
just the right time in the boot sequence or a lot of work on the users
part to unbind all the resources, setup the ACS bits and then rebind
everything (assuming the soft re-add doesn't rebind it every time you
adjust one ACS bit). Ugly.

IMO, if we need to do the sysfs approach then we need to be able to
adjust the groups dynamically in a sensible way and not through the
large hammer that is soft-replaces. I think this would be great but I
don't think we will be tackling that with this patch set.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 23:06             ` Don Dutile
                                 ` (2 preceding siblings ...)
  (?)
@ 2018-05-09  0:01               ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09  0:01 UTC (permalink / raw)
  To: Don Dutile
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	Linux Kernel Mailing List, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Tue, 8 May 2018 19:06:17 -0400
Don Dutile <ddutile@redhat.com> wrote:
> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> > As I understand it VMs need to know because VFIO passes IOMMU
> > grouping up into the VMs. So if a IOMMU grouping changes the VM's
> > view of its PCIe topology changes. I think we even have to be
> > cognizant of the fact the OS running on the VM may not even support
> > hot-plug of PCI devices.  
> Alex:
> Really? IOMMU groups are created by the kernel, so don't know how
> they would be passed into the VMs, unless indirectly via  PCI(e)
> layout. At best, twiddling w/ACS enablement (emulation) would cause
> VMs to see different IOMMU groups, but again, VMs are not the
> security point/level, the host/HV's are.

Correct, the VM has no concept of the host's IOMMU groups, only the
hypervisor knows about the groups, but really only to the extent of
which device belongs to which group and whether the group is viable.
Any runtime change to grouping though would require DMA mapping
updates, which I don't see how we can reasonably do with drivers,
vfio-pci or native host drivers, bound to the affected devices.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09  0:01               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09  0:01 UTC (permalink / raw)
  To: Don Dutile
  Cc: Stephen Bates, Dan Williams, Logan Gunthorpe,
	Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Christian König

On Tue, 8 May 2018 19:06:17 -0400
Don Dutile <ddutile@redhat.com> wrote:
> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> > As I understand it VMs need to know because VFIO passes IOMMU
> > grouping up into the VMs. So if a IOMMU grouping changes the VM's
> > view of its PCIe topology changes. I think we even have to be
> > cognizant of the fact the OS running on the VM may not even support
> > hot-plug of PCI devices.  
> Alex:
> Really? IOMMU groups are created by the kernel, so don't know how
> they would be passed into the VMs, unless indirectly via  PCI(e)
> layout. At best, twiddling w/ACS enablement (emulation) would cause
> VMs to see different IOMMU groups, but again, VMs are not the
> security point/level, the host/HV's are.

Correct, the VM has no concept of the host's IOMMU groups, only the
hypervisor knows about the groups, but really only to the extent of
which device belongs to which group and whether the group is viable.
Any runtime change to grouping though would require DMA mapping
updates, which I don't see how we can reasonably do with drivers,
vfio-pci or native host drivers, bound to the affected devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09  0:01               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09  0:01 UTC (permalink / raw)
  To: Don Dutile
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Tue, 8 May 2018 19:06:17 -0400
Don Dutile <ddutile-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> > As I understand it VMs need to know because VFIO passes IOMMU
> > grouping up into the VMs. So if a IOMMU grouping changes the VM's
> > view of its PCIe topology changes. I think we even have to be
> > cognizant of the fact the OS running on the VM may not even support
> > hot-plug of PCI devices.  
> Alex:
> Really? IOMMU groups are created by the kernel, so don't know how
> they would be passed into the VMs, unless indirectly via  PCI(e)
> layout. At best, twiddling w/ACS enablement (emulation) would cause
> VMs to see different IOMMU groups, but again, VMs are not the
> security point/level, the host/HV's are.

Correct, the VM has no concept of the host's IOMMU groups, only the
hypervisor knows about the groups, but really only to the extent of
which device belongs to which group and whether the group is viable.
Any runtime change to grouping though would require DMA mapping
updates, which I don't see how we can reasonably do with drivers,
vfio-pci or native host drivers, bound to the affected devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09  0:01               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09  0:01 UTC (permalink / raw)
  To: Don Dutile
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, Linux Kernel Mailing List, linux-nvme, Stephen Bates,
	linux-block, Jérôme Glisse, Jason Gunthorpe,
	Christian König, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Dan Williams, Logan Gunthorpe, Christoph Hellwig

On Tue, 8 May 2018 19:06:17 -0400
Don Dutile <ddutile@redhat.com> wrote:
> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> > As I understand it VMs need to know because VFIO passes IOMMU
> > grouping up into the VMs. So if a IOMMU grouping changes the VM's
> > view of its PCIe topology changes. I think we even have to be
> > cognizant of the fact the OS running on the VM may not even support
> > hot-plug of PCI devices.  
> Alex:
> Really? IOMMU groups are created by the kernel, so don't know how
> they would be passed into the VMs, unless indirectly via  PCI(e)
> layout. At best, twiddling w/ACS enablement (emulation) would cause
> VMs to see different IOMMU groups, but again, VMs are not the
> security point/level, the host/HV's are.

Correct, the VM has no concept of the host's IOMMU groups, only the
hypervisor knows about the groups, but really only to the extent of
which device belongs to which group and whether the group is viable.
Any runtime change to grouping though would require DMA mapping
updates, which I don't see how we can reasonably do with drivers,
vfio-pci or native host drivers, bound to the affected devices.  Thanks,

Alex

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09  0:01               ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09  0:01 UTC (permalink / raw)


On Tue, 8 May 2018 19:06:17 -0400
Don Dutile <ddutile@redhat.com> wrote:
> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> > As I understand it VMs need to know because VFIO passes IOMMU
> > grouping up into the VMs. So if a IOMMU grouping changes the VM's
> > view of its PCIe topology changes. I think we even have to be
> > cognizant of the fact the OS running on the VM may not even support
> > hot-plug of PCI devices.  
> Alex:
> Really? IOMMU groups are created by the kernel, so don't know how
> they would be passed into the VMs, unless indirectly via  PCI(e)
> layout. At best, twiddling w/ACS enablement (emulation) would cause
> VMs to see different IOMMU groups, but again, VMs are not the
> security point/level, the host/HV's are.

Correct, the VM has no concept of the host's IOMMU groups, only the
hypervisor knows about the groups, but really only to the extent of
which device belongs to which group and whether the group is viable.
Any runtime change to grouping though would require DMA mapping
updates, which I don't see how we can reasonably do with drivers,
vfio-pci or native host drivers, bound to the affected devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 23:31                                       ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-09  0:17                                         ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09  0:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 17:31:48 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:
> On 08/05/18 05:11 PM, Alex Williamson wrote:
> > A runtime, sysfs approach has some benefits here,
> > especially in identifying the device assuming we're ok with leaving
> > the persistence problem to userspace tools.  I'm still a little fond of
> > the idea of exposing an acs_flags attribute for devices in sysfs where
> > a write would do a soft unplug and re-add of all affected devices to
> > automatically recreate the proper grouping.  Any dynamic change in
> > routing and grouping would require all DMA be re-established anyway and
> > a soft hotplug seems like an elegant way of handling it.  Thanks,  
> 
> This approach sounds like it has a lot more issues to contend with:
> 
> For starters, a soft unplug/re-add of all the devices behind a switch is
> going to be difficult if a lot of those devices have had drivers
> installed and their respective resources are now mounted or otherwise in
> use.
> 
> Then, do we have to redo a the soft-replace every time we change the ACS
> bit for every downstream port? That could mean you have to do dozens
> soft-replaces before you have all the ACS bits set which means you have
> a storm of drivers being added and removed.

True, anything requiring tweaking multiple downstream ports would
induce a hot-unplug/replug for each.  A better sysfs interface would
allow multiple downstream ports to be updated in a single shot.

> This would require some kind of fancy custom setup software that runs at
> just the right time in the boot sequence or a lot of work on the users
> part to unbind all the resources, setup the ACS bits and then rebind
> everything (assuming the soft re-add doesn't rebind it every time you
> adjust one ACS bit). Ugly.
> 
> IMO, if we need to do the sysfs approach then we need to be able to
> adjust the groups dynamically in a sensible way and not through the
> large hammer that is soft-replaces. I think this would be great but I
> don't think we will be tackling that with this patch set.

OTOH, I think the only sensible way to dynamically adjust groups is
through hotplug, we cannot have running drivers attached to downstream
endpoints as we're adjusting the routing.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09  0:17                                         ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09  0:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Stephen Bates, Christian König, Bjorn Helgaas, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

On Tue, 8 May 2018 17:31:48 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:
> On 08/05/18 05:11 PM, Alex Williamson wrote:
> > A runtime, sysfs approach has some benefits here,
> > especially in identifying the device assuming we're ok with leaving
> > the persistence problem to userspace tools.  I'm still a little fond of
> > the idea of exposing an acs_flags attribute for devices in sysfs where
> > a write would do a soft unplug and re-add of all affected devices to
> > automatically recreate the proper grouping.  Any dynamic change in
> > routing and grouping would require all DMA be re-established anyway and
> > a soft hotplug seems like an elegant way of handling it.  Thanks,  
> 
> This approach sounds like it has a lot more issues to contend with:
> 
> For starters, a soft unplug/re-add of all the devices behind a switch is
> going to be difficult if a lot of those devices have had drivers
> installed and their respective resources are now mounted or otherwise in
> use.
> 
> Then, do we have to redo a the soft-replace every time we change the ACS
> bit for every downstream port? That could mean you have to do dozens
> soft-replaces before you have all the ACS bits set which means you have
> a storm of drivers being added and removed.

True, anything requiring tweaking multiple downstream ports would
induce a hot-unplug/replug for each.  A better sysfs interface would
allow multiple downstream ports to be updated in a single shot.

> This would require some kind of fancy custom setup software that runs at
> just the right time in the boot sequence or a lot of work on the users
> part to unbind all the resources, setup the ACS bits and then rebind
> everything (assuming the soft re-add doesn't rebind it every time you
> adjust one ACS bit). Ugly.
> 
> IMO, if we need to do the sysfs approach then we need to be able to
> adjust the groups dynamically in a sensible way and not through the
> large hammer that is soft-replaces. I think this would be great but I
> don't think we will be tackling that with this patch set.

OTOH, I think the only sensible way to dynamically adjust groups is
through hotplug, we cannot have running drivers attached to downstream
endpoints as we're adjusting the routing.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09  0:17                                         ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09  0:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig,
	Christian König

On Tue, 8 May 2018 17:31:48 -0600
Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
> On 08/05/18 05:11 PM, Alex Williamson wrote:
> > A runtime, sysfs approach has some benefits here,
> > especially in identifying the device assuming we're ok with leaving
> > the persistence problem to userspace tools.  I'm still a little fond of
> > the idea of exposing an acs_flags attribute for devices in sysfs where
> > a write would do a soft unplug and re-add of all affected devices to
> > automatically recreate the proper grouping.  Any dynamic change in
> > routing and grouping would require all DMA be re-established anyway and
> > a soft hotplug seems like an elegant way of handling it.  Thanks,  
> 
> This approach sounds like it has a lot more issues to contend with:
> 
> For starters, a soft unplug/re-add of all the devices behind a switch is
> going to be difficult if a lot of those devices have had drivers
> installed and their respective resources are now mounted or otherwise in
> use.
> 
> Then, do we have to redo a the soft-replace every time we change the ACS
> bit for every downstream port? That could mean you have to do dozens
> soft-replaces before you have all the ACS bits set which means you have
> a storm of drivers being added and removed.

True, anything requiring tweaking multiple downstream ports would
induce a hot-unplug/replug for each.  A better sysfs interface would
allow multiple downstream ports to be updated in a single shot.

> This would require some kind of fancy custom setup software that runs at
> just the right time in the boot sequence or a lot of work on the users
> part to unbind all the resources, setup the ACS bits and then rebind
> everything (assuming the soft re-add doesn't rebind it every time you
> adjust one ACS bit). Ugly.
> 
> IMO, if we need to do the sysfs approach then we need to be able to
> adjust the groups dynamically in a sensible way and not through the
> large hammer that is soft-replaces. I think this would be great but I
> don't think we will be tackling that with this patch set.

OTOH, I think the only sensible way to dynamically adjust groups is
through hotplug, we cannot have running drivers attached to downstream
endpoints as we're adjusting the routing.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09  0:17                                         ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09  0:17 UTC (permalink / raw)


On Tue, 8 May 2018 17:31:48 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:
> On 08/05/18 05:11 PM, Alex Williamson wrote:
> > A runtime, sysfs approach has some benefits here,
> > especially in identifying the device assuming we're ok with leaving
> > the persistence problem to userspace tools.  I'm still a little fond of
> > the idea of exposing an acs_flags attribute for devices in sysfs where
> > a write would do a soft unplug and re-add of all affected devices to
> > automatically recreate the proper grouping.  Any dynamic change in
> > routing and grouping would require all DMA be re-established anyway and
> > a soft hotplug seems like an elegant way of handling it.  Thanks,  
> 
> This approach sounds like it has a lot more issues to contend with:
> 
> For starters, a soft unplug/re-add of all the devices behind a switch is
> going to be difficult if a lot of those devices have had drivers
> installed and their respective resources are now mounted or otherwise in
> use.
> 
> Then, do we have to redo a the soft-replace every time we change the ACS
> bit for every downstream port? That could mean you have to do dozens
> soft-replaces before you have all the ACS bits set which means you have
> a storm of drivers being added and removed.

True, anything requiring tweaking multiple downstream ports would
induce a hot-unplug/replug for each.  A better sysfs interface would
allow multiple downstream ports to be updated in a single shot.

> This would require some kind of fancy custom setup software that runs at
> just the right time in the boot sequence or a lot of work on the users
> part to unbind all the resources, setup the ACS bits and then rebind
> everything (assuming the soft re-add doesn't rebind it every time you
> adjust one ACS bit). Ugly.
> 
> IMO, if we need to do the sysfs approach then we need to be able to
> adjust the groups dynamically in a sensible way and not through the
> large hammer that is soft-replaces. I think this would be great but I
> don't think we will be tackling that with this patch set.

OTOH, I think the only sensible way to dynamically adjust groups is
through hotplug, we cannot have running drivers attached to downstream
endpoints as we're adjusting the routing.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09  0:01               ` Alex Williamson
                                   ` (3 preceding siblings ...)
  (?)
@ 2018-05-09 12:35                 ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:35 UTC (permalink / raw)
  To: Alex Williamson, Don Dutile
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	Linux Kernel Mailing List, linux-nvme, Christian König,
	linux-block, Jérôme Glisse, Jason Gunthorpe,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

Hi Alex and Don

>    Correct, the VM has no concept of the host's IOMMU groups, only the
>   hypervisor knows about the groups, 

But as I understand it these groups are usually passed through to VMs on a pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings but it is impacted by them in the sense that if the groupings change the PCI topology presented to the VM needs to change too.

Stephen

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:35                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:35 UTC (permalink / raw)
  To: Alex Williamson, Don Dutile
  Cc: Dan Williams, Logan Gunthorpe, Linux Kernel Mailing List,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Benjamin Herrenschmidt,
	Christian König

SGkgQWxleCBhbmQgRG9uDQoNCj4gICAgQ29ycmVjdCwgdGhlIFZNIGhhcyBubyBjb25jZXB0IG9m
IHRoZSBob3N0J3MgSU9NTVUgZ3JvdXBzLCBvbmx5IHRoZQ0KPiAgIGh5cGVydmlzb3Iga25vd3Mg
YWJvdXQgdGhlIGdyb3VwcywgDQoNCkJ1dCBhcyBJIHVuZGVyc3RhbmQgaXQgdGhlc2UgZ3JvdXBz
IGFyZSB1c3VhbGx5IHBhc3NlZCB0aHJvdWdoIHRvIFZNcyBvbiBhIHByZS1ncm91cCBiYXNpcyBi
eSB0aGUgaHlwZXJ2aXNvcj8gU28gSU9NTVUgZ3JvdXAgMSBtaWdodCBiZSBwYXNzZWQgdG8gVk0g
QSBhbmQgSU9NTVUgZ3JvdXAgMiBwYXNzZWQgdG8gVk0gQi4gU28gSSBhZ3JlZSB0aGUgVk0gaXMg
bm90IGF3YXJlIG9mIElPTU1VIGdyb3VwaW5ncyBidXQgaXQgaXMgaW1wYWN0ZWQgYnkgdGhlbSBp
biB0aGUgc2Vuc2UgdGhhdCBpZiB0aGUgZ3JvdXBpbmdzIGNoYW5nZSB0aGUgUENJIHRvcG9sb2d5
IHByZXNlbnRlZCB0byB0aGUgVk0gbmVlZHMgdG8gY2hhbmdlIHRvby4NCg0KU3RlcGhlbg0KDQo=

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:35                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:35 UTC (permalink / raw)
  To: Alex Williamson, Don Dutile
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christian König, linux-block-u79uwXL29TY76Z2rM5mHXA,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Alex and Don

>    Correct, the VM has no concept of the host's IOMMU groups, only the
>   hypervisor knows about the groups, 

But as I understand it these groups are usually passed through to VMs on a pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings but it is impacted by them in the sense that if the groupings change the PCI topology presented to the VM needs to change too.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:35                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:35 UTC (permalink / raw)
  To: Alex Williamson, Don Dutile
  Cc: Dan Williams, Logan Gunthorpe, Linux Kernel Mailing List,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Benjamin Herrenschmidt,
	Christian König

Hi Alex and Don

>    Correct, the VM has no concept of the host's IOMMU groups, only the
>   hypervisor knows about the groups, 

But as I understand it these groups are usually passed through to VMs on a pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings but it is impacted by them in the sense that if the groupings change the PCI topology presented to the VM needs to change too.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:35                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:35 UTC (permalink / raw)
  To: Alex Williamson, Don Dutile
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, Linux Kernel Mailing List, linux-nvme,
	Christian König, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Dan Williams, Logan Gunthorpe, Christoph Hellwig

Hi Alex and Don

>    Correct, the VM has no concept of the host's IOMMU groups, only the
>   hypervisor knows about the groups, 

But as I understand it these groups are usually passed through to VMs on a pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings but it is impacted by them in the sense that if the groupings change the PCI topology presented to the VM needs to change too.

Stephen

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:35                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:35 UTC (permalink / raw)


Hi Alex and Don

>    Correct, the VM has no concept of the host's IOMMU groups, only the
>   hypervisor knows about the groups, 

But as I understand it these groups are usually passed through to VMs on a pre-group basis by the hypervisor? So IOMMU group 1 might be passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is not aware of IOMMU groupings but it is impacted by them in the sense that if the groupings change the PCI topology presented to the VM needs to change too.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 23:15                                       ` Logan Gunthorpe
                                                           ` (2 preceding siblings ...)
  (?)
@ 2018-05-09 12:38                                         ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:38 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

Hi Logan

>    Yeah, I'm having a hard time coming up with an easy enough solution for
>    the user. I agree with Dan though, the bus renumbering risk would be
>    fairly low in the custom hardware seeing the switches are likely going
>    to be directly soldered to the same board with the CPU.
    
I am afraid that soldered down assumption may not be valid. More and more PCIe cards with PCIe switches on them are becoming available and people are using these to connect servers to arrays of NVMe SSDs which may make the topology more dynamic.

Stephen

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:38                                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:38 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams, Alex Williamson
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Benjamin Herrenschmidt

SGkgTG9nYW4NCg0KPiAgICBZZWFoLCBJJ20gaGF2aW5nIGEgaGFyZCB0aW1lIGNvbWluZyB1cCB3
aXRoIGFuIGVhc3kgZW5vdWdoIHNvbHV0aW9uIGZvcg0KPiAgICB0aGUgdXNlci4gSSBhZ3JlZSB3
aXRoIERhbiB0aG91Z2gsIHRoZSBidXMgcmVudW1iZXJpbmcgcmlzayB3b3VsZCBiZQ0KPiAgICBm
YWlybHkgbG93IGluIHRoZSBjdXN0b20gaGFyZHdhcmUgc2VlaW5nIHRoZSBzd2l0Y2hlcyBhcmUg
bGlrZWx5IGdvaW5nDQo+ICAgIHRvIGJlIGRpcmVjdGx5IHNvbGRlcmVkIHRvIHRoZSBzYW1lIGJv
YXJkIHdpdGggdGhlIENQVS4NCiAgICANCkkgYW0gYWZyYWlkIHRoYXQgc29sZGVyZWQgZG93biBh
c3N1bXB0aW9uIG1heSBub3QgYmUgdmFsaWQuIE1vcmUgYW5kIG1vcmUgUENJZSBjYXJkcyB3aXRo
IFBDSWUgc3dpdGNoZXMgb24gdGhlbSBhcmUgYmVjb21pbmcgYXZhaWxhYmxlIGFuZCBwZW9wbGUg
YXJlIHVzaW5nIHRoZXNlIHRvIGNvbm5lY3Qgc2VydmVycyB0byBhcnJheXMgb2YgTlZNZSBTU0Rz
IHdoaWNoIG1heSBtYWtlIHRoZSB0b3BvbG9neSBtb3JlIGR5bmFtaWMuDQoNClN0ZXBoZW4NCg0K

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:38                                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:38 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

Hi Logan

>    Yeah, I'm having a hard time coming up with an easy enough solution for
>    the user. I agree with Dan though, the bus renumbering risk would be
>    fairly low in the custom hardware seeing the switches are likely going
>    to be directly soldered to the same board with the CPU.
    
I am afraid that soldered down assumption may not be valid. More and more PCIe cards with PCIe switches on them are becoming available and people are using these to connect servers to arrays of NVMe SSDs which may make the topology more dynamic.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:38                                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:38 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams, Alex Williamson
  Cc: Christian König, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Benjamin Herrenschmidt

Hi Logan

>    Yeah, I'm having a hard time coming up with an easy enough solution for
>    the user. I agree with Dan though, the bus renumbering risk would be
>    fairly low in the custom hardware seeing the switches are likely going
>    to be directly soldered to the same board with the CPU.
    
I am afraid that soldered down assumption may not be valid. More and more PCIe cards with PCIe switches on them are becoming available and people are using these to connect servers to arrays of NVMe SSDs which may make the topology more dynamic.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:38                                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:38 UTC (permalink / raw)


Hi Logan

>    Yeah, I'm having a hard time coming up with an easy enough solution for
>    the user. I agree with Dan though, the bus renumbering risk would be
>    fairly low in the custom hardware seeing the switches are likely going
>    to be directly soldered to the same board with the CPU.
    
I am afraid that soldered down assumption may not be valid. More and more PCIe cards with PCIe switches on them are becoming available and people are using these to connect servers to arrays of NVMe SSDs which may make the topology more dynamic.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 22:21                                 ` Don Dutile
                                                     ` (3 preceding siblings ...)
  (?)
@ 2018-05-09 12:44                                   ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:44 UTC (permalink / raw)
  To: Don Dutile, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

Hi Don

>    RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>    put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>    which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
    
Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.

NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...

Stephen
    
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:44                                   ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:44 UTC (permalink / raw)
  To: Don Dutile, Alex Williamson
  Cc: Logan Gunthorpe, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

SGkgRG9uDQoNCj4gICAgUkRNQSBWRnMgbGVuZCB0aGVtc2VsdmVzIHRvIE5WTUVvRiB3L2Rldmlj
ZS1hc3NpZ25tZW50Li4uLiBuZWVkIGEgd2F5IHRvDQo+ICAgIHB1dCBOVk1FICdyZXNvdXJjZXMn
IGludG8gYW4gYXNzaWduYWJsZS9tYW5hZ2VhYmxlIG9iamVjdCBmb3IgJ0lPTU1VLWdyb3VwaW5n
JywNCj4gICAgd2hpY2ggaXMgcmVhbGx5IGEgJ0RNQSBzZWN1cml0eSBkb21haW4nIGFuZCBsZXNz
IGFuICdJT01NVSBncm91cGluZyBkb21haW4nLg0KICAgIA0KSGEsIEkgbGlrZSB5b3VyIHRlcm0g
IkRNQSBTZWN1cml0eSBEb21haW4iIHdoaWNoIHNvdW5kcyBhYm91dCByaWdodCBmb3Igd2hhdCB3
ZSBhcmUgZGlzY3Vzc2luZyB3aXRoIHAycGRtYSBhbmQgQUNTIGRpc2FibGVtZW50IDstKS4gVGhl
IHByb2JsZW0gaXMgdGhhdCBBQ1MgaXMsIGluIHNvbWUgd2F5cywgdG9vIGJpZyBvZiBoYW1tZXIg
Zm9yIHdoYXQgd2Ugd2FudCBoZXJlIGluIHRoZSBzZW5zZSB0aGF0IGl0IGlzIGVpdGhlciBvbiBv
ciBvZmYgZm9yIHRoZSBicmlkZ2Ugb3IgTUYgRVAgd2UgZW5hYmxlL2Rpc2FibGUgaXQgZm9yLiBB
Q1MgY2FuJ3QgZmlsdGVyIHRoZSBUTFBzIGJ5IGFkZHJlc3Mgb3IgSUQgdGhvdWdoIFBDSS1TSUcg
YXJlIGhhdmluZyBzb21lIGRpc2N1c3Npb25zIG9uIGV4dGVuZGluZyBBQ1MuIFRoYXQncyBhIGxv
bmcgdGVybSBzb2x1dGlvbiBhbmQgd29uJ3QgYmUgYXBwbGljYWJsZSB0byB1cyBmb3Igc29tZSB0
aW1lLg0KDQpOVk1lIFNTRHMgdGhhdCBzdXBwb3J0IFNSLUlPViBhcmUgY29taW5nIHRvIG1hcmtl
dCBidXQgd2UgY2FuJ3QgYXNzdW1lIGFsbCBOVk1lIFNTRHMgd2l0aCBzdXBwb3J0IFNSLUlPVi4g
VGhhdCB3aWxsIHByb2JhYmx5IGJlIGEgcHJldHR5IGhpZ2ggZW5kLWZlYXR1cmUuLi4NCg0KU3Rl
cGhlbg0KICAgIA0KICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:44                                   ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:44 UTC (permalink / raw)
  To: Don Dutile, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

Hi Don

>    RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>    put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>    which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
    
Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.

NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:44                                   ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:44 UTC (permalink / raw)
  To: Don Dutile, Alex Williamson
  Cc: Logan Gunthorpe, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

Hi Don

>    RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>    put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>    which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
    
Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.

NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...

Stephen
    
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:44                                   ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:44 UTC (permalink / raw)
  To: Don Dutile, Alex Williamson
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, Christoph Hellwig,
	linux-block, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Dan Williams, Logan Gunthorpe,
	Christian König

Hi Don

>    RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>    put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>    which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
    
Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.

NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...

Stephen
    
    

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 12:44                                   ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 12:44 UTC (permalink / raw)


Hi Don

>    RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>    put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>    which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
    
Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.

NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...

Stephen
    
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 20:50                       ` Jerome Glisse
                                           ` (2 preceding siblings ...)
  (?)
@ 2018-05-09 13:12                         ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 13:12 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

Jerome and Christian
    
> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.

So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA. 

So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

> Also ATS is meaningless without something like PASID as far as i know.
    
ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.

Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Stephen    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 13:12                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 13:12 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Alex Williamson, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Benjamin Herrenschmidt

SmVyb21lIGFuZCBDaHJpc3RpYW4NCiAgICANCj4gSSB0aGluayB0aGVyZSBpcyBjb25mdXNpb24g
aGVyZSwgQWxleCBwcm9wZXJseSBleHBsYWluZWQgdGhlIHNjaGVtZQ0KPiBQQ0lFLWRldmljZSBk
byBhIEFUUyByZXF1ZXN0IHRvIHRoZSBJT01NVSB3aGljaCByZXR1cm5zIGEgdmFsaWQNCj4gdHJh
bnNsYXRpb24gZm9yIGEgdmlydHVhbCBhZGRyZXNzLiBEZXZpY2UgY2FuIHRoZW4gdXNlIHRoYXQg
YWRkcmVzcw0KPiBkaXJlY3RseSB3aXRob3V0IGdvaW5nIHRocm91Z2ggSU9NTVUgZm9yIHRyYW5z
bGF0aW9uLg0KDQpTbyBJIHdlbnQgdGhyb3VnaCBBVFMgaW4gdmVyc2lvbiA0LjByMSBvZiB0aGUg
UENJIHNwZWMuIEl0IGxvb2tzIGxpa2UgZXZlbiBhIEFUUyB0cmFuc2xhdGVkIFRMUCBpcyBzdGls
bCBpbXBhY3RlZCBieSBBQ1MgdGhvdWdoIGl0IGhhcyBhIHNlcGFyYXRlIGNvbnRyb2wga25vYiBm
b3IgdHJhbnNsYXRlZCBhZGRyZXNzIFRMUHMgKHNlZSA3LjcuNy4yIG9mIDQuMHIxIG9mIHRoZSBz
cGVjKS4gU28gZXZlbiBpZiB5b3VyIGRldmljZSBzdXBwb3J0cyBBVFMgYSBQMlAgRE1BIHdpbGwg
c3RpbGwgYmUgcm91dGVkIHRvIHRoZSBhc3NvY2lhdGVkIFJQIG9mIHRoZSBkb21haW4gYW5kIGRv
d24gYWdhaW4gdW5sZXNzIHdlIGRpc2FibGUgQUNTIERUIFAyUCBvbiBhbGwgYnJpZGdlcyBiZXR3
ZWVuIHRoZSB0d28gZGV2aWNlcyBpbnZvbHZlZCBpbiB0aGUgUDJQIERNQS4gDQoNClNvIHdlIHN0
aWxsIGRvbid0IGdldCBmaW5lIGdyYWluZWQgY29udHJvbCB3aXRoIEFUUyBhbmQgSSBndWVzcyB3
ZSBzdGlsbCBoYXZlIHNlY3VyaXR5IGlzc3VlcyBiZWNhdXNlIGEgcm9ndWUgb3IgbWFsZnVuY3Rp
b25pbmcgRVAgY291bGQganVzdCBhcyBlYXNpbHkgaXNzdWUgVExQcyB3aXRoIFRBIHNldCB2cyBu
b3Qgc2V0Lg0KDQo+IEFsc28gQVRTIGlzIG1lYW5pbmdsZXNzIHdpdGhvdXQgc29tZXRoaW5nIGxp
a2UgUEFTSUQgYXMgZmFyIGFzIGkga25vdy4NCiAgICANCkFUUyBpcyBzdGlsbCBzb21ld2hhdCB2
YWx1YWJsZSB3aXRob3V0IFBTQUlEIGluIHRoZSBzZW5zZSB5b3UgY2FuIGNhY2hlIElPTU1VIGFk
ZHJlc3MgdHJhbnNsYXRpb25zIGF0IHRoZSBFUC4gVGhpcyBzYXZlcyBoYW1tZXJpbmcgb24gdGhl
IElPTU1VIGFzIG11Y2ggaW4gY2VydGFpbiB3b3JrbG9hZHMuDQoNCkludGVyZXN0aW5nbHkgU2Vj
dGlvbiA3LjcuNy4yIGFsbW9zdCBtZW50aW9ucyB0aGF0IFJvb3QgUG9ydHMgdGhhdCBzdXBwb3J0
IEFUUyBBTkQgY2FuIGltcGxlbWVudCBQMlAgYmV0d2VlbiByb290IHBvcnRzIHNob3VsZCBhZHZl
cnRpc2UgIkFDUyBEaXJlY3QgVHJhbnNsYXRlZCBQMlAgKFQpIiBjYXBhYmlsaXR5LiBUaGlzIHRp
ZXMgaW50byB0aGUgZGlzY3Vzc2lvbiBhcm91bmQgUDJQIGJldHdlZW4gcm91dGUgcG9ydHMgd2Ug
aGFkIGEgZmV3IHdlZWtzIGFnby4uLg0KDQpTdGVwaGVuICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 13:12                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 13:12 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

Jerome and Christian
    
> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.

So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA. 

So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

> Also ATS is meaningless without something like PASID as far as i know.
    
ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.

Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Stephen    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 13:12                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 13:12 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Alex Williamson, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Benjamin Herrenschmidt

Jerome and Christian
    
> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.

So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA. 

So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

> Also ATS is meaningless without something like PASID as far as i know.
    
ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.

Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Stephen    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 13:12                         ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 13:12 UTC (permalink / raw)


Jerome and Christian
    
> I think there is confusion here, Alex properly explained the scheme
> PCIE-device do a ATS request to the IOMMU which returns a valid
> translation for a virtual address. Device can then use that address
> directly without going through IOMMU for translation.

So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA. 

So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

> Also ATS is meaningless without something like PASID as far as i know.
    
ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.

Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Stephen    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 13:12                         ` Stephen  Bates
                                             ` (2 preceding siblings ...)
  (?)
@ 2018-05-09 13:40                           ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-09 13:40 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Am 09.05.2018 um 15:12 schrieb Stephen Bates:
> Jerome and Christian
>      
>> I think there is confusion here, Alex properly explained the scheme
>> PCIE-device do a ATS request to the IOMMU which returns a valid
>> translation for a virtual address. Device can then use that address
>> directly without going through IOMMU for translation.
> So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA.
>
> So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

Still need to double check the specification (had a busy morning today), 
but that sounds about correct.

The key takeaway is that when any device has ATS enabled you can't 
disable ACS without breaking it (even if you unplug and replug it).

>> Also ATS is meaningless without something like PASID as far as i know.
>      
> ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.
>
> Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Interesting point, give me a moment to check that. That finally makes 
all the hardware I have standing around here valuable :)

Christian.

>
> Stephen
>

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 13:40                           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-09 13:40 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse, Logan Gunthorpe
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

Am 09.05.2018 um 15:12 schrieb Stephen Bates:
> Jerome and Christian
>      
>> I think there is confusion here, Alex properly explained the scheme
>> PCIE-device do a ATS request to the IOMMU which returns a valid
>> translation for a virtual address. Device can then use that address
>> directly without going through IOMMU for translation.
> So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA.
>
> So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

Still need to double check the specification (had a busy morning today), 
but that sounds about correct.

The key takeaway is that when any device has ATS enabled you can't 
disable ACS without breaking it (even if you unplug and replug it).

>> Also ATS is meaningless without something like PASID as far as i know.
>      
> ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.
>
> Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Interesting point, give me a moment to check that. That finally makes 
all the hardware I have standing around here valuable :)

Christian.

>
> Stephen
>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 13:40                           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-09 13:40 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Am 09.05.2018 um 15:12 schrieb Stephen Bates:
> Jerome and Christian
>      
>> I think there is confusion here, Alex properly explained the scheme
>> PCIE-device do a ATS request to the IOMMU which returns a valid
>> translation for a virtual address. Device can then use that address
>> directly without going through IOMMU for translation.
> So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA.
>
> So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

Still need to double check the specification (had a busy morning today), 
but that sounds about correct.

The key takeaway is that when any device has ATS enabled you can't 
disable ACS without breaking it (even if you unplug and replug it).

>> Also ATS is meaningless without something like PASID as far as i know.
>      
> ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.
>
> Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Interesting point, give me a moment to check that. That finally makes 
all the hardware I have standing around here valuable :)

Christian.

>
> Stephen
>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 13:40                           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-09 13:40 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Christoph Hellwig

Am 09.05.2018 um 15:12 schrieb Stephen Bates:
> Jerome and Christian
>      
>> I think there is confusion here, Alex properly explained the scheme
>> PCIE-device do a ATS request to the IOMMU which returns a valid
>> translation for a virtual address. Device can then use that address
>> directly without going through IOMMU for translation.
> So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA.
>
> So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

Still need to double check the specification (had a busy morning today), 
but that sounds about correct.

The key takeaway is that when any device has ATS enabled you can't 
disable ACS without breaking it (even if you unplug and replug it).

>> Also ATS is meaningless without something like PASID as far as i know.
>      
> ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.
>
> Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Interesting point, give me a moment to check that. That finally makes 
all the hardware I have standing around here valuable :)

Christian.

>
> Stephen
>


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 13:40                           ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-09 13:40 UTC (permalink / raw)


Am 09.05.2018 um 15:12 schrieb Stephen Bates:
> Jerome and Christian
>      
>> I think there is confusion here, Alex properly explained the scheme
>> PCIE-device do a ATS request to the IOMMU which returns a valid
>> translation for a virtual address. Device can then use that address
>> directly without going through IOMMU for translation.
> So I went through ATS in version 4.0r1 of the PCI spec. It looks like even a ATS translated TLP is still impacted by ACS though it has a separate control knob for translated address TLPs (see 7.7.7.2 of 4.0r1 of the spec). So even if your device supports ATS a P2P DMA will still be routed to the associated RP of the domain and down again unless we disable ACS DT P2P on all bridges between the two devices involved in the P2P DMA.
>
> So we still don't get fine grained control with ATS and I guess we still have security issues because a rogue or malfunctioning EP could just as easily issue TLPs with TA set vs not set.

Still need to double check the specification (had a busy morning today), 
but that sounds about correct.

The key takeaway is that when any device has ATS enabled you can't 
disable ACS without breaking it (even if you unplug and replug it).

>> Also ATS is meaningless without something like PASID as far as i know.
>      
> ATS is still somewhat valuable without PSAID in the sense you can cache IOMMU address translations at the EP. This saves hammering on the IOMMU as much in certain workloads.
>
> Interestingly Section 7.7.7.2 almost mentions that Root Ports that support ATS AND can implement P2P between root ports should advertise "ACS Direct Translated P2P (T)" capability. This ties into the discussion around P2P between route ports we had a few weeks ago...

Interesting point, give me a moment to check that. That finally makes 
all the hardware I have standing around here valuable :)

Christian.

>
> Stephen
>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 12:35                 ` Stephen  Bates
                                     ` (2 preceding siblings ...)
  (?)
@ 2018-05-09 14:44                   ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09 14:44 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	Linux Kernel Mailing List, linux-nvme, Christian König,
	linux-block, Jérôme Glisse, Jason Gunthorpe,
	Don Dutile, Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Wed, 9 May 2018 12:35:56 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Hi Alex and Don
> 
> >    Correct, the VM has no concept of the host's IOMMU groups, only
> > the hypervisor knows about the groups,   
> 
> But as I understand it these groups are usually passed through to VMs
> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
> not aware of IOMMU groupings but it is impacted by them in the sense
> that if the groupings change the PCI topology presented to the VM
> needs to change too.

Hypervisors don't currently expose any topology based on the grouping,
the only case where such a concept even makes sense is when a vIOMMU is
present as devices within the same group cannot have separate address
spaces.  Our options for exposing such information is also limited, our
only real option would seem to be placing devices within the same group
together on a conventional PCI bus to denote the address space
granularity.  Currently we strongly recommend singleton groups for this
case and leave any additional configuration constraints to the admin.

The case you note of a group passed to VM A and another passed to VM B
is exactly an example of why any sort of dynamic routing change needs to
have the groups fully released, such as via hot-unplug.  For instance,
a routing change at a shared node above groups 1 & 2 could result in
the merging of these groups and there is absolutely no way to handle
that with portions of the group being owned by two separate VMs after
the merge.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 14:44                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09 14:44 UTC (permalink / raw)
  To: Stephen  Bates
  Cc: Don Dutile, Dan Williams, Logan Gunthorpe,
	Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Christian König

On Wed, 9 May 2018 12:35:56 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Hi Alex and Don
> 
> >    Correct, the VM has no concept of the host's IOMMU groups, only
> > the hypervisor knows about the groups,   
> 
> But as I understand it these groups are usually passed through to VMs
> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
> not aware of IOMMU groupings but it is impacted by them in the sense
> that if the groupings change the PCI topology presented to the VM
> needs to change too.

Hypervisors don't currently expose any topology based on the grouping,
the only case where such a concept even makes sense is when a vIOMMU is
present as devices within the same group cannot have separate address
spaces.  Our options for exposing such information is also limited, our
only real option would seem to be placing devices within the same group
together on a conventional PCI bus to denote the address space
granularity.  Currently we strongly recommend singleton groups for this
case and leave any additional configuration constraints to the admin.

The case you note of a group passed to VM A and another passed to VM B
is exactly an example of why any sort of dynamic routing change needs to
have the groups fully released, such as via hot-unplug.  For instance,
a routing change at a shared node above groups 1 & 2 could result in
the merging of these groups and there is absolutely no way to handle
that with portions of the group being owned by two separate VMs after
the merge.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 14:44                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09 14:44 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christian König, linux-block-u79uwXL29TY76Z2rM5mHXA,
	Jérôme Glisse, Jason Gunthorpe, Don Dutile,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Wed, 9 May 2018 12:35:56 +0000
"Stephen  Bates" <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org> wrote:

> Hi Alex and Don
> 
> >    Correct, the VM has no concept of the host's IOMMU groups, only
> > the hypervisor knows about the groups,   
> 
> But as I understand it these groups are usually passed through to VMs
> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
> not aware of IOMMU groupings but it is impacted by them in the sense
> that if the groupings change the PCI topology presented to the VM
> needs to change too.

Hypervisors don't currently expose any topology based on the grouping,
the only case where such a concept even makes sense is when a vIOMMU is
present as devices within the same group cannot have separate address
spaces.  Our options for exposing such information is also limited, our
only real option would seem to be placing devices within the same group
together on a conventional PCI bus to denote the address space
granularity.  Currently we strongly recommend singleton groups for this
case and leave any additional configuration constraints to the admin.

The case you note of a group passed to VM A and another passed to VM B
is exactly an example of why any sort of dynamic routing change needs to
have the groups fully released, such as via hot-unplug.  For instance,
a routing change at a shared node above groups 1 & 2 could result in
the merging of these groups and there is absolutely no way to handle
that with portions of the group being owned by two separate VMs after
the merge.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 14:44                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09 14:44 UTC (permalink / raw)
  To: Stephen  Bates
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, Linux Kernel Mailing List, linux-nvme,
	Christian König, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Don Dutile, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Dan Williams, Logan Gunthorpe,
	Christoph Hellwig

On Wed, 9 May 2018 12:35:56 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Hi Alex and Don
> 
> >    Correct, the VM has no concept of the host's IOMMU groups, only
> > the hypervisor knows about the groups,   
> 
> But as I understand it these groups are usually passed through to VMs
> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
> not aware of IOMMU groupings but it is impacted by them in the sense
> that if the groupings change the PCI topology presented to the VM
> needs to change too.

Hypervisors don't currently expose any topology based on the grouping,
the only case where such a concept even makes sense is when a vIOMMU is
present as devices within the same group cannot have separate address
spaces.  Our options for exposing such information is also limited, our
only real option would seem to be placing devices within the same group
together on a conventional PCI bus to denote the address space
granularity.  Currently we strongly recommend singleton groups for this
case and leave any additional configuration constraints to the admin.

The case you note of a group passed to VM A and another passed to VM B
is exactly an example of why any sort of dynamic routing change needs to
have the groups fully released, such as via hot-unplug.  For instance,
a routing change at a shared node above groups 1 & 2 could result in
the merging of these groups and there is absolutely no way to handle
that with portions of the group being owned by two separate VMs after
the merge.  Thanks,

Alex

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 14:44                   ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-09 14:44 UTC (permalink / raw)


On Wed, 9 May 2018 12:35:56 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Hi Alex and Don
> 
> >    Correct, the VM has no concept of the host's IOMMU groups, only
> > the hypervisor knows about the groups,   
> 
> But as I understand it these groups are usually passed through to VMs
> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
> not aware of IOMMU groupings but it is impacted by them in the sense
> that if the groupings change the PCI topology presented to the VM
> needs to change too.

Hypervisors don't currently expose any topology based on the grouping,
the only case where such a concept even makes sense is when a vIOMMU is
present as devices within the same group cannot have separate address
spaces.  Our options for exposing such information is also limited, our
only real option would seem to be placing devices within the same group
together on a conventional PCI bus to denote the address space
granularity.  Currently we strongly recommend singleton groups for this
case and leave any additional configuration constraints to the admin.

The case you note of a group passed to VM A and another passed to VM B
is exactly an example of why any sort of dynamic routing change needs to
have the groups fully released, such as via hot-unplug.  For instance,
a routing change at a shared node above groups 1 & 2 could result in
the merging of these groups and there is absolutely no way to handle
that with portions of the group being owned by two separate VMs after
the merge.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 13:40                           ` Christian König
                                               ` (2 preceding siblings ...)
  (?)
@ 2018-05-09 15:41                             ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 15:41 UTC (permalink / raw)
  To: Christian König
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Christian

>    Interesting point, give me a moment to check that. That finally makes 
>    all the hardware I have standing around here valuable :)
    
Yes. At the very least it provides an initial standards based path for P2P DMAs across RPs which is something we have discussed on this list in the past as being desirable.

BTW I am trying to understand how an ATS capable EP function determines when to perform an ATS Translation Request (ATS TR). Is there an upstream example of the driver for your APU that uses ATS? If so, can you provide a pointer to it. Do you provide some type of entry in the submission queues for commands going to the APU to indicate if the address associated with a specific command should be translated using ATS or not? Or do you simply enable ATS and then all addresses passed to your APU that miss the local cache result in a ATS TR?

Your feedback would be useful and I initiate discussions within the NVMe community on where we might go with ATS...

Thanks

Stephen

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:41                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 15:41 UTC (permalink / raw)
  To: Christian König, Jerome Glisse, Logan Gunthorpe
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

Q2hyaXN0aWFuDQoNCj4gICAgSW50ZXJlc3RpbmcgcG9pbnQsIGdpdmUgbWUgYSBtb21lbnQgdG8g
Y2hlY2sgdGhhdC4gVGhhdCBmaW5hbGx5IG1ha2VzIA0KPiAgICBhbGwgdGhlIGhhcmR3YXJlIEkg
aGF2ZSBzdGFuZGluZyBhcm91bmQgaGVyZSB2YWx1YWJsZSA6KQ0KICAgIA0KWWVzLiBBdCB0aGUg
dmVyeSBsZWFzdCBpdCBwcm92aWRlcyBhbiBpbml0aWFsIHN0YW5kYXJkcyBiYXNlZCBwYXRoIGZv
ciBQMlAgRE1BcyBhY3Jvc3MgUlBzIHdoaWNoIGlzIHNvbWV0aGluZyB3ZSBoYXZlIGRpc2N1c3Nl
ZCBvbiB0aGlzIGxpc3QgaW4gdGhlIHBhc3QgYXMgYmVpbmcgZGVzaXJhYmxlLg0KDQpCVFcgSSBh
bSB0cnlpbmcgdG8gdW5kZXJzdGFuZCBob3cgYW4gQVRTIGNhcGFibGUgRVAgZnVuY3Rpb24gZGV0
ZXJtaW5lcyB3aGVuIHRvIHBlcmZvcm0gYW4gQVRTIFRyYW5zbGF0aW9uIFJlcXVlc3QgKEFUUyBU
UikuIElzIHRoZXJlIGFuIHVwc3RyZWFtIGV4YW1wbGUgb2YgdGhlIGRyaXZlciBmb3IgeW91ciBB
UFUgdGhhdCB1c2VzIEFUUz8gSWYgc28sIGNhbiB5b3UgcHJvdmlkZSBhIHBvaW50ZXIgdG8gaXQu
IERvIHlvdSBwcm92aWRlIHNvbWUgdHlwZSBvZiBlbnRyeSBpbiB0aGUgc3VibWlzc2lvbiBxdWV1
ZXMgZm9yIGNvbW1hbmRzIGdvaW5nIHRvIHRoZSBBUFUgdG8gaW5kaWNhdGUgaWYgdGhlIGFkZHJl
c3MgYXNzb2NpYXRlZCB3aXRoIGEgc3BlY2lmaWMgY29tbWFuZCBzaG91bGQgYmUgdHJhbnNsYXRl
ZCB1c2luZyBBVFMgb3Igbm90PyBPciBkbyB5b3Ugc2ltcGx5IGVuYWJsZSBBVFMgYW5kIHRoZW4g
YWxsIGFkZHJlc3NlcyBwYXNzZWQgdG8geW91ciBBUFUgdGhhdCBtaXNzIHRoZSBsb2NhbCBjYWNo
ZSByZXN1bHQgaW4gYSBBVFMgVFI/DQoNCllvdXIgZmVlZGJhY2sgd291bGQgYmUgdXNlZnVsIGFu
ZCBJIGluaXRpYXRlIGRpc2N1c3Npb25zIHdpdGhpbiB0aGUgTlZNZSBjb21tdW5pdHkgb24gd2hl
cmUgd2UgbWlnaHQgZ28gd2l0aCBBVFMuLi4NCg0KVGhhbmtzDQoNClN0ZXBoZW4NCg0K

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:41                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 15:41 UTC (permalink / raw)
  To: Christian König, Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Christian

>    Interesting point, give me a moment to check that. That finally makes 
>    all the hardware I have standing around here valuable :)
    
Yes. At the very least it provides an initial standards based path for P2P DMAs across RPs which is something we have discussed on this list in the past as being desirable.

BTW I am trying to understand how an ATS capable EP function determines when to perform an ATS Translation Request (ATS TR). Is there an upstream example of the driver for your APU that uses ATS? If so, can you provide a pointer to it. Do you provide some type of entry in the submission queues for commands going to the APU to indicate if the address associated with a specific command should be translated using ATS or not? Or do you simply enable ATS and then all addresses passed to your APU that miss the local cache result in a ATS TR?

Your feedback would be useful and I initiate discussions within the NVMe community on where we might go with ATS...

Thanks

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:41                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 15:41 UTC (permalink / raw)
  To: Christian König, Jerome Glisse, Logan Gunthorpe
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

Christian

>    Interesting point, give me a moment to check that. That finally makes 
>    all the hardware I have standing around here valuable :)
    
Yes. At the very least it provides an initial standards based path for P2P DMAs across RPs which is something we have discussed on this list in the past as being desirable.

BTW I am trying to understand how an ATS capable EP function determines when to perform an ATS Translation Request (ATS TR). Is there an upstream example of the driver for your APU that uses ATS? If so, can you provide a pointer to it. Do you provide some type of entry in the submission queues for commands going to the APU to indicate if the address associated with a specific command should be translated using ATS or not? Or do you simply enable ATS and then all addresses passed to your APU that miss the local cache result in a ATS TR?

Your feedback would be useful and I initiate discussions within the NVMe community on where we might go with ATS...

Thanks

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:41                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 15:41 UTC (permalink / raw)


Christian

>    Interesting point, give me a moment to check that. That finally makes 
>    all the hardware I have standing around here valuable :)
    
Yes. At the very least it provides an initial standards based path for P2P DMAs across RPs which is something we have discussed on this list in the past as being desirable.

BTW I am trying to understand how an ATS capable EP function determines when to perform an ATS Translation Request (ATS TR). Is there an upstream example of the driver for your APU that uses ATS? If so, can you provide a pointer to it. Do you provide some type of entry in the submission queues for commands going to the APU to indicate if the address associated with a specific command should be translated using ATS or not? Or do you simply enable ATS and then all addresses passed to your APU that miss the local cache result in a ATS TR?

Your feedback would be useful and I initiate discussions within the NVMe community on where we might go with ATS...

Thanks

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09  0:01               ` Alex Williamson
  (?)
  (?)
@ 2018-05-09 15:47                 ` Don Dutile
  -1 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	Linux Kernel Mailing List, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Christian König,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On 05/08/2018 08:01 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 19:06:17 -0400
> Don Dutile <ddutile@redhat.com> wrote:
>> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
>>> As I understand it VMs need to know because VFIO passes IOMMU
>>> grouping up into the VMs. So if a IOMMU grouping changes the VM's
>>> view of its PCIe topology changes. I think we even have to be
>>> cognizant of the fact the OS running on the VM may not even support
>>> hot-plug of PCI devices.
>> Alex:
>> Really? IOMMU groups are created by the kernel, so don't know how
>> they would be passed into the VMs, unless indirectly via  PCI(e)
>> layout. At best, twiddling w/ACS enablement (emulation) would cause
>> VMs to see different IOMMU groups, but again, VMs are not the
>> security point/level, the host/HV's are.
> 
> Correct, the VM has no concept of the host's IOMMU groups, only the
> hypervisor knows about the groups, but really only to the extent of
> which device belongs to which group and whether the group is viable.
> Any runtime change to grouping though would require DMA mapping
> updates, which I don't see how we can reasonably do with drivers,
> vfio-pci or native host drivers, bound to the affected devices.  Thanks,
> 
> Alex
> 
A change in iommu groups would/could require a device remove/add cycle to get an updated DMA-mapping (yet-another-overused-term: iommu 'domain').

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:47                 ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Stephen Bates, Dan Williams, Logan Gunthorpe,
	Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Christian König

On 05/08/2018 08:01 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 19:06:17 -0400
> Don Dutile <ddutile@redhat.com> wrote:
>> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
>>> As I understand it VMs need to know because VFIO passes IOMMU
>>> grouping up into the VMs. So if a IOMMU grouping changes the VM's
>>> view of its PCIe topology changes. I think we even have to be
>>> cognizant of the fact the OS running on the VM may not even support
>>> hot-plug of PCI devices.
>> Alex:
>> Really? IOMMU groups are created by the kernel, so don't know how
>> they would be passed into the VMs, unless indirectly via  PCI(e)
>> layout. At best, twiddling w/ACS enablement (emulation) would cause
>> VMs to see different IOMMU groups, but again, VMs are not the
>> security point/level, the host/HV's are.
> 
> Correct, the VM has no concept of the host's IOMMU groups, only the
> hypervisor knows about the groups, but really only to the extent of
> which device belongs to which group and whether the group is viable.
> Any runtime change to grouping though would require DMA mapping
> updates, which I don't see how we can reasonably do with drivers,
> vfio-pci or native host drivers, bound to the affected devices.  Thanks,
> 
> Alex
> 
A change in iommu groups would/could require a device remove/add cycle to get an updated DMA-mapping (yet-another-overused-term: iommu 'domain').

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:47                 ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Christian König, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 05/08/2018 08:01 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 19:06:17 -0400
> Don Dutile <ddutile-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
>>> As I understand it VMs need to know because VFIO passes IOMMU
>>> grouping up into the VMs. So if a IOMMU grouping changes the VM's
>>> view of its PCIe topology changes. I think we even have to be
>>> cognizant of the fact the OS running on the VM may not even support
>>> hot-plug of PCI devices.
>> Alex:
>> Really? IOMMU groups are created by the kernel, so don't know how
>> they would be passed into the VMs, unless indirectly via  PCI(e)
>> layout. At best, twiddling w/ACS enablement (emulation) would cause
>> VMs to see different IOMMU groups, but again, VMs are not the
>> security point/level, the host/HV's are.
> 
> Correct, the VM has no concept of the host's IOMMU groups, only the
> hypervisor knows about the groups, but really only to the extent of
> which device belongs to which group and whether the group is viable.
> Any runtime change to grouping though would require DMA mapping
> updates, which I don't see how we can reasonably do with drivers,
> vfio-pci or native host drivers, bound to the affected devices.  Thanks,
> 
> Alex
> 
A change in iommu groups would/could require a device remove/add cycle to get an updated DMA-mapping (yet-another-overused-term: iommu 'domain').

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:47                 ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:47 UTC (permalink / raw)


On 05/08/2018 08:01 PM, Alex Williamson wrote:
> On Tue, 8 May 2018 19:06:17 -0400
> Don Dutile <ddutile@redhat.com> wrote:
>> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
>>> As I understand it VMs need to know because VFIO passes IOMMU
>>> grouping up into the VMs. So if a IOMMU grouping changes the VM's
>>> view of its PCIe topology changes. I think we even have to be
>>> cognizant of the fact the OS running on the VM may not even support
>>> hot-plug of PCI devices.
>> Alex:
>> Really? IOMMU groups are created by the kernel, so don't know how
>> they would be passed into the VMs, unless indirectly via  PCI(e)
>> layout. At best, twiddling w/ACS enablement (emulation) would cause
>> VMs to see different IOMMU groups, but again, VMs are not the
>> security point/level, the host/HV's are.
> 
> Correct, the VM has no concept of the host's IOMMU groups, only the
> hypervisor knows about the groups, but really only to the extent of
> which device belongs to which group and whether the group is viable.
> Any runtime change to grouping though would require DMA mapping
> updates, which I don't see how we can reasonably do with drivers,
> vfio-pci or native host drivers, bound to the affected devices.  Thanks,
> 
> Alex
> 
A change in iommu groups would/could require a device remove/add cycle to get an updated DMA-mapping (yet-another-overused-term: iommu 'domain').

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 14:44                   ` Alex Williamson
  (?)
  (?)
@ 2018-05-09 15:52                     ` Don Dutile
  -1 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:52 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	Linux Kernel Mailing List, linux-nvme, Christian König,
	linux-block, Jérôme Glisse, Jason Gunthorpe,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On 05/09/2018 10:44 AM, Alex Williamson wrote:
> On Wed, 9 May 2018 12:35:56 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:
> 
>> Hi Alex and Don
>>
>>>     Correct, the VM has no concept of the host's IOMMU groups, only
>>> the hypervisor knows about the groups,
>>
>> But as I understand it these groups are usually passed through to VMs
>> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
>> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
>> not aware of IOMMU groupings but it is impacted by them in the sense
>> that if the groupings change the PCI topology presented to the VM
>> needs to change too.
> 
> Hypervisors don't currently expose any topology based on the grouping,
> the only case where such a concept even makes sense is when a vIOMMU is
> present as devices within the same group cannot have separate address
> spaces.  Our options for exposing such information is also limited, our
> only real option would seem to be placing devices within the same group
> together on a conventional PCI bus to denote the address space
> granularity.  Currently we strongly recommend singleton groups for this
> case and leave any additional configuration constraints to the admin.
> 
> The case you note of a group passed to VM A and another passed to VM B
> is exactly an example of why any sort of dynamic routing change needs to
> have the groups fully released, such as via hot-unplug.  For instance,
> a routing change at a shared node above groups 1 & 2 could result in
> the merging of these groups and there is absolutely no way to handle
> that with portions of the group being owned by two separate VMs after
> the merge.  Thanks,
> 
> Alex
> 
The above is why I stated the host/HV has to do p2p setup *before* device-assignment
is done.
Now, that could be done at boot time (with a mod.conf-like config in host/HV, before VM startup)
as well.
Dynamically, if such a feature is needed, requires a hot-unplug/plug cycling as Alex states.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:52                     ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:52 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Dan Williams, Logan Gunthorpe, Linux Kernel Mailing List,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Benjamin Herrenschmidt,
	Christian König

On 05/09/2018 10:44 AM, Alex Williamson wrote:
> On Wed, 9 May 2018 12:35:56 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:
> 
>> Hi Alex and Don
>>
>>>     Correct, the VM has no concept of the host's IOMMU groups, only
>>> the hypervisor knows about the groups,
>>
>> But as I understand it these groups are usually passed through to VMs
>> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
>> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
>> not aware of IOMMU groupings but it is impacted by them in the sense
>> that if the groupings change the PCI topology presented to the VM
>> needs to change too.
> 
> Hypervisors don't currently expose any topology based on the grouping,
> the only case where such a concept even makes sense is when a vIOMMU is
> present as devices within the same group cannot have separate address
> spaces.  Our options for exposing such information is also limited, our
> only real option would seem to be placing devices within the same group
> together on a conventional PCI bus to denote the address space
> granularity.  Currently we strongly recommend singleton groups for this
> case and leave any additional configuration constraints to the admin.
> 
> The case you note of a group passed to VM A and another passed to VM B
> is exactly an example of why any sort of dynamic routing change needs to
> have the groups fully released, such as via hot-unplug.  For instance,
> a routing change at a shared node above groups 1 & 2 could result in
> the merging of these groups and there is absolutely no way to handle
> that with portions of the group being owned by two separate VMs after
> the merge.  Thanks,
> 
> Alex
> 
The above is why I stated the host/HV has to do p2p setup *before* device-assignment
is done.
Now, that could be done at boot time (with a mod.conf-like config in host/HV, before VM startup)
as well.
Dynamically, if such a feature is needed, requires a hot-unplug/plug cycling as Alex states.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:52                     ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:52 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Christian König, linux-block-u79uwXL29TY76Z2rM5mHXA,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 05/09/2018 10:44 AM, Alex Williamson wrote:
> On Wed, 9 May 2018 12:35:56 +0000
> "Stephen  Bates" <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org> wrote:
> 
>> Hi Alex and Don
>>
>>>     Correct, the VM has no concept of the host's IOMMU groups, only
>>> the hypervisor knows about the groups,
>>
>> But as I understand it these groups are usually passed through to VMs
>> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
>> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
>> not aware of IOMMU groupings but it is impacted by them in the sense
>> that if the groupings change the PCI topology presented to the VM
>> needs to change too.
> 
> Hypervisors don't currently expose any topology based on the grouping,
> the only case where such a concept even makes sense is when a vIOMMU is
> present as devices within the same group cannot have separate address
> spaces.  Our options for exposing such information is also limited, our
> only real option would seem to be placing devices within the same group
> together on a conventional PCI bus to denote the address space
> granularity.  Currently we strongly recommend singleton groups for this
> case and leave any additional configuration constraints to the admin.
> 
> The case you note of a group passed to VM A and another passed to VM B
> is exactly an example of why any sort of dynamic routing change needs to
> have the groups fully released, such as via hot-unplug.  For instance,
> a routing change at a shared node above groups 1 & 2 could result in
> the merging of these groups and there is absolutely no way to handle
> that with portions of the group being owned by two separate VMs after
> the merge.  Thanks,
> 
> Alex
> 
The above is why I stated the host/HV has to do p2p setup *before* device-assignment
is done.
Now, that could be done at boot time (with a mod.conf-like config in host/HV, before VM startup)
as well.
Dynamically, if such a feature is needed, requires a hot-unplug/plug cycling as Alex states.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:52                     ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:52 UTC (permalink / raw)


On 05/09/2018 10:44 AM, Alex Williamson wrote:
> On Wed, 9 May 2018 12:35:56 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:
> 
>> Hi Alex and Don
>>
>>>     Correct, the VM has no concept of the host's IOMMU groups, only
>>> the hypervisor knows about the groups,
>>
>> But as I understand it these groups are usually passed through to VMs
>> on a pre-group basis by the hypervisor? So IOMMU group 1 might be
>> passed to VM A and IOMMU group 2 passed to VM B. So I agree the VM is
>> not aware of IOMMU groupings but it is impacted by them in the sense
>> that if the groupings change the PCI topology presented to the VM
>> needs to change too.
> 
> Hypervisors don't currently expose any topology based on the grouping,
> the only case where such a concept even makes sense is when a vIOMMU is
> present as devices within the same group cannot have separate address
> spaces.  Our options for exposing such information is also limited, our
> only real option would seem to be placing devices within the same group
> together on a conventional PCI bus to denote the address space
> granularity.  Currently we strongly recommend singleton groups for this
> case and leave any additional configuration constraints to the admin.
> 
> The case you note of a group passed to VM A and another passed to VM B
> is exactly an example of why any sort of dynamic routing change needs to
> have the groups fully released, such as via hot-unplug.  For instance,
> a routing change at a shared node above groups 1 & 2 could result in
> the merging of these groups and there is absolutely no way to handle
> that with portions of the group being owned by two separate VMs after
> the merge.  Thanks,
> 
> Alex
> 
The above is why I stated the host/HV has to do p2p setup *before* device-assignment
is done.
Now, that could be done at boot time (with a mod.conf-like config in host/HV, before VM startup)
as well.
Dynamically, if such a feature is needed, requires a hot-unplug/plug cycling as Alex states.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-08 21:27           ` Stephen  Bates
  (?)
@ 2018-05-09 15:53             ` Don Dutile
  -1 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:53 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Christian König, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
under linux-pci I'm assuming...
you cc'd a number of upstream lists; I picked this thread up via rdma-list.

> Stephen
>      
>      
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:53             ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:53 UTC (permalink / raw)
  To: Stephen Bates, Dan Williams, Logan Gunthorpe
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König

On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
under linux-pci I'm assuming...
you cc'd a number of upstream lists; I picked this thread up via rdma-list.

> Stephen
>      
>      
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:53             ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:53 UTC (permalink / raw)


On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> Hi Don
> 
>> Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices.
>>     That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints.
>>     I recommend doing so via a sysfs method.
> 
> Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series.
> 
>      
>>             So I don't understand the comments why VMs should need to know.
> 
> As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices.
>      
>> Is there a thread I need to read up to explain /clear-up the thoughts above?
> 
> If you search for p2pdma you should find the previous discussions. Thanks for the input!
> 
under linux-pci I'm assuming...
you cc'd a number of upstream lists; I picked this thread up via rdma-list.

> Stephen
>      
>      
> 

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 12:44                                   ` Stephen  Bates
  (?)
  (?)
@ 2018-05-09 15:58                                     ` Don Dutile
  -1 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:58 UTC (permalink / raw)
  To: Stephen Bates, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

On 05/09/2018 08:44 AM, Stephen  Bates wrote:
> Hi Don
> 
>>     RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>>     put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>>     which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
>      
> Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.
> 
> NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...
> 
> Stephen
>      
>      
> 
Sure, we could provide unsecure enablement for development and kick-the-tires deployment ..
device-assignment started that way (no ACS, no intr-remapping, etc.), but for secure setups,
VF's for both p2p EPs is the best security model.
So, we should have a design goal for the secure configuration.
workarounds/unsecure modes to deal with near-term what-we-have-to-work-with can be employed, but they shoudn't be
the only/defacto/final-solution.


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:58                                     ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:58 UTC (permalink / raw)
  To: Stephen Bates, Alex Williamson
  Cc: Logan Gunthorpe, Christian König, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

On 05/09/2018 08:44 AM, Stephen  Bates wrote:
> Hi Don
> 
>>     RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>>     put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>>     which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
>      
> Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.
> 
> NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...
> 
> Stephen
>      
>      
> 
Sure, we could provide unsecure enablement for development and kick-the-tires deployment ..
device-assignment started that way (no ACS, no intr-remapping, etc.), but for secure setups,
VF's for both p2p EPs is the best security model.
So, we should have a design goal for the secure configuration.
workarounds/unsecure modes to deal with near-term what-we-have-to-work-with can be employed, but they shoudn't be
the only/defacto/final-solution.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:58                                     ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:58 UTC (permalink / raw)
  To: Stephen Bates, Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

On 05/09/2018 08:44 AM, Stephen  Bates wrote:
> Hi Don
> 
>>     RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>>     put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>>     which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
>      
> Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.
> 
> NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...
> 
> Stephen
>      
>      
> 
Sure, we could provide unsecure enablement for development and kick-the-tires deployment ..
device-assignment started that way (no ACS, no intr-remapping, etc.), but for secure setups,
VF's for both p2p EPs is the best security model.
So, we should have a design goal for the secure configuration.
workarounds/unsecure modes to deal with near-term what-we-have-to-work-with can be employed, but they shoudn't be
the only/defacto/final-solution.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 15:58                                     ` Don Dutile
  0 siblings, 0 replies; 460+ messages in thread
From: Don Dutile @ 2018-05-09 15:58 UTC (permalink / raw)


On 05/09/2018 08:44 AM, Stephen  Bates wrote:
> Hi Don
> 
>>     RDMA VFs lend themselves to NVMEoF w/device-assignment.... need a way to
>>     put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
>>     which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.
>      
> Ha, I like your term "DMA Security Domain" which sounds about right for what we are discussing with p2pdma and ACS disablement ;-). The problem is that ACS is, in some ways, too big of hammer for what we want here in the sense that it is either on or off for the bridge or MF EP we enable/disable it for. ACS can't filter the TLPs by address or ID though PCI-SIG are having some discussions on extending ACS. That's a long term solution and won't be applicable to us for some time.
> 
> NVMe SSDs that support SR-IOV are coming to market but we can't assume all NVMe SSDs with support SR-IOV. That will probably be a pretty high end-feature...
> 
> Stephen
>      
>      
> 
Sure, we could provide unsecure enablement for development and kick-the-tires deployment ..
device-assignment started that way (no ACS, no intr-remapping, etc.), but for secure setups,
VF's for both p2p EPs is the best security model.
So, we should have a design goal for the secure configuration.
workarounds/unsecure modes to deal with near-term what-we-have-to-work-with can be employed, but they shoudn't be
the only/defacto/final-solution.

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 15:41                             ` Stephen  Bates
                                                 ` (2 preceding siblings ...)
  (?)
@ 2018-05-09 16:07                               ` Jerome Glisse
  -1 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 16:07 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

On Wed, May 09, 2018 at 03:41:44PM +0000, Stephen  Bates wrote:
> Christian
> 
> >    Interesting point, give me a moment to check that. That finally makes 
> >    all the hardware I have standing around here valuable :)
>     
> Yes. At the very least it provides an initial standards based path
> for P2P DMAs across RPs which is something we have discussed on this
> list in the past as being desirable.
> 
> BTW I am trying to understand how an ATS capable EP function determines
> when to perform an ATS Translation Request (ATS TR). Is there an
> upstream example of the driver for your APU that uses ATS? If so, can
> you provide a pointer to it. Do you provide some type of entry in the
> submission queues for commands going to the APU to indicate if the
> address associated with a specific command should be translated using
> ATS or not? Or do you simply enable ATS and then all addresses passed
> to your APU that miss the local cache result in a ATS TR?

On GPU ATS is always tie to a PASID. You do not do the former without
the latter (AFAICT this is not doable, maybe through some JTAG but not
in normal operation).

GPU are like CPU, so you have GPU threads that run against an address
space. This address space use a page table (very much like the CPU page
table). Now inside that page table you can point GPU virtual address
to use GPU memory or use system memory. Those system memory entry can
also be mark as ATS against a given PASID.

On some GPU you define a window of GPU virtual address that goes through
PASID & ATS (so access in that window do not go through the page table
but directly through PASID & ATS).

Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:07                               ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 16:07 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Wed, May 09, 2018 at 03:41:44PM +0000, Stephen  Bates wrote:
> Christian
> 
> >    Interesting point, give me a moment to check that. That finally makes 
> >    all the hardware I have standing around here valuable :)
>     
> Yes. At the very least it provides an initial standards based path
> for P2P DMAs across RPs which is something we have discussed on this
> list in the past as being desirable.
> 
> BTW I am trying to understand how an ATS capable EP function determines
> when to perform an ATS Translation Request (ATS TR). Is there an
> upstream example of the driver for your APU that uses ATS? If so, can
> you provide a pointer to it. Do you provide some type of entry in the
> submission queues for commands going to the APU to indicate if the
> address associated with a specific command should be translated using
> ATS or not? Or do you simply enable ATS and then all addresses passed
> to your APU that miss the local cache result in a ATS TR?

On GPU ATS is always tie to a PASID. You do not do the former without
the latter (AFAICT this is not doable, maybe through some JTAG but not
in normal operation).

GPU are like CPU, so you have GPU threads that run against an address
space. This address space use a page table (very much like the CPU page
table). Now inside that page table you can point GPU virtual address
to use GPU memory or use system memory. Those system memory entry can
also be mark as ATS against a given PASID.

On some GPU you define a window of GPU virtual address that goes through
PASID & ATS (so access in that window do not go through the page table
but directly through PASID & ATS).

J�r�me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:07                               ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 16:07 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

On Wed, May 09, 2018 at 03:41:44PM +0000, Stephen  Bates wrote:
> Christian
> 
> >    Interesting point, give me a moment to check that. That finally makes 
> >    all the hardware I have standing around here valuable :)
>     
> Yes. At the very least it provides an initial standards based path
> for P2P DMAs across RPs which is something we have discussed on this
> list in the past as being desirable.
> 
> BTW I am trying to understand how an ATS capable EP function determines
> when to perform an ATS Translation Request (ATS TR). Is there an
> upstream example of the driver for your APU that uses ATS? If so, can
> you provide a pointer to it. Do you provide some type of entry in the
> submission queues for commands going to the APU to indicate if the
> address associated with a specific command should be translated using
> ATS or not? Or do you simply enable ATS and then all addresses passed
> to your APU that miss the local cache result in a ATS TR?

On GPU ATS is always tie to a PASID. You do not do the former without
the latter (AFAICT this is not doable, maybe through some JTAG but not
in normal operation).

GPU are like CPU, so you have GPU threads that run against an address
space. This address space use a page table (very much like the CPU page
table). Now inside that page table you can point GPU virtual address
to use GPU memory or use system memory. Those system memory entry can
also be mark as ATS against a given PASID.

On some GPU you define a window of GPU virtual address that goes through
PASID & ATS (so access in that window do not go through the page table
but directly through PASID & ATS).

Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:07                               ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 16:07 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Wed, May 09, 2018 at 03:41:44PM +0000, Stephen  Bates wrote:
> Christian
> 
> >    Interesting point, give me a moment to check that. That finally makes 
> >    all the hardware I have standing around here valuable :)
>     
> Yes. At the very least it provides an initial standards based path
> for P2P DMAs across RPs which is something we have discussed on this
> list in the past as being desirable.
> 
> BTW I am trying to understand how an ATS capable EP function determines
> when to perform an ATS Translation Request (ATS TR). Is there an
> upstream example of the driver for your APU that uses ATS? If so, can
> you provide a pointer to it. Do you provide some type of entry in the
> submission queues for commands going to the APU to indicate if the
> address associated with a specific command should be translated using
> ATS or not? Or do you simply enable ATS and then all addresses passed
> to your APU that miss the local cache result in a ATS TR?

On GPU ATS is always tie to a PASID. You do not do the former without
the latter (AFAICT this is not doable, maybe through some JTAG but not
in normal operation).

GPU are like CPU, so you have GPU threads that run against an address
space. This address space use a page table (very much like the CPU page
table). Now inside that page table you can point GPU virtual address
to use GPU memory or use system memory. Those system memory entry can
also be mark as ATS against a given PASID.

On some GPU you define a window of GPU virtual address that goes through
PASID & ATS (so access in that window do not go through the page table
but directly through PASID & ATS).

Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:07                               ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 16:07 UTC (permalink / raw)


On Wed, May 09, 2018@03:41:44PM +0000, Stephen  Bates wrote:
> Christian
> 
> >    Interesting point, give me a moment to check that. That finally makes 
> >    all the hardware I have standing around here valuable :)
>     
> Yes. At the very least it provides an initial standards based path
> for P2P DMAs across RPs which is something we have discussed on this
> list in the past as being desirable.
> 
> BTW I am trying to understand how an ATS capable EP function determines
> when to perform an ATS Translation Request (ATS TR). Is there an
> upstream example of the driver for your APU that uses ATS? If so, can
> you provide a pointer to it. Do you provide some type of entry in the
> submission queues for commands going to the APU to indicate if the
> address associated with a specific command should be translated using
> ATS or not? Or do you simply enable ATS and then all addresses passed
> to your APU that miss the local cache result in a ATS TR?

On GPU ATS is always tie to a PASID. You do not do the former without
the latter (AFAICT this is not doable, maybe through some JTAG but not
in normal operation).

GPU are like CPU, so you have GPU threads that run against an address
space. This address space use a page table (very much like the CPU page
table). Now inside that page table you can point GPU virtual address
to use GPU memory or use system memory. Those system memory entry can
also be mark as ATS against a given PASID.

On some GPU you define a window of GPU virtual address that goes through
PASID & ATS (so access in that window do not go through the page table
but directly through PASID & ATS).

J?r?me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 16:07                               ` Jerome Glisse
                                                   ` (2 preceding siblings ...)
  (?)
@ 2018-05-09 16:30                                 ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 16:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

Hi Jerome

> Now inside that page table you can point GPU virtual address
> to use GPU memory or use system memory. Those system memory entry can
> also be mark as ATS against a given PASID.
    
Thanks. This all makes sense. 

But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:30                                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 16:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

SGkgSmVyb21lDQoNCj4gTm93IGluc2lkZSB0aGF0IHBhZ2UgdGFibGUgeW91IGNhbiBwb2ludCBH
UFUgdmlydHVhbCBhZGRyZXNzDQo+IHRvIHVzZSBHUFUgbWVtb3J5IG9yIHVzZSBzeXN0ZW0gbWVt
b3J5LiBUaG9zZSBzeXN0ZW0gbWVtb3J5IGVudHJ5IGNhbg0KPiBhbHNvIGJlIG1hcmsgYXMgQVRT
IGFnYWluc3QgYSBnaXZlbiBQQVNJRC4NCiAgICANClRoYW5rcy4gVGhpcyBhbGwgbWFrZXMgc2Vu
c2UuIA0KDQpCdXQgZG8geW91IGhhdmUgZXhhbXBsZXMgb2YgdGhpcyBpbiBhIGtlcm5lbCBkcml2
ZXIgKGlmIHNvIGNhbiB5b3UgcG9pbnQgbWUgdG9vIGl0KSBvciBpcyB0aGlzIGFsbCBkb25lIHZp
YSB1c2VyLXNwYWNlPyBCYXNlZCBvbiBteSBncmVwcGluZyBvZiB0aGUga2VybmVsIGNvZGUgSSBz
ZWUgemVybyBFUCBkcml2ZXJzIHVzaW5nIGluLWtlcm5lbCBBVFMgZnVuY3Rpb25hbGl0eSByaWdo
dCBub3cuLi4NCg0KU3RlcGhlbg0KICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:30                                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 16:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

Hi Jerome

> Now inside that page table you can point GPU virtual address
> to use GPU memory or use system memory. Those system memory entry can
> also be mark as ATS against a given PASID.
    
Thanks. This all makes sense. 

But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:30                                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 16:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

Hi Jerome

> Now inside that page table you can point GPU virtual address
> to use GPU memory or use system memory. Those system memory entry can
> also be mark as ATS against a given PASID.
    
Thanks. This all makes sense. 

But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:30                                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-09 16:30 UTC (permalink / raw)


Hi Jerome

> Now inside that page table you can point GPU virtual address
> to use GPU memory or use system memory. Those system memory entry can
> also be mark as ATS against a given PASID.
    
Thanks. This all makes sense. 

But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 13:40                           ` Christian König
  (?)
  (?)
@ 2018-05-09 16:45                             ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-09 16:45 UTC (permalink / raw)
  To: Christian König, Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 09/05/18 07:40 AM, Christian König wrote:
> The key takeaway is that when any device has ATS enabled you can't 
> disable ACS without breaking it (even if you unplug and replug it).

I don't follow how you came to this conclusion...
 The ACS bits we'd be turning off are the ones that force TLPs addressed
at a peer to go to the RC. However, ATS translation packets will be
addressed to an untranslated address which a switch will not identify as
a peer address so it should send upstream regardless the state of the
ACS Req/Comp redirect bits.

Once the translation comes back, the ATS endpoint should send the TLP to
the peer address with the AT packet type and it will be directed to the
peer provided the Direct Translated bit is set (or the redirect bits are
unset).

I can't see how turning off the Req/Comp redirect bits could break
anything except for the isolation they provide.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:45                             ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-09 16:45 UTC (permalink / raw)
  To: Christian König, Stephen Bates, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt



On 09/05/18 07:40 AM, Christian König wrote:
> The key takeaway is that when any device has ATS enabled you can't 
> disable ACS without breaking it (even if you unplug and replug it).

I don't follow how you came to this conclusion...
 The ACS bits we'd be turning off are the ones that force TLPs addressed
at a peer to go to the RC. However, ATS translation packets will be
addressed to an untranslated address which a switch will not identify as
a peer address so it should send upstream regardless the state of the
ACS Req/Comp redirect bits.

Once the translation comes back, the ATS endpoint should send the TLP to
the peer address with the AT packet type and it will be directed to the
peer provided the Direct Translated bit is set (or the redirect bits are
unset).

I can't see how turning off the Req/Comp redirect bits could break
anything except for the isolation they provide.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:45                             ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-09 16:45 UTC (permalink / raw)
  To: Christian König, Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 09/05/18 07:40 AM, Christian König wrote:
> The key takeaway is that when any device has ATS enabled you can't 
> disable ACS without breaking it (even if you unplug and replug it).

I don't follow how you came to this conclusion...
 The ACS bits we'd be turning off are the ones that force TLPs addressed
at a peer to go to the RC. However, ATS translation packets will be
addressed to an untranslated address which a switch will not identify as
a peer address so it should send upstream regardless the state of the
ACS Req/Comp redirect bits.

Once the translation comes back, the ATS endpoint should send the TLP to
the peer address with the AT packet type and it will be directed to the
peer provided the Direct Translated bit is set (or the redirect bits are
unset).

I can't see how turning off the Req/Comp redirect bits could break
anything except for the isolation they provide.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 16:45                             ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-09 16:45 UTC (permalink / raw)




On 09/05/18 07:40 AM, Christian K?nig wrote:
> The key takeaway is that when any device has ATS enabled you can't 
> disable ACS without breaking it (even if you unplug and replug it).

I don't follow how you came to this conclusion...
 The ACS bits we'd be turning off are the ones that force TLPs addressed
at a peer to go to the RC. However, ATS translation packets will be
addressed to an untranslated address which a switch will not identify as
a peer address so it should send upstream regardless the state of the
ACS Req/Comp redirect bits.

Once the translation comes back, the ATS endpoint should send the TLP to
the peer address with the AT packet type and it will be directed to the
peer provided the Direct Translated bit is set (or the redirect bits are
unset).

I can't see how turning off the Req/Comp redirect bits could break
anything except for the isolation they provide.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 16:30                                 ` Stephen  Bates
                                                     ` (2 preceding siblings ...)
  (?)
@ 2018-05-09 17:49                                   ` Jerome Glisse
  -1 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 17:49 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

On Wed, May 09, 2018 at 04:30:32PM +0000, Stephen  Bates wrote:
> Hi Jerome
> 
> > Now inside that page table you can point GPU virtual address
> > to use GPU memory or use system memory. Those system memory entry can
> > also be mark as ATS against a given PASID.
>     
> Thanks. This all makes sense. 
> 
> But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...
> 

As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:

drivers/gpu/drm/amd/
drivers/gpu/drm/amd/amdkfd/
drivers/gpu/drm/amd/amdgpu/

Lot of codes there. The GPU code details do not really matter for
this discussions thought. You do not need to do much to use PASID.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 17:49                                   ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 17:49 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Wed, May 09, 2018 at 04:30:32PM +0000, Stephen  Bates wrote:
> Hi Jerome
> 
> > Now inside that page table you can point GPU virtual address
> > to use GPU memory or use system memory. Those system memory entry can
> > also be mark as ATS against a given PASID.
>     
> Thanks. This all makes sense. 
> 
> But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...
> 

As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:

drivers/gpu/drm/amd/
drivers/gpu/drm/amd/amdkfd/
drivers/gpu/drm/amd/amdgpu/

Lot of codes there. The GPU code details do not really matter for
this discussions thought. You do not need to do much to use PASID.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 17:49                                   ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 17:49 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

On Wed, May 09, 2018 at 04:30:32PM +0000, Stephen  Bates wrote:
> Hi Jerome
> 
> > Now inside that page table you can point GPU virtual address
> > to use GPU memory or use system memory. Those system memory entry can
> > also be mark as ATS against a given PASID.
>     
> Thanks. This all makes sense. 
> 
> But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...
> 

As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:

drivers/gpu/drm/amd/
drivers/gpu/drm/amd/amdkfd/
drivers/gpu/drm/amd/amdgpu/

Lot of codes there. The GPU code details do not really matter for
this discussions thought. You do not need to do much to use PASID.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 17:49                                   ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 17:49 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Wed, May 09, 2018 at 04:30:32PM +0000, Stephen  Bates wrote:
> Hi Jerome
> 
> > Now inside that page table you can point GPU virtual address
> > to use GPU memory or use system memory. Those system memory entry can
> > also be mark as ATS against a given PASID.
>     
> Thanks. This all makes sense. 
> 
> But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...
> 

As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:

drivers/gpu/drm/amd/
drivers/gpu/drm/amd/amdkfd/
drivers/gpu/drm/amd/amdgpu/

Lot of codes there. The GPU code details do not really matter for
this discussions thought. You do not need to do much to use PASID.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-09 17:49                                   ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-09 17:49 UTC (permalink / raw)


On Wed, May 09, 2018@04:30:32PM +0000, Stephen  Bates wrote:
> Hi Jerome
> 
> > Now inside that page table you can point GPU virtual address
> > to use GPU memory or use system memory. Those system memory entry can
> > also be mark as ATS against a given PASID.
>     
> Thanks. This all makes sense. 
> 
> But do you have examples of this in a kernel driver (if so can you point me too it) or is this all done via user-space? Based on my grepping of the kernel code I see zero EP drivers using in-kernel ATS functionality right now...
> 

As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:

drivers/gpu/drm/amd/
drivers/gpu/drm/amd/amdkfd/
drivers/gpu/drm/amd/amdgpu/

Lot of codes there. The GPU code details do not really matter for
this discussions thought. You do not need to do much to use PASID.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 16:45                             ` Logan Gunthorpe
                                                 ` (2 preceding siblings ...)
  (?)
@ 2018-05-10 12:52                               ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-10 12:52 UTC (permalink / raw)
  To: Logan Gunthorpe, Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Am 09.05.2018 um 18:45 schrieb Logan Gunthorpe:
>
> On 09/05/18 07:40 AM, Christian König wrote:
>> The key takeaway is that when any device has ATS enabled you can't
>> disable ACS without breaking it (even if you unplug and replug it).
> I don't follow how you came to this conclusion...
>   The ACS bits we'd be turning off are the ones that force TLPs addressed
> at a peer to go to the RC. However, ATS translation packets will be
> addressed to an untranslated address which a switch will not identify as
> a peer address so it should send upstream regardless the state of the
> ACS Req/Comp redirect bits.

Why would a switch not identify that as a peer address? We use the PASID 
together with ATS to identify the address space which a transaction 
should use.

If I'm not completely mistaken when you disable ACS it is perfectly 
possible that a bridge identifies a transaction as belonging to a peer 
address, which isn't what we want here.

Christian.

>
> Once the translation comes back, the ATS endpoint should send the TLP to
> the peer address with the AT packet type and it will be directed to the
> peer provided the Direct Translated bit is set (or the redirect bits are
> unset).
>
> I can't see how turning off the Req/Comp redirect bits could break
> anything except for the isolation they provide.
>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 12:52                               ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-10 12:52 UTC (permalink / raw)
  To: Logan Gunthorpe, Stephen Bates, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

Am 09.05.2018 um 18:45 schrieb Logan Gunthorpe:
>
> On 09/05/18 07:40 AM, Christian König wrote:
>> The key takeaway is that when any device has ATS enabled you can't
>> disable ACS without breaking it (even if you unplug and replug it).
> I don't follow how you came to this conclusion...
>   The ACS bits we'd be turning off are the ones that force TLPs addressed
> at a peer to go to the RC. However, ATS translation packets will be
> addressed to an untranslated address which a switch will not identify as
> a peer address so it should send upstream regardless the state of the
> ACS Req/Comp redirect bits.

Why would a switch not identify that as a peer address? We use the PASID 
together with ATS to identify the address space which a transaction 
should use.

If I'm not completely mistaken when you disable ACS it is perfectly 
possible that a bridge identifies a transaction as belonging to a peer 
address, which isn't what we want here.

Christian.

>
> Once the translation comes back, the ATS endpoint should send the TLP to
> the peer address with the AT packet type and it will be directed to the
> peer provided the Direct Translated bit is set (or the redirect bits are
> unset).
>
> I can't see how turning off the Req/Comp redirect bits could break
> anything except for the isolation they provide.
>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 12:52                               ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-10 12:52 UTC (permalink / raw)
  To: Logan Gunthorpe, Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Am 09.05.2018 um 18:45 schrieb Logan Gunthorpe:
>
> On 09/05/18 07:40 AM, Christian König wrote:
>> The key takeaway is that when any device has ATS enabled you can't
>> disable ACS without breaking it (even if you unplug and replug it).
> I don't follow how you came to this conclusion...
>   The ACS bits we'd be turning off are the ones that force TLPs addressed
> at a peer to go to the RC. However, ATS translation packets will be
> addressed to an untranslated address which a switch will not identify as
> a peer address so it should send upstream regardless the state of the
> ACS Req/Comp redirect bits.

Why would a switch not identify that as a peer address? We use the PASID 
together with ATS to identify the address space which a transaction 
should use.

If I'm not completely mistaken when you disable ACS it is perfectly 
possible that a bridge identifies a transaction as belonging to a peer 
address, which isn't what we want here.

Christian.

>
> Once the translation comes back, the ATS endpoint should send the TLP to
> the peer address with the AT packet type and it will be directed to the
> peer provided the Direct Translated bit is set (or the redirect bits are
> unset).
>
> I can't see how turning off the Req/Comp redirect bits could break
> anything except for the isolation they provide.
>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 12:52                               ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-10 12:52 UTC (permalink / raw)
  To: Logan Gunthorpe, Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Christoph Hellwig

QW0gMDkuMDUuMjAxOCB1bSAxODo0NSBzY2hyaWViIExvZ2FuIEd1bnRob3JwZToKPgo+IE9uIDA5
LzA1LzE4IDA3OjQwIEFNLCBDaHJpc3RpYW4gS8O2bmlnIHdyb3RlOgo+PiBUaGUga2V5IHRha2Vh
d2F5IGlzIHRoYXQgd2hlbiBhbnkgZGV2aWNlIGhhcyBBVFMgZW5hYmxlZCB5b3UgY2FuJ3QKPj4g
ZGlzYWJsZSBBQ1Mgd2l0aG91dCBicmVha2luZyBpdCAoZXZlbiBpZiB5b3UgdW5wbHVnIGFuZCBy
ZXBsdWcgaXQpLgo+IEkgZG9uJ3QgZm9sbG93IGhvdyB5b3UgY2FtZSB0byB0aGlzIGNvbmNsdXNp
b24uLi4KPiAgIFRoZSBBQ1MgYml0cyB3ZSdkIGJlIHR1cm5pbmcgb2ZmIGFyZSB0aGUgb25lcyB0
aGF0IGZvcmNlIFRMUHMgYWRkcmVzc2VkCj4gYXQgYSBwZWVyIHRvIGdvIHRvIHRoZSBSQy4gSG93
ZXZlciwgQVRTIHRyYW5zbGF0aW9uIHBhY2tldHMgd2lsbCBiZQo+IGFkZHJlc3NlZCB0byBhbiB1
bnRyYW5zbGF0ZWQgYWRkcmVzcyB3aGljaCBhIHN3aXRjaCB3aWxsIG5vdCBpZGVudGlmeSBhcwo+
IGEgcGVlciBhZGRyZXNzIHNvIGl0IHNob3VsZCBzZW5kIHVwc3RyZWFtIHJlZ2FyZGxlc3MgdGhl
IHN0YXRlIG9mIHRoZQo+IEFDUyBSZXEvQ29tcCByZWRpcmVjdCBiaXRzLgoKV2h5IHdvdWxkIGEg
c3dpdGNoIG5vdCBpZGVudGlmeSB0aGF0IGFzIGEgcGVlciBhZGRyZXNzPyBXZSB1c2UgdGhlIFBB
U0lEIAp0b2dldGhlciB3aXRoIEFUUyB0byBpZGVudGlmeSB0aGUgYWRkcmVzcyBzcGFjZSB3aGlj
aCBhIHRyYW5zYWN0aW9uIApzaG91bGQgdXNlLgoKSWYgSSdtIG5vdCBjb21wbGV0ZWx5IG1pc3Rh
a2VuIHdoZW4geW91IGRpc2FibGUgQUNTIGl0IGlzIHBlcmZlY3RseSAKcG9zc2libGUgdGhhdCBh
IGJyaWRnZSBpZGVudGlmaWVzIGEgdHJhbnNhY3Rpb24gYXMgYmVsb25naW5nIHRvIGEgcGVlciAK
YWRkcmVzcywgd2hpY2ggaXNuJ3Qgd2hhdCB3ZSB3YW50IGhlcmUuCgpDaHJpc3RpYW4uCgo+Cj4g
T25jZSB0aGUgdHJhbnNsYXRpb24gY29tZXMgYmFjaywgdGhlIEFUUyBlbmRwb2ludCBzaG91bGQg
c2VuZCB0aGUgVExQIHRvCj4gdGhlIHBlZXIgYWRkcmVzcyB3aXRoIHRoZSBBVCBwYWNrZXQgdHlw
ZSBhbmQgaXQgd2lsbCBiZSBkaXJlY3RlZCB0byB0aGUKPiBwZWVyIHByb3ZpZGVkIHRoZSBEaXJl
Y3QgVHJhbnNsYXRlZCBiaXQgaXMgc2V0IChvciB0aGUgcmVkaXJlY3QgYml0cyBhcmUKPiB1bnNl
dCkuCj4KPiBJIGNhbid0IHNlZSBob3cgdHVybmluZyBvZmYgdGhlIFJlcS9Db21wIHJlZGlyZWN0
IGJpdHMgY291bGQgYnJlYWsKPiBhbnl0aGluZyBleGNlcHQgZm9yIHRoZSBpc29sYXRpb24gdGhl
eSBwcm92aWRlLgo+Cj4gTG9nYW4KCgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fXwpMaW51eC1udm1lIG1haWxpbmcgbGlzdApMaW51eC1udm1lQGxpc3RzLmlu
ZnJhZGVhZC5vcmcKaHR0cDovL2xpc3RzLmluZnJhZGVhZC5vcmcvbWFpbG1hbi9saXN0aW5mby9s
aW51eC1udm1lCg==

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 12:52                               ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-10 12:52 UTC (permalink / raw)


Am 09.05.2018 um 18:45 schrieb Logan Gunthorpe:
>
> On 09/05/18 07:40 AM, Christian K?nig wrote:
>> The key takeaway is that when any device has ATS enabled you can't
>> disable ACS without breaking it (even if you unplug and replug it).
> I don't follow how you came to this conclusion...
>   The ACS bits we'd be turning off are the ones that force TLPs addressed
> at a peer to go to the RC. However, ATS translation packets will be
> addressed to an untranslated address which a switch will not identify as
> a peer address so it should send upstream regardless the state of the
> ACS Req/Comp redirect bits.

Why would a switch not identify that as a peer address? We use the PASID 
together with ATS to identify the address space which a transaction 
should use.

If I'm not completely mistaken when you disable ACS it is perfectly 
possible that a bridge identifies a transaction as belonging to a peer 
address, which isn't what we want here.

Christian.

>
> Once the translation comes back, the ATS endpoint should send the TLP to
> the peer address with the AT packet type and it will be directed to the
> peer provided the Direct Translated bit is set (or the redirect bits are
> unset).
>
> I can't see how turning off the Req/Comp redirect bits could break
> anything except for the isolation they provide.
>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 12:52                               ` Christian König
                                                   ` (2 preceding siblings ...)
  (?)
@ 2018-05-10 14:16                                 ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:16 UTC (permalink / raw)
  To: Christian König
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Christian

> Why would a switch not identify that as a peer address? We use the PASID 
>    together with ATS to identify the address space which a transaction 
>    should use.

I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
    
 >   If I'm not completely mistaken when you disable ACS it is perfectly 
 >   possible that a bridge identifies a transaction as belonging to a peer 
 >   address, which isn't what we want here.
   
You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:

If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.

So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).

Make sense?

Stephen
 
    
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:16                                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:16 UTC (permalink / raw)
  To: Christian König, Logan Gunthorpe, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

SGkgQ2hyaXN0aWFuDQoNCj4gV2h5IHdvdWxkIGEgc3dpdGNoIG5vdCBpZGVudGlmeSB0aGF0IGFz
IGEgcGVlciBhZGRyZXNzPyBXZSB1c2UgdGhlIFBBU0lEIA0KPiAgICB0b2dldGhlciB3aXRoIEFU
UyB0byBpZGVudGlmeSB0aGUgYWRkcmVzcyBzcGFjZSB3aGljaCBhIHRyYW5zYWN0aW9uIA0KPiAg
ICBzaG91bGQgdXNlLg0KDQpJIHRoaW5rIHlvdSBhcmUgY29uZmxhdGluZyB0d28gdHlwZXMgb2Yg
VExQcyBoZXJlLiBJZiB0aGUgZGV2aWNlIHN1cHBvcnRzIEFUUyB0aGVuIGl0IHdpbGwgaXNzdWUg
YSBUUiBUTFAgdG8gb2J0YWluIGEgdHJhbnNsYXRlZCBhZGRyZXNzIGZyb20gdGhlIElPTU1VLiBU
aGlzIFRSIFRMUCB3aWxsIGJlIGFkZHJlc3NlZCB0byB0aGUgUlAgYW5kIHNvIHJlZ2FyZGxlc3Mg
b2YgQUNTIGl0IGlzIGdvaW5nIHVwIHRvIHRoZSBSb290IFBvcnQuIFdoZW4gaXQgZ2V0cyB0aGUg
cmVzcG9uc2UgaXQgZ2V0cyB0aGUgcGh5c2ljYWwgYWRkcmVzcyBhbmQgY2FuIHVzZSB0aGF0IHdp
dGggdGhlIFRBIGJpdCBzZXQgZm9yIHRoZSBwMnBkbWEuIEluIHRoZSBjYXNlIG9mIEFUUyBzdXBw
b3J0IHdlIGFsc28gaGF2ZSBtb3JlIGNvbnRyb2wgb3ZlciBBQ1MgYXMgd2UgY2FuIGRpc2FibGUg
aXQganVzdCBmb3IgVEEgYWRkcmVzc2VzIChhcyBwZXIgNy43LjcuNy4yIG9mIHRoZSBzcGVjKS4N
CiAgICANCiA+ICAgSWYgSSdtIG5vdCBjb21wbGV0ZWx5IG1pc3Rha2VuIHdoZW4geW91IGRpc2Fi
bGUgQUNTIGl0IGlzIHBlcmZlY3RseSANCiA+ICAgcG9zc2libGUgdGhhdCBhIGJyaWRnZSBpZGVu
dGlmaWVzIGEgdHJhbnNhY3Rpb24gYXMgYmVsb25naW5nIHRvIGEgcGVlciANCiA+ICAgYWRkcmVz
cywgd2hpY2ggaXNuJ3Qgd2hhdCB3ZSB3YW50IGhlcmUuDQogICANCllvdSBhcmUgcmlnaHQgaGVy
ZSBhbmQgSSB0aGluayB0aGlzIGlsbHVzdHJhdGVzIGEgcHJvYmxlbSBmb3IgdXNpbmcgdGhlIElP
TU1VIGF0IGFsbCB3aGVuIFAyUERNQSBkZXZpY2VzIGRvIG5vdCBzdXBwb3J0IEFUUy4gTGV0IG1l
IGV4cGxhaW46DQoNCklmIHdlIHdhbnQgdG8gZG8gYSBQMlBETUEgYW5kIHRoZSBETUEgZGV2aWNl
IGRvZXMgbm90IHN1cHBvcnQgQVRTIHRoZW4gSSB0aGluayB3ZSBoYXZlIHRvIGRpc2FibGUgdGhl
IElPTU1VIChzb21ldGhpbmcgTWlrZSBzdWdnZXN0ZWQgZWFybGllcikuIFRoZSByZWFzb24gaXMg
dGhhdCBzaW5jZSBBVFMgaXMgbm90IGFuIG9wdGlvbiB0aGUgRVAgbXVzdCBpbml0aWF0ZSB0aGUg
RE1BIHVzaW5nIHRoZSBhZGRyZXNzZXMgcGFzc2VkIGRvd24gdG8gaXQuIElmIHRoZSBJT01NVSBp
cyBvbiB0aGVuIHRoaXMgaXMgYW4gSU9WQSB0aGF0IGNvdWxkICh3aXRoIHNvbWUgbm9uLXplcm8g
cHJvYmFiaWxpdHkpIHBvaW50IHRvIGFuIElPIE1lbW9yeSBhZGRyZXNzIGluIHRoZSBzYW1lIFBD
SSBkb21haW4uIFNvIGlmIHdlIGRpc2FibGUgQUNTIHdlIGFyZSBpbiB0cm91YmxlIGFzIHdlIG1p
Z2h0IE1lbVdyIHRvIHRoZSB3cm9uZyBwbGFjZSBidXQgaWYgd2UgZW5hYmxlIEFDUyB3ZSBsb3Nl
IG11Y2ggb2YgdGhlIGJlbmVmaXQgb2YgUDJQRE1BLiBEaXNhYmxpbmcgdGhlIElPTU1VIHJlbW92
ZXMgdGhlIElPVkEgcmlzayBhbmQgaXJvbmljYWxseSBhbHNvIHJlc29sdmVzIHRoZSBJT01NVSBn
cm91cGluZyBpc3N1ZXMuDQoNClNvIEkgdGhpbmsgaWYgd2Ugd2FudCB0byBzdXBwb3J0IHBlcmZv
cm1hbnQgUDJQRE1BIGZvciBkZXZpY2VzIHRoYXQgZG9uJ3QgaGF2ZSBBVFMgKGFuZCBubyBOVk1l
IFNTRHMgdG9kYXkgc3VwcG9ydCBBVFMpIHRoZW4gd2UgaGF2ZSB0byBkaXNhYmxlIHRoZSBJT01N
VS4gSSBrbm93IHRoaXMgaXMgcHJvYmxlbWF0aWMgZm9yIEFNRHMgdXNlIGNhc2Ugc28gcGVyaGFw
cyB3ZSBhbHNvIG5lZWQgdG8gY29uc2lkZXIgYSBtb2RlIGZvciBQMlBETUEgZm9yIGRldmljZXMg
dGhhdCBETyBzdXBwb3J0IEFUUyB3aGVyZSB3ZSBjYW4gZW5hYmxlIHRoZSBJT01NVSAoYnV0IGlu
IHRoaXMgY2FzZSBFUHMgd2l0aG91dCBBVFMgY2Fubm90IHBhcnRpY2lwYXRlIGFzIFAyUERNQSBE
TUEgaW5pYXRvcnMpLg0KDQpNYWtlIHNlbnNlPw0KDQpTdGVwaGVuDQogDQogICAgDQogICAgDQoN
Cg==

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:16                                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:16 UTC (permalink / raw)
  To: Christian König, Logan Gunthorpe, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Christian

> Why would a switch not identify that as a peer address? We use the PASID 
>    together with ATS to identify the address space which a transaction 
>    should use.

I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
    
 >   If I'm not completely mistaken when you disable ACS it is perfectly 
 >   possible that a bridge identifies a transaction as belonging to a peer 
 >   address, which isn't what we want here.
   
You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:

If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.

So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).

Make sense?

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:16                                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:16 UTC (permalink / raw)
  To: Christian König, Logan Gunthorpe, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

Hi Christian

> Why would a switch not identify that as a peer address? We use the PASID 
>    together with ATS to identify the address space which a transaction 
>    should use.

I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
    
 >   If I'm not completely mistaken when you disable ACS it is perfectly 
 >   possible that a bridge identifies a transaction as belonging to a peer 
 >   address, which isn't what we want here.
   
You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:

If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.

So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).

Make sense?

Stephen
 
    
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:16                                 ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:16 UTC (permalink / raw)


Hi Christian

> Why would a switch not identify that as a peer address? We use the PASID 
>    together with ATS to identify the address space which a transaction 
>    should use.

I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
    
 >   If I'm not completely mistaken when you disable ACS it is perfectly 
 >   possible that a bridge identifies a transaction as belonging to a peer 
 >   address, which isn't what we want here.
   
You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:

If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.

So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).

Make sense?

Stephen
 
    
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-09 17:49                                   ` Jerome Glisse
                                                       ` (2 preceding siblings ...)
  (?)
@ 2018-05-10 14:20                                     ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:20 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

Hi Jerome

> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>  user is the AMD GPU driver see:
    
Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Cheers

Stephen    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:20                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:20 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

SGkgSmVyb21lDQoNCj4gQXMgaXQgaXMgdGllIHRvIFBBU0lEIHRoaXMgaXMgZG9uZSB1c2luZyBJ
T01NVSBzbyBsb29rcyBmb3IgY2FsbGVyDQo+IG9mIGFtZF9pb21tdV9iaW5kX3Bhc2lkKCkgb3Ig
aW50ZWxfc3ZtX2JpbmRfbW0oKSBpbiBHUFUgdGhlIGV4aXN0aW5nDQo+ICB1c2VyIGlzIHRoZSBB
TUQgR1BVIGRyaXZlciBzZWU6DQogICAgDQpBaCB0aGFua3MuIFRoaXMgY2xlYXJlZCB0aGluZ3Mg
dXAgZm9yIG1lLiBBIHF1aWNrIHNlYXJjaCBzaG93cyB0aGVyZSBhcmUgc3RpbGwgbm8gdXNlcnMg
b2YgaW50ZWxfc3ZtX2JpbmRfbW0oKSBidXQgSSBzZWUgdGhlIEFNRCB2ZXJzaW9uIHVzZWQgaW4g
dGhhdCBHUFUgZHJpdmVyLg0KDQpPbmUgdGhpbmcgSSBjb3VsZCBub3QgZ3JvayBmcm9tIHRoZSBj
b2RlIGhvdyB0aGUgR1BVIGRyaXZlciBpbmRpY2F0ZXMgd2hpY2ggRE1BIGV2ZW50cyByZXF1aXJl
IEFUUyB0cmFuc2xhdGlvbnMgYW5kIHdoaWNoIGRvIG5vdC4gSSBhbSBhc3N1bWluZyB0aGUgZHJp
dmVyIGltcGxlbWVudHMgc29tZXdheSBvZiBpbmRpY2F0aW5nIHRoYXQgYW5kIGl0cyBub3QganVz
dCBhIGdsb2JhbCBPTiBvciBPRkYgZm9yIGFsbCBETUFzPyBUaGUgcmVhc29uIEkgYXNrIGlzIHRo
YXQgSSBsb29raW5nIGF0IGlmIE5WTWUgd2FzIHRvIHN1cHBvcnQgQVRTIHdoYXQgd291bGQgbmVl
ZCB0byBiZSBhZGRlZCBpbiB0aGUgTlZNZSBzcGVjIGFib3ZlIGFuZCBiZXlvbmQgd2hhdCB3ZSBo
YXZlIGluIFBDSSBBVFMgdG8gc3VwcG9ydCBlZmZpY2llbnQgdXNlIG9mIEFUUyAoZm9yIGV4YW1w
bGUgd291bGQgd2UgbmVlZCBhIGZsYWcgaW4gdGhlIHN1Ym1pc3Npb24gcXVldWUgZW50cmllcyB0
byBpbmRpY2F0ZSBhIHBhcnRpY3VsYXIgSU8ncyBTR0wvUFJQIHNob3VsZCB1bmRlcmdvIEFUUyku
DQoNCkNoZWVycw0KDQpTdGVwaGVuICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:20                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:20 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

Hi Jerome

> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>  user is the AMD GPU driver see:
    
Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Cheers

Stephen    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:20                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:20 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

Hi Jerome

> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>  user is the AMD GPU driver see:
    
Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Cheers

Stephen    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:20                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 14:20 UTC (permalink / raw)


Hi Jerome

> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>  user is the AMD GPU driver see:
    
Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Cheers

Stephen    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 14:20                                     ` Stephen  Bates
  (?)
  (?)
@ 2018-05-10 14:29                                       ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-10 14:29 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> Hi Jerome
>
>> As it is tie to PASID this is done using IOMMU so looks for caller
>> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>>   user is the AMD GPU driver see:
>      
> Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

Just FYI: There is also another effort ongoing to give both the AMD, 
Intel as well as ARM IOMMUs a common interface so that drivers can use 
whatever the platform offers fro SVM support.

> One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Oh, well that is complicated at best.

On very old hardware it wasn't a window, but instead you had to use 
special commands in your shader which indicated that you want to use an 
ATS transaction instead of a normal PCIe transaction for your 
read/write/atomic.

As Jerome explained on most hardware we have a window inside the 
internal GPU address space which when accessed issues a ATS transaction 
with a configurable PASID.

But on very newer hardware that window became a bit in the GPUVM page 
tables, so in theory we now can control it on a 4K granularity basis for 
the internal 48bit GPU address space.

Christian.

>
> Cheers
>
> Stephen
>

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:29                                       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-10 14:29 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse
  Cc: Logan Gunthorpe, Alex Williamson, Bjorn Helgaas, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> Hi Jerome
>
>> As it is tie to PASID this is done using IOMMU so looks for caller
>> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>>   user is the AMD GPU driver see:
>      
> Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

Just FYI: There is also another effort ongoing to give both the AMD, 
Intel as well as ARM IOMMUs a common interface so that drivers can use 
whatever the platform offers fro SVM support.

> One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Oh, well that is complicated at best.

On very old hardware it wasn't a window, but instead you had to use 
special commands in your shader which indicated that you want to use an 
ATS transaction instead of a normal PCIe transaction for your 
read/write/atomic.

As Jerome explained on most hardware we have a window inside the 
internal GPU address space which when accessed issues a ATS transaction 
with a configurable PASID.

But on very newer hardware that window became a bit in the GPUVM page 
tables, so in theory we now can control it on a 4K granularity basis for 
the internal 48bit GPU address space.

Christian.

>
> Cheers
>
> Stephen
>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:29                                       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-10 14:29 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> Hi Jerome
>
>> As it is tie to PASID this is done using IOMMU so looks for caller
>> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>>   user is the AMD GPU driver see:
>      
> Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

Just FYI: There is also another effort ongoing to give both the AMD, 
Intel as well as ARM IOMMUs a common interface so that drivers can use 
whatever the platform offers fro SVM support.

> One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Oh, well that is complicated at best.

On very old hardware it wasn't a window, but instead you had to use 
special commands in your shader which indicated that you want to use an 
ATS transaction instead of a normal PCIe transaction for your 
read/write/atomic.

As Jerome explained on most hardware we have a window inside the 
internal GPU address space which when accessed issues a ATS transaction 
with a configurable PASID.

But on very newer hardware that window became a bit in the GPUVM page 
tables, so in theory we now can control it on a 4K granularity basis for 
the internal 48bit GPU address space.

Christian.

>
> Cheers
>
> Stephen
>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:29                                       ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-10 14:29 UTC (permalink / raw)


Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> Hi Jerome
>
>> As it is tie to PASID this is done using IOMMU so looks for caller
>> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>>   user is the AMD GPU driver see:
>      
> Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

Just FYI: There is also another effort ongoing to give both the AMD, 
Intel as well as ARM IOMMUs a common interface so that drivers can use 
whatever the platform offers fro SVM support.

> One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).

Oh, well that is complicated at best.

On very old hardware it wasn't a window, but instead you had to use 
special commands in your shader which indicated that you want to use an 
ATS transaction instead of a normal PCIe transaction for your 
read/write/atomic.

As Jerome explained on most hardware we have a window inside the 
internal GPU address space which when accessed issues a ATS transaction 
with a configurable PASID.

But on very newer hardware that window became a bit in the GPUVM page 
tables, so in theory we now can control it on a 4K granularity basis for 
the internal 48bit GPU address space.

Christian.

>
> Cheers
>
> Stephen
>

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 14:16                                 ` Stephen  Bates
                                                     ` (2 preceding siblings ...)
  (?)
@ 2018-05-10 14:41                                   ` Jerome Glisse
  -1 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:41 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

On Thu, May 10, 2018 at 02:16:25PM +0000, Stephen  Bates wrote:
> Hi Christian
> 
> > Why would a switch not identify that as a peer address? We use the PASID 
> >    together with ATS to identify the address space which a transaction 
> >    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
>     
>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> 
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?
> 

Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.

However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.

Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.

Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:41                                   ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:41 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Thu, May 10, 2018 at 02:16:25PM +0000, Stephen  Bates wrote:
> Hi Christian
> 
> > Why would a switch not identify that as a peer address? We use the PASID 
> >    together with ATS to identify the address space which a transaction 
> >    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
>     
>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> 
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?
> 

Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.

However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.

Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.

Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:41                                   ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:41 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy

On Thu, May 10, 2018 at 02:16:25PM +0000, Stephen  Bates wrote:
> Hi Christian
> 
> > Why would a switch not identify that as a peer address? We use the PASID 
> >    together with ATS to identify the address space which a transaction 
> >    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
>     
>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> 
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?
> 

Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.

However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.

Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.

Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:41                                   ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:41 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Thu, May 10, 2018 at 02:16:25PM +0000, Stephen  Bates wrote:
> Hi Christian
> 
> > Why would a switch not identify that as a peer address? We use the PASID 
> >    together with ATS to identify the address space which a transaction 
> >    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
>     
>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> 
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?
> 

Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.

However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.

Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.

Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:41                                   ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:41 UTC (permalink / raw)


On Thu, May 10, 2018@02:16:25PM +0000, Stephen  Bates wrote:
> Hi Christian
> 
> > Why would a switch not identify that as a peer address? We use the PASID 
> >    together with ATS to identify the address space which a transaction 
> >    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).
>     
>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> 
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?
> 

Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.

However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.

Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.

Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 14:29                                       ` Christian König
                                                           ` (2 preceding siblings ...)
  (?)
@ 2018-05-10 14:59                                         ` Jerome Glisse
  -1 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:59 UTC (permalink / raw)
  To: Christian König
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, May 10, 2018 at 04:29:44PM +0200, Christian König wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> > 
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > >   user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.
> 
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
> 
> > One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).
> 
> Oh, well that is complicated at best.
> 
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
> 
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
> 
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
> 

To complete this a 50 lines primer on GPU:

GPUVA - GPU virtual address
GPUPA - GPU physical address

GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.

Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.

GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).

So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.


So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.


So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.


Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.


All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.


Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:59                                         ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:59 UTC (permalink / raw)
  To: Christian König
  Cc: Stephen Bates, Logan Gunthorpe, Alex Williamson, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Benjamin Herrenschmidt

On Thu, May 10, 2018 at 04:29:44PM +0200, Christian K�nig wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> > 
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > >   user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.
> 
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
> 
> > One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).
> 
> Oh, well that is complicated at best.
> 
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
> 
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
> 
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
> 

To complete this a 50 lines primer on GPU:

GPUVA - GPU virtual address
GPUPA - GPU physical address

GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.

Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.

GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).

So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.


So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.


So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.


Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.


All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.


Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:59                                         ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:59 UTC (permalink / raw)
  To: Christian König
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, May 10, 2018 at 04:29:44PM +0200, Christian König wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> > 
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > >   user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.
> 
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
> 
> > One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).
> 
> Oh, well that is complicated at best.
> 
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
> 
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
> 
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
> 

To complete this a 50 lines primer on GPU:

GPUVA - GPU virtual address
GPUPA - GPU physical address

GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.

Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.

GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).

So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.


So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.


So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.


Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.


All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.


Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:59                                         ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:59 UTC (permalink / raw)
  To: Christian König
  Cc: Stephen Bates, Logan Gunthorpe, Alex Williamson, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Benjamin Herrenschmidt

On Thu, May 10, 2018 at 04:29:44PM +0200, Christian König wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> > 
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > >   user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.
> 
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
> 
> > One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).
> 
> Oh, well that is complicated at best.
> 
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
> 
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
> 
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
> 

To complete this a 50 lines primer on GPU:

GPUVA - GPU virtual address
GPUPA - GPU physical address

GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.

Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.

GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).

So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.


So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.


So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.


Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.


All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.


Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 14:59                                         ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 14:59 UTC (permalink / raw)


On Thu, May 10, 2018@04:29:44PM +0200, Christian K?nig wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> > 
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > >   user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.
> 
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
> 
> > One thing I could not grok from the code how the GPU driver indicates which DMA events require ATS translations and which do not. I am assuming the driver implements someway of indicating that and its not just a global ON or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what would need to be added in the NVMe spec above and beyond what we have in PCI ATS to support efficient use of ATS (for example would we need a flag in the submission queue entries to indicate a particular IO's SGL/PRP should undergo ATS).
> 
> Oh, well that is complicated at best.
> 
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
> 
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
> 
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
> 

To complete this a 50 lines primer on GPU:

GPUVA - GPU virtual address
GPUPA - GPU physical address

GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.

Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.

GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).

So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.


So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.


So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.


Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.


All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.


Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 14:16                                 ` Stephen  Bates
  (?)
  (?)
@ 2018-05-10 16:32                                   ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 16:32 UTC (permalink / raw)
  To: Stephen Bates, Christian König, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 10/05/18 08:16 AM, Stephen  Bates wrote:
> Hi Christian
> 
>> Why would a switch not identify that as a peer address? We use the PASID 
>>    together with ATS to identify the address space which a transaction 
>>    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).

Yes. Remember if we are using the IOMMU the EP is being programmed
(regardless of whether it's a DMA engine, NTB window or GPUVA) with an
IOVA address which is separate from the device's PCI bus address. Any
packet addressed to an IOVA address is going to go back to the root
complex no matter what the ACS bits say. Only once ATS translates the
addres back into the PCI bus address will the EP send packets to the
peer and the switch will attempt to root them to the peer and only then
do the ACS bits apply. And the direct translated ACS bit allows packets
that have purportedly been translated through.

>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?

Not to me. In the p2pdma code we specifically program DMA engines with
the PCI bus address. So regardless of whether we are using the IOMMU or
not, the packets will be forwarded directly to the peer. If the ACS
Redir bits are on they will be forced back to the RC by the switch and
the transaction will fail. If we clear the ACS bits, the TLPs will go
where we want and everything will work (but we lose the isolation of ACS).

For EPs that support ATS, we should (but don't necessarily have to)
program them with the IOVA address so they can go through the
translation process which will allow P2P without disabling the ACS Redir
bits -- provided the ACS direct translation bit is set. (And btw, if it
is, then we lose the benefit of ACS protecting against malicious EPs).
But, per above, the ATS transaction should involve only the IOVA address
so the ACS bits not being set should not break ATS.

Logan



_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 16:32                                   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 16:32 UTC (permalink / raw)
  To: Stephen Bates, Christian König, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt



On 10/05/18 08:16 AM, Stephen  Bates wrote:
> Hi Christian
> 
>> Why would a switch not identify that as a peer address? We use the PASID 
>>    together with ATS to identify the address space which a transaction 
>>    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).

Yes. Remember if we are using the IOMMU the EP is being programmed
(regardless of whether it's a DMA engine, NTB window or GPUVA) with an
IOVA address which is separate from the device's PCI bus address. Any
packet addressed to an IOVA address is going to go back to the root
complex no matter what the ACS bits say. Only once ATS translates the
addres back into the PCI bus address will the EP send packets to the
peer and the switch will attempt to root them to the peer and only then
do the ACS bits apply. And the direct translated ACS bit allows packets
that have purportedly been translated through.

>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?

Not to me. In the p2pdma code we specifically program DMA engines with
the PCI bus address. So regardless of whether we are using the IOMMU or
not, the packets will be forwarded directly to the peer. If the ACS
Redir bits are on they will be forced back to the RC by the switch and
the transaction will fail. If we clear the ACS bits, the TLPs will go
where we want and everything will work (but we lose the isolation of ACS).

For EPs that support ATS, we should (but don't necessarily have to)
program them with the IOVA address so they can go through the
translation process which will allow P2P without disabling the ACS Redir
bits -- provided the ACS direct translation bit is set. (And btw, if it
is, then we lose the benefit of ACS protecting against malicious EPs).
But, per above, the ATS transaction should involve only the IOVA address
so the ACS bits not being set should not break ATS.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 16:32                                   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 16:32 UTC (permalink / raw)
  To: Stephen Bates, Christian König, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 10/05/18 08:16 AM, Stephen  Bates wrote:
> Hi Christian
> 
>> Why would a switch not identify that as a peer address? We use the PASID 
>>    together with ATS to identify the address space which a transaction 
>>    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).

Yes. Remember if we are using the IOMMU the EP is being programmed
(regardless of whether it's a DMA engine, NTB window or GPUVA) with an
IOVA address which is separate from the device's PCI bus address. Any
packet addressed to an IOVA address is going to go back to the root
complex no matter what the ACS bits say. Only once ATS translates the
addres back into the PCI bus address will the EP send packets to the
peer and the switch will attempt to root them to the peer and only then
do the ACS bits apply. And the direct translated ACS bit allows packets
that have purportedly been translated through.

>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?

Not to me. In the p2pdma code we specifically program DMA engines with
the PCI bus address. So regardless of whether we are using the IOMMU or
not, the packets will be forwarded directly to the peer. If the ACS
Redir bits are on they will be forced back to the RC by the switch and
the transaction will fail. If we clear the ACS bits, the TLPs will go
where we want and everything will work (but we lose the isolation of ACS).

For EPs that support ATS, we should (but don't necessarily have to)
program them with the IOVA address so they can go through the
translation process which will allow P2P without disabling the ACS Redir
bits -- provided the ACS direct translation bit is set. (And btw, if it
is, then we lose the benefit of ACS protecting against malicious EPs).
But, per above, the ATS transaction should involve only the IOVA address
so the ACS bits not being set should not break ATS.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 16:32                                   ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 16:32 UTC (permalink / raw)




On 10/05/18 08:16 AM, Stephen  Bates wrote:
> Hi Christian
> 
>> Why would a switch not identify that as a peer address? We use the PASID 
>>    together with ATS to identify the address space which a transaction 
>>    should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS then it will issue a TR TLP to obtain a translated address from the IOMMU. This TR TLP will be addressed to the RP and so regardless of ACS it is going up to the Root Port. When it gets the response it gets the physical address and can use that with the TA bit set for the p2pdma. In the case of ATS support we also have more control over ACS as we can disable it just for TA addresses (as per 7.7.7.7.2 of the spec).

Yes. Remember if we are using the IOMMU the EP is being programmed
(regardless of whether it's a DMA engine, NTB window or GPUVA) with an
IOVA address which is separate from the device's PCI bus address. Any
packet addressed to an IOVA address is going to go back to the root
complex no matter what the ACS bits say. Only once ATS translates the
addres back into the PCI bus address will the EP send packets to the
peer and the switch will attempt to root them to the peer and only then
do the ACS bits apply. And the direct translated ACS bit allows packets
that have purportedly been translated through.

>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>    
> You are right here and I think this illustrates a problem for using the IOMMU at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I think we have to disable the IOMMU (something Mike suggested earlier). The reason is that since ATS is not an option the EP must initiate the DMA using the addresses passed down to it. If the IOMMU is on then this is an IOVA that could (with some non-zero probability) point to an IO Memory address in the same PCI domain. So if we disable ACS we are in trouble as we might MemWr to the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping issues.
> So I think if we want to support performant P2PDMA for devices that don't have ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I know this is problematic for AMDs use case so perhaps we also need to consider a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU (but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).
> 
> Make sense?

Not to me. In the p2pdma code we specifically program DMA engines with
the PCI bus address. So regardless of whether we are using the IOMMU or
not, the packets will be forwarded directly to the peer. If the ACS
Redir bits are on they will be forced back to the RC by the switch and
the transaction will fail. If we clear the ACS bits, the TLPs will go
where we want and everything will work (but we lose the isolation of ACS).

For EPs that support ATS, we should (but don't necessarily have to)
program them with the IOVA address so they can go through the
translation process which will allow P2P without disabling the ACS Redir
bits -- provided the ACS direct translation bit is set. (And btw, if it
is, then we lose the benefit of ACS protecting against malicious EPs).
But, per above, the ATS transaction should involve only the IOVA address
so the ACS bits not being set should not break ATS.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 16:32                                   ` Logan Gunthorpe
                                                       ` (2 preceding siblings ...)
  (?)
@ 2018-05-10 17:11                                     ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 17:11 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address. 

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

> So regardless of whether we are using the IOMMU or
> not, the packets will be forwarded directly to the peer. If the ACS
>  Redir bits are on they will be forced back to the RC by the switch and
>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>  where we want and everything will work (but we lose the isolation of ACS).

Agreed.
    
>    For EPs that support ATS, we should (but don't necessarily have to)
>    program them with the IOVA address so they can go through the
>    translation process which will allow P2P without disabling the ACS Redir
>    bits -- provided the ACS direct translation bit is set. (And btw, if it
>    is, then we lose the benefit of ACS protecting against malicious EPs).
>    But, per above, the ATS transaction should involve only the IOVA address
>    so the ACS bits not being set should not break ATS.
    
Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 17:11                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 17:11 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

PiBOb3QgdG8gbWUuIEluIHRoZSBwMnBkbWEgY29kZSB3ZSBzcGVjaWZpY2FsbHkgcHJvZ3JhbSBE
TUEgZW5naW5lcyB3aXRoDQo+IHRoZSBQQ0kgYnVzIGFkZHJlc3MuIA0KDQpBaCB5ZXMgb2YgY291
cnNlLiBCcmFpbiBmYXJ0IG9uIG15IHBhcnQuIFdlIGFyZSBub3QgcHJvZ3JhbW1pbmcgdGhlIFAy
UERNQSBpbml0aWF0b3Igd2l0aCBhbiBJT1ZBIGJ1dCB3aXRoIHRoZSBQQ0kgYnVzIGFkZHJlc3Mu
Li4NCg0KPiBTbyByZWdhcmRsZXNzIG9mIHdoZXRoZXIgd2UgYXJlIHVzaW5nIHRoZSBJT01NVSBv
cg0KPiBub3QsIHRoZSBwYWNrZXRzIHdpbGwgYmUgZm9yd2FyZGVkIGRpcmVjdGx5IHRvIHRoZSBw
ZWVyLiBJZiB0aGUgQUNTDQo+ICBSZWRpciBiaXRzIGFyZSBvbiB0aGV5IHdpbGwgYmUgZm9yY2Vk
IGJhY2sgdG8gdGhlIFJDIGJ5IHRoZSBzd2l0Y2ggYW5kDQo+ICB0aGUgdHJhbnNhY3Rpb24gd2ls
bCBmYWlsLiBJZiB3ZSBjbGVhciB0aGUgQUNTIGJpdHMsIHRoZSBUTFBzIHdpbGwgZ28NCj4gIHdo
ZXJlIHdlIHdhbnQgYW5kIGV2ZXJ5dGhpbmcgd2lsbCB3b3JrIChidXQgd2UgbG9zZSB0aGUgaXNv
bGF0aW9uIG9mIEFDUykuDQoNCkFncmVlZC4NCiAgICANCj4gICAgRm9yIEVQcyB0aGF0IHN1cHBv
cnQgQVRTLCB3ZSBzaG91bGQgKGJ1dCBkb24ndCBuZWNlc3NhcmlseSBoYXZlIHRvKQ0KPiAgICBw
cm9ncmFtIHRoZW0gd2l0aCB0aGUgSU9WQSBhZGRyZXNzIHNvIHRoZXkgY2FuIGdvIHRocm91Z2gg
dGhlDQo+ICAgIHRyYW5zbGF0aW9uIHByb2Nlc3Mgd2hpY2ggd2lsbCBhbGxvdyBQMlAgd2l0aG91
dCBkaXNhYmxpbmcgdGhlIEFDUyBSZWRpcg0KPiAgICBiaXRzIC0tIHByb3ZpZGVkIHRoZSBBQ1Mg
ZGlyZWN0IHRyYW5zbGF0aW9uIGJpdCBpcyBzZXQuIChBbmQgYnR3LCBpZiBpdA0KPiAgICBpcywg
dGhlbiB3ZSBsb3NlIHRoZSBiZW5lZml0IG9mIEFDUyBwcm90ZWN0aW5nIGFnYWluc3QgbWFsaWNp
b3VzIEVQcykuDQo+ICAgIEJ1dCwgcGVyIGFib3ZlLCB0aGUgQVRTIHRyYW5zYWN0aW9uIHNob3Vs
ZCBpbnZvbHZlIG9ubHkgdGhlIElPVkEgYWRkcmVzcw0KPiAgICBzbyB0aGUgQUNTIGJpdHMgbm90
IGJlaW5nIHNldCBzaG91bGQgbm90IGJyZWFrIEFUUy4NCiAgICANCldlbGwgd2Ugd291bGQgc3Rp
bGwgaGF2ZSB0byBjbGVhciBzb21lIEFDUyBiaXRzIGJ1dCBub3cgd2UgY2FuIGNsZWFyIG9ubHkg
Zm9yIHRyYW5zbGF0ZWQgYWRkcmVzc2VzLg0KDQpTdGVwaGVuDQogICAgDQoNCg==

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 17:11                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 17:11 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address. 

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

> So regardless of whether we are using the IOMMU or
> not, the packets will be forwarded directly to the peer. If the ACS
>  Redir bits are on they will be forced back to the RC by the switch and
>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>  where we want and everything will work (but we lose the isolation of ACS).

Agreed.
    
>    For EPs that support ATS, we should (but don't necessarily have to)
>    program them with the IOVA address so they can go through the
>    translation process which will allow P2P without disabling the ACS Redir
>    bits -- provided the ACS direct translation bit is set. (And btw, if it
>    is, then we lose the benefit of ACS protecting against malicious EPs).
>    But, per above, the ATS transaction should involve only the IOVA address
>    so the ACS bits not being set should not break ATS.
    
Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 17:11                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 17:11 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address. 

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

> So regardless of whether we are using the IOMMU or
> not, the packets will be forwarded directly to the peer. If the ACS
>  Redir bits are on they will be forced back to the RC by the switch and
>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>  where we want and everything will work (but we lose the isolation of ACS).

Agreed.
    
>    For EPs that support ATS, we should (but don't necessarily have to)
>    program them with the IOVA address so they can go through the
>    translation process which will allow P2P without disabling the ACS Redir
>    bits -- provided the ACS direct translation bit is set. (And btw, if it
>    is, then we lose the benefit of ACS protecting against malicious EPs).
>    But, per above, the ATS transaction should involve only the IOVA address
>    so the ACS bits not being set should not break ATS.
    
Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 17:11                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 17:11 UTC (permalink / raw)


> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address. 

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

> So regardless of whether we are using the IOMMU or
> not, the packets will be forwarded directly to the peer. If the ACS
>  Redir bits are on they will be forced back to the RC by the switch and
>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>  where we want and everything will work (but we lose the isolation of ACS).

Agreed.
    
>    For EPs that support ATS, we should (but don't necessarily have to)
>    program them with the IOVA address so they can go through the
>    translation process which will allow P2P without disabling the ACS Redir
>    bits -- provided the ACS direct translation bit is set. (And btw, if it
>    is, then we lose the benefit of ACS protecting against malicious EPs).
>    But, per above, the ATS transaction should involve only the IOVA address
>    so the ACS bits not being set should not break ATS.
    
Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 17:11                                     ` Stephen  Bates
  (?)
  (?)
@ 2018-05-10 17:15                                       ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 17:15 UTC (permalink / raw)
  To: Stephen Bates, Christian König, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 10/05/18 11:11 AM, Stephen  Bates wrote:
>> Not to me. In the p2pdma code we specifically program DMA engines with
>> the PCI bus address. 
> 
> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...
> 
>> So regardless of whether we are using the IOMMU or
>> not, the packets will be forwarded directly to the peer. If the ACS
>>  Redir bits are on they will be forced back to the RC by the switch and
>>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>>  where we want and everything will work (but we lose the isolation of ACS).
> 
> Agreed.
>     
>>    For EPs that support ATS, we should (but don't necessarily have to)
>>    program them with the IOVA address so they can go through the
>>    translation process which will allow P2P without disabling the ACS Redir
>>    bits -- provided the ACS direct translation bit is set. (And btw, if it
>>    is, then we lose the benefit of ACS protecting against malicious EPs).
>>    But, per above, the ATS transaction should involve only the IOVA address
>>    so the ACS bits not being set should not break ATS.
>     
> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

We don't have to clear the ACS Redir bits as we did in the first case.
We just have to make sure the ACS Direct Translated bit is set.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 17:15                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 17:15 UTC (permalink / raw)
  To: Stephen Bates, Christian König, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt



On 10/05/18 11:11 AM, Stephen  Bates wrote:
>> Not to me. In the p2pdma code we specifically program DMA engines with
>> the PCI bus address. 
> 
> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...
> 
>> So regardless of whether we are using the IOMMU or
>> not, the packets will be forwarded directly to the peer. If the ACS
>>  Redir bits are on they will be forced back to the RC by the switch and
>>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>>  where we want and everything will work (but we lose the isolation of ACS).
> 
> Agreed.
>     
>>    For EPs that support ATS, we should (but don't necessarily have to)
>>    program them with the IOVA address so they can go through the
>>    translation process which will allow P2P without disabling the ACS Redir
>>    bits -- provided the ACS direct translation bit is set. (And btw, if it
>>    is, then we lose the benefit of ACS protecting against malicious EPs).
>>    But, per above, the ATS transaction should involve only the IOVA address
>>    so the ACS bits not being set should not break ATS.
>     
> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

We don't have to clear the ACS Redir bits as we did in the first case.
We just have to make sure the ACS Direct Translated bit is set.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 17:15                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 17:15 UTC (permalink / raw)
  To: Stephen Bates, Christian König, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 10/05/18 11:11 AM, Stephen  Bates wrote:
>> Not to me. In the p2pdma code we specifically program DMA engines with
>> the PCI bus address. 
> 
> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...
> 
>> So regardless of whether we are using the IOMMU or
>> not, the packets will be forwarded directly to the peer. If the ACS
>>  Redir bits are on they will be forced back to the RC by the switch and
>>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>>  where we want and everything will work (but we lose the isolation of ACS).
> 
> Agreed.
>     
>>    For EPs that support ATS, we should (but don't necessarily have to)
>>    program them with the IOVA address so they can go through the
>>    translation process which will allow P2P without disabling the ACS Redir
>>    bits -- provided the ACS direct translation bit is set. (And btw, if it
>>    is, then we lose the benefit of ACS protecting against malicious EPs).
>>    But, per above, the ATS transaction should involve only the IOVA address
>>    so the ACS bits not being set should not break ATS.
>     
> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

We don't have to clear the ACS Redir bits as we did in the first case.
We just have to make sure the ACS Direct Translated bit is set.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 17:15                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 17:15 UTC (permalink / raw)




On 10/05/18 11:11 AM, Stephen  Bates wrote:
>> Not to me. In the p2pdma code we specifically program DMA engines with
>> the PCI bus address. 
> 
> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...
> 
>> So regardless of whether we are using the IOMMU or
>> not, the packets will be forwarded directly to the peer. If the ACS
>>  Redir bits are on they will be forced back to the RC by the switch and
>>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>>  where we want and everything will work (but we lose the isolation of ACS).
> 
> Agreed.
>     
>>    For EPs that support ATS, we should (but don't necessarily have to)
>>    program them with the IOVA address so they can go through the
>>    translation process which will allow P2P without disabling the ACS Redir
>>    bits -- provided the ACS direct translation bit is set. (And btw, if it
>>    is, then we lose the benefit of ACS protecting against malicious EPs).
>>    But, per above, the ATS transaction should involve only the IOVA address
>>    so the ACS bits not being set should not break ATS.
>     
> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

We don't have to clear the ACS Redir bits as we did in the first case.
We just have to make sure the ACS Direct Translated bit is set.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 14:41                                   ` Jerome Glisse
                                                       ` (3 preceding siblings ...)
  (?)
@ 2018-05-10 18:41                                     ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

Hi Jerome

>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>    are the part likely to be use in peer to peer.

OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
    
>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>    for performance reasons ie we do not care having our transaction going
>    to the root complex and back down the destination. At least in use case
>    i am working on this is fine.

If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.
    
>    Reasons is that GPU are giving up on PCIe (see all specialize link like
>    NVlink that are popping up in GPU space). So for fast GPU inter-connect
>    we have this new links. 

I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.
    
>    Also the IOMMU isolation do matter a lot to us. Think someone using this
>    peer to peer to gain control of a server in the cloud.
    
I agree that IOMMU isolation is very desirable. Hence the desire to ensure we can keep the IOMMU on while doing p2pdma if at all possible whilst still delivering the desired performance to the user.

Stephen    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:41                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

SGkgSmVyb21lDQoNCj4gICAgTm90ZSBvbiBHUFUgd2UgZG8gd291bGQgbm90IHJlbHkgb24gQVRT
IGZvciBwZWVyIHRvIHBlZXIuIFNvbWUgcGFydA0KPiAgICBvZiB0aGUgR1BVIChETUEgZW5naW5l
cykgZG8gbm90IG5lY2Vzc2FyaWx5IHN1cHBvcnQgQVRTLiBZZXQgdGhvc2UNCj4gICAgYXJlIHRo
ZSBwYXJ0IGxpa2VseSB0byBiZSB1c2UgaW4gcGVlciB0byBwZWVyLg0KDQpPSyB0aGlzIGlzIGdv
b2QgdG8ga25vdy4gSSBhZ3JlZSB0aGUgRE1BIGVuZ2luZSBpcyBwcm9iYWJseSBvbmUgb2YgdGhl
IEdQVSBjb21wb25lbnRzIG1vc3QgYXBwbGljYWJsZSB0byBwMnBkbWEuDQogICAgDQo+ICAgIFdl
IChha2UgR1BVIHBlb3BsZSBha2EgdGhlIGdvb2QgZ3V5cyA7KSkgZG8gbm8gd2FudCB0byBkbyBw
ZWVyIHRvIHBlZXINCj4gICAgZm9yIHBlcmZvcm1hbmNlIHJlYXNvbnMgaWUgd2UgZG8gbm90IGNh
cmUgaGF2aW5nIG91ciB0cmFuc2FjdGlvbiBnb2luZw0KPiAgICB0byB0aGUgcm9vdCBjb21wbGV4
IGFuZCBiYWNrIGRvd24gdGhlIGRlc3RpbmF0aW9uLiBBdCBsZWFzdCBpbiB1c2UgY2FzZQ0KPiAg
ICBpIGFtIHdvcmtpbmcgb24gdGhpcyBpcyBmaW5lLg0KDQpJZiB0aGUgR1BVIHBlb3BsZSBhcmUg
dGhlIGdvb2QgZ3V5cyBkb2VzIHRoYXQgbWFrZSB0aGUgTlZNZSBwZW9wbGUgdGhlIGJhZCBndXlz
IDstKS4gSWYgc28sIHdoYXQgYXJlIHRoZSBSRE1BIHBlb3BsZT8/PyBBZ2FpbiBnb29kIHRvIGtu
b3cuDQogICAgDQo+ICAgIFJlYXNvbnMgaXMgdGhhdCBHUFUgYXJlIGdpdmluZyB1cCBvbiBQQ0ll
IChzZWUgYWxsIHNwZWNpYWxpemUgbGluayBsaWtlDQo+ICAgIE5WbGluayB0aGF0IGFyZSBwb3Bw
aW5nIHVwIGluIEdQVSBzcGFjZSkuIFNvIGZvciBmYXN0IEdQVSBpbnRlci1jb25uZWN0DQo+ICAg
IHdlIGhhdmUgdGhpcyBuZXcgbGlua3MuIA0KDQpJIGxvb2sgZm9yd2FyZCB0byBOdmlkaWEgb3Bl
bi1saWNlbnNpbmcgTlZMaW5rIHRvIGFueW9uZSB3aG8gd2FudHMgdG8gdXNlIGl0IDstKS4gT3Ig
bWF5YmUgd2UnbGwgYWxsIGp1c3Qgc3dpdGNoIHRvIE9wZW5HZW5DQ0lYIHdoZW4gdGhlIHRpbWUg
Y29tZXMuDQogICAgDQo+ICAgIEFsc28gdGhlIElPTU1VIGlzb2xhdGlvbiBkbyBtYXR0ZXIgYSBs
b3QgdG8gdXMuIFRoaW5rIHNvbWVvbmUgdXNpbmcgdGhpcw0KPiAgICBwZWVyIHRvIHBlZXIgdG8g
Z2FpbiBjb250cm9sIG9mIGEgc2VydmVyIGluIHRoZSBjbG91ZC4NCiAgICANCkkgYWdyZWUgdGhh
dCBJT01NVSBpc29sYXRpb24gaXMgdmVyeSBkZXNpcmFibGUuIEhlbmNlIHRoZSBkZXNpcmUgdG8g
ZW5zdXJlIHdlIGNhbiBrZWVwIHRoZSBJT01NVSBvbiB3aGlsZSBkb2luZyBwMnBkbWEgaWYgYXQg
YWxsIHBvc3NpYmxlIHdoaWxzdCBzdGlsbCBkZWxpdmVyaW5nIHRoZSBkZXNpcmVkIHBlcmZvcm1h
bmNlIHRvIHRoZSB1c2VyLg0KDQpTdGVwaGVuICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:41                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

Hi Jerome

>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>    are the part likely to be use in peer to peer.

OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
    
>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>    for performance reasons ie we do not care having our transaction going
>    to the root complex and back down the destination. At least in use case
>    i am working on this is fine.

If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.
    
>    Reasons is that GPU are giving up on PCIe (see all specialize link like
>    NVlink that are popping up in GPU space). So for fast GPU inter-connect
>    we have this new links. 

I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.
    
>    Also the IOMMU isolation do matter a lot to us. Think someone using this
>    peer to peer to gain control of a server in the cloud.
    
I agree that IOMMU isolation is very desirable. Hence the desire to ensure we can keep the IOMMU on while doing p2pdma if at all possible whilst still delivering the desired performance to the user.

Stephen    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:41                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Logan Gunthorpe, Alex Williamson,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

Hi Jerome

>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>    are the part likely to be use in peer to peer.

OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
    
>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>    for performance reasons ie we do not care having our transaction going
>    to the root complex and back down the destination. At least in use case
>    i am working on this is fine.

If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.
    
>    Reasons is that GPU are giving up on PCIe (see all specialize link like
>    NVlink that are popping up in GPU space). So for fast GPU inter-connect
>    we have this new links. 

I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.
    
>    Also the IOMMU isolation do matter a lot to us. Think someone using this
>    peer to peer to gain control of a server in the cloud.
    
I agree that IOMMU isolation is very desirable. Hence the desire to ensure we can keep the IOMMU on while doing p2pdma if at all possible whilst still delivering the desired performance to the user.

Stephen    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:41                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, Christoph Hellwig,
	linux-block, Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Logan Gunthorpe, Christian König

Hi Jerome

>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>    are the part likely to be use in peer to peer.

OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
    
>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>    for performance reasons ie we do not care having our transaction going
>    to the root complex and back down the destination. At least in use case
>    i am working on this is fine.

If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.
    
>    Reasons is that GPU are giving up on PCIe (see all specialize link like
>    NVlink that are popping up in GPU space). So for fast GPU inter-connect
>    we have this new links. 

I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.
    
>    Also the IOMMU isolation do matter a lot to us. Think someone using this
>    peer to peer to gain control of a server in the cloud.
    
I agree that IOMMU isolation is very desirable. Hence the desire to ensure we can keep the IOMMU on while doing p2pdma if at all possible whilst still delivering the desired performance to the user.

Stephen    

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:41                                     ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:41 UTC (permalink / raw)


Hi Jerome

>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>    are the part likely to be use in peer to peer.

OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
    
>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>    for performance reasons ie we do not care having our transaction going
>    to the root complex and back down the destination. At least in use case
>    i am working on this is fine.

If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.
    
>    Reasons is that GPU are giving up on PCIe (see all specialize link like
>    NVlink that are popping up in GPU space). So for fast GPU inter-connect
>    we have this new links. 

I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.
    
>    Also the IOMMU isolation do matter a lot to us. Think someone using this
>    peer to peer to gain control of a server in the cloud.
    
I agree that IOMMU isolation is very desirable. Hence the desire to ensure we can keep the IOMMU on while doing p2pdma if at all possible whilst still delivering the desired performance to the user.

Stephen    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 14:59                                         ` Jerome Glisse
                                                             ` (3 preceding siblings ...)
  (?)
@ 2018-05-10 18:44                                           ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Christian König
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Jerome

>    Hopes this helps understanding the big picture. I over simplify thing and
>    devils is in the details.
    
This was a great primer thanks for putting it together. An LWN.net article perhaps ;-)??

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:44                                           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Christian König
  Cc: Logan Gunthorpe, Alex Williamson, Bjorn Helgaas, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

SGkgSmVyb21lDQoNCj4gICAgSG9wZXMgdGhpcyBoZWxwcyB1bmRlcnN0YW5kaW5nIHRoZSBiaWcg
cGljdHVyZS4gSSBvdmVyIHNpbXBsaWZ5IHRoaW5nIGFuZA0KPiAgICBkZXZpbHMgaXMgaW4gdGhl
IGRldGFpbHMuDQogICAgDQpUaGlzIHdhcyBhIGdyZWF0IHByaW1lciB0aGFua3MgZm9yIHB1dHRp
bmcgaXQgdG9nZXRoZXIuIEFuIExXTi5uZXQgYXJ0aWNsZSBwZXJoYXBzIDstKT8/DQoNClN0ZXBo
ZW4NCiAgICANCg0K

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:44                                           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Christian König
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Jerome

>    Hopes this helps understanding the big picture. I over simplify thing and
>    devils is in the details.
    
This was a great primer thanks for putting it together. An LWN.net article perhaps ;-)??

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:44                                           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Christian König
  Cc: Logan Gunthorpe, Alex Williamson, Bjorn Helgaas, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

Hi Jerome

>    Hopes this helps understanding the big picture. I over simplify thing and
>    devils is in the details.
    
This was a great primer thanks for putting it together. An LWN.net article perhaps ;-)??

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:44                                           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Christian König
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Logan Gunthorpe, Christoph Hellwig

Hi Jerome

>    Hopes this helps understanding the big picture. I over simplify thing and
>    devils is in the details.
    
This was a great primer thanks for putting it together. An LWN.net article perhaps ;-)??

Stephen
    

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:44                                           ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-10 18:44 UTC (permalink / raw)


Hi Jerome

>    Hopes this helps understanding the big picture. I over simplify thing and
>    devils is in the details.
    
This was a great primer thanks for putting it together. An LWN.net article perhaps ;-)??

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 18:41                                     ` Stephen  Bates
  (?)
  (?)
@ 2018-05-10 18:59                                       ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 18:59 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König



On 10/05/18 12:41 PM, Stephen  Bates wrote:
> Hi Jerome
> 
>>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>>    are the part likely to be use in peer to peer.
> 
> OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
>     
>>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>>    for performance reasons ie we do not care having our transaction going
>>    to the root complex and back down the destination. At least in use case
>>    i am working on this is fine.
> 
> If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.

The NVMe people are the Nice Neighbors, the RDMA people are the
Righteous Romantics and the PCI people are the Pleasant Protagonists...

Obviously.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:59                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 18:59 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse
  Cc: Christian König, Alex Williamson, Bjorn Helgaas,
	linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Benjamin Herrenschmidt



On 10/05/18 12:41 PM, Stephen  Bates wrote:
> Hi Jerome
> 
>>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>>    are the part likely to be use in peer to peer.
> 
> OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
>     
>>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>>    for performance reasons ie we do not care having our transaction going
>>    to the root complex and back down the destination. At least in use case
>>    i am working on this is fine.
> 
> If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.

The NVMe people are the Nice Neighbors, the RDMA people are the
Righteous Romantics and the PCI people are the Pleasant Protagonists...

Obviously.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:59                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 18:59 UTC (permalink / raw)
  To: Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König



On 10/05/18 12:41 PM, Stephen  Bates wrote:
> Hi Jerome
> 
>>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>>    are the part likely to be use in peer to peer.
> 
> OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
>     
>>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>>    for performance reasons ie we do not care having our transaction going
>>    to the root complex and back down the destination. At least in use case
>>    i am working on this is fine.
> 
> If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.

The NVMe people are the Nice Neighbors, the RDMA people are the
Righteous Romantics and the PCI people are the Pleasant Protagonists...

Obviously.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 18:59                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-10 18:59 UTC (permalink / raw)




On 10/05/18 12:41 PM, Stephen  Bates wrote:
> Hi Jerome
> 
>>    Note on GPU we do would not rely on ATS for peer to peer. Some part
>>    of the GPU (DMA engines) do not necessarily support ATS. Yet those
>>    are the part likely to be use in peer to peer.
> 
> OK this is good to know. I agree the DMA engine is probably one of the GPU components most applicable to p2pdma.
>     
>>    We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>>    for performance reasons ie we do not care having our transaction going
>>    to the root complex and back down the destination. At least in use case
>>    i am working on this is fine.
> 
> If the GPU people are the good guys does that make the NVMe people the bad guys ;-). If so, what are the RDMA people??? Again good to know.

The NVMe people are the Nice Neighbors, the RDMA people are the
Righteous Romantics and the PCI people are the Pleasant Protagonists...

Obviously.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 18:41                                     ` Stephen  Bates
                                                         ` (2 preceding siblings ...)
  (?)
@ 2018-05-10 19:10                                       ` Alex Williamson
  -1 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-10 19:10 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Christoph Hellwig, linux-block,
	Jerome Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christian König

On Thu, 10 May 2018 18:41:09 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:    
> >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> >    we have this new links.   
> 
> I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).

No doubt, the marketing for it is quick to point out the mesh topology
of NVLink, but I haven't seen any technical documents that describe the
isolation capabilities or IOMMU interaction.  Whether this is included
or an afterthought, I have no idea.

> >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> >    peer to peer to gain control of a server in the cloud.  

>From that perspective, do we have any idea what NVLink means for
topology and IOMMU provided isolation and translation?  I've seen a
device assignment user report that seems to suggest it might pretend to
be PCIe compatible, but the assigned GPU ultimately doesn't work
correctly in a VM, so perhaps the software compatibility is only so
deep. Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 19:10                                       ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-10 19:10 UTC (permalink / raw)
  To: Stephen  Bates
  Cc: Jerome Glisse, Christian König, Logan Gunthorpe,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Thu, 10 May 2018 18:41:09 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:   =20
> >    Reasons is that GPU are giving up on PCIe (see all specialize link l=
ike
> >    NVlink that are popping up in GPU space). So for fast GPU inter-conn=
ect
> >    we have this new links.  =20
>=20
> I look forward to Nvidia open-licensing NVLink to anyone who wants to use=
 it ;-).

No doubt, the marketing for it is quick to point out the mesh topology
of NVLink, but I haven't seen any technical documents that describe the
isolation capabilities or IOMMU interaction.  Whether this is included
or an afterthought, I have no idea.

> >    Also the IOMMU isolation do matter a lot to us. Think someone using =
this
> >    peer to peer to gain control of a server in the cloud. =20

=46rom that perspective, do we have any idea what NVLink means for
topology and IOMMU provided isolation and translation?  I've seen a
device assignment user report that seems to suggest it might pretend to
be PCIe compatible, but the assigned GPU ultimately doesn't work
correctly in a VM, so perhaps the software compatibility is only so
deep. Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 19:10                                       ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-10 19:10 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jerome Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

On Thu, 10 May 2018 18:41:09 +0000
"Stephen  Bates" <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org> wrote:    
> >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> >    we have this new links.   
> 
> I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).

No doubt, the marketing for it is quick to point out the mesh topology
of NVLink, but I haven't seen any technical documents that describe the
isolation capabilities or IOMMU interaction.  Whether this is included
or an afterthought, I have no idea.

> >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> >    peer to peer to gain control of a server in the cloud.  

>From that perspective, do we have any idea what NVLink means for
topology and IOMMU provided isolation and translation?  I've seen a
device assignment user report that seems to suggest it might pretend to
be PCIe compatible, but the assigned GPU ultimately doesn't work
correctly in a VM, so perhaps the software compatibility is only so
deep. Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 19:10                                       ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-10 19:10 UTC (permalink / raw)
  To: Stephen  Bates
  Cc: Jerome Glisse, Christian König, Logan Gunthorpe,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Thu, 10 May 2018 18:41:09 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:    
> >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> >    we have this new links.   
> 
> I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).

No doubt, the marketing for it is quick to point out the mesh topology
of NVLink, but I haven't seen any technical documents that describe the
isolation capabilities or IOMMU interaction.  Whether this is included
or an afterthought, I have no idea.

> >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> >    peer to peer to gain control of a server in the cloud.  

>From that perspective, do we have any idea what NVLink means for
topology and IOMMU provided isolation and translation?  I've seen a
device assignment user report that seems to suggest it might pretend to
be PCIe compatible, but the assigned GPU ultimately doesn't work
correctly in a VM, so perhaps the software compatibility is only so
deep. Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 19:10                                       ` Alex Williamson
  0 siblings, 0 replies; 460+ messages in thread
From: Alex Williamson @ 2018-05-10 19:10 UTC (permalink / raw)


On Thu, 10 May 2018 18:41:09 +0000
"Stephen  Bates" <sbates at raithlin.com> wrote:    
> >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> >    we have this new links.   
> 
> I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).

No doubt, the marketing for it is quick to point out the mesh topology
of NVLink, but I haven't seen any technical documents that describe the
isolation capabilities or IOMMU interaction.  Whether this is included
or an afterthought, I have no idea.

> >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> >    peer to peer to gain control of a server in the cloud.  


From that perspective, do we have any idea what NVLink means for
topology and IOMMU provided isolation and translation?  I've seen a
device assignment user report that seems to suggest it might pretend to
be PCIe compatible, but the assigned GPU ultimately doesn't work
correctly in a VM, so perhaps the software compatibility is only so
deep. Thanks,

Alex

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 19:10                                       ` Alex Williamson
                                                           ` (2 preceding siblings ...)
  (?)
@ 2018-05-10 19:24                                         ` Jerome Glisse
  -1 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 19:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	Christoph Hellwig, linux-kernel, linux-nvme, linux-block,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christian König

On Thu, May 10, 2018 at 01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:    
> > >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> > >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > >    we have this new links.   
> > 
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).
> 
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction.  Whether this is included
> or an afterthought, I have no idea.

AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.

GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).

So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.

Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.

> 
> > >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> > >    peer to peer to gain control of a server in the cloud.  
> 
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation?  I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,

Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.

I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 19:24                                         ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 19:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Stephen Bates, Christian König, Logan Gunthorpe,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Thu, May 10, 2018 at 01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:    
> > >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> > >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > >    we have this new links.   
> > 
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).
> 
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction.  Whether this is included
> or an afterthought, I have no idea.

AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.

GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).

So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.

Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.

> 
> > >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> > >    peer to peer to gain control of a server in the cloud.  
> 
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation?  I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,

Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.

I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 19:24                                         ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 19:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe,
	Bjorn Helgaas, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christian König

On Thu, May 10, 2018 at 01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +0000
> "Stephen  Bates" <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org> wrote:    
> > >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> > >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > >    we have this new links.   
> > 
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).
> 
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction.  Whether this is included
> or an afterthought, I have no idea.

AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.

GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).

So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.

Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.

> 
> > >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> > >    peer to peer to gain control of a server in the cloud.  
> 
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation?  I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,

Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.

I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 19:24                                         ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 19:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Stephen Bates, Christian König, Logan Gunthorpe,
	Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Benjamin Herrenschmidt

On Thu, May 10, 2018 at 01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +0000
> "Stephen  Bates" <sbates@raithlin.com> wrote:    
> > >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> > >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > >    we have this new links.   
> > 
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).
> 
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction.  Whether this is included
> or an afterthought, I have no idea.

AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.

GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).

So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.

Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.

> 
> > >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> > >    peer to peer to gain control of a server in the cloud.  
> 
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation?  I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,

Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.

I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-10 19:24                                         ` Jerome Glisse
  0 siblings, 0 replies; 460+ messages in thread
From: Jerome Glisse @ 2018-05-10 19:24 UTC (permalink / raw)


On Thu, May 10, 2018@01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +0000
> "Stephen  Bates" <sbates at raithlin.com> wrote:    
> > >    Reasons is that GPU are giving up on PCIe (see all specialize link like
> > >    NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > >    we have this new links.   
> > 
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use it ;-).
> 
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction.  Whether this is included
> or an afterthought, I have no idea.

AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.

GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).

So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.

Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.

> 
> > >    Also the IOMMU isolation do matter a lot to us. Think someone using this
> > >    peer to peer to gain control of a server in the cloud.  
> 
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation?  I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,

Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.

I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-10 17:15                                       ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-11  8:52                                         ` Christian König
  -1 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-11  8:52 UTC (permalink / raw)
  To: Logan Gunthorpe, Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe:
>
> On 10/05/18 11:11 AM, Stephen  Bates wrote:
>>> Not to me. In the p2pdma code we specifically program DMA engines with
>>> the PCI bus address.
>> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

By disabling the ACS bits on the intermediate bridges you turn their 
address routing from IOVA addresses (which are to be resolved by the 
root complex) back to PCI bus addresses (which are resolved locally in 
the bridge).

This only works when the IOVA and the PCI bus addresses never overlap. 
I'm not sure how the IOVA allocation works but I don't think we 
guarantee that on Linux.

>>
>>> So regardless of whether we are using the IOMMU or
>>> not, the packets will be forwarded directly to the peer. If the ACS
>>>   Redir bits are on they will be forced back to the RC by the switch and
>>>   the transaction will fail. If we clear the ACS bits, the TLPs will go
>>>   where we want and everything will work (but we lose the isolation of ACS).
>> Agreed.

If we really want to enable P2P without ATS and IOMMU enabled I think we 
should probably approach it like this:

a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
BARs in that group.

b) Add configuration options to put a whole PCI branch of devices (e.g. 
a bridge) into a single IOMMU group.

c) Add a configuration option to disable the ACS bit on bridges in the 
same IOMMU group.


I agree that we have a rather special case here, but I still find that 
approach rather brave and would vote for disabling P2P without ATS when 
IOMMU is enabled.


BTW: I can't say anything about other implementations, but at least for 
the AMD-IOMMU the transaction won't fail when it is send to the root 
complex.

Instead the root complex would send it to the correct device. I already 
tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs 
(but could be that this only works because of ATS).

Regards,
Christian.

>>>     For EPs that support ATS, we should (but don't necessarily have to)
>>>     program them with the IOVA address so they can go through the
>>>     translation process which will allow P2P without disabling the ACS Redir
>>>     bits -- provided the ACS direct translation bit is set. (And btw, if it
>>>     is, then we lose the benefit of ACS protecting against malicious EPs).
>>>     But, per above, the ATS transaction should involve only the IOVA address
>>>     so the ACS bits not being set should not break ATS.
>>      
>> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.
> We don't have to clear the ACS Redir bits as we did in the first case.
> We just have to make sure the ACS Direct Translated bit is set.
>
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11  8:52                                         ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-11  8:52 UTC (permalink / raw)
  To: Logan Gunthorpe, Stephen Bates, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe:
>
> On 10/05/18 11:11 AM, Stephen  Bates wrote:
>>> Not to me. In the p2pdma code we specifically program DMA engines with
>>> the PCI bus address.
>> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

By disabling the ACS bits on the intermediate bridges you turn their 
address routing from IOVA addresses (which are to be resolved by the 
root complex) back to PCI bus addresses (which are resolved locally in 
the bridge).

This only works when the IOVA and the PCI bus addresses never overlap. 
I'm not sure how the IOVA allocation works but I don't think we 
guarantee that on Linux.

>>
>>> So regardless of whether we are using the IOMMU or
>>> not, the packets will be forwarded directly to the peer. If the ACS
>>>   Redir bits are on they will be forced back to the RC by the switch and
>>>   the transaction will fail. If we clear the ACS bits, the TLPs will go
>>>   where we want and everything will work (but we lose the isolation of ACS).
>> Agreed.

If we really want to enable P2P without ATS and IOMMU enabled I think we 
should probably approach it like this:

a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
BARs in that group.

b) Add configuration options to put a whole PCI branch of devices (e.g. 
a bridge) into a single IOMMU group.

c) Add a configuration option to disable the ACS bit on bridges in the 
same IOMMU group.


I agree that we have a rather special case here, but I still find that 
approach rather brave and would vote for disabling P2P without ATS when 
IOMMU is enabled.


BTW: I can't say anything about other implementations, but at least for 
the AMD-IOMMU the transaction won't fail when it is send to the root 
complex.

Instead the root complex would send it to the correct device. I already 
tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs 
(but could be that this only works because of ATS).

Regards,
Christian.

>>>     For EPs that support ATS, we should (but don't necessarily have to)
>>>     program them with the IOVA address so they can go through the
>>>     translation process which will allow P2P without disabling the ACS Redir
>>>     bits -- provided the ACS direct translation bit is set. (And btw, if it
>>>     is, then we lose the benefit of ACS protecting against malicious EPs).
>>>     But, per above, the ATS transaction should involve only the IOVA address
>>>     so the ACS bits not being set should not break ATS.
>>      
>> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.
> We don't have to clear the ACS Redir bits as we did in the first case.
> We just have to make sure the ACS Direct Translated bit is set.
>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11  8:52                                         ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-11  8:52 UTC (permalink / raw)
  To: Logan Gunthorpe, Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe:
>
> On 10/05/18 11:11 AM, Stephen  Bates wrote:
>>> Not to me. In the p2pdma code we specifically program DMA engines with
>>> the PCI bus address.
>> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

By disabling the ACS bits on the intermediate bridges you turn their 
address routing from IOVA addresses (which are to be resolved by the 
root complex) back to PCI bus addresses (which are resolved locally in 
the bridge).

This only works when the IOVA and the PCI bus addresses never overlap. 
I'm not sure how the IOVA allocation works but I don't think we 
guarantee that on Linux.

>>
>>> So regardless of whether we are using the IOMMU or
>>> not, the packets will be forwarded directly to the peer. If the ACS
>>>   Redir bits are on they will be forced back to the RC by the switch and
>>>   the transaction will fail. If we clear the ACS bits, the TLPs will go
>>>   where we want and everything will work (but we lose the isolation of ACS).
>> Agreed.

If we really want to enable P2P without ATS and IOMMU enabled I think we 
should probably approach it like this:

a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
BARs in that group.

b) Add configuration options to put a whole PCI branch of devices (e.g. 
a bridge) into a single IOMMU group.

c) Add a configuration option to disable the ACS bit on bridges in the 
same IOMMU group.


I agree that we have a rather special case here, but I still find that 
approach rather brave and would vote for disabling P2P without ATS when 
IOMMU is enabled.


BTW: I can't say anything about other implementations, but at least for 
the AMD-IOMMU the transaction won't fail when it is send to the root 
complex.

Instead the root complex would send it to the correct device. I already 
tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs 
(but could be that this only works because of ATS).

Regards,
Christian.

>>>     For EPs that support ATS, we should (but don't necessarily have to)
>>>     program them with the IOVA address so they can go through the
>>>     translation process which will allow P2P without disabling the ACS Redir
>>>     bits -- provided the ACS direct translation bit is set. (And btw, if it
>>>     is, then we lose the benefit of ACS protecting against malicious EPs).
>>>     But, per above, the ATS transaction should involve only the IOVA address
>>>     so the ACS bits not being set should not break ATS.
>>      
>> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.
> We don't have to clear the ACS Redir bits as we did in the first case.
> We just have to make sure the ACS Direct Translated bit is set.
>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11  8:52                                         ` Christian König
  0 siblings, 0 replies; 460+ messages in thread
From: Christian König @ 2018-05-11  8:52 UTC (permalink / raw)


Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe:
>
> On 10/05/18 11:11 AM, Stephen  Bates wrote:
>>> Not to me. In the p2pdma code we specifically program DMA engines with
>>> the PCI bus address.
>> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address...

By disabling the ACS bits on the intermediate bridges you turn their 
address routing from IOVA addresses (which are to be resolved by the 
root complex) back to PCI bus addresses (which are resolved locally in 
the bridge).

This only works when the IOVA and the PCI bus addresses never overlap. 
I'm not sure how the IOVA allocation works but I don't think we 
guarantee that on Linux.

>>
>>> So regardless of whether we are using the IOMMU or
>>> not, the packets will be forwarded directly to the peer. If the ACS
>>>   Redir bits are on they will be forced back to the RC by the switch and
>>>   the transaction will fail. If we clear the ACS bits, the TLPs will go
>>>   where we want and everything will work (but we lose the isolation of ACS).
>> Agreed.

If we really want to enable P2P without ATS and IOMMU enabled I think we 
should probably approach it like this:

a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
BARs in that group.

b) Add configuration options to put a whole PCI branch of devices (e.g. 
a bridge) into a single IOMMU group.

c) Add a configuration option to disable the ACS bit on bridges in the 
same IOMMU group.


I agree that we have a rather special case here, but I still find that 
approach rather brave and would vote for disabling P2P without ATS when 
IOMMU is enabled.


BTW: I can't say anything about other implementations, but at least for 
the AMD-IOMMU the transaction won't fail when it is send to the root 
complex.

Instead the root complex would send it to the correct device. I already 
tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs 
(but could be that this only works because of ATS).

Regards,
Christian.

>>>     For EPs that support ATS, we should (but don't necessarily have to)
>>>     program them with the IOVA address so they can go through the
>>>     translation process which will allow P2P without disabling the ACS Redir
>>>     bits -- provided the ACS direct translation bit is set. (And btw, if it
>>>     is, then we lose the benefit of ACS protecting against malicious EPs).
>>>     But, per above, the ATS transaction should involve only the IOVA address
>>>     so the ACS bits not being set should not break ATS.
>>      
>> Well we would still have to clear some ACS bits but now we can clear only for translated addresses.
> We don't have to clear the ACS Redir bits as we did in the first case.
> We just have to make sure the ACS Direct Translated bit is set.
>
> Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-11  8:52                                         ` Christian König
  (?)
  (?)
@ 2018-05-11 15:48                                           ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-11 15:48 UTC (permalink / raw)
  To: Christian König, Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 5/11/2018 2:52 AM, Christian König wrote:
> This only works when the IOVA and the PCI bus addresses never overlap. 
> I'm not sure how the IOVA allocation works but I don't think we 
> guarantee that on Linux.

I find this hard to believe. There's always the possibility that some 
part of the system doesn't support ACS so if the PCI bus addresses and 
IOVA overlap there's a good chance that P2P and ATS won't work at all on 
some hardware.


> If we really want to enable P2P without ATS and IOMMU enabled I think we 
> should probably approach it like this:
> 
> a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
> BARs in that group.
> 
> b) Add configuration options to put a whole PCI branch of devices (e.g. 
> a bridge) into a single IOMMU group.
> 
> c) Add a configuration option to disable the ACS bit on bridges in the 
> same IOMMU group.

I think a configuration option to manage IOMMU groups as you suggest 
would be a very complex interface and difficult to implement. I prefer 
the option to disable the ACS bit on boot and let the existing code put 
the devices into their own IOMMU group (as it should already do to 
support hardware that doesn't have ACS support).

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 15:48                                           ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-11 15:48 UTC (permalink / raw)
  To: Christian König, Stephen Bates, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

On 5/11/2018 2:52 AM, Christian König wrote:
> This only works when the IOVA and the PCI bus addresses never overlap. 
> I'm not sure how the IOVA allocation works but I don't think we 
> guarantee that on Linux.

I find this hard to believe. There's always the possibility that some 
part of the system doesn't support ACS so if the PCI bus addresses and 
IOVA overlap there's a good chance that P2P and ATS won't work at all on 
some hardware.


> If we really want to enable P2P without ATS and IOMMU enabled I think we 
> should probably approach it like this:
> 
> a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
> BARs in that group.
> 
> b) Add configuration options to put a whole PCI branch of devices (e.g. 
> a bridge) into a single IOMMU group.
> 
> c) Add a configuration option to disable the ACS bit on bridges in the 
> same IOMMU group.

I think a configuration option to manage IOMMU groups as you suggest 
would be a very complex interface and difficult to implement. I prefer 
the option to disable the ACS bit on boot and let the existing code put 
the devices into their own IOMMU group (as it should already do to 
support hardware that doesn't have ACS support).

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 15:48                                           ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-11 15:48 UTC (permalink / raw)
  To: Christian König, Stephen Bates, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 5/11/2018 2:52 AM, Christian König wrote:
> This only works when the IOVA and the PCI bus addresses never overlap. 
> I'm not sure how the IOVA allocation works but I don't think we 
> guarantee that on Linux.

I find this hard to believe. There's always the possibility that some 
part of the system doesn't support ACS so if the PCI bus addresses and 
IOVA overlap there's a good chance that P2P and ATS won't work at all on 
some hardware.


> If we really want to enable P2P without ATS and IOMMU enabled I think we 
> should probably approach it like this:
> 
> a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
> BARs in that group.
> 
> b) Add configuration options to put a whole PCI branch of devices (e.g. 
> a bridge) into a single IOMMU group.
> 
> c) Add a configuration option to disable the ACS bit on bridges in the 
> same IOMMU group.

I think a configuration option to manage IOMMU groups as you suggest 
would be a very complex interface and difficult to implement. I prefer 
the option to disable the ACS bit on boot and let the existing code put 
the devices into their own IOMMU group (as it should already do to 
support hardware that doesn't have ACS support).

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 15:48                                           ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-11 15:48 UTC (permalink / raw)


On 5/11/2018 2:52 AM, Christian K?nig wrote:
> This only works when the IOVA and the PCI bus addresses never overlap. 
> I'm not sure how the IOVA allocation works but I don't think we 
> guarantee that on Linux.

I find this hard to believe. There's always the possibility that some 
part of the system doesn't support ACS so if the PCI bus addresses and 
IOVA overlap there's a good chance that P2P and ATS won't work at all on 
some hardware.


> If we really want to enable P2P without ATS and IOMMU enabled I think we 
> should probably approach it like this:
> 
> a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
> BARs in that group.
> 
> b) Add configuration options to put a whole PCI branch of devices (e.g. 
> a bridge) into a single IOMMU group.
> 
> c) Add a configuration option to disable the ACS bit on bridges in the 
> same IOMMU group.

I think a configuration option to manage IOMMU groups as you suggest 
would be a very complex interface and difficult to implement. I prefer 
the option to disable the ACS bit on boot and let the existing code put 
the devices into their own IOMMU group (as it should already do to 
support hardware that doesn't have ACS support).

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-11 15:48                                           ` Logan Gunthorpe
                                                               ` (2 preceding siblings ...)
  (?)
@ 2018-05-11 21:50                                             ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 21:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

>    I find this hard to believe. There's always the possibility that some 
>    part of the system doesn't support ACS so if the PCI bus addresses and 
>    IOVA overlap there's a good chance that P2P and ATS won't work at all on 
>    some hardware.

I tend to agree but this comes down to how IOVA addresses are generated in the kernel. Alex (or anyone else) can you point to where IOVA addresses are generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI bus address directly but other IO going to the same PCI EP may flow through the IOMMU and be programmed with IOVA rather than PCI bus addresses.
    
> I prefer 
>    the option to disable the ACS bit on boot and let the existing code put 
>    the devices into their own IOMMU group (as it should already do to 
>    support hardware that doesn't have ACS support).
    
+1

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 21:50                                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 21:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

PiAgICBJIGZpbmQgdGhpcyBoYXJkIHRvIGJlbGlldmUuIFRoZXJlJ3MgYWx3YXlzIHRoZSBwb3Nz
aWJpbGl0eSB0aGF0IHNvbWUgDQo+ICAgIHBhcnQgb2YgdGhlIHN5c3RlbSBkb2Vzbid0IHN1cHBv
cnQgQUNTIHNvIGlmIHRoZSBQQ0kgYnVzIGFkZHJlc3NlcyBhbmQgDQo+ICAgIElPVkEgb3Zlcmxh
cCB0aGVyZSdzIGEgZ29vZCBjaGFuY2UgdGhhdCBQMlAgYW5kIEFUUyB3b24ndCB3b3JrIGF0IGFs
bCBvbiANCj4gICAgc29tZSBoYXJkd2FyZS4NCg0KSSB0ZW5kIHRvIGFncmVlIGJ1dCB0aGlzIGNv
bWVzIGRvd24gdG8gaG93IElPVkEgYWRkcmVzc2VzIGFyZSBnZW5lcmF0ZWQgaW4gdGhlIGtlcm5l
bC4gQWxleCAob3IgYW55b25lIGVsc2UpIGNhbiB5b3UgcG9pbnQgdG8gd2hlcmUgSU9WQSBhZGRy
ZXNzZXMgYXJlIGdlbmVyYXRlZD8gQXMgTG9nYW4gc3RhdGVkIGVhcmxpZXIsIHAycGRtYSBieXBh
c3NlcyB0aGlzIGFuZCBwcm9ncmFtcyB0aGUgUENJIGJ1cyBhZGRyZXNzIGRpcmVjdGx5IGJ1dCBv
dGhlciBJTyBnb2luZyB0byB0aGUgc2FtZSBQQ0kgRVAgbWF5IGZsb3cgdGhyb3VnaCB0aGUgSU9N
TVUgYW5kIGJlIHByb2dyYW1tZWQgd2l0aCBJT1ZBIHJhdGhlciB0aGFuIFBDSSBidXMgYWRkcmVz
c2VzLg0KICAgIA0KPiBJIHByZWZlciANCj4gICAgdGhlIG9wdGlvbiB0byBkaXNhYmxlIHRoZSBB
Q1MgYml0IG9uIGJvb3QgYW5kIGxldCB0aGUgZXhpc3RpbmcgY29kZSBwdXQgDQo+ICAgIHRoZSBk
ZXZpY2VzIGludG8gdGhlaXIgb3duIElPTU1VIGdyb3VwIChhcyBpdCBzaG91bGQgYWxyZWFkeSBk
byB0byANCj4gICAgc3VwcG9ydCBoYXJkd2FyZSB0aGF0IGRvZXNuJ3QgaGF2ZSBBQ1Mgc3VwcG9y
dCkuDQogICAgDQorMQ0KDQpTdGVwaGVuDQogICAgDQoNCg==

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 21:50                                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 21:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

>    I find this hard to believe. There's always the possibility that some 
>    part of the system doesn't support ACS so if the PCI bus addresses and 
>    IOVA overlap there's a good chance that P2P and ATS won't work at all on 
>    some hardware.

I tend to agree but this comes down to how IOVA addresses are generated in the kernel. Alex (or anyone else) can you point to where IOVA addresses are generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI bus address directly but other IO going to the same PCI EP may flow through the IOMMU and be programmed with IOVA rather than PCI bus addresses.
    
> I prefer 
>    the option to disable the ACS bit on boot and let the existing code put 
>    the devices into their own IOMMU group (as it should already do to 
>    support hardware that doesn't have ACS support).
    
+1

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 21:50                                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 21:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

>    I find this hard to believe. There's always the possibility that some 
>    part of the system doesn't support ACS so if the PCI bus addresses and 
>    IOVA overlap there's a good chance that P2P and ATS won't work at all on 
>    some hardware.

I tend to agree but this comes down to how IOVA addresses are generated in the kernel. Alex (or anyone else) can you point to where IOVA addresses are generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI bus address directly but other IO going to the same PCI EP may flow through the IOMMU and be programmed with IOVA rather than PCI bus addresses.
    
> I prefer 
>    the option to disable the ACS bit on boot and let the existing code put 
>    the devices into their own IOMMU group (as it should already do to 
>    support hardware that doesn't have ACS support).
    
+1

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 21:50                                             ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 21:50 UTC (permalink / raw)


>    I find this hard to believe. There's always the possibility that some 
>    part of the system doesn't support ACS so if the PCI bus addresses and 
>    IOVA overlap there's a good chance that P2P and ATS won't work at all on 
>    some hardware.

I tend to agree but this comes down to how IOVA addresses are generated in the kernel. Alex (or anyone else) can you point to where IOVA addresses are generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI bus address directly but other IO going to the same PCI EP may flow through the IOMMU and be programmed with IOVA rather than PCI bus addresses.
    
> I prefer 
>    the option to disable the ACS bit on boot and let the existing code put 
>    the devices into their own IOMMU group (as it should already do to 
>    support hardware that doesn't have ACS support).
    
+1

Stephen
    

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-11 21:50                                             ` Stephen  Bates
                                                                 ` (3 preceding siblings ...)
  (?)
@ 2018-05-11 22:24                                               ` Stephen  Bates
  -1 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 22:24 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Jens Axboe, linux-block, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Keith Busch, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

All

> Alex (or anyone else) can you point to where IOVA addresses are generated?

A case of RTFM perhaps (though a pointer to the code would still be appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
-----------------------
Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 22:24                                               ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 22:24 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

All

>=A0Alex (or anyone else) can you point to where IOVA addresses are generat=
ed?

A case of RTFM perhaps (though a pointer to the code would still be appreci=
ated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
-----------------------
Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 22:24                                               ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 22:24 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

All

> Alex (or anyone else) can you point to where IOVA addresses are generated?

A case of RTFM perhaps (though a pointer to the code would still be appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
-----------------------
Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 22:24                                               ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 22:24 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

All

> Alex (or anyone else) can you point to where IOVA addresses are generated?

A case of RTFM perhaps (though a pointer to the code would still be appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
-----------------------
Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 22:24                                               ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 22:24 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jerome Glisse
  Cc: Jens Axboe, linux-block, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Keith Busch, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

All

>=A0Alex (or anyone else) can you point to where IOVA addresses are generat=
ed?

A case of RTFM perhaps (though a pointer to the code would still be appreci=
ated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
-----------------------
Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 22:24                                               ` Stephen  Bates
  0 siblings, 0 replies; 460+ messages in thread
From: Stephen  Bates @ 2018-05-11 22:24 UTC (permalink / raw)


All

>?Alex (or anyone else) can you point to where IOVA addresses are generated?

A case of RTFM perhaps (though a pointer to the code would still be appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
-----------------------
Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-05-11 22:24                                               ` Stephen  Bates
  (?)
@ 2018-05-11 22:55                                                 ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-11 22:55 UTC (permalink / raw)
  To: Stephen Bates, Christian König, Jerome Glisse
  Cc: Jens Axboe, linux-block, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Keith Busch, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 5/11/2018 4:24 PM, Stephen  Bates wrote:
> All
> 
>>   Alex (or anyone else) can you point to where IOVA addresses are generated?
> 
> A case of RTFM perhaps (though a pointer to the code would still be appreciated).
> 
> https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
> 
> Some exceptions to IOVA
> -----------------------
> Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
> The same is true for peer to peer transactions. Hence we reserve the
> address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Hmm, except I'm not sure how to interpret that. It sounds like there 
can't be an IOVA address that overlaps with the PCI MMIO range which is 
good and what I'd expect.

But for peer to peer they say they don't translate the address which 
implies to me that the intention is for a peer to peer address to not be 
mapped in the same way using the dma_map interface (of course though if 
you were using ATS you'd want this for sure). Unless the existing 
dma_map command's notice a PCI MMIO address and handle them differently, 
but I don't see how.

Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 22:55                                                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-11 22:55 UTC (permalink / raw)
  To: Stephen Bates, Christian König, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On 5/11/2018 4:24 PM, Stephen  Bates wrote:
> All
> 
>>   Alex (or anyone else) can you point to where IOVA addresses are generated?
> 
> A case of RTFM perhaps (though a pointer to the code would still be appreciated).
> 
> https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
> 
> Some exceptions to IOVA
> -----------------------
> Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
> The same is true for peer to peer transactions. Hence we reserve the
> address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Hmm, except I'm not sure how to interpret that. It sounds like there 
can't be an IOVA address that overlaps with the PCI MMIO range which is 
good and what I'd expect.

But for peer to peer they say they don't translate the address which 
implies to me that the intention is for a peer to peer address to not be 
mapped in the same way using the dma_map interface (of course though if 
you were using ATS you'd want this for sure). Unless the existing 
dma_map command's notice a PCI MMIO address and handle them differently, 
but I don't see how.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-05-11 22:55                                                 ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-11 22:55 UTC (permalink / raw)


On 5/11/2018 4:24 PM, Stephen  Bates wrote:
> All
> 
>>  ?Alex (or anyone else) can you point to where IOVA addresses are generated?
> 
> A case of RTFM perhaps (though a pointer to the code would still be appreciated).
> 
> https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
> 
> Some exceptions to IOVA
> -----------------------
> Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
> The same is true for peer to peer transactions. Hence we reserve the
> address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Hmm, except I'm not sure how to interpret that. It sounds like there 
can't be an IOVA address that overlaps with the PCI MMIO range which is 
good and what I'd expect.

But for peer to peer they say they don't translate the address which 
implies to me that the intention is for a peer to peer address to not be 
mapped in the same way using the dma_map interface (of course though if 
you were using ATS you'd want this for sure). Unless the existing 
dma_map command's notice a PCI MMIO address and handle them differently, 
but I don't see how.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
  2018-04-23 23:30   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-05-22 21:24     ` Randy Dunlap
  -1 siblings, 0 replies; 460+ messages in thread
From: Randy Dunlap @ 2018-05-22 21:24 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Jonathan Corbet, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> ---
>  Documentation/PCI/index.rst             |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
>  Documentation/index.rst                 |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst


> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is
> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required
> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.
> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
> +
> +
> +Driver Writer's Guide
> +=====================
> +
> +In a given P2P implementation there may be three or more different
> +types of kernel drivers in play:
> +
> +* Providers - A driver which provides or publishes P2P resources like

   * Provider -

> +  memory or doorbell registers to other drivers.
> +* Clients - A driver which makes use of a resource by setting up a

   * Client -

> +  DMA transaction to or from it.
> +* Orchestrators - A driver which orchestrates the flow of data between

   * Orchestrator -

> +  clients and providers
> +
> +In many cases there could be overlap between these three types (ie.

                                                                  (i.e.,

> +it may be typical for a driver to be both a provider and a client).
> +
> +For example, in the NVMe Target Copy Offload implementation:
> +
> +* The NVMe PCI driver is both a client, provider and orchestrator
> +  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
> +  resource (provider), it accepts P2P memory pages as buffers in requests
> +  to be used directly (client) and it can also make use the CMB as
> +  submission queue entries.
> +* The RDMA driver is a client in this arrangement so that an RNIC
> +  can DMA directly to the memory exposed by the NVMe device.
> +* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
> +  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
> +
> +This is currently the only arrangement supported by the kernel but
> +one could imagine slight tweaks to this that would allow for the same
> +functionality. For example, if a specific RNIC added a BAR with some
> +memory behind it, its driver could add support as a P2P provider and
> +then the NVMe Target could use the RNIC's memory instead of the CMB
> +in cases where the NVMe cards in use do not have CMB support.
> +
> +
> +Provider Drivers
> +----------------
> +
> +A provider simply needs to register a BAR (or a portion of a BAR)
> +as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
> +This will register struct pages for all the specified memory.
> +
> +After that it may optionally publish all of its resources as
> +P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
> +any orchestrator drivers to find and use the memory. When marked in
> +this way, the resource must be regular memory with no side effects.
> +
> +For the time being this is fairly rudimentary in that all resources
> +are typically going to be P2P memory. Future work will likely expand
> +this to include other types of resources like doorbells.
> +
> +
> +Client Drivers
> +--------------
> +
> +A client driver typically only has to conditionally change its DMA map
> +routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
> +:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
> +functions.
> +
> +The client may also, optionally, make use of
> +:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
> +functions and when to use the regular mapping functions. In some
> +situations, it may be more appropriate to use a flag to indicate a
> +given request is P2P memory and map appropriately (for example the
> +block layer uses a flag to keep P2P memory out of queues that do not
> +have P2P client support). It is important to ensure that struct pages that
> +back P2P memory stay out of code that does not have support for them.
> +
> +
> +Orchestrator Drivers
> +--------------------
> +
> +The first task an orchestrator driver must do is compile a list of
> +all client drivers that will be involved in a given transaction. For
> +example, the NVMe Target driver creates a list including all NVMe drives
                                                                     ^^^^^^
                                                                   or drivers ?
Could be either, I guess, but the previous sentence says "compile a list of drivers."

> +and the RNIC in use. The list is stored as an anonymous struct
> +list_head which must be initialized with the usual INIT_LIST_HEAD.
> +The following functions may then be used to add to, remove from and free
> +the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
> +:c:func:`pci_p2pdma_remove_client()` and
> +:c:func:`pci_p2pdma_client_list_free()`.
> +
> +With the client list in hand, the orchestrator may then call
> +:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
> +that is supported (behind the same root port) as all the clients. If more
> +than one provider is supported, the one nearest to all the clients will
> +be chosen first. If there are more than one provider is an equal distance

                       there is

> +away, the one returned will be chosen at random. This function returns the PCI
> +device to use for the provider with a reference taken and therefore
> +when it's no longer needed it should be returned with pci_dev_put().
> +
> +Alternatively, if the orchestrator knows (via some other means)
> +which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
> +to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
> +to determine the cumulative distance between it and a potential
> +list of clients.
> +
> +With a supported provider in hand, the driver can then call
> +:c:func:`pci_p2pdma_assign_provider()` to assign the provider
> +to the client list. This function returns false if any of the
> +clients are unsupported by the provider.
[I would say:]
           is unsupported

> +
> +Once a provider is assigned to a client list via either
> +:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
> +the list is permanently bound to the provider such that any new clients
> +added to the list must be supported by the already selected provider.
> +If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
> +an error. In this way, orchestrators are free to add and remove devices
> +without having to recheck support or tear down existing transfers to
> +change P2P providers.
> +
> +Once a provider is selected, the orchestrator can then use
> +:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
> +allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
> +and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
> +allocating scatter-gather lists with P2P memory.
> +
> +Struct Page Caveats
> +-------------------
> +
> +Driver writers should be very careful about not passing these special
> +struct pages to code that isn't prepared for it. At this time, the kernel
> +interfaces do not have any checks for ensuring this. This obviously
> +precludes passing these pages to userspace.
> +
> +P2P memory is also technically IO memory but should never have any side
> +effects behind it. Thus, the order of loads and stores should not be important
> +and ioreadX(), iowriteX() and friends should not be necessary.
> +However, as the memory is not cache coherent, if access ever needs to
> +be protected by a spinlock then :c:func:`mmiowb()` must be used before
> +unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
> +Documentation/memory-barriers.txt)
> +
> +
> +P2P DMA Support Library
> +=====================
> +
> +.. kernel-doc:: drivers/pci/p2pdma.c
> +   :export:


-- 
~Randy
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-22 21:24     ` Randy Dunlap
  0 siblings, 0 replies; 460+ messages in thread
From: Randy Dunlap @ 2018-05-22 21:24 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Jonathan Corbet

On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> ---
>  Documentation/PCI/index.rst             |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
>  Documentation/index.rst                 |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst


> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is
> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required
> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.
> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
> +
> +
> +Driver Writer's Guide
> +=====================
> +
> +In a given P2P implementation there may be three or more different
> +types of kernel drivers in play:
> +
> +* Providers - A driver which provides or publishes P2P resources like

   * Provider -

> +  memory or doorbell registers to other drivers.
> +* Clients - A driver which makes use of a resource by setting up a

   * Client -

> +  DMA transaction to or from it.
> +* Orchestrators - A driver which orchestrates the flow of data between

   * Orchestrator -

> +  clients and providers
> +
> +In many cases there could be overlap between these three types (ie.

                                                                  (i.e.,

> +it may be typical for a driver to be both a provider and a client).
> +
> +For example, in the NVMe Target Copy Offload implementation:
> +
> +* The NVMe PCI driver is both a client, provider and orchestrator
> +  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
> +  resource (provider), it accepts P2P memory pages as buffers in requests
> +  to be used directly (client) and it can also make use the CMB as
> +  submission queue entries.
> +* The RDMA driver is a client in this arrangement so that an RNIC
> +  can DMA directly to the memory exposed by the NVMe device.
> +* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
> +  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
> +
> +This is currently the only arrangement supported by the kernel but
> +one could imagine slight tweaks to this that would allow for the same
> +functionality. For example, if a specific RNIC added a BAR with some
> +memory behind it, its driver could add support as a P2P provider and
> +then the NVMe Target could use the RNIC's memory instead of the CMB
> +in cases where the NVMe cards in use do not have CMB support.
> +
> +
> +Provider Drivers
> +----------------
> +
> +A provider simply needs to register a BAR (or a portion of a BAR)
> +as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
> +This will register struct pages for all the specified memory.
> +
> +After that it may optionally publish all of its resources as
> +P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
> +any orchestrator drivers to find and use the memory. When marked in
> +this way, the resource must be regular memory with no side effects.
> +
> +For the time being this is fairly rudimentary in that all resources
> +are typically going to be P2P memory. Future work will likely expand
> +this to include other types of resources like doorbells.
> +
> +
> +Client Drivers
> +--------------
> +
> +A client driver typically only has to conditionally change its DMA map
> +routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
> +:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
> +functions.
> +
> +The client may also, optionally, make use of
> +:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
> +functions and when to use the regular mapping functions. In some
> +situations, it may be more appropriate to use a flag to indicate a
> +given request is P2P memory and map appropriately (for example the
> +block layer uses a flag to keep P2P memory out of queues that do not
> +have P2P client support). It is important to ensure that struct pages that
> +back P2P memory stay out of code that does not have support for them.
> +
> +
> +Orchestrator Drivers
> +--------------------
> +
> +The first task an orchestrator driver must do is compile a list of
> +all client drivers that will be involved in a given transaction. For
> +example, the NVMe Target driver creates a list including all NVMe drives
                                                                     ^^^^^^
                                                                   or drivers ?
Could be either, I guess, but the previous sentence says "compile a list of drivers."

> +and the RNIC in use. The list is stored as an anonymous struct
> +list_head which must be initialized with the usual INIT_LIST_HEAD.
> +The following functions may then be used to add to, remove from and free
> +the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
> +:c:func:`pci_p2pdma_remove_client()` and
> +:c:func:`pci_p2pdma_client_list_free()`.
> +
> +With the client list in hand, the orchestrator may then call
> +:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
> +that is supported (behind the same root port) as all the clients. If more
> +than one provider is supported, the one nearest to all the clients will
> +be chosen first. If there are more than one provider is an equal distance

                       there is

> +away, the one returned will be chosen at random. This function returns the PCI
> +device to use for the provider with a reference taken and therefore
> +when it's no longer needed it should be returned with pci_dev_put().
> +
> +Alternatively, if the orchestrator knows (via some other means)
> +which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
> +to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
> +to determine the cumulative distance between it and a potential
> +list of clients.
> +
> +With a supported provider in hand, the driver can then call
> +:c:func:`pci_p2pdma_assign_provider()` to assign the provider
> +to the client list. This function returns false if any of the
> +clients are unsupported by the provider.
[I would say:]
           is unsupported

> +
> +Once a provider is assigned to a client list via either
> +:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
> +the list is permanently bound to the provider such that any new clients
> +added to the list must be supported by the already selected provider.
> +If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
> +an error. In this way, orchestrators are free to add and remove devices
> +without having to recheck support or tear down existing transfers to
> +change P2P providers.
> +
> +Once a provider is selected, the orchestrator can then use
> +:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
> +allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
> +and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
> +allocating scatter-gather lists with P2P memory.
> +
> +Struct Page Caveats
> +-------------------
> +
> +Driver writers should be very careful about not passing these special
> +struct pages to code that isn't prepared for it. At this time, the kernel
> +interfaces do not have any checks for ensuring this. This obviously
> +precludes passing these pages to userspace.
> +
> +P2P memory is also technically IO memory but should never have any side
> +effects behind it. Thus, the order of loads and stores should not be important
> +and ioreadX(), iowriteX() and friends should not be necessary.
> +However, as the memory is not cache coherent, if access ever needs to
> +be protected by a spinlock then :c:func:`mmiowb()` must be used before
> +unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
> +Documentation/memory-barriers.txt)
> +
> +
> +P2P DMA Support Library
> +=====================
> +
> +.. kernel-doc:: drivers/pci/p2pdma.c
> +   :export:


-- 
~Randy

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-22 21:24     ` Randy Dunlap
  0 siblings, 0 replies; 460+ messages in thread
From: Randy Dunlap @ 2018-05-22 21:24 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Sagi Grimberg, Christian König,
	Benjamin Herrenschmidt, Jonathan Corbet, Alex Williamson,
	Stephen Bates, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Christoph Hellwig

On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> ---
>  Documentation/PCI/index.rst             |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
>  Documentation/index.rst                 |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst


> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is
> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required
> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.
> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
> +
> +
> +Driver Writer's Guide
> +=====================
> +
> +In a given P2P implementation there may be three or more different
> +types of kernel drivers in play:
> +
> +* Providers - A driver which provides or publishes P2P resources like

   * Provider -

> +  memory or doorbell registers to other drivers.
> +* Clients - A driver which makes use of a resource by setting up a

   * Client -

> +  DMA transaction to or from it.
> +* Orchestrators - A driver which orchestrates the flow of data between

   * Orchestrator -

> +  clients and providers
> +
> +In many cases there could be overlap between these three types (ie.

                                                                  (i.e.,

> +it may be typical for a driver to be both a provider and a client).
> +
> +For example, in the NVMe Target Copy Offload implementation:
> +
> +* The NVMe PCI driver is both a client, provider and orchestrator
> +  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
> +  resource (provider), it accepts P2P memory pages as buffers in requests
> +  to be used directly (client) and it can also make use the CMB as
> +  submission queue entries.
> +* The RDMA driver is a client in this arrangement so that an RNIC
> +  can DMA directly to the memory exposed by the NVMe device.
> +* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
> +  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
> +
> +This is currently the only arrangement supported by the kernel but
> +one could imagine slight tweaks to this that would allow for the same
> +functionality. For example, if a specific RNIC added a BAR with some
> +memory behind it, its driver could add support as a P2P provider and
> +then the NVMe Target could use the RNIC's memory instead of the CMB
> +in cases where the NVMe cards in use do not have CMB support.
> +
> +
> +Provider Drivers
> +----------------
> +
> +A provider simply needs to register a BAR (or a portion of a BAR)
> +as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
> +This will register struct pages for all the specified memory.
> +
> +After that it may optionally publish all of its resources as
> +P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
> +any orchestrator drivers to find and use the memory. When marked in
> +this way, the resource must be regular memory with no side effects.
> +
> +For the time being this is fairly rudimentary in that all resources
> +are typically going to be P2P memory. Future work will likely expand
> +this to include other types of resources like doorbells.
> +
> +
> +Client Drivers
> +--------------
> +
> +A client driver typically only has to conditionally change its DMA map
> +routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
> +:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
> +functions.
> +
> +The client may also, optionally, make use of
> +:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
> +functions and when to use the regular mapping functions. In some
> +situations, it may be more appropriate to use a flag to indicate a
> +given request is P2P memory and map appropriately (for example the
> +block layer uses a flag to keep P2P memory out of queues that do not
> +have P2P client support). It is important to ensure that struct pages that
> +back P2P memory stay out of code that does not have support for them.
> +
> +
> +Orchestrator Drivers
> +--------------------
> +
> +The first task an orchestrator driver must do is compile a list of
> +all client drivers that will be involved in a given transaction. For
> +example, the NVMe Target driver creates a list including all NVMe drives
                                                                     ^^^^^^
                                                                   or drivers ?
Could be either, I guess, but the previous sentence says "compile a list of drivers."

> +and the RNIC in use. The list is stored as an anonymous struct
> +list_head which must be initialized with the usual INIT_LIST_HEAD.
> +The following functions may then be used to add to, remove from and free
> +the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
> +:c:func:`pci_p2pdma_remove_client()` and
> +:c:func:`pci_p2pdma_client_list_free()`.
> +
> +With the client list in hand, the orchestrator may then call
> +:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
> +that is supported (behind the same root port) as all the clients. If more
> +than one provider is supported, the one nearest to all the clients will
> +be chosen first. If there are more than one provider is an equal distance

                       there is

> +away, the one returned will be chosen at random. This function returns the PCI
> +device to use for the provider with a reference taken and therefore
> +when it's no longer needed it should be returned with pci_dev_put().
> +
> +Alternatively, if the orchestrator knows (via some other means)
> +which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
> +to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
> +to determine the cumulative distance between it and a potential
> +list of clients.
> +
> +With a supported provider in hand, the driver can then call
> +:c:func:`pci_p2pdma_assign_provider()` to assign the provider
> +to the client list. This function returns false if any of the
> +clients are unsupported by the provider.
[I would say:]
           is unsupported

> +
> +Once a provider is assigned to a client list via either
> +:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
> +the list is permanently bound to the provider such that any new clients
> +added to the list must be supported by the already selected provider.
> +If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
> +an error. In this way, orchestrators are free to add and remove devices
> +without having to recheck support or tear down existing transfers to
> +change P2P providers.
> +
> +Once a provider is selected, the orchestrator can then use
> +:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
> +allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
> +and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
> +allocating scatter-gather lists with P2P memory.
> +
> +Struct Page Caveats
> +-------------------
> +
> +Driver writers should be very careful about not passing these special
> +struct pages to code that isn't prepared for it. At this time, the kernel
> +interfaces do not have any checks for ensuring this. This obviously
> +precludes passing these pages to userspace.
> +
> +P2P memory is also technically IO memory but should never have any side
> +effects behind it. Thus, the order of loads and stores should not be important
> +and ioreadX(), iowriteX() and friends should not be necessary.
> +However, as the memory is not cache coherent, if access ever needs to
> +be protected by a spinlock then :c:func:`mmiowb()` must be used before
> +unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
> +Documentation/memory-barriers.txt)
> +
> +
> +P2P DMA Support Library
> +=====================
> +
> +.. kernel-doc:: drivers/pci/p2pdma.c
> +   :export:


-- 
~Randy

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-22 21:24     ` Randy Dunlap
  0 siblings, 0 replies; 460+ messages in thread
From: Randy Dunlap @ 2018-05-22 21:24 UTC (permalink / raw)


On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> Cc: Jonathan Corbet <corbet at lwn.net>
> ---
>  Documentation/PCI/index.rst             |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 ++++++++++++++++++++++++++++++++
>  Documentation/index.rst                 |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst


> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index 000000000000..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +============================
> +PCI Peer-to-Peer DMA Support
> +============================
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is
> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required
> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.
> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
> +
> +
> +Driver Writer's Guide
> +=====================
> +
> +In a given P2P implementation there may be three or more different
> +types of kernel drivers in play:
> +
> +* Providers - A driver which provides or publishes P2P resources like

   * Provider -

> +  memory or doorbell registers to other drivers.
> +* Clients - A driver which makes use of a resource by setting up a

   * Client -

> +  DMA transaction to or from it.
> +* Orchestrators - A driver which orchestrates the flow of data between

   * Orchestrator -

> +  clients and providers
> +
> +In many cases there could be overlap between these three types (ie.

                                                                  (i.e.,

> +it may be typical for a driver to be both a provider and a client).
> +
> +For example, in the NVMe Target Copy Offload implementation:
> +
> +* The NVMe PCI driver is both a client, provider and orchestrator
> +  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
> +  resource (provider), it accepts P2P memory pages as buffers in requests
> +  to be used directly (client) and it can also make use the CMB as
> +  submission queue entries.
> +* The RDMA driver is a client in this arrangement so that an RNIC
> +  can DMA directly to the memory exposed by the NVMe device.
> +* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
> +  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
> +
> +This is currently the only arrangement supported by the kernel but
> +one could imagine slight tweaks to this that would allow for the same
> +functionality. For example, if a specific RNIC added a BAR with some
> +memory behind it, its driver could add support as a P2P provider and
> +then the NVMe Target could use the RNIC's memory instead of the CMB
> +in cases where the NVMe cards in use do not have CMB support.
> +
> +
> +Provider Drivers
> +----------------
> +
> +A provider simply needs to register a BAR (or a portion of a BAR)
> +as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
> +This will register struct pages for all the specified memory.
> +
> +After that it may optionally publish all of its resources as
> +P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
> +any orchestrator drivers to find and use the memory. When marked in
> +this way, the resource must be regular memory with no side effects.
> +
> +For the time being this is fairly rudimentary in that all resources
> +are typically going to be P2P memory. Future work will likely expand
> +this to include other types of resources like doorbells.
> +
> +
> +Client Drivers
> +--------------
> +
> +A client driver typically only has to conditionally change its DMA map
> +routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and
> +:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()`
> +functions.
> +
> +The client may also, optionally, make use of
> +:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
> +functions and when to use the regular mapping functions. In some
> +situations, it may be more appropriate to use a flag to indicate a
> +given request is P2P memory and map appropriately (for example the
> +block layer uses a flag to keep P2P memory out of queues that do not
> +have P2P client support). It is important to ensure that struct pages that
> +back P2P memory stay out of code that does not have support for them.
> +
> +
> +Orchestrator Drivers
> +--------------------
> +
> +The first task an orchestrator driver must do is compile a list of
> +all client drivers that will be involved in a given transaction. For
> +example, the NVMe Target driver creates a list including all NVMe drives
                                                                     ^^^^^^
                                                                   or drivers ?
Could be either, I guess, but the previous sentence says "compile a list of drivers."

> +and the RNIC in use. The list is stored as an anonymous struct
> +list_head which must be initialized with the usual INIT_LIST_HEAD.
> +The following functions may then be used to add to, remove from and free
> +the list of clients with the functions :c:func:`pci_p2pdma_add_client()`,
> +:c:func:`pci_p2pdma_remove_client()` and
> +:c:func:`pci_p2pdma_client_list_free()`.
> +
> +With the client list in hand, the orchestrator may then call
> +:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider
> +that is supported (behind the same root port) as all the clients. If more
> +than one provider is supported, the one nearest to all the clients will
> +be chosen first. If there are more than one provider is an equal distance

                       there is

> +away, the one returned will be chosen at random. This function returns the PCI
> +device to use for the provider with a reference taken and therefore
> +when it's no longer needed it should be returned with pci_dev_put().
> +
> +Alternatively, if the orchestrator knows (via some other means)
> +which provider it wants to use it may use :c:func:`pci_has_p2pmem()`
> +to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()`
> +to determine the cumulative distance between it and a potential
> +list of clients.
> +
> +With a supported provider in hand, the driver can then call
> +:c:func:`pci_p2pdma_assign_provider()` to assign the provider
> +to the client list. This function returns false if any of the
> +clients are unsupported by the provider.
[I would say:]
           is unsupported

> +
> +Once a provider is assigned to a client list via either
> +:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`,
> +the list is permanently bound to the provider such that any new clients
> +added to the list must be supported by the already selected provider.
> +If they are not supported, :c:func:`pci_p2pdma_add_client()` will return
> +an error. In this way, orchestrators are free to add and remove devices
> +without having to recheck support or tear down existing transfers to
> +change P2P providers.
> +
> +Once a provider is selected, the orchestrator can then use
> +:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
> +allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
> +and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
> +allocating scatter-gather lists with P2P memory.
> +
> +Struct Page Caveats
> +-------------------
> +
> +Driver writers should be very careful about not passing these special
> +struct pages to code that isn't prepared for it. At this time, the kernel
> +interfaces do not have any checks for ensuring this. This obviously
> +precludes passing these pages to userspace.
> +
> +P2P memory is also technically IO memory but should never have any side
> +effects behind it. Thus, the order of loads and stores should not be important
> +and ioreadX(), iowriteX() and friends should not be necessary.
> +However, as the memory is not cache coherent, if access ever needs to
> +be protected by a spinlock then :c:func:`mmiowb()` must be used before
> +unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
> +Documentation/memory-barriers.txt)
> +
> +
> +P2P DMA Support Library
> +=====================
> +
> +.. kernel-doc:: drivers/pci/p2pdma.c
> +   :export:


-- 
~Randy

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
  2018-05-22 21:24     ` Randy Dunlap
                         ` (2 preceding siblings ...)
  (?)
@ 2018-05-22 21:28       ` Logan Gunthorpe
  -1 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-22 21:28 UTC (permalink / raw)
  To: Randy Dunlap, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Jonathan Corbet, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Thanks for the review Randy! I'll make the changes for the next time we
post the series.

On 22/05/18 03:24 PM, Randy Dunlap wrote:
>> +The first task an orchestrator driver must do is compile a list of
>> +all client drivers that will be involved in a given transaction. For
>> +example, the NVMe Target driver creates a list including all NVMe drives
>                                                                      ^^^^^^
>                                                                    or drivers ?
> Could be either, I guess, but the previous sentence says "compile a list of drivers."

I did mean "drives". But perhaps "devices" would be more clear. A list
of all NVMe drivers doesn't make much sense as I'm pretty sure there is
only one NVMe driver.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-22 21:28       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-22 21:28 UTC (permalink / raw)
  To: Randy Dunlap, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Christian König, Jonathan Corbet

Thanks for the review Randy! I'll make the changes for the next time we
post the series.

On 22/05/18 03:24 PM, Randy Dunlap wrote:
>> +The first task an orchestrator driver must do is compile a list of
>> +all client drivers that will be involved in a given transaction. For
>> +example, the NVMe Target driver creates a list including all NVMe drives
>                                                                      ^^^^^^
>                                                                    or drivers ?
> Could be either, I guess, but the previous sentence says "compile a list of drivers."

I did mean "drives". But perhaps "devices" would be more clear. A list
of all NVMe drivers doesn't make much sense as I'm pretty sure there is
only one NVMe driver.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-22 21:28       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-22 21:28 UTC (permalink / raw)
  To: Randy Dunlap, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Christian König, Benjamin Herrenschmidt,
	Jonathan Corbet, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Thanks for the review Randy! I'll make the changes for the next time we
post the series.

On 22/05/18 03:24 PM, Randy Dunlap wrote:
>> +The first task an orchestrator driver must do is compile a list of
>> +all client drivers that will be involved in a given transaction. For
>> +example, the NVMe Target driver creates a list including all NVMe drives
>                                                                      ^^^^^^
>                                                                    or drivers ?
> Could be either, I guess, but the previous sentence says "compile a list of drivers."

I did mean "drives". But perhaps "devices" would be more clear. A list
of all NVMe drivers doesn't make much sense as I'm pretty sure there is
only one NVMe driver.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

* Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-22 21:28       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-22 21:28 UTC (permalink / raw)
  To: Randy Dunlap, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Sagi Grimberg, Christian König,
	Benjamin Herrenschmidt, Jonathan Corbet, Alex Williamson,
	Stephen Bates, Keith Busch, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Christoph Hellwig

Thanks for the review Randy! I'll make the changes for the next time we
post the series.

On 22/05/18 03:24 PM, Randy Dunlap wrote:
>> +The first task an orchestrator driver must do is compile a list of
>> +all client drivers that will be involved in a given transaction. For
>> +example, the NVMe Target driver creates a list including all NVMe drives
>                                                                      ^^^^^^
>                                                                    or drivers ?
> Could be either, I guess, but the previous sentence says "compile a list of drivers."

I did mean "drives". But perhaps "devices" would be more clear. A list
of all NVMe drivers doesn't make much sense as I'm pretty sure there is
only one NVMe driver.

Logan

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 460+ messages in thread

* [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation
@ 2018-05-22 21:28       ` Logan Gunthorpe
  0 siblings, 0 replies; 460+ messages in thread
From: Logan Gunthorpe @ 2018-05-22 21:28 UTC (permalink / raw)


Thanks for the review Randy! I'll make the changes for the next time we
post the series.

On 22/05/18 03:24 PM, Randy Dunlap wrote:
>> +The first task an orchestrator driver must do is compile a list of
>> +all client drivers that will be involved in a given transaction. For
>> +example, the NVMe Target driver creates a list including all NVMe drives
>                                                                      ^^^^^^
>                                                                    or drivers ?
> Could be either, I guess, but the previous sentence says "compile a list of drivers."

I did mean "drives". But perhaps "devices" would be more clear. A list
of all NVMe drivers doesn't make much sense as I'm pretty sure there is
only one NVMe driver.

Logan

^ permalink raw reply	[flat|nested] 460+ messages in thread

end of thread, other threads:[~2018-05-22 21:28 UTC | newest]

Thread overview: 460+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-23 23:30 [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory Logan Gunthorpe
2018-04-23 23:30 ` Logan Gunthorpe
2018-04-23 23:30 ` Logan Gunthorpe
2018-04-23 23:30 ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-05-07 23:00   ` Bjorn Helgaas
2018-05-07 23:00     ` Bjorn Helgaas
2018-05-07 23:00     ` Bjorn Helgaas
2018-05-07 23:00     ` Bjorn Helgaas
2018-05-07 23:00     ` Bjorn Helgaas
2018-05-07 23:09     ` Logan Gunthorpe
2018-05-07 23:09       ` Logan Gunthorpe
2018-05-07 23:09       ` Logan Gunthorpe
2018-05-07 23:09       ` Logan Gunthorpe
2018-05-07 23:09       ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 02/14] PCI/P2PDMA: Add sysfs group to display p2pmem stats Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-05-07 23:02   ` Bjorn Helgaas
2018-05-07 23:02     ` Bjorn Helgaas
2018-05-07 23:02     ` Bjorn Helgaas
2018-05-07 23:02     ` Bjorn Helgaas
2018-05-07 23:02     ` Bjorn Helgaas
2018-04-23 23:30 ` [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-24  3:33   ` Randy Dunlap
2018-04-24  3:33     ` Randy Dunlap
2018-04-24  3:33     ` Randy Dunlap
2018-04-24  3:33     ` Randy Dunlap
2018-05-07 23:13   ` Bjorn Helgaas
2018-05-07 23:13     ` Bjorn Helgaas
2018-05-07 23:13     ` Bjorn Helgaas
2018-05-07 23:13     ` Bjorn Helgaas
2018-05-07 23:13     ` Bjorn Helgaas
2018-05-08  7:17     ` Christian König
2018-05-08  7:17       ` Christian König
2018-05-08  7:17       ` Christian König
2018-05-08  7:17       ` Christian König
2018-05-08  7:17       ` Christian König
2018-05-08 14:25       ` Stephen  Bates
2018-05-08 14:25         ` Stephen  Bates
2018-05-08 14:25         ` Stephen  Bates
2018-05-08 14:25         ` Stephen  Bates
2018-05-08 14:25         ` Stephen  Bates
2018-05-08 14:25         ` Stephen  Bates
2018-05-08 16:37         ` Christian König
2018-05-08 16:37           ` Christian König
2018-05-08 16:37           ` Christian König
2018-05-08 16:37           ` Christian König
2018-05-08 16:37           ` Christian König
2018-05-08 16:27       ` Logan Gunthorpe
2018-05-08 16:27         ` Logan Gunthorpe
2018-05-08 16:27         ` Logan Gunthorpe
2018-05-08 16:27         ` Logan Gunthorpe
2018-05-08 16:50         ` Christian König
2018-05-08 16:50           ` Christian König
2018-05-08 16:50           ` Christian König
2018-05-08 16:50           ` Christian König
2018-05-08 19:13           ` Logan Gunthorpe
2018-05-08 19:13             ` Logan Gunthorpe
2018-05-08 19:13             ` Logan Gunthorpe
2018-05-08 19:13             ` Logan Gunthorpe
2018-05-08 19:34             ` Alex Williamson
2018-05-08 19:34               ` Alex Williamson
2018-05-08 19:34               ` Alex Williamson
2018-05-08 19:34               ` Alex Williamson
2018-05-08 19:45               ` Logan Gunthorpe
2018-05-08 19:45                 ` Logan Gunthorpe
2018-05-08 19:45                 ` Logan Gunthorpe
2018-05-08 19:45                 ` Logan Gunthorpe
2018-05-08 20:13                 ` Alex Williamson
2018-05-08 20:13                   ` Alex Williamson
2018-05-08 20:13                   ` Alex Williamson
2018-05-08 20:13                   ` Alex Williamson
2018-05-08 20:19                   ` Logan Gunthorpe
2018-05-08 20:19                     ` Logan Gunthorpe
2018-05-08 20:19                     ` Logan Gunthorpe
2018-05-08 20:19                     ` Logan Gunthorpe
2018-05-08 20:43                     ` Alex Williamson
2018-05-08 20:43                       ` Alex Williamson
2018-05-08 20:43                       ` Alex Williamson
2018-05-08 20:43                       ` Alex Williamson
2018-05-08 20:49                       ` Logan Gunthorpe
2018-05-08 20:49                         ` Logan Gunthorpe
2018-05-08 20:49                         ` Logan Gunthorpe
2018-05-08 20:49                         ` Logan Gunthorpe
2018-05-08 21:26                         ` Alex Williamson
2018-05-08 21:26                           ` Alex Williamson
2018-05-08 21:26                           ` Alex Williamson
2018-05-08 21:26                           ` Alex Williamson
2018-05-08 21:42                           ` Stephen  Bates
2018-05-08 21:42                             ` Stephen  Bates
2018-05-08 21:42                             ` Stephen  Bates
2018-05-08 21:42                             ` Stephen  Bates
2018-05-08 21:42                             ` Stephen  Bates
2018-05-08 22:03                             ` Alex Williamson
2018-05-08 22:03                               ` Alex Williamson
2018-05-08 22:03                               ` Alex Williamson
2018-05-08 22:03                               ` Alex Williamson
2018-05-08 22:10                               ` Logan Gunthorpe
2018-05-08 22:10                                 ` Logan Gunthorpe
2018-05-08 22:10                                 ` Logan Gunthorpe
2018-05-08 22:10                                 ` Logan Gunthorpe
2018-05-08 22:25                                 ` Stephen  Bates
2018-05-08 22:25                                   ` Stephen  Bates
2018-05-08 22:25                                   ` Stephen  Bates
2018-05-08 22:25                                   ` Stephen  Bates
2018-05-08 22:25                                   ` Stephen  Bates
2018-05-08 23:11                                   ` Alex Williamson
2018-05-08 23:11                                     ` Alex Williamson
2018-05-08 23:11                                     ` Alex Williamson
2018-05-08 23:11                                     ` Alex Williamson
2018-05-08 23:31                                     ` Logan Gunthorpe
2018-05-08 23:31                                       ` Logan Gunthorpe
2018-05-08 23:31                                       ` Logan Gunthorpe
2018-05-08 23:31                                       ` Logan Gunthorpe
2018-05-09  0:17                                       ` Alex Williamson
2018-05-09  0:17                                         ` Alex Williamson
2018-05-09  0:17                                         ` Alex Williamson
2018-05-09  0:17                                         ` Alex Williamson
2018-05-08 22:32                                 ` Alex Williamson
2018-05-08 22:32                                   ` Alex Williamson
2018-05-08 22:32                                   ` Alex Williamson
2018-05-08 22:32                                   ` Alex Williamson
2018-05-08 23:00                                   ` Dan Williams
2018-05-08 23:00                                     ` Dan Williams
2018-05-08 23:00                                     ` Dan Williams
2018-05-08 23:00                                     ` Dan Williams
2018-05-08 23:15                                     ` Logan Gunthorpe
2018-05-08 23:15                                       ` Logan Gunthorpe
2018-05-08 23:15                                       ` Logan Gunthorpe
2018-05-08 23:15                                       ` Logan Gunthorpe
2018-05-09 12:38                                       ` Stephen  Bates
2018-05-09 12:38                                         ` Stephen  Bates
2018-05-09 12:38                                         ` Stephen  Bates
2018-05-09 12:38                                         ` Stephen  Bates
2018-05-09 12:38                                         ` Stephen  Bates
2018-05-08 22:21                               ` Don Dutile
2018-05-08 22:21                                 ` Don Dutile
2018-05-08 22:21                                 ` Don Dutile
2018-05-08 22:21                                 ` Don Dutile
2018-05-09 12:44                                 ` Stephen  Bates
2018-05-09 12:44                                   ` Stephen  Bates
2018-05-09 12:44                                   ` Stephen  Bates
2018-05-09 12:44                                   ` Stephen  Bates
2018-05-09 12:44                                   ` Stephen  Bates
2018-05-09 12:44                                   ` Stephen  Bates
2018-05-09 15:58                                   ` Don Dutile
2018-05-09 15:58                                     ` Don Dutile
2018-05-09 15:58                                     ` Don Dutile
2018-05-09 15:58                                     ` Don Dutile
2018-05-08 20:50                     ` Jerome Glisse
2018-05-08 20:50                       ` Jerome Glisse
2018-05-08 20:50                       ` Jerome Glisse
2018-05-08 20:50                       ` Jerome Glisse
2018-05-08 21:35                       ` Stephen  Bates
2018-05-08 21:35                         ` Stephen  Bates
2018-05-08 21:35                         ` Stephen  Bates
2018-05-08 21:35                         ` Stephen  Bates
2018-05-08 21:35                         ` Stephen  Bates
2018-05-08 21:35                         ` Stephen  Bates
2018-05-09 13:12                       ` Stephen  Bates
2018-05-09 13:12                         ` Stephen  Bates
2018-05-09 13:12                         ` Stephen  Bates
2018-05-09 13:12                         ` Stephen  Bates
2018-05-09 13:12                         ` Stephen  Bates
2018-05-09 13:40                         ` Christian König
2018-05-09 13:40                           ` Christian König
2018-05-09 13:40                           ` Christian König
2018-05-09 13:40                           ` Christian König
2018-05-09 13:40                           ` Christian König
2018-05-09 15:41                           ` Stephen  Bates
2018-05-09 15:41                             ` Stephen  Bates
2018-05-09 15:41                             ` Stephen  Bates
2018-05-09 15:41                             ` Stephen  Bates
2018-05-09 15:41                             ` Stephen  Bates
2018-05-09 16:07                             ` Jerome Glisse
2018-05-09 16:07                               ` Jerome Glisse
2018-05-09 16:07                               ` Jerome Glisse
2018-05-09 16:07                               ` Jerome Glisse
2018-05-09 16:07                               ` Jerome Glisse
2018-05-09 16:30                               ` Stephen  Bates
2018-05-09 16:30                                 ` Stephen  Bates
2018-05-09 16:30                                 ` Stephen  Bates
2018-05-09 16:30                                 ` Stephen  Bates
2018-05-09 16:30                                 ` Stephen  Bates
2018-05-09 17:49                                 ` Jerome Glisse
2018-05-09 17:49                                   ` Jerome Glisse
2018-05-09 17:49                                   ` Jerome Glisse
2018-05-09 17:49                                   ` Jerome Glisse
2018-05-09 17:49                                   ` Jerome Glisse
2018-05-10 14:20                                   ` Stephen  Bates
2018-05-10 14:20                                     ` Stephen  Bates
2018-05-10 14:20                                     ` Stephen  Bates
2018-05-10 14:20                                     ` Stephen  Bates
2018-05-10 14:20                                     ` Stephen  Bates
2018-05-10 14:29                                     ` Christian König
2018-05-10 14:29                                       ` Christian König
2018-05-10 14:29                                       ` Christian König
2018-05-10 14:29                                       ` Christian König
2018-05-10 14:59                                       ` Jerome Glisse
2018-05-10 14:59                                         ` Jerome Glisse
2018-05-10 14:59                                         ` Jerome Glisse
2018-05-10 14:59                                         ` Jerome Glisse
2018-05-10 14:59                                         ` Jerome Glisse
2018-05-10 18:44                                         ` Stephen  Bates
2018-05-10 18:44                                           ` Stephen  Bates
2018-05-10 18:44                                           ` Stephen  Bates
2018-05-10 18:44                                           ` Stephen  Bates
2018-05-10 18:44                                           ` Stephen  Bates
2018-05-10 18:44                                           ` Stephen  Bates
2018-05-09 16:45                           ` Logan Gunthorpe
2018-05-09 16:45                             ` Logan Gunthorpe
2018-05-09 16:45                             ` Logan Gunthorpe
2018-05-09 16:45                             ` Logan Gunthorpe
2018-05-10 12:52                             ` Christian König
2018-05-10 12:52                               ` Christian König
2018-05-10 12:52                               ` Christian König
2018-05-10 12:52                               ` Christian König
2018-05-10 12:52                               ` Christian König
2018-05-10 14:16                               ` Stephen  Bates
2018-05-10 14:16                                 ` Stephen  Bates
2018-05-10 14:16                                 ` Stephen  Bates
2018-05-10 14:16                                 ` Stephen  Bates
2018-05-10 14:16                                 ` Stephen  Bates
2018-05-10 14:41                                 ` Jerome Glisse
2018-05-10 14:41                                   ` Jerome Glisse
2018-05-10 14:41                                   ` Jerome Glisse
2018-05-10 14:41                                   ` Jerome Glisse
2018-05-10 14:41                                   ` Jerome Glisse
2018-05-10 18:41                                   ` Stephen  Bates
2018-05-10 18:41                                     ` Stephen  Bates
2018-05-10 18:41                                     ` Stephen  Bates
2018-05-10 18:41                                     ` Stephen  Bates
2018-05-10 18:41                                     ` Stephen  Bates
2018-05-10 18:41                                     ` Stephen  Bates
2018-05-10 18:59                                     ` Logan Gunthorpe
2018-05-10 18:59                                       ` Logan Gunthorpe
2018-05-10 18:59                                       ` Logan Gunthorpe
2018-05-10 18:59                                       ` Logan Gunthorpe
2018-05-10 19:10                                     ` Alex Williamson
2018-05-10 19:10                                       ` Alex Williamson
2018-05-10 19:10                                       ` Alex Williamson
2018-05-10 19:10                                       ` Alex Williamson
2018-05-10 19:10                                       ` Alex Williamson
2018-05-10 19:24                                       ` Jerome Glisse
2018-05-10 19:24                                         ` Jerome Glisse
2018-05-10 19:24                                         ` Jerome Glisse
2018-05-10 19:24                                         ` Jerome Glisse
2018-05-10 19:24                                         ` Jerome Glisse
2018-05-10 16:32                                 ` Logan Gunthorpe
2018-05-10 16:32                                   ` Logan Gunthorpe
2018-05-10 16:32                                   ` Logan Gunthorpe
2018-05-10 16:32                                   ` Logan Gunthorpe
2018-05-10 17:11                                   ` Stephen  Bates
2018-05-10 17:11                                     ` Stephen  Bates
2018-05-10 17:11                                     ` Stephen  Bates
2018-05-10 17:11                                     ` Stephen  Bates
2018-05-10 17:11                                     ` Stephen  Bates
2018-05-10 17:15                                     ` Logan Gunthorpe
2018-05-10 17:15                                       ` Logan Gunthorpe
2018-05-10 17:15                                       ` Logan Gunthorpe
2018-05-10 17:15                                       ` Logan Gunthorpe
2018-05-11  8:52                                       ` Christian König
2018-05-11  8:52                                         ` Christian König
2018-05-11  8:52                                         ` Christian König
2018-05-11  8:52                                         ` Christian König
2018-05-11 15:48                                         ` Logan Gunthorpe
2018-05-11 15:48                                           ` Logan Gunthorpe
2018-05-11 15:48                                           ` Logan Gunthorpe
2018-05-11 15:48                                           ` Logan Gunthorpe
2018-05-11 21:50                                           ` Stephen  Bates
2018-05-11 21:50                                             ` Stephen  Bates
2018-05-11 21:50                                             ` Stephen  Bates
2018-05-11 21:50                                             ` Stephen  Bates
2018-05-11 21:50                                             ` Stephen  Bates
2018-05-11 22:24                                             ` Stephen  Bates
2018-05-11 22:24                                               ` Stephen  Bates
2018-05-11 22:24                                               ` Stephen  Bates
2018-05-11 22:24                                               ` Stephen  Bates
2018-05-11 22:24                                               ` Stephen  Bates
2018-05-11 22:24                                               ` Stephen  Bates
2018-05-11 22:55                                               ` Logan Gunthorpe
2018-05-11 22:55                                                 ` Logan Gunthorpe
2018-05-11 22:55                                                 ` Logan Gunthorpe
2018-05-08 14:31   ` Dan Williams
2018-05-08 14:31     ` Dan Williams
2018-05-08 14:31     ` Dan Williams
2018-05-08 14:31     ` Dan Williams
2018-05-08 14:31     ` Dan Williams
2018-05-08 14:44     ` Stephen  Bates
2018-05-08 14:44       ` Stephen  Bates
2018-05-08 14:44       ` Stephen  Bates
2018-05-08 14:44       ` Stephen  Bates
2018-05-08 14:44       ` Stephen  Bates
2018-05-08 14:44       ` Stephen  Bates
2018-05-08 21:04       ` Don Dutile
2018-05-08 21:04         ` Don Dutile
2018-05-08 21:04         ` Don Dutile
2018-05-08 21:04         ` Don Dutile
2018-05-08 21:04         ` Don Dutile
2018-05-08 21:27         ` Stephen  Bates
2018-05-08 21:27           ` Stephen  Bates
2018-05-08 21:27           ` Stephen  Bates
2018-05-08 21:27           ` Stephen  Bates
2018-05-08 21:27           ` Stephen  Bates
2018-05-08 21:27           ` Stephen  Bates
2018-05-08 23:06           ` Don Dutile
2018-05-08 23:06             ` Don Dutile
2018-05-08 23:06             ` Don Dutile
2018-05-08 23:06             ` Don Dutile
2018-05-08 23:06             ` Don Dutile
2018-05-09  0:01             ` Alex Williamson
2018-05-09  0:01               ` Alex Williamson
2018-05-09  0:01               ` Alex Williamson
2018-05-09  0:01               ` Alex Williamson
2018-05-09  0:01               ` Alex Williamson
2018-05-09 12:35               ` Stephen  Bates
2018-05-09 12:35                 ` Stephen  Bates
2018-05-09 12:35                 ` Stephen  Bates
2018-05-09 12:35                 ` Stephen  Bates
2018-05-09 12:35                 ` Stephen  Bates
2018-05-09 12:35                 ` Stephen  Bates
2018-05-09 14:44                 ` Alex Williamson
2018-05-09 14:44                   ` Alex Williamson
2018-05-09 14:44                   ` Alex Williamson
2018-05-09 14:44                   ` Alex Williamson
2018-05-09 14:44                   ` Alex Williamson
2018-05-09 15:52                   ` Don Dutile
2018-05-09 15:52                     ` Don Dutile
2018-05-09 15:52                     ` Don Dutile
2018-05-09 15:52                     ` Don Dutile
2018-05-09 15:47               ` Don Dutile
2018-05-09 15:47                 ` Don Dutile
2018-05-09 15:47                 ` Don Dutile
2018-05-09 15:47                 ` Don Dutile
2018-05-09 15:53           ` Don Dutile
2018-05-09 15:53             ` Don Dutile
2018-05-09 15:53             ` Don Dutile
2018-04-23 23:30 ` [PATCH v4 05/14] docs-rst: Add a new directory for PCI documentation Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-05-07 23:20   ` Bjorn Helgaas
2018-05-07 23:20     ` Bjorn Helgaas
2018-05-07 23:20     ` Bjorn Helgaas
2018-05-07 23:20     ` Bjorn Helgaas
2018-05-22 21:24   ` Randy Dunlap
2018-05-22 21:24     ` Randy Dunlap
2018-05-22 21:24     ` Randy Dunlap
2018-05-22 21:24     ` Randy Dunlap
2018-05-22 21:28     ` Logan Gunthorpe
2018-05-22 21:28       ` Logan Gunthorpe
2018-05-22 21:28       ` Logan Gunthorpe
2018-05-22 21:28       ` Logan Gunthorpe
2018-05-22 21:28       ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 07/14] block: Introduce PCI P2P flags for request and request queue Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 08/14] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]() Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 09/14] nvme-pci: Use PCI p2pmem subsystem to manage the CMB Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 10/14] nvme-pci: Add support for P2P memory in requests Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 11/14] nvme-pci: Add a quirk for a pseudo CMB Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 12/14] nvmet: Introduce helper functions to allocate and free request SGLs Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 13/14] nvmet-rdma: Use new SGL alloc/free helper for requests Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30 ` [PATCH v4 14/14] nvmet: Optionally use PCI P2P memory Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-04-23 23:30   ` Logan Gunthorpe
2018-05-02 11:51 ` [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory Christian König
2018-05-02 11:51   ` Christian König
2018-05-02 11:51   ` Christian König
2018-05-02 11:51   ` Christian König
2018-05-02 15:56   ` Logan Gunthorpe
2018-05-02 15:56     ` Logan Gunthorpe
2018-05-02 15:56     ` Logan Gunthorpe
2018-05-02 15:56     ` Logan Gunthorpe
2018-05-03  9:05     ` Christian König
2018-05-03  9:05       ` Christian König
2018-05-03  9:05       ` Christian König
2018-05-03  9:05       ` Christian König
2018-05-03 15:59       ` Logan Gunthorpe
2018-05-03 15:59         ` Logan Gunthorpe
2018-05-03 15:59         ` Logan Gunthorpe
2018-05-03 15:59         ` Logan Gunthorpe
2018-05-03 17:29         ` Christian König
2018-05-03 17:29           ` Christian König
2018-05-03 17:29           ` Christian König
2018-05-03 17:29           ` Christian König
2018-05-03 18:43           ` Logan Gunthorpe
2018-05-03 18:43             ` Logan Gunthorpe
2018-05-03 18:43             ` Logan Gunthorpe
2018-05-03 18:43             ` Logan Gunthorpe
2018-05-04 14:27             ` Christian König
2018-05-04 14:27               ` Christian König
2018-05-04 14:27               ` Christian König
2018-05-04 14:27               ` Christian König
2018-05-04 15:52               ` Logan Gunthorpe
2018-05-04 15:52                 ` Logan Gunthorpe
2018-05-04 15:52                 ` Logan Gunthorpe
2018-05-04 15:52                 ` Logan Gunthorpe
2018-05-07 23:23 ` Bjorn Helgaas
2018-05-07 23:23   ` Bjorn Helgaas
2018-05-07 23:23   ` Bjorn Helgaas
2018-05-07 23:23   ` Bjorn Helgaas
2018-05-07 23:23   ` Bjorn Helgaas
2018-05-07 23:34   ` Logan Gunthorpe
2018-05-07 23:34     ` Logan Gunthorpe
2018-05-07 23:34     ` Logan Gunthorpe
2018-05-07 23:34     ` Logan Gunthorpe
2018-05-08 16:57   ` Alex Williamson
2018-05-08 16:57     ` Alex Williamson
2018-05-08 16:57     ` Alex Williamson
2018-05-08 16:57     ` Alex Williamson
2018-05-08 19:14     ` Logan Gunthorpe
2018-05-08 19:14       ` Logan Gunthorpe
2018-05-08 19:14       ` Logan Gunthorpe
2018-05-08 19:14       ` Logan Gunthorpe
2018-05-08 21:25     ` Don Dutile
2018-05-08 21:25       ` Don Dutile
2018-05-08 21:25       ` Don Dutile
2018-05-08 21:25       ` Don Dutile
2018-05-08 21:40       ` Alex Williamson
2018-05-08 21:40         ` Alex Williamson
2018-05-08 21:40         ` Alex Williamson
2018-05-08 21:40         ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.