All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-02-28 23:39 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Everyone,

Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

Thanks,

Logan

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

Logan Gunthorpe (10):
  PCI/P2PDMA: Support peer to peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci |  25 ++
 block/blk-core.c                        |   3 +
 drivers/infiniband/core/rw.c            |  21 +-
 drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
 drivers/nvme/host/core.c                |   4 +
 drivers/nvme/host/nvme.h                |   8 +
 drivers/nvme/host/pci.c                 | 118 ++++--
 drivers/nvme/target/configfs.c          |  29 ++
 drivers/nvme/target/core.c              |  95 ++++-
 drivers/nvme/target/io-cmd.c            |   3 +
 drivers/nvme/target/nvmet.h             |  10 +
 drivers/nvme/target/rdma.c              |  43 +-
 drivers/pci/Kconfig                     |  20 +
 drivers/pci/Makefile                    |   1 +
 drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
 drivers/pci/pci.c                       |   4 +
 include/linux/blk_types.h               |  18 +-
 include/linux/blkdev.h                  |   3 +
 include/linux/memremap.h                |  19 +
 include/linux/pci-p2pdma.h              | 105 +++++
 include/linux/pci.h                     |   4 +
 include/rdma/rw.h                       |   7 +-
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
 24 files changed, 1204 insertions(+), 67 deletions(-)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-02-28 23:39 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

Hi Everyone,

Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

Thanks,

Logan

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

Logan Gunthorpe (10):
  PCI/P2PDMA: Support peer to peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci |  25 ++
 block/blk-core.c                        |   3 +
 drivers/infiniband/core/rw.c            |  21 +-
 drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
 drivers/nvme/host/core.c                |   4 +
 drivers/nvme/host/nvme.h                |   8 +
 drivers/nvme/host/pci.c                 | 118 ++++--
 drivers/nvme/target/configfs.c          |  29 ++
 drivers/nvme/target/core.c              |  95 ++++-
 drivers/nvme/target/io-cmd.c            |   3 +
 drivers/nvme/target/nvmet.h             |  10 +
 drivers/nvme/target/rdma.c              |  43 +-
 drivers/pci/Kconfig                     |  20 +
 drivers/pci/Makefile                    |   1 +
 drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
 drivers/pci/pci.c                       |   4 +
 include/linux/blk_types.h               |  18 +-
 include/linux/blkdev.h                  |   3 +
 include/linux/memremap.h                |  19 +
 include/linux/pci-p2pdma.h              | 105 +++++
 include/linux/pci.h                     |   4 +
 include/rdma/rw.h                       |   7 +-
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
 24 files changed, 1204 insertions(+), 67 deletions(-)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-02-28 23:39 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Everyone,

Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

Thanks,

Logan

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

Logan Gunthorpe (10):
  PCI/P2PDMA: Support peer to peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci |  25 ++
 block/blk-core.c                        |   3 +
 drivers/infiniband/core/rw.c            |  21 +-
 drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
 drivers/nvme/host/core.c                |   4 +
 drivers/nvme/host/nvme.h                |   8 +
 drivers/nvme/host/pci.c                 | 118 ++++--
 drivers/nvme/target/configfs.c          |  29 ++
 drivers/nvme/target/core.c              |  95 ++++-
 drivers/nvme/target/io-cmd.c            |   3 +
 drivers/nvme/target/nvmet.h             |  10 +
 drivers/nvme/target/rdma.c              |  43 +-
 drivers/pci/Kconfig                     |  20 +
 drivers/pci/Makefile                    |   1 +
 drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
 drivers/pci/pci.c                       |   4 +
 include/linux/blk_types.h               |  18 +-
 include/linux/blkdev.h                  |   3 +
 include/linux/memremap.h                |  19 +
 include/linux/pci-p2pdma.h              | 105 +++++
 include/linux/pci.h                     |   4 +
 include/rdma/rw.h                       |   7 +-
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
 24 files changed, 1204 insertions(+), 67 deletions(-)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-02-28 23:39 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)


Hi Everyone,

Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

Thanks,

Logan

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

Logan Gunthorpe (10):
  PCI/P2PDMA: Support peer to peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci |  25 ++
 block/blk-core.c                        |   3 +
 drivers/infiniband/core/rw.c            |  21 +-
 drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
 drivers/nvme/host/core.c                |   4 +
 drivers/nvme/host/nvme.h                |   8 +
 drivers/nvme/host/pci.c                 | 118 ++++--
 drivers/nvme/target/configfs.c          |  29 ++
 drivers/nvme/target/core.c              |  95 ++++-
 drivers/nvme/target/io-cmd.c            |   3 +
 drivers/nvme/target/nvmet.h             |  10 +
 drivers/nvme/target/rdma.c              |  43 +-
 drivers/pci/Kconfig                     |  20 +
 drivers/pci/Makefile                    |   1 +
 drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
 drivers/pci/pci.c                       |   4 +
 include/linux/blk_types.h               |  18 +-
 include/linux/blkdev.h                  |   3 +
 include/linux/memremap.h                |  19 +
 include/linux/pci-p2pdma.h              | 105 +++++
 include/linux/pci.h                     |   4 +
 include/rdma/rw.h                       |   7 +-
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
 24 files changed, 1204 insertions(+), 67 deletions(-)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
@ 2018-02-28 23:39     ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Some PCI devices may have memory mapped in a BAR space that's
intended for use in Peer-to-Peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

A kernel interface is provided so that other subsystems can find and
allocate chunks of P2P memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but would significantly reduce pressure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

The interface requires a user driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary. The ACS bits on the
downstream switch port will be managed for all the registered clients.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI switch. This is because
using P2P transactions through the PCI root complex can have performance
limitations or, worse, might not work at all. Finding out how well a
particular RC supports P2P transfers is non-trivial. Additionally, the
benefits of P2P transfers that go through the RC is limited to only
reducing DRAM usage.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  16 ++
 drivers/pci/Makefile       |   1 +
 drivers/pci/p2pdma.c       | 568 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  18 ++
 include/linux/pci-p2pdma.h |  87 +++++++
 include/linux/pci.h        |   4 +
 6 files changed, 694 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..840831418cbd 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,22 @@ config PCI_PASID
 
 	  If unsure, say N.
 
+config PCI_P2PDMA
+	bool "PCI Peer to Peer transfer support"
+	depends on ZONE_DEVICE
+	select GENERIC_ALLOCATOR
+	help
+	  Enableѕ drivers to do PCI peer to peer transactions to and from
+	  BARs that are exposed in other devices that are the part of
+	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+	  specification to work (ie. anything below a single PCI bridge).
+
+	  Many PCIe root complexes do not support P2P transactions and
+	  it's hard to tell which support it with good performance, so
+	  at this time you will need a PCIe switch.
+
+	  If unsure, say N.
+
 config PCI_LABEL
 	def_bool y if (DMI || ACPI)
 	depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 941970936840..45e0ff6f3213 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o
 
 obj-$(CONFIG_PCI_ATS) += ats.o
 obj-$(CONFIG_PCI_IOV) += iov.o
+obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
 
 #
 # ACPI Related PCI FW Functions
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..ec0a6cb9e500
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,568 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+
+struct pci_p2pdma {
+	struct percpu_ref devmap_ref;
+	struct completion devmap_ref_done;
+	struct gen_pool *pool;
+	bool published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+	struct pci_p2pdma *p2p =
+		container_of(ref, struct pci_p2pdma, devmap_ref);
+
+	complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+	struct pci_dev *pdev = data;
+
+	wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+	gen_pool_destroy(pdev->p2pdma->pool);
+	pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+	int error = -ENOMEM;
+	struct pci_p2pdma *p2p;
+
+	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+	if (!p2p)
+		return -ENOMEM;
+
+	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2p->pool)
+		goto out;
+
+	init_completion(&p2p->devmap_ref_done);
+	error = percpu_ref_init(&p2p->devmap_ref,
+			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+	if (error)
+		goto out_pool_destroy;
+
+	percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_pool_destroy;
+
+	pdev->p2pdma = p2p;
+
+	return 0;
+
+out_pool_destroy:
+	gen_pool_destroy(p2p->pool);
+out:
+	devm_kfree(&pdev->dev, p2p);
+	return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pci: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+			    u64 offset)
+{
+	struct dev_pagemap *pgmap;
+	void *addr;
+	int error;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	if (offset >= pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!size)
+		size = pci_resource_len(pdev, bar) - offset;
+
+	if (size + offset > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!pdev->p2pdma) {
+		error = pci_p2pdma_setup(pdev);
+		if (error)
+			return error;
+	}
+
+	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+	pgmap->res.end = pgmap->res.start + size - 1;
+	pgmap->res.flags = pci_resource_flags(pdev, bar);
+	pgmap->ref = &pdev->p2pdma->devmap_ref;
+	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+	addr = devm_memremap_pages(&pdev->dev, pgmap);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
+			pci_bus_address(pdev, bar) + offset,
+			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+	if (error)
+		return error;
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+					  &pdev->p2pdma->devmap_ref);
+	if (error)
+		return error;
+
+	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
+		 &pgmap->res);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+	struct device *parent;
+
+	dev = get_device(dev);
+
+	while (dev) {
+		if (dev_is_pci(dev))
+			return to_pci_dev(dev);
+
+		parent = get_device(dev->parent);
+		put_device(dev);
+		dev = parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+	struct pci_dev *up1, *up2;
+
+	if (!pdev)
+		return NULL;
+
+	up1 = pci_dev_get(pci_upstream_bridge(pdev));
+	if (!up1)
+		return NULL;
+
+	up2 = pci_dev_get(pci_upstream_bridge(up1));
+	pci_dev_put(up1);
+
+	return up2;
+}
+
+static bool __upstream_bridges_match(struct pci_dev *upstream,
+				     struct pci_dev *client)
+{
+	struct pci_dev *dma_up;
+	bool ret = true;
+
+	dma_up = get_upstream_bridge_port(client);
+
+	if (!dma_up) {
+		dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
+		ret = false;
+		goto out;
+	}
+
+	if (upstream != dma_up) {
+		dev_dbg(&client->dev,
+			"does not reside on the same upstream bridge\n");
+		ret = false;
+		goto out;
+	}
+
+out:
+	pci_dev_put(dma_up);
+	return ret;
+}
+
+static bool upstream_bridges_match(struct pci_dev *pdev,
+				   struct pci_dev *client)
+{
+	struct pci_dev *upstream;
+	bool ret;
+
+	upstream = get_upstream_bridge_port(pdev);
+	if (!upstream) {
+		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
+		return false;
+	}
+
+	ret = __upstream_bridges_match(upstream, client);
+
+	pci_dev_put(upstream);
+
+	return ret;
+}
+
+struct pci_p2pdma_client {
+	struct list_head list;
+	struct pci_dev *client;
+	struct pci_dev *p2pdma;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to p2pmem_find(). Once p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *item, *new_item;
+	struct pci_dev *p2pdma = NULL;
+	struct pci_dev *client;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+		dev_warn(dev,
+			 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+		return -ENODEV;
+	}
+
+
+	client = find_parent_pci_dev(dev);
+	if (!client) {
+		dev_warn(dev,
+			 "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+		return -ENODEV;
+	}
+
+	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+	if (item && item->p2pdma) {
+		p2pdma = item->p2pdma;
+
+		if (!upstream_bridges_match(p2pdma, client)) {
+			ret = -EXDEV;
+			goto put_client;
+		}
+	}
+
+	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+	if (!new_item) {
+		ret = -ENOMEM;
+		goto put_client;
+	}
+
+	new_item->client = client;
+	new_item->p2pdma = pci_dev_get(p2pdma);
+
+	list_add_tail(&new_item->list, head);
+
+	return 0;
+
+put_client:
+	pci_dev_put(client);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+	list_del(&item->list);
+	pci_dev_put(item->client);
+	pci_dev_put(item->p2pdma);
+	kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a new p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+	struct pci_dev *pdev;
+
+	pdev = find_parent_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	list_for_each_entry_safe(pos, tmp, head, list) {
+		if (pos->client != pdev)
+			continue;
+
+		pci_p2pdma_client_free(pos);
+	}
+
+	pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, head, list)
+		pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+static bool upstream_bridges_match_list(struct pci_dev *pdev,
+					struct list_head *head)
+{
+	struct pci_p2pdma_client *pos;
+	struct pci_dev *upstream;
+	bool ret;
+
+	upstream = get_upstream_bridge_port(pdev);
+	if (!upstream) {
+		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
+		return false;
+	}
+
+	list_for_each_entry(pos, head, list) {
+		ret = __upstream_bridges_match(upstream, pos->client);
+		if (!ret)
+			break;
+	}
+
+	pci_dev_put(upstream);
+	return ret;
+}
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ *	the specified list of clients
+ * @dev: list of devices to check (NULL-terminated)
+ *
+ * For now, we only support cases where the devices that will transfer to the
+ * p2pmem device are behind the same bridge. This cuts out cases that may work
+ * but is safest for the user.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	struct pci_dev *pdev = NULL;
+	struct pci_p2pdma_client *pos;
+
+	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+		if (!pdev->p2pdma || !pdev->p2pdma->published)
+			continue;
+
+		if (!upstream_bridges_match_list(pdev, clients))
+			continue;
+
+		list_for_each_entry(pos, clients, list)
+			pos->p2pdma = pdev;
+
+		return pdev;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev:	the device to allocate memory from
+ * @size:	number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	void *ret;
+
+	if (unlikely(!pdev->p2pdma))
+		return NULL;
+
+	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+		return NULL;
+
+	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+	if (unlikely(!ret))
+		percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev:	the device the memory was allocated from
+ * @addr:	address of the memory that was allocated
+ * @size:	number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+	percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ *	address obtained with pci_alloc_p2pmem
+ * @pdev:	the device the memory was allocated from
+ * @addr:	address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+	if (!addr)
+		return 0;
+	if (!pdev->p2pdma)
+		return 0;
+
+	/*
+	 * Note: when we added the memory to the pool we used the PCI
+	 * bus address as the physical address. So gen_pool_virt_to_phys()
+	 * actually returns the bus address despite the misleading name.
+	 */
+	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in an scatterlist
+ * @pdev:	the device to allocate memory from
+ * @sgl:	the allocated scatterlist
+ * @nents:      the number of SG entries in the list
+ * @length:     number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+			 unsigned int *nents, u32 length)
+{
+	struct scatterlist *sg;
+	void *addr;
+
+	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+	if (!sg)
+		return -ENOMEM;
+
+	sg_init_table(sg, 1);
+
+	addr = pci_alloc_p2pmem(pdev, length);
+	if (!addr)
+		goto out_free_sg;
+
+	sg_set_buf(sg, addr, length);
+	*sgl = sg;
+	*nents = 1;
+	return 0;
+
+out_free_sg:
+	kfree(sg);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist  allocated by pci_p2pmem_alloc_sgl
+ * @pdev:	the device to allocate memory from
+ * @sgl:	the allocated scatterlist
+ * @nents:      the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+			 unsigned int nents)
+{
+	struct scatterlist *sg;
+	int count;
+
+	if (!sgl || !nents)
+		return;
+
+	for_each_sg(sgl, sg, nents, count)
+		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+	kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ *	other devices with pci_p2pmem_find
+ * @pdev:	the device with peer-to-peer DMA memory to publish
+ * @publish:	set to true to publish the memory, false to unpublish it
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+	if (publish && !pdev->p2pdma)
+		return;
+
+	pdev->p2pdma->published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_PCI_P2PDMA,
 };
 
 /*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
 static inline bool is_device_private_page(const struct page *page)
 {
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..c0dde3d3aac4
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+		u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+		unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+		unsigned int nents);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+		size_t size, u64 offset)
+{
+	return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+		struct device *dev)
+{
+	return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+		struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+		size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+						    void *addr)
+{
+	return 0;
+}
+static inline int pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+		struct scatterlist **sgl, unsigned int *nents, u32 length)
+{
+	return -ENODEV;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+		struct scatterlist *sgl, unsigned int nents)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1beda008..437e42615896 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -276,6 +276,7 @@ struct pcie_link_state;
 struct pci_vpd;
 struct pci_sriov;
 struct pci_ats;
+struct pci_p2pdma;
 
 /* The pci_dev structure describes PCI devices */
 struct pci_dev {
@@ -429,6 +430,9 @@ struct pci_dev {
 #ifdef CONFIG_PCI_PASID
 	u16		pasid_features;
 #endif
+#ifdef CONFIG_PCI_P2PDMA
+	struct pci_p2pdma *p2pdma;
+#endif
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-02-28 23:39     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

Some PCI devices may have memory mapped in a BAR space that's
intended for use in Peer-to-Peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

A kernel interface is provided so that other subsystems can find and
allocate chunks of P2P memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but would significantly reduce pressure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

The interface requires a user driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary. The ACS bits on the
downstream switch port will be managed for all the registered clients.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI switch. This is because
using P2P transactions through the PCI root complex can have performance
limitations or, worse, might not work at all. Finding out how well a
particular RC supports P2P transfers is non-trivial. Additionally, the
benefits of P2P transfers that go through the RC is limited to only
reducing DRAM usage.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  16 ++
 drivers/pci/Makefile       |   1 +
 drivers/pci/p2pdma.c       | 568 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  18 ++
 include/linux/pci-p2pdma.h |  87 +++++++
 include/linux/pci.h        |   4 +
 6 files changed, 694 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..840831418cbd 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,22 @@ config PCI_PASID
 
 	  If unsure, say N.
 
+config PCI_P2PDMA
+	bool "PCI Peer to Peer transfer support"
+	depends on ZONE_DEVICE
+	select GENERIC_ALLOCATOR
+	help
+	  Enableѕ drivers to do PCI peer to peer transactions to and from
+	  BARs that are exposed in other devices that are the part of
+	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+	  specification to work (ie. anything below a single PCI bridge).
+
+	  Many PCIe root complexes do not support P2P transactions and
+	  it's hard to tell which support it with good performance, so
+	  at this time you will need a PCIe switch.
+
+	  If unsure, say N.
+
 config PCI_LABEL
 	def_bool y if (DMI || ACPI)
 	depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 941970936840..45e0ff6f3213 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o
 
 obj-$(CONFIG_PCI_ATS) += ats.o
 obj-$(CONFIG_PCI_IOV) += iov.o
+obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
 
 #
 # ACPI Related PCI FW Functions
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..ec0a6cb9e500
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,568 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+
+struct pci_p2pdma {
+	struct percpu_ref devmap_ref;
+	struct completion devmap_ref_done;
+	struct gen_pool *pool;
+	bool published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+	struct pci_p2pdma *p2p =
+		container_of(ref, struct pci_p2pdma, devmap_ref);
+
+	complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+	struct pci_dev *pdev = data;
+
+	wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+	gen_pool_destroy(pdev->p2pdma->pool);
+	pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+	int error = -ENOMEM;
+	struct pci_p2pdma *p2p;
+
+	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+	if (!p2p)
+		return -ENOMEM;
+
+	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2p->pool)
+		goto out;
+
+	init_completion(&p2p->devmap_ref_done);
+	error = percpu_ref_init(&p2p->devmap_ref,
+			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+	if (error)
+		goto out_pool_destroy;
+
+	percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_pool_destroy;
+
+	pdev->p2pdma = p2p;
+
+	return 0;
+
+out_pool_destroy:
+	gen_pool_destroy(p2p->pool);
+out:
+	devm_kfree(&pdev->dev, p2p);
+	return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pci: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+			    u64 offset)
+{
+	struct dev_pagemap *pgmap;
+	void *addr;
+	int error;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	if (offset >= pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!size)
+		size = pci_resource_len(pdev, bar) - offset;
+
+	if (size + offset > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!pdev->p2pdma) {
+		error = pci_p2pdma_setup(pdev);
+		if (error)
+			return error;
+	}
+
+	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+	pgmap->res.end = pgmap->res.start + size - 1;
+	pgmap->res.flags = pci_resource_flags(pdev, bar);
+	pgmap->ref = &pdev->p2pdma->devmap_ref;
+	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+	addr = devm_memremap_pages(&pdev->dev, pgmap);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
+			pci_bus_address(pdev, bar) + offset,
+			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+	if (error)
+		return error;
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+					  &pdev->p2pdma->devmap_ref);
+	if (error)
+		return error;
+
+	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
+		 &pgmap->res);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+	struct device *parent;
+
+	dev = get_device(dev);
+
+	while (dev) {
+		if (dev_is_pci(dev))
+			return to_pci_dev(dev);
+
+		parent = get_device(dev->parent);
+		put_device(dev);
+		dev = parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+	struct pci_dev *up1, *up2;
+
+	if (!pdev)
+		return NULL;
+
+	up1 = pci_dev_get(pci_upstream_bridge(pdev));
+	if (!up1)
+		return NULL;
+
+	up2 = pci_dev_get(pci_upstream_bridge(up1));
+	pci_dev_put(up1);
+
+	return up2;
+}
+
+static bool __upstream_bridges_match(struct pci_dev *upstream,
+				     struct pci_dev *client)
+{
+	struct pci_dev *dma_up;
+	bool ret = true;
+
+	dma_up = get_upstream_bridge_port(client);
+
+	if (!dma_up) {
+		dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
+		ret = false;
+		goto out;
+	}
+
+	if (upstream != dma_up) {
+		dev_dbg(&client->dev,
+			"does not reside on the same upstream bridge\n");
+		ret = false;
+		goto out;
+	}
+
+out:
+	pci_dev_put(dma_up);
+	return ret;
+}
+
+static bool upstream_bridges_match(struct pci_dev *pdev,
+				   struct pci_dev *client)
+{
+	struct pci_dev *upstream;
+	bool ret;
+
+	upstream = get_upstream_bridge_port(pdev);
+	if (!upstream) {
+		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
+		return false;
+	}
+
+	ret = __upstream_bridges_match(upstream, client);
+
+	pci_dev_put(upstream);
+
+	return ret;
+}
+
+struct pci_p2pdma_client {
+	struct list_head list;
+	struct pci_dev *client;
+	struct pci_dev *p2pdma;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to p2pmem_find(). Once p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *item, *new_item;
+	struct pci_dev *p2pdma = NULL;
+	struct pci_dev *client;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+		dev_warn(dev,
+			 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+		return -ENODEV;
+	}
+
+
+	client = find_parent_pci_dev(dev);
+	if (!client) {
+		dev_warn(dev,
+			 "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+		return -ENODEV;
+	}
+
+	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+	if (item && item->p2pdma) {
+		p2pdma = item->p2pdma;
+
+		if (!upstream_bridges_match(p2pdma, client)) {
+			ret = -EXDEV;
+			goto put_client;
+		}
+	}
+
+	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+	if (!new_item) {
+		ret = -ENOMEM;
+		goto put_client;
+	}
+
+	new_item->client = client;
+	new_item->p2pdma = pci_dev_get(p2pdma);
+
+	list_add_tail(&new_item->list, head);
+
+	return 0;
+
+put_client:
+	pci_dev_put(client);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+	list_del(&item->list);
+	pci_dev_put(item->client);
+	pci_dev_put(item->p2pdma);
+	kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a new p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+	struct pci_dev *pdev;
+
+	pdev = find_parent_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	list_for_each_entry_safe(pos, tmp, head, list) {
+		if (pos->client != pdev)
+			continue;
+
+		pci_p2pdma_client_free(pos);
+	}
+
+	pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, head, list)
+		pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+static bool upstream_bridges_match_list(struct pci_dev *pdev,
+					struct list_head *head)
+{
+	struct pci_p2pdma_client *pos;
+	struct pci_dev *upstream;
+	bool ret;
+
+	upstream = get_upstream_bridge_port(pdev);
+	if (!upstream) {
+		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
+		return false;
+	}
+
+	list_for_each_entry(pos, head, list) {
+		ret = __upstream_bridges_match(upstream, pos->client);
+		if (!ret)
+			break;
+	}
+
+	pci_dev_put(upstream);
+	return ret;
+}
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ *	the specified list of clients
+ * @dev: list of devices to check (NULL-terminated)
+ *
+ * For now, we only support cases where the devices that will transfer to the
+ * p2pmem device are behind the same bridge. This cuts out cases that may work
+ * but is safest for the user.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	struct pci_dev *pdev = NULL;
+	struct pci_p2pdma_client *pos;
+
+	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+		if (!pdev->p2pdma || !pdev->p2pdma->published)
+			continue;
+
+		if (!upstream_bridges_match_list(pdev, clients))
+			continue;
+
+		list_for_each_entry(pos, clients, list)
+			pos->p2pdma = pdev;
+
+		return pdev;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev:	the device to allocate memory from
+ * @size:	number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	void *ret;
+
+	if (unlikely(!pdev->p2pdma))
+		return NULL;
+
+	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+		return NULL;
+
+	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+	if (unlikely(!ret))
+		percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev:	the device the memory was allocated from
+ * @addr:	address of the memory that was allocated
+ * @size:	number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+	percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ *	address obtained with pci_alloc_p2pmem
+ * @pdev:	the device the memory was allocated from
+ * @addr:	address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+	if (!addr)
+		return 0;
+	if (!pdev->p2pdma)
+		return 0;
+
+	/*
+	 * Note: when we added the memory to the pool we used the PCI
+	 * bus address as the physical address. So gen_pool_virt_to_phys()
+	 * actually returns the bus address despite the misleading name.
+	 */
+	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in an scatterlist
+ * @pdev:	the device to allocate memory from
+ * @sgl:	the allocated scatterlist
+ * @nents:      the number of SG entries in the list
+ * @length:     number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+			 unsigned int *nents, u32 length)
+{
+	struct scatterlist *sg;
+	void *addr;
+
+	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+	if (!sg)
+		return -ENOMEM;
+
+	sg_init_table(sg, 1);
+
+	addr = pci_alloc_p2pmem(pdev, length);
+	if (!addr)
+		goto out_free_sg;
+
+	sg_set_buf(sg, addr, length);
+	*sgl = sg;
+	*nents = 1;
+	return 0;
+
+out_free_sg:
+	kfree(sg);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist  allocated by pci_p2pmem_alloc_sgl
+ * @pdev:	the device to allocate memory from
+ * @sgl:	the allocated scatterlist
+ * @nents:      the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+			 unsigned int nents)
+{
+	struct scatterlist *sg;
+	int count;
+
+	if (!sgl || !nents)
+		return;
+
+	for_each_sg(sgl, sg, nents, count)
+		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+	kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ *	other devices with pci_p2pmem_find
+ * @pdev:	the device with peer-to-peer DMA memory to publish
+ * @publish:	set to true to publish the memory, false to unpublish it
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+	if (publish && !pdev->p2pdma)
+		return;
+
+	pdev->p2pdma->published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_PCI_P2PDMA,
 };
 
 /*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
 static inline bool is_device_private_page(const struct page *page)
 {
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..c0dde3d3aac4
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+		u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+		unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+		unsigned int nents);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+		size_t size, u64 offset)
+{
+	return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+		struct device *dev)
+{
+	return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+		struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+		size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+						    void *addr)
+{
+	return 0;
+}
+static inline int pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+		struct scatterlist **sgl, unsigned int *nents, u32 length)
+{
+	return -ENODEV;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+		struct scatterlist *sgl, unsigned int nents)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1beda008..437e42615896 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -276,6 +276,7 @@ struct pcie_link_state;
 struct pci_vpd;
 struct pci_sriov;
 struct pci_ats;
+struct pci_p2pdma;
 
 /* The pci_dev structure describes PCI devices */
 struct pci_dev {
@@ -429,6 +430,9 @@ struct pci_dev {
 #ifdef CONFIG_PCI_PASID
 	u16		pasid_features;
 #endif
+#ifdef CONFIG_PCI_P2PDMA
+	struct pci_p2pdma *p2pdma;
+#endif
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-02-28 23:39     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)


Some PCI devices may have memory mapped in a BAR space that's
intended for use in Peer-to-Peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

A kernel interface is provided so that other subsystems can find and
allocate chunks of P2P memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but would significantly reduce pressure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

The interface requires a user driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary. The ACS bits on the
downstream switch port will be managed for all the registered clients.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI switch. This is because
using P2P transactions through the PCI root complex can have performance
limitations or, worse, might not work at all. Finding out how well a
particular RC supports P2P transfers is non-trivial. Additionally, the
benefits of P2P transfers that go through the RC is limited to only
reducing DRAM usage.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/pci/Kconfig        |  16 ++
 drivers/pci/Makefile       |   1 +
 drivers/pci/p2pdma.c       | 568 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  18 ++
 include/linux/pci-p2pdma.h |  87 +++++++
 include/linux/pci.h        |   4 +
 6 files changed, 694 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 34b56a8f8480..840831418cbd 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -124,6 +124,22 @@ config PCI_PASID
 
 	  If unsure, say N.
 
+config PCI_P2PDMA
+	bool "PCI Peer to Peer transfer support"
+	depends on ZONE_DEVICE
+	select GENERIC_ALLOCATOR
+	help
+	  Enable? drivers to do PCI peer to peer transactions to and from
+	  BARs that are exposed in other devices that are the part of
+	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+	  specification to work (ie. anything below a single PCI bridge).
+
+	  Many PCIe root complexes do not support P2P transactions and
+	  it's hard to tell which support it with good performance, so
+	  at this time you will need a PCIe switch.
+
+	  If unsure, say N.
+
 config PCI_LABEL
 	def_bool y if (DMI || ACPI)
 	depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 941970936840..45e0ff6f3213 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o
 
 obj-$(CONFIG_PCI_ATS) += ats.o
 obj-$(CONFIG_PCI_IOV) += iov.o
+obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
 
 #
 # ACPI Related PCI FW Functions
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index 000000000000..ec0a6cb9e500
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,568 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/pci-p2pdma.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/percpu-refcount.h>
+
+struct pci_p2pdma {
+	struct percpu_ref devmap_ref;
+	struct completion devmap_ref_done;
+	struct gen_pool *pool;
+	bool published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+	struct pci_p2pdma *p2p =
+		container_of(ref, struct pci_p2pdma, devmap_ref);
+
+	complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+}
+
+static void pci_p2pdma_release(void *data)
+{
+	struct pci_dev *pdev = data;
+
+	wait_for_completion(&pdev->p2pdma->devmap_ref_done);
+	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+
+	gen_pool_destroy(pdev->p2pdma->pool);
+	pdev->p2pdma = NULL;
+}
+
+static int pci_p2pdma_setup(struct pci_dev *pdev)
+{
+	int error = -ENOMEM;
+	struct pci_p2pdma *p2p;
+
+	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+	if (!p2p)
+		return -ENOMEM;
+
+	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2p->pool)
+		goto out;
+
+	init_completion(&p2p->devmap_ref_done);
+	error = percpu_ref_init(&p2p->devmap_ref,
+			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+	if (error)
+		goto out_pool_destroy;
+
+	percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref);
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_pool_destroy;
+
+	pdev->p2pdma = p2p;
+
+	return 0;
+
+out_pool_destroy:
+	gen_pool_destroy(p2p->pool);
+out:
+	devm_kfree(&pdev->dev, p2p);
+	return error;
+}
+
+/**
+ * pci_p2pdma_add_resource - add memory for use as p2p memory
+ * @pci: the device to add the memory to
+ * @bar: PCI BAR to add
+ * @size: size of the memory to add, may be zero to use the whole BAR
+ * @offset: offset into the PCI BAR
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+			    u64 offset)
+{
+	struct dev_pagemap *pgmap;
+	void *addr;
+	int error;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	if (offset >= pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!size)
+		size = pci_resource_len(pdev, bar) - offset;
+
+	if (size + offset > pci_resource_len(pdev, bar))
+		return -EINVAL;
+
+	if (!pdev->p2pdma) {
+		error = pci_p2pdma_setup(pdev);
+		if (error)
+			return error;
+	}
+
+	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
+	if (!pgmap)
+		return -ENOMEM;
+
+	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
+	pgmap->res.end = pgmap->res.start + size - 1;
+	pgmap->res.flags = pci_resource_flags(pdev, bar);
+	pgmap->ref = &pdev->p2pdma->devmap_ref;
+	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+
+	addr = devm_memremap_pages(&pdev->dev, pgmap);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
+			pci_bus_address(pdev, bar) + offset,
+			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+	if (error)
+		return error;
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
+					  &pdev->p2pdma->devmap_ref);
+	if (error)
+		return error;
+
+	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
+		 &pgmap->res);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
+
+static struct pci_dev *find_parent_pci_dev(struct device *dev)
+{
+	struct device *parent;
+
+	dev = get_device(dev);
+
+	while (dev) {
+		if (dev_is_pci(dev))
+			return to_pci_dev(dev);
+
+		parent = get_device(dev->parent);
+		put_device(dev);
+		dev = parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+{
+	struct pci_dev *up1, *up2;
+
+	if (!pdev)
+		return NULL;
+
+	up1 = pci_dev_get(pci_upstream_bridge(pdev));
+	if (!up1)
+		return NULL;
+
+	up2 = pci_dev_get(pci_upstream_bridge(up1));
+	pci_dev_put(up1);
+
+	return up2;
+}
+
+static bool __upstream_bridges_match(struct pci_dev *upstream,
+				     struct pci_dev *client)
+{
+	struct pci_dev *dma_up;
+	bool ret = true;
+
+	dma_up = get_upstream_bridge_port(client);
+
+	if (!dma_up) {
+		dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
+		ret = false;
+		goto out;
+	}
+
+	if (upstream != dma_up) {
+		dev_dbg(&client->dev,
+			"does not reside on the same upstream bridge\n");
+		ret = false;
+		goto out;
+	}
+
+out:
+	pci_dev_put(dma_up);
+	return ret;
+}
+
+static bool upstream_bridges_match(struct pci_dev *pdev,
+				   struct pci_dev *client)
+{
+	struct pci_dev *upstream;
+	bool ret;
+
+	upstream = get_upstream_bridge_port(pdev);
+	if (!upstream) {
+		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
+		return false;
+	}
+
+	ret = __upstream_bridges_match(upstream, client);
+
+	pci_dev_put(upstream);
+
+	return ret;
+}
+
+struct pci_p2pdma_client {
+	struct list_head list;
+	struct pci_dev *client;
+	struct pci_dev *p2pdma;
+};
+
+/**
+ * pci_p2pdma_add_client - allocate a new element in a client device list
+ * @head: list head of p2pdma clients
+ * @dev: device to add to the list
+ *
+ * This adds @dev to a list of clients used by a p2pdma device.
+ * This list should be passed to p2pmem_find(). Once p2pmem_find() has
+ * been called successfully, the list will be bound to a specific p2pdma
+ * device and new clients can only be added to the list if they are
+ * supported by that p2pdma device.
+ *
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ *
+ * Returns 0 if the client was successfully added.
+ */
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *item, *new_item;
+	struct pci_dev *p2pdma = NULL;
+	struct pci_dev *client;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
+		dev_warn(dev,
+			 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
+		return -ENODEV;
+	}
+
+
+	client = find_parent_pci_dev(dev);
+	if (!client) {
+		dev_warn(dev,
+			 "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
+		return -ENODEV;
+	}
+
+	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
+	if (item && item->p2pdma) {
+		p2pdma = item->p2pdma;
+
+		if (!upstream_bridges_match(p2pdma, client)) {
+			ret = -EXDEV;
+			goto put_client;
+		}
+	}
+
+	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
+	if (!new_item) {
+		ret = -ENOMEM;
+		goto put_client;
+	}
+
+	new_item->client = client;
+	new_item->p2pdma = pci_dev_get(p2pdma);
+
+	list_add_tail(&new_item->list, head);
+
+	return 0;
+
+put_client:
+	pci_dev_put(client);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);
+
+static void pci_p2pdma_client_free(struct pci_p2pdma_client *item)
+{
+	list_del(&item->list);
+	pci_dev_put(item->client);
+	pci_dev_put(item->p2pdma);
+	kfree(item);
+}
+
+/**
+ * pci_p2pdma_remove_client - remove and free a new p2pdma client
+ * @head: list head of p2pdma clients
+ * @dev: device to remove from the list
+ *
+ * This removes @dev from a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2p functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+	struct pci_dev *pdev;
+
+	pdev = find_parent_pci_dev(dev);
+	if (!pdev)
+		return;
+
+	list_for_each_entry_safe(pos, tmp, head, list) {
+		if (pos->client != pdev)
+			continue;
+
+		pci_p2pdma_client_free(pos);
+	}
+
+	pci_dev_put(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client);
+
+/**
+ * pci_p2pdma_client_list_free - free an entire list of p2pdma clients
+ * @head: list head of p2pdma clients
+ *
+ * This removes all devices in a list of clients used by a p2pdma device.
+ * The caller is expected to have a lock which protects @head as necessary
+ * so that none of the pci_p2pdma functions can be called concurrently
+ * on that list.
+ */
+void pci_p2pdma_client_list_free(struct list_head *head)
+{
+	struct pci_p2pdma_client *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, head, list)
+		pci_p2pdma_client_free(pos);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free);
+
+static bool upstream_bridges_match_list(struct pci_dev *pdev,
+					struct list_head *head)
+{
+	struct pci_p2pdma_client *pos;
+	struct pci_dev *upstream;
+	bool ret;
+
+	upstream = get_upstream_bridge_port(pdev);
+	if (!upstream) {
+		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
+		return false;
+	}
+
+	list_for_each_entry(pos, head, list) {
+		ret = __upstream_bridges_match(upstream, pos->client);
+		if (!ret)
+			break;
+	}
+
+	pci_dev_put(upstream);
+	return ret;
+}
+
+/**
+ * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with
+ *	the specified list of clients
+ * @dev: list of devices to check (NULL-terminated)
+ *
+ * For now, we only support cases where the devices that will transfer to the
+ * p2pmem device are behind the same bridge. This cuts out cases that may work
+ * but is safest for the user.
+ *
+ * Returns a pointer to the PCI device with a reference taken (use pci_dev_put
+ * to return the reference) or NULL if no compatible device is found.
+ */
+struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	struct pci_dev *pdev = NULL;
+	struct pci_p2pdma_client *pos;
+
+	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) {
+		if (!pdev->p2pdma || !pdev->p2pdma->published)
+			continue;
+
+		if (!upstream_bridges_match_list(pdev, clients))
+			continue;
+
+		list_for_each_entry(pos, clients, list)
+			pos->p2pdma = pdev;
+
+		return pdev;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_find);
+
+/**
+ * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
+ * @pdev:	the device to allocate memory from
+ * @size:	number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error.
+ */
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	void *ret;
+
+	if (unlikely(!pdev->p2pdma))
+		return NULL;
+
+	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
+		return NULL;
+
+	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
+
+	if (unlikely(!ret))
+		percpu_ref_put(&pdev->p2pdma->devmap_ref);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
+
+/**
+ * pci_free_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev:	the device the memory was allocated from
+ * @addr:	address of the memory that was allocated
+ * @size:	number of bytes that was allocated
+ */
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
+{
+	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
+	percpu_ref_put(&pdev->p2pdma->devmap_ref);
+}
+EXPORT_SYMBOL_GPL(pci_free_p2pmem);
+
+/**
+ * pci_virt_to_bus - return the PCI bus address for a given virtual
+ *	address obtained with pci_alloc_p2pmem
+ * @pdev:	the device the memory was allocated from
+ * @addr:	address of the memory that was allocated
+ */
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
+{
+	if (!addr)
+		return 0;
+	if (!pdev->p2pdma)
+		return 0;
+
+	/*
+	 * Note: when we added the memory to the pool we used the PCI
+	 * bus address as the physical address. So gen_pool_virt_to_phys()
+	 * actually returns the bus address despite the misleading name.
+	 */
+	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
+
+/**
+ * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in an scatterlist
+ * @pdev:	the device to allocate memory from
+ * @sgl:	the allocated scatterlist
+ * @nents:      the number of SG entries in the list
+ * @length:     number of bytes to allocate
+ *
+ * Returns 0 on success
+ */
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+			 unsigned int *nents, u32 length)
+{
+	struct scatterlist *sg;
+	void *addr;
+
+	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
+	if (!sg)
+		return -ENOMEM;
+
+	sg_init_table(sg, 1);
+
+	addr = pci_alloc_p2pmem(pdev, length);
+	if (!addr)
+		goto out_free_sg;
+
+	sg_set_buf(sg, addr, length);
+	*sgl = sg;
+	*nents = 1;
+	return 0;
+
+out_free_sg:
+	kfree(sg);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
+
+/**
+ * pci_p2pmem_free_sgl - free a scatterlist  allocated by pci_p2pmem_alloc_sgl
+ * @pdev:	the device to allocate memory from
+ * @sgl:	the allocated scatterlist
+ * @nents:      the number of SG entries in the list
+ */
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+			 unsigned int nents)
+{
+	struct scatterlist *sg;
+	int count;
+
+	if (!sgl || !nents)
+		return;
+
+	for_each_sg(sgl, sg, nents, count)
+		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
+	kfree(sgl);
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
+
+/**
+ * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
+ *	other devices with pci_p2pmem_find
+ * @pdev:	the device with peer-to-peer DMA memory to publish
+ * @publish:	set to true to publish the memory, false to unpublish it
+ */
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+	if (publish && !pdev->p2pdma)
+		return;
+
+	pdev->p2pdma->published = publish;
+}
+EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..9e907c338a44 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,11 +53,16 @@ struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_PCI_P2PDMA:
+ * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
+ * transactions.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_PCI_P2PDMA,
 };
 
 /*
@@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
+#ifdef CONFIG_PCI_P2PDMA
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+}
+#else /* CONFIG_PCI_P2PDMA */
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
 static inline bool is_device_private_page(const struct page *page)
 {
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
new file mode 100644
index 000000000000..c0dde3d3aac4
--- /dev/null
+++ b/include/linux/pci-p2pdma.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_PCI_P2PDMA_H
+#define _LINUX_PCI_P2PDMA_H
+
+#include <linux/pci.h>
+
+struct block_device;
+struct scatterlist;
+
+#ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
+		u64 offset);
+int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
+void pci_p2pdma_client_list_free(struct list_head *head);
+struct pci_dev *pci_p2pmem_find(struct list_head *clients);
+void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
+void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);
+pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
+int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
+		unsigned int *nents, u32 length);
+void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
+		unsigned int nents);
+void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+#else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
+		size_t size, u64 offset)
+{
+	return 0;
+}
+static inline int pci_p2pdma_add_client(struct list_head *head,
+		struct device *dev)
+{
+	return 0;
+}
+static inline void pci_p2pdma_remove_client(struct list_head *head,
+		struct device *dev)
+{
+}
+static inline void pci_p2pdma_client_list_free(struct list_head *head)
+{
+}
+static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients)
+{
+	return NULL;
+}
+static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
+{
+	return NULL;
+}
+static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr,
+		size_t size)
+{
+}
+static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev,
+						    void *addr)
+{
+	return 0;
+}
+static inline int pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
+		struct scatterlist **sgl, unsigned int *nents, u32 length)
+{
+	return -ENODEV;
+}
+static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
+		struct scatterlist *sgl, unsigned int nents)
+{
+}
+static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
+{
+}
+#endif /* CONFIG_PCI_P2PDMA */
+#endif /* _LINUX_PCI_P2P_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1beda008..437e42615896 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -276,6 +276,7 @@ struct pcie_link_state;
 struct pci_vpd;
 struct pci_sriov;
 struct pci_ats;
+struct pci_p2pdma;
 
 /* The pci_dev structure describes PCI devices */
 struct pci_dev {
@@ -429,6 +430,9 @@ struct pci_dev {
 #ifdef CONFIG_PCI_PASID
 	u16		pasid_features;
 #endif
+#ifdef CONFIG_PCI_P2PDMA
+	struct pci_p2pdma *p2pdma;
+#endif
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-02-28 23:39   ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Attributes display the total amount of P2P memory, the amount available
and whether it is published or not.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
 drivers/pci/p2pdma.c                    | 50 +++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..f5656dae21be 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
 
 		This is similar to /sys/bus/pci/drivers_autoprobe, but
 		affects only the VFs associated with a specific PF.
+
+What:		/sys/bus/pci/devices/.../p2pmem/available
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the amount of memory that has not been
+		allocated (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/size
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the total ammount of memory that the device
+		provides (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/published
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains a '1' if the memory has been published for
+		use inside the kernel or a '0' if it is only intended
+		for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ec0a6cb9e500..a57df78f6a32 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -29,6 +29,53 @@ struct pci_p2pdma {
 	bool published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t size = 0;
+
+	if (pdev->p2pdma->pool)
+		size = gen_pool_size(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t avail = 0;
+
+	if (pdev->p2pdma->pool)
+		avail = gen_pool_avail(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+	&dev_attr_size.attr,
+	&dev_attr_available.attr,
+	&dev_attr_published.attr,
+	NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+	.attrs = p2pmem_attrs,
+	.name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
 	struct pci_p2pdma *p2p =
@@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
 	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
 
 	gen_pool_destroy(pdev->p2pdma->pool);
+	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	pdev->p2pdma = NULL;
 }
 
@@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 	if (error)
 		goto out_pool_destroy;
 
+	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+
 	pdev->p2pdma = p2p;
 
 	return 0;
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-02-28 23:39   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

Attributes display the total amount of P2P memory, the amount available
and whether it is published or not.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
 drivers/pci/p2pdma.c                    | 50 +++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..f5656dae21be 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
 
 		This is similar to /sys/bus/pci/drivers_autoprobe, but
 		affects only the VFs associated with a specific PF.
+
+What:		/sys/bus/pci/devices/.../p2pmem/available
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the amount of memory that has not been
+		allocated (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/size
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the total ammount of memory that the device
+		provides (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/published
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains a '1' if the memory has been published for
+		use inside the kernel or a '0' if it is only intended
+		for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ec0a6cb9e500..a57df78f6a32 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -29,6 +29,53 @@ struct pci_p2pdma {
 	bool published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t size = 0;
+
+	if (pdev->p2pdma->pool)
+		size = gen_pool_size(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t avail = 0;
+
+	if (pdev->p2pdma->pool)
+		avail = gen_pool_avail(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+	&dev_attr_size.attr,
+	&dev_attr_available.attr,
+	&dev_attr_published.attr,
+	NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+	.attrs = p2pmem_attrs,
+	.name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
 	struct pci_p2pdma *p2p =
@@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
 	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
 
 	gen_pool_destroy(pdev->p2pdma->pool);
+	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	pdev->p2pdma = NULL;
 }
 
@@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 	if (error)
 		goto out_pool_destroy;
 
+	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+
 	pdev->p2pdma = p2p;
 
 	return 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-02-28 23:39   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Attributes display the total amount of P2P memory, the amount available
and whether it is published or not.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
 drivers/pci/p2pdma.c                    | 50 +++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..f5656dae21be 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
 
 		This is similar to /sys/bus/pci/drivers_autoprobe, but
 		affects only the VFs associated with a specific PF.
+
+What:		/sys/bus/pci/devices/.../p2pmem/available
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the amount of memory that has not been
+		allocated (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/size
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the total ammount of memory that the device
+		provides (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/published
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains a '1' if the memory has been published for
+		use inside the kernel or a '0' if it is only intended
+		for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ec0a6cb9e500..a57df78f6a32 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -29,6 +29,53 @@ struct pci_p2pdma {
 	bool published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t size = 0;
+
+	if (pdev->p2pdma->pool)
+		size = gen_pool_size(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t avail = 0;
+
+	if (pdev->p2pdma->pool)
+		avail = gen_pool_avail(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+	&dev_attr_size.attr,
+	&dev_attr_available.attr,
+	&dev_attr_published.attr,
+	NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+	.attrs = p2pmem_attrs,
+	.name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
 	struct pci_p2pdma *p2p =
@@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
 	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
 
 	gen_pool_destroy(pdev->p2pdma->pool);
+	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	pdev->p2pdma = NULL;
 }
 
@@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 	if (error)
 		goto out_pool_destroy;
 
+	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+
 	pdev->p2pdma = p2p;
 
 	return 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-02-28 23:39   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)


Attributes display the total amount of P2P memory, the amount available
and whether it is published or not.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
 drivers/pci/p2pdma.c                    | 50 +++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..f5656dae21be 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,28 @@ Description:
 
 		This is similar to /sys/bus/pci/drivers_autoprobe, but
 		affects only the VFs associated with a specific PF.
+
+What:		/sys/bus/pci/devices/.../p2pmem/available
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang at deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the amount of memory that has not been
+		allocated (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/size
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang at deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains the total ammount of memory that the device
+		provides (in decimal).
+
+What:		/sys/bus/pci/devices/.../p2pmem/published
+Date:		November 2017
+Contact:	Logan Gunthorpe <logang at deltatee.com>
+Description:
+		If the device has any Peer-to-Peer memory registered, this
+	        file contains a '1' if the memory has been published for
+		use inside the kernel or a '0' if it is only intended
+		for use within the driver that published it.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ec0a6cb9e500..a57df78f6a32 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -29,6 +29,53 @@ struct pci_p2pdma {
 	bool published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t size = 0;
+
+	if (pdev->p2pdma->pool)
+		size = gen_pool_size(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	size_t avail = 0;
+
+	if (pdev->p2pdma->pool)
+		avail = gen_pool_avail(pdev->p2pdma->pool);
+
+	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute *attr,
+			      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+	&dev_attr_size.attr,
+	&dev_attr_available.attr,
+	&dev_attr_published.attr,
+	NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+	.attrs = p2pmem_attrs,
+	.name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
 	struct pci_p2pdma *p2p =
@@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
 	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
 
 	gen_pool_destroy(pdev->p2pdma->pool);
+	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	pdev->p2pdma = NULL;
 }
 
@@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 	if (error)
 		goto out_pool_destroy;
 
+	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+
 	pdev->p2pdma = p2p;
 
 	return 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
@ 2018-02-28 23:39     ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.

For this, we assume that an SGL passed to these functions contain all
p2p memory or no p2p memory.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  1 +
 include/linux/pci-p2pdma.h | 13 ++++++++++++
 3 files changed, 65 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index a57df78f6a32..4e1c81f64b29 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->res.flags = pci_resource_flags(pdev, bar);
 	pgmap->ref = &pdev->p2pdma->devmap_ref;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+		pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr))
@@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 	pdev->p2pdma->published = publish;
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/*
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev:        device doing the DMA request
+ * @sg:		scatter list to map
+ * @nents:	elements in the scatterlist
+ * @dir:        DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir)
+{
+	struct dev_pagemap *pgmap;
+	struct scatterlist *s;
+	phys_addr_t paddr;
+	int i;
+
+	/*
+	 * p2pdma mappings are not compatible with devices that use
+	 * dma_virt_ops.
+	 */
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+		return 0;
+
+	for_each_sg(sg, s, nents, i) {
+		pgmap = sg_page(s)->pgmap;
+		paddr = sg_phys(s);
+
+		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+		sg_dma_len(s) = s->length;
+	}
+
+	return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev:        device doing the DMA request
+ * @sg:		scatter list to map
+ * @nents:	elements in the scatterlist
+ * @dir:        DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
 	struct device *dev;
 	void *data;
 	enum memory_type type;
+	u64 pci_p2pdma_bus_offset;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index c0dde3d3aac4..126eca697ab3 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
 		unsigned int nents);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline int pci_p2pdma_map_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+	return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
 #endif /* CONFIG_PCI_P2PDMA */
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-02-28 23:39     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.

For this, we assume that an SGL passed to these functions contain all
p2p memory or no p2p memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  1 +
 include/linux/pci-p2pdma.h | 13 ++++++++++++
 3 files changed, 65 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index a57df78f6a32..4e1c81f64b29 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->res.flags = pci_resource_flags(pdev, bar);
 	pgmap->ref = &pdev->p2pdma->devmap_ref;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+		pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr))
@@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 	pdev->p2pdma->published = publish;
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/*
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev:        device doing the DMA request
+ * @sg:		scatter list to map
+ * @nents:	elements in the scatterlist
+ * @dir:        DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir)
+{
+	struct dev_pagemap *pgmap;
+	struct scatterlist *s;
+	phys_addr_t paddr;
+	int i;
+
+	/*
+	 * p2pdma mappings are not compatible with devices that use
+	 * dma_virt_ops.
+	 */
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+		return 0;
+
+	for_each_sg(sg, s, nents, i) {
+		pgmap = sg_page(s)->pgmap;
+		paddr = sg_phys(s);
+
+		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+		sg_dma_len(s) = s->length;
+	}
+
+	return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev:        device doing the DMA request
+ * @sg:		scatter list to map
+ * @nents:	elements in the scatterlist
+ * @dir:        DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
 	struct device *dev;
 	void *data;
 	enum memory_type type;
+	u64 pci_p2pdma_bus_offset;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index c0dde3d3aac4..126eca697ab3 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
 		unsigned int nents);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline int pci_p2pdma_map_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+	return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
 #endif /* CONFIG_PCI_P2PDMA */
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-02-28 23:39     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:39 UTC (permalink / raw)


The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
addresses when using P2P memory.

For this, we assume that an SGL passed to these functions contain all
p2p memory or no p2p memory.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/memremap.h   |  1 +
 include/linux/pci-p2pdma.h | 13 ++++++++++++
 3 files changed, 65 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index a57df78f6a32..4e1c81f64b29 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->res.flags = pci_resource_flags(pdev, bar);
 	pgmap->ref = &pdev->p2pdma->devmap_ref;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+		pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr))
@@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 	pdev->p2pdma->published = publish;
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/*
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
+ * @dev:        device doing the DMA request
+ * @sg:		scatter list to map
+ * @nents:	elements in the scatterlist
+ * @dir:        DMA direction
+ *
+ * Returns the number of SG entries mapped
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir)
+{
+	struct dev_pagemap *pgmap;
+	struct scatterlist *s;
+	phys_addr_t paddr;
+	int i;
+
+	/*
+	 * p2pdma mappings are not compatible with devices that use
+	 * dma_virt_ops.
+	 */
+	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
+		return 0;
+
+	for_each_sg(sg, s, nents, i) {
+		pgmap = sg_page(s)->pgmap;
+		paddr = sg_phys(s);
+
+		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+		sg_dma_len(s) = s->length;
+	}
+
+	return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
+ * @dev:        device doing the DMA request
+ * @sg:		scatter list to map
+ * @nents:	elements in the scatterlist
+ * @dir:        DMA direction
+ */
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir)
+{
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9e907c338a44..1660f64ce96f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
 	struct device *dev;
 	void *data;
 	enum memory_type type;
+	u64 pci_p2pdma_bus_offset;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index c0dde3d3aac4..126eca697ab3 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
 		unsigned int nents);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+		      enum dma_data_direction dir);
+void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
+			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline int pci_p2pdma_map_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+	return 0;
+}
+static inline void pci_p2pdma_unmap_sg(struct device *dev,
+	struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+}
 #endif /* CONFIG_PCI_P2PDMA */
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-02-28 23:40   ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any switch will be in the same IOMMU group.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  4 ++++
 drivers/pci/p2pdma.c       | 44 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.c          |  4 ++++
 include/linux/pci-p2pdma.h |  5 +++++
 4 files changed, 57 insertions(+)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 840831418cbd..a430672f0ad4 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -138,6 +138,10 @@ config PCI_P2PDMA
 	  it's hard to tell which support it with good performance, so
 	  at this time you will need a PCIe switch.
 
+	  Enabling this option will also disable ACS on all ports behind
+	  any PCIe switch. This effictively puts all devices behind any
+	  switch into the same IOMMU group.
+
 	  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4e1c81f64b29..61af07acd21a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
 	return up2;
 }
 
+/*
+ * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
+ *	bridges/switches
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any downstream port in any switch in order for
+ * the TLPs to not be forwarded up to the RC which is not what we want
+ * for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any switch to be in the same IOMMU
+ * group. At this time there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
+ */
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	struct pci_dev *up;
+	int pos;
+	u16 ctrl;
+
+	up = get_upstream_bridge_port(pdev);
+	if (!up)
+		return 0;
+	pci_dev_put(up);
+
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return 0;
+
+	dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
+
+	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
+
+	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
+
+	return 1;
+}
+
 static bool __upstream_bridges_match(struct pci_dev *upstream,
 				     struct pci_dev *client)
 {
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd10d9b0..95ad3cf288c8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/pm.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+	if (pci_p2pdma_disable_acs(dev))
+		return;
+
 	if (!pci_acs_enable)
 		return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 126eca697ab3..f537f521f60c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -22,6 +22,7 @@ struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	return 0;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any switch will be in the same IOMMU group.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig        |  4 ++++
 drivers/pci/p2pdma.c       | 44 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.c          |  4 ++++
 include/linux/pci-p2pdma.h |  5 +++++
 4 files changed, 57 insertions(+)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 840831418cbd..a430672f0ad4 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -138,6 +138,10 @@ config PCI_P2PDMA
 	  it's hard to tell which support it with good performance, so
 	  at this time you will need a PCIe switch.
 
+	  Enabling this option will also disable ACS on all ports behind
+	  any PCIe switch. This effictively puts all devices behind any
+	  switch into the same IOMMU group.
+
 	  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4e1c81f64b29..61af07acd21a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
 	return up2;
 }
 
+/*
+ * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
+ *	bridges/switches
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any downstream port in any switch in order for
+ * the TLPs to not be forwarded up to the RC which is not what we want
+ * for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any switch to be in the same IOMMU
+ * group. At this time there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
+ */
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	struct pci_dev *up;
+	int pos;
+	u16 ctrl;
+
+	up = get_upstream_bridge_port(pdev);
+	if (!up)
+		return 0;
+	pci_dev_put(up);
+
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return 0;
+
+	dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
+
+	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
+
+	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
+
+	return 1;
+}
+
 static bool __upstream_bridges_match(struct pci_dev *upstream,
 				     struct pci_dev *client)
 {
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd10d9b0..95ad3cf288c8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/pm.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+	if (pci_p2pdma_disable_acs(dev))
+		return;
+
 	if (!pci_acs_enable)
 		return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 126eca697ab3..f537f521f60c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -22,6 +22,7 @@ struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	return 0;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any switch will be in the same IOMMU group.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/pci/Kconfig        |  4 ++++
 drivers/pci/p2pdma.c       | 44 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.c          |  4 ++++
 include/linux/pci-p2pdma.h |  5 +++++
 4 files changed, 57 insertions(+)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 840831418cbd..a430672f0ad4 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -138,6 +138,10 @@ config PCI_P2PDMA
 	  it's hard to tell which support it with good performance, so
 	  at this time you will need a PCIe switch.
 
+	  Enabling this option will also disable ACS on all ports behind
+	  any PCIe switch. This effictively puts all devices behind any
+	  switch into the same IOMMU group.
+
 	  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4e1c81f64b29..61af07acd21a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
 	return up2;
 }
 
+/*
+ * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
+ *	bridges/switches
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any downstream port in any switch in order for
+ * the TLPs to not be forwarded up to the RC which is not what we want
+ * for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any switch to be in the same IOMMU
+ * group. At this time there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
+ */
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	struct pci_dev *up;
+	int pos;
+	u16 ctrl;
+
+	up = get_upstream_bridge_port(pdev);
+	if (!up)
+		return 0;
+	pci_dev_put(up);
+
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return 0;
+
+	dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
+
+	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
+
+	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
+
+	return 1;
+}
+
 static bool __upstream_bridges_match(struct pci_dev *upstream,
 				     struct pci_dev *client)
 {
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd10d9b0..95ad3cf288c8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/pm.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+	if (pci_p2pdma_disable_acs(dev))
+		return;
+
 	if (!pci_acs_enable)
 		return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 126eca697ab3..f537f521f60c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -22,6 +22,7 @@ struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	return 0;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)


For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any switch will be in the same IOMMU group.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/pci/Kconfig        |  4 ++++
 drivers/pci/p2pdma.c       | 44 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.c          |  4 ++++
 include/linux/pci-p2pdma.h |  5 +++++
 4 files changed, 57 insertions(+)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 840831418cbd..a430672f0ad4 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -138,6 +138,10 @@ config PCI_P2PDMA
 	  it's hard to tell which support it with good performance, so
 	  at this time you will need a PCIe switch.
 
+	  Enabling this option will also disable ACS on all ports behind
+	  any PCIe switch. This effictively puts all devices behind any
+	  switch into the same IOMMU group.
+
 	  If unsure, say N.
 
 config PCI_LABEL
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4e1c81f64b29..61af07acd21a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
 	return up2;
 }
 
+/*
+ * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
+ *	bridges/switches
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any downstream port in any switch in order for
+ * the TLPs to not be forwarded up to the RC which is not what we want
+ * for P2P.
+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any switch to be in the same IOMMU
+ * group. At this time there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
+ */
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	struct pci_dev *up;
+	int pos;
+	u16 ctrl;
+
+	up = get_upstream_bridge_port(pdev);
+	if (!up)
+		return 0;
+	pci_dev_put(up);
+
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return 0;
+
+	dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
+
+	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
+
+	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
+
+	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
+
+	return 1;
+}
+
 static bool __upstream_bridges_match(struct pci_dev *upstream,
 				     struct pci_dev *client)
 {
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd10d9b0..95ad3cf288c8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -16,6 +16,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/pm.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
  */
 void pci_enable_acs(struct pci_dev *dev)
 {
+	if (pci_p2pdma_disable_acs(dev))
+		return;
+
 	if (!pci_acs_enable)
 		return;
 
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 126eca697ab3..f537f521f60c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -22,6 +22,7 @@ struct block_device;
 struct scatterlist;
 
 #ifdef CONFIG_PCI_P2PDMA
+int pci_p2pdma_disable_acs(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
@@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
+{
+	return 0;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-02-28 23:40   ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-core.c          |  3 +++
 include/linux/blk_types.h | 18 +++++++++++++++++-
 include/linux/blkdev.h    |  3 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2d1a7bbe0634..36a6a304a559 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2183,6 +2183,9 @@ generic_make_request_checks(struct bio *bio)
 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
 		goto not_supported;
 
+	if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+		goto not_supported;
+
 	if (should_fail_bio(bio))
 		goto end_io;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bf18b95ed92d..490122c85b3f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -274,6 +274,10 @@ enum req_flag_bits {
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
 
+#ifdef CONFIG_PCI_P2PDMA
+	__REQ_PCI_P2PDMA,	/* request is to/from P2P memory */
+#endif
+
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
@@ -298,6 +302,18 @@ enum req_flag_bits {
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA		(1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req)	((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA		0
+#define REQ_IS_PCI_P2PDMA(req)	0
+#endif /* CONFIG_PCI_P2PDMA */
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
@@ -306,7 +322,7 @@ enum req_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
 
 #define bio_op(bio) \
 	((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed63f3b69c12..0b4a386c73ea 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30	/* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -793,6 +794,8 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)	\
+	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-core.c          |  3 +++
 include/linux/blk_types.h | 18 +++++++++++++++++-
 include/linux/blkdev.h    |  3 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2d1a7bbe0634..36a6a304a559 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2183,6 +2183,9 @@ generic_make_request_checks(struct bio *bio)
 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
 		goto not_supported;
 
+	if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+		goto not_supported;
+
 	if (should_fail_bio(bio))
 		goto end_io;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bf18b95ed92d..490122c85b3f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -274,6 +274,10 @@ enum req_flag_bits {
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
 
+#ifdef CONFIG_PCI_P2PDMA
+	__REQ_PCI_P2PDMA,	/* request is to/from P2P memory */
+#endif
+
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
@@ -298,6 +302,18 @@ enum req_flag_bits {
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA		(1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req)	((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA		0
+#define REQ_IS_PCI_P2PDMA(req)	0
+#endif /* CONFIG_PCI_P2PDMA */
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
@@ -306,7 +322,7 @@ enum req_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
 
 #define bio_op(bio) \
 	((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed63f3b69c12..0b4a386c73ea 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30	/* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -793,6 +794,8 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)	\
+	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 block/blk-core.c          |  3 +++
 include/linux/blk_types.h | 18 +++++++++++++++++-
 include/linux/blkdev.h    |  3 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2d1a7bbe0634..36a6a304a559 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2183,6 +2183,9 @@ generic_make_request_checks(struct bio *bio)
 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
 		goto not_supported;
 
+	if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+		goto not_supported;
+
 	if (should_fail_bio(bio))
 		goto end_io;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bf18b95ed92d..490122c85b3f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -274,6 +274,10 @@ enum req_flag_bits {
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
 
+#ifdef CONFIG_PCI_P2PDMA
+	__REQ_PCI_P2PDMA,	/* request is to/from P2P memory */
+#endif
+
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
@@ -298,6 +302,18 @@ enum req_flag_bits {
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA		(1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req)	((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA		0
+#define REQ_IS_PCI_P2PDMA(req)	0
+#endif /* CONFIG_PCI_P2PDMA */
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
@@ -306,7 +322,7 @@ enum req_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
 
 #define bio_op(bio) \
 	((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed63f3b69c12..0b4a386c73ea 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30	/* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -793,6 +794,8 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)	\
+	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)


QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory.

REQ_PCI_P2P is introduced to indicate a particular bio request is
directed to/from PCI P2P memory. A request with this flag is not
accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P
flag set.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 block/blk-core.c          |  3 +++
 include/linux/blk_types.h | 18 +++++++++++++++++-
 include/linux/blkdev.h    |  3 +++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2d1a7bbe0634..36a6a304a559 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2183,6 +2183,9 @@ generic_make_request_checks(struct bio *bio)
 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
 		goto not_supported;
 
+	if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q))
+		goto not_supported;
+
 	if (should_fail_bio(bio))
 		goto end_io;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bf18b95ed92d..490122c85b3f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -274,6 +274,10 @@ enum req_flag_bits {
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
 
+#ifdef CONFIG_PCI_P2PDMA
+	__REQ_PCI_P2PDMA,	/* request is to/from P2P memory */
+#endif
+
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
@@ -298,6 +302,18 @@ enum req_flag_bits {
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
+#ifdef CONFIG_PCI_P2PDMA
+/*
+ * Currently SGLs do not support mixed P2P and regular memory so
+ * requests with P2P memory must not be merged.
+ */
+#define REQ_PCI_P2PDMA		(1ULL << __REQ_PCI_P2PDMA)
+#define REQ_IS_PCI_P2PDMA(req)	((req)->cmd_flags & REQ_PCI_P2PDMA)
+#else
+#define REQ_PCI_P2PDMA		0
+#define REQ_IS_PCI_P2PDMA(req)	0
+#endif /* CONFIG_PCI_P2PDMA */
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
@@ -306,7 +322,7 @@ enum req_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
 #define REQ_NOMERGE_FLAGS \
-	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA)
+	(REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA)
 
 #define bio_op(bio) \
 	((bio)->bi_opf & REQ_OP_MASK)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed63f3b69c12..0b4a386c73ea 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,6 +698,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30	/* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -793,6 +794,8 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_dax(q)	test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)	\
 	test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)	\
+	test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-02-28 23:40   ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct DMA address. To do this, we add a flags
variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
specified use the appropriate map function.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/infiniband/core/rw.c            | 21 +++++++++++++++++----
 drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
 drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ++++---
 drivers/nvme/target/rdma.c              |  6 +++---
 include/rdma/rw.h                       |  7 +++++--
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |  6 +++---
 6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..775a9f8b15a6 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
  * @remote_addr:remote address to read/write (relative to @rkey)
  * @rkey:	remote key to operate on
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
  *
  * Returns the number of WQEs that will be needed on the workqueue if
  * successful, or a negative error code.
  */
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+	else
+		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -579,9 +586,11 @@ EXPORT_SYMBOL(rdma_rw_ctx_post);
  * @sg:		scatterlist that was used for the READ/WRITE
  * @sg_cnt:	number of entries in @sg
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      the same flags used to init the context
  */
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
-		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	int i;
 
@@ -602,7 +611,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		break;
 	}
 
-	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+		pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+				    sg_cnt, dir);
+	else
+		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index fff40b097947..933200645262 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1497,7 +1497,8 @@ isert_rdma_rw_ctx_destroy(struct isert_cmd *cmd, struct isert_conn *conn)
 				se_cmd->t_prot_nents, dir);
 	} else {
 		rdma_rw_ctx_destroy(&cmd->rw, conn->qp, conn->cm_id->port_num,
-				se_cmd->t_data_sg, se_cmd->t_data_nents, dir);
+				se_cmd->t_data_sg, se_cmd->t_data_nents,
+				dir, 0);
 	}
 
 	cmd->rw.nr_ops = 0;
@@ -2152,7 +2153,7 @@ isert_rdma_rw_ctx_post(struct isert_cmd *cmd, struct isert_conn *conn,
 	} else {
 		ret = rdma_rw_ctx_init(&cmd->rw, conn->qp, port_num,
 				se_cmd->t_data_sg, se_cmd->t_data_nents,
-				offset, addr, rkey, dir);
+				offset, addr, rkey, dir, 0);
 	}
 
 	if (ret < 0) {
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 0373b7c40902..5dcbe0a18db8 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -899,7 +899,8 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 			goto unwind;
 
 		ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, 0, remote_addr, rkey, dir);
+				ctx->sg, ctx->nents, 0, remote_addr, rkey,
+				dir, 0);
 		if (ret < 0) {
 			target_free_sgl(ctx->sg, ctx->nents);
 			goto unwind;
@@ -928,7 +929,7 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 	if (ioctx->rw_ctxs != &ioctx->s_rw_ctx)
@@ -946,7 +947,7 @@ static void srpt_free_rw_ctxs(struct srpt_rdma_ch *ch,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 978e169c11bf..020354e11351 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -427,7 +427,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	if (rsp->n_rdma) {
 		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
@@ -510,7 +510,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
 	rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 			queue->cm_id->port_num, rsp->req.sg,
-			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	rsp->n_rdma = 0;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -579,7 +579,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req));
+			nvmet_data_dir(&rsp->req), 0);
 	if (ret < 0)
 		return NVME_SC_INTERNAL;
 	rsp->req.transfer_len += len;
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index a3cbbc7b6417..dbc342b53901 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -59,12 +59,15 @@ struct rdma_rw_ctx {
 	};
 };
 
+#define RDMA_RW_CTX_FLAG_PCI_P2PDMA  (1 << 0)
+
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir);
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags);
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt,
-		enum dma_data_direction dir);
+		enum dma_data_direction dir, unsigned int flags);
 
 int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u8 port_num, struct scatterlist *sg, u32 sg_cnt,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 12b9a7e0b6d2..4364a8e98470 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -140,7 +140,7 @@ static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
 
 		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
 				    rdma->sc_port_num, ctxt->rw_sg_table.sgl,
-				    ctxt->rw_nents, dir);
+				    ctxt->rw_nents, dir, 0);
 		svc_rdma_put_rw_ctxt(rdma, ctxt);
 	}
 	svc_xprt_put(&rdma->sc_xprt);
@@ -433,7 +433,7 @@ svc_rdma_build_writes(struct svc_rdma_write_info *info,
 		ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp,
 				       rdma->sc_port_num, ctxt->rw_sg_table.sgl,
 				       ctxt->rw_nents, 0, seg_offset,
-				       seg_handle, DMA_TO_DEVICE);
+				       seg_handle, DMA_TO_DEVICE, 0);
 		if (ret < 0)
 			goto out_initerr;
 
@@ -639,7 +639,7 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
 	ret = rdma_rw_ctx_init(&ctxt->rw_ctx, cc->cc_rdma->sc_qp,
 			       cc->cc_rdma->sc_port_num,
 			       ctxt->rw_sg_table.sgl, ctxt->rw_nents,
-			       0, offset, rkey, DMA_FROM_DEVICE);
+			       0, offset, rkey, DMA_FROM_DEVICE, 0);
 	if (ret < 0)
 		goto out_initerr;
 
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct DMA address. To do this, we add a flags
variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
specified use the appropriate map function.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/infiniband/core/rw.c            | 21 +++++++++++++++++----
 drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
 drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ++++---
 drivers/nvme/target/rdma.c              |  6 +++---
 include/rdma/rw.h                       |  7 +++++--
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |  6 +++---
 6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..775a9f8b15a6 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
  * @remote_addr:remote address to read/write (relative to @rkey)
  * @rkey:	remote key to operate on
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
  *
  * Returns the number of WQEs that will be needed on the workqueue if
  * successful, or a negative error code.
  */
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+	else
+		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -579,9 +586,11 @@ EXPORT_SYMBOL(rdma_rw_ctx_post);
  * @sg:		scatterlist that was used for the READ/WRITE
  * @sg_cnt:	number of entries in @sg
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      the same flags used to init the context
  */
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
-		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	int i;
 
@@ -602,7 +611,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		break;
 	}
 
-	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+		pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+				    sg_cnt, dir);
+	else
+		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index fff40b097947..933200645262 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1497,7 +1497,8 @@ isert_rdma_rw_ctx_destroy(struct isert_cmd *cmd, struct isert_conn *conn)
 				se_cmd->t_prot_nents, dir);
 	} else {
 		rdma_rw_ctx_destroy(&cmd->rw, conn->qp, conn->cm_id->port_num,
-				se_cmd->t_data_sg, se_cmd->t_data_nents, dir);
+				se_cmd->t_data_sg, se_cmd->t_data_nents,
+				dir, 0);
 	}
 
 	cmd->rw.nr_ops = 0;
@@ -2152,7 +2153,7 @@ isert_rdma_rw_ctx_post(struct isert_cmd *cmd, struct isert_conn *conn,
 	} else {
 		ret = rdma_rw_ctx_init(&cmd->rw, conn->qp, port_num,
 				se_cmd->t_data_sg, se_cmd->t_data_nents,
-				offset, addr, rkey, dir);
+				offset, addr, rkey, dir, 0);
 	}
 
 	if (ret < 0) {
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 0373b7c40902..5dcbe0a18db8 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -899,7 +899,8 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 			goto unwind;
 
 		ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, 0, remote_addr, rkey, dir);
+				ctx->sg, ctx->nents, 0, remote_addr, rkey,
+				dir, 0);
 		if (ret < 0) {
 			target_free_sgl(ctx->sg, ctx->nents);
 			goto unwind;
@@ -928,7 +929,7 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 	if (ioctx->rw_ctxs != &ioctx->s_rw_ctx)
@@ -946,7 +947,7 @@ static void srpt_free_rw_ctxs(struct srpt_rdma_ch *ch,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 978e169c11bf..020354e11351 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -427,7 +427,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	if (rsp->n_rdma) {
 		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
@@ -510,7 +510,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
 	rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 			queue->cm_id->port_num, rsp->req.sg,
-			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	rsp->n_rdma = 0;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -579,7 +579,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req));
+			nvmet_data_dir(&rsp->req), 0);
 	if (ret < 0)
 		return NVME_SC_INTERNAL;
 	rsp->req.transfer_len += len;
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index a3cbbc7b6417..dbc342b53901 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -59,12 +59,15 @@ struct rdma_rw_ctx {
 	};
 };
 
+#define RDMA_RW_CTX_FLAG_PCI_P2PDMA  (1 << 0)
+
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir);
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags);
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt,
-		enum dma_data_direction dir);
+		enum dma_data_direction dir, unsigned int flags);
 
 int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u8 port_num, struct scatterlist *sg, u32 sg_cnt,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 12b9a7e0b6d2..4364a8e98470 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -140,7 +140,7 @@ static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
 
 		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
 				    rdma->sc_port_num, ctxt->rw_sg_table.sgl,
-				    ctxt->rw_nents, dir);
+				    ctxt->rw_nents, dir, 0);
 		svc_rdma_put_rw_ctxt(rdma, ctxt);
 	}
 	svc_xprt_put(&rdma->sc_xprt);
@@ -433,7 +433,7 @@ svc_rdma_build_writes(struct svc_rdma_write_info *info,
 		ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp,
 				       rdma->sc_port_num, ctxt->rw_sg_table.sgl,
 				       ctxt->rw_nents, 0, seg_offset,
-				       seg_handle, DMA_TO_DEVICE);
+				       seg_handle, DMA_TO_DEVICE, 0);
 		if (ret < 0)
 			goto out_initerr;
 
@@ -639,7 +639,7 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
 	ret = rdma_rw_ctx_init(&ctxt->rw_ctx, cc->cc_rdma->sc_qp,
 			       cc->cc_rdma->sc_port_num,
 			       ctxt->rw_sg_table.sgl, ctxt->rw_nents,
-			       0, offset, rkey, DMA_FROM_DEVICE);
+			       0, offset, rkey, DMA_FROM_DEVICE, 0);
 	if (ret < 0)
 		goto out_initerr;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct DMA address. To do this, we add a flags
variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
specified use the appropriate map function.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/rw.c            | 21 +++++++++++++++++----
 drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
 drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ++++---
 drivers/nvme/target/rdma.c              |  6 +++---
 include/rdma/rw.h                       |  7 +++++--
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |  6 +++---
 6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..775a9f8b15a6 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
  * @remote_addr:remote address to read/write (relative to @rkey)
  * @rkey:	remote key to operate on
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
  *
  * Returns the number of WQEs that will be needed on the workqueue if
  * successful, or a negative error code.
  */
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+	else
+		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -579,9 +586,11 @@ EXPORT_SYMBOL(rdma_rw_ctx_post);
  * @sg:		scatterlist that was used for the READ/WRITE
  * @sg_cnt:	number of entries in @sg
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      the same flags used to init the context
  */
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
-		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	int i;
 
@@ -602,7 +611,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		break;
 	}
 
-	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+		pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+				    sg_cnt, dir);
+	else
+		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index fff40b097947..933200645262 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1497,7 +1497,8 @@ isert_rdma_rw_ctx_destroy(struct isert_cmd *cmd, struct isert_conn *conn)
 				se_cmd->t_prot_nents, dir);
 	} else {
 		rdma_rw_ctx_destroy(&cmd->rw, conn->qp, conn->cm_id->port_num,
-				se_cmd->t_data_sg, se_cmd->t_data_nents, dir);
+				se_cmd->t_data_sg, se_cmd->t_data_nents,
+				dir, 0);
 	}
 
 	cmd->rw.nr_ops = 0;
@@ -2152,7 +2153,7 @@ isert_rdma_rw_ctx_post(struct isert_cmd *cmd, struct isert_conn *conn,
 	} else {
 		ret = rdma_rw_ctx_init(&cmd->rw, conn->qp, port_num,
 				se_cmd->t_data_sg, se_cmd->t_data_nents,
-				offset, addr, rkey, dir);
+				offset, addr, rkey, dir, 0);
 	}
 
 	if (ret < 0) {
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 0373b7c40902..5dcbe0a18db8 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -899,7 +899,8 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 			goto unwind;
 
 		ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, 0, remote_addr, rkey, dir);
+				ctx->sg, ctx->nents, 0, remote_addr, rkey,
+				dir, 0);
 		if (ret < 0) {
 			target_free_sgl(ctx->sg, ctx->nents);
 			goto unwind;
@@ -928,7 +929,7 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 	if (ioctx->rw_ctxs != &ioctx->s_rw_ctx)
@@ -946,7 +947,7 @@ static void srpt_free_rw_ctxs(struct srpt_rdma_ch *ch,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 978e169c11bf..020354e11351 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -427,7 +427,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	if (rsp->n_rdma) {
 		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
@@ -510,7 +510,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
 	rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 			queue->cm_id->port_num, rsp->req.sg,
-			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	rsp->n_rdma = 0;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -579,7 +579,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req));
+			nvmet_data_dir(&rsp->req), 0);
 	if (ret < 0)
 		return NVME_SC_INTERNAL;
 	rsp->req.transfer_len += len;
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index a3cbbc7b6417..dbc342b53901 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -59,12 +59,15 @@ struct rdma_rw_ctx {
 	};
 };
 
+#define RDMA_RW_CTX_FLAG_PCI_P2PDMA  (1 << 0)
+
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir);
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags);
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt,
-		enum dma_data_direction dir);
+		enum dma_data_direction dir, unsigned int flags);
 
 int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u8 port_num, struct scatterlist *sg, u32 sg_cnt,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 12b9a7e0b6d2..4364a8e98470 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -140,7 +140,7 @@ static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
 
 		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
 				    rdma->sc_port_num, ctxt->rw_sg_table.sgl,
-				    ctxt->rw_nents, dir);
+				    ctxt->rw_nents, dir, 0);
 		svc_rdma_put_rw_ctxt(rdma, ctxt);
 	}
 	svc_xprt_put(&rdma->sc_xprt);
@@ -433,7 +433,7 @@ svc_rdma_build_writes(struct svc_rdma_write_info *info,
 		ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp,
 				       rdma->sc_port_num, ctxt->rw_sg_table.sgl,
 				       ctxt->rw_nents, 0, seg_offset,
-				       seg_handle, DMA_TO_DEVICE);
+				       seg_handle, DMA_TO_DEVICE, 0);
 		if (ret < 0)
 			goto out_initerr;
 
@@ -639,7 +639,7 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
 	ret = rdma_rw_ctx_init(&ctxt->rw_ctx, cc->cc_rdma->sc_qp,
 			       cc->cc_rdma->sc_port_num,
 			       ctxt->rw_sg_table.sgl, ctxt->rw_nents,
-			       0, offset, rkey, DMA_FROM_DEVICE);
+			       0, offset, rkey, DMA_FROM_DEVICE, 0);
 	if (ret < 0)
 		goto out_initerr;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)


In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct DMA address. To do this, we add a flags
variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
specified use the appropriate map function.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/infiniband/core/rw.c            | 21 +++++++++++++++++----
 drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
 drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ++++---
 drivers/nvme/target/rdma.c              |  6 +++---
 include/rdma/rw.h                       |  7 +++++--
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |  6 +++---
 6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..775a9f8b15a6 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
  * @remote_addr:remote address to read/write (relative to @rkey)
  * @rkey:	remote key to operate on
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
  *
  * Returns the number of WQEs that will be needed on the workqueue if
  * successful, or a negative error code.
  */
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+	else
+		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -579,9 +586,11 @@ EXPORT_SYMBOL(rdma_rw_ctx_post);
  * @sg:		scatterlist that was used for the READ/WRITE
  * @sg_cnt:	number of entries in @sg
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      the same flags used to init the context
  */
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
-		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	int i;
 
@@ -602,7 +611,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		break;
 	}
 
-	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+		pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg,
+				    sg_cnt, dir);
+	else
+		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index fff40b097947..933200645262 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1497,7 +1497,8 @@ isert_rdma_rw_ctx_destroy(struct isert_cmd *cmd, struct isert_conn *conn)
 				se_cmd->t_prot_nents, dir);
 	} else {
 		rdma_rw_ctx_destroy(&cmd->rw, conn->qp, conn->cm_id->port_num,
-				se_cmd->t_data_sg, se_cmd->t_data_nents, dir);
+				se_cmd->t_data_sg, se_cmd->t_data_nents,
+				dir, 0);
 	}
 
 	cmd->rw.nr_ops = 0;
@@ -2152,7 +2153,7 @@ isert_rdma_rw_ctx_post(struct isert_cmd *cmd, struct isert_conn *conn,
 	} else {
 		ret = rdma_rw_ctx_init(&cmd->rw, conn->qp, port_num,
 				se_cmd->t_data_sg, se_cmd->t_data_nents,
-				offset, addr, rkey, dir);
+				offset, addr, rkey, dir, 0);
 	}
 
 	if (ret < 0) {
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 0373b7c40902..5dcbe0a18db8 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -899,7 +899,8 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 			goto unwind;
 
 		ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, 0, remote_addr, rkey, dir);
+				ctx->sg, ctx->nents, 0, remote_addr, rkey,
+				dir, 0);
 		if (ret < 0) {
 			target_free_sgl(ctx->sg, ctx->nents);
 			goto unwind;
@@ -928,7 +929,7 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 	if (ioctx->rw_ctxs != &ioctx->s_rw_ctx)
@@ -946,7 +947,7 @@ static void srpt_free_rw_ctxs(struct srpt_rdma_ch *ch,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 978e169c11bf..020354e11351 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -427,7 +427,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	if (rsp->n_rdma) {
 		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
@@ -510,7 +510,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
 	rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 			queue->cm_id->port_num, rsp->req.sg,
-			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	rsp->n_rdma = 0;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -579,7 +579,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req));
+			nvmet_data_dir(&rsp->req), 0);
 	if (ret < 0)
 		return NVME_SC_INTERNAL;
 	rsp->req.transfer_len += len;
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index a3cbbc7b6417..dbc342b53901 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -59,12 +59,15 @@ struct rdma_rw_ctx {
 	};
 };
 
+#define RDMA_RW_CTX_FLAG_PCI_P2PDMA  (1 << 0)
+
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir);
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags);
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt,
-		enum dma_data_direction dir);
+		enum dma_data_direction dir, unsigned int flags);
 
 int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u8 port_num, struct scatterlist *sg, u32 sg_cnt,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 12b9a7e0b6d2..4364a8e98470 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -140,7 +140,7 @@ static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
 
 		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
 				    rdma->sc_port_num, ctxt->rw_sg_table.sgl,
-				    ctxt->rw_nents, dir);
+				    ctxt->rw_nents, dir, 0);
 		svc_rdma_put_rw_ctxt(rdma, ctxt);
 	}
 	svc_xprt_put(&rdma->sc_xprt);
@@ -433,7 +433,7 @@ svc_rdma_build_writes(struct svc_rdma_write_info *info,
 		ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp,
 				       rdma->sc_port_num, ctxt->rw_sg_table.sgl,
 				       ctxt->rw_nents, 0, seg_offset,
-				       seg_handle, DMA_TO_DEVICE);
+				       seg_handle, DMA_TO_DEVICE, 0);
 		if (ret < 0)
 			goto out_initerr;
 
@@ -639,7 +639,7 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
 	ret = rdma_rw_ctx_init(&ctxt->rw_ctx, cc->cc_rdma->sc_qp,
 			       cc->cc_rdma->sc_port_num,
 			       ctxt->rw_sg_table.sgl, ctxt->rw_nents,
-			       0, offset, rkey, DMA_FROM_DEVICE);
+			       0, offset, rkey, DMA_FROM_DEVICE, 0);
 	if (ret < 0)
 		goto out_initerr;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-02-28 23:40   ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO SQ.

If the CMB supports WDS and RDS, publish it for use as p2p memory
by other devices.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2fbbd5..56ca79be8476 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvme.h"
 
@@ -91,9 +92,8 @@ struct nvme_dev {
 	struct work_struct remove_work;
 	struct mutex shutdown_lock;
 	bool subsystem;
-	void __iomem *cmb;
-	pci_bus_addr_t cmb_bus_addr;
 	u64 cmb_size;
+	bool cmb_use_sqes;
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
@@ -148,7 +148,7 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t q_lock;
 	struct nvme_command *sq_cmds;
-	struct nvme_command __iomem *sq_cmds_io;
+	bool sq_cmds_is_io;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
 	u16 tail = nvmeq->sq_tail;
 
-	if (nvmeq->sq_cmds_io)
-		memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
-	else
-		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
@@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-	if (nvmeq->sq_cmds)
-		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+	if (nvmeq->sq_cmds) {
+		if (nvmeq->sq_cmds_is_io)
+			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+					nvmeq->sq_cmds,
+					SQ_SIZE(nvmeq->q_depth));
+		else
+			dma_free_coherent(nvmeq->q_dmadev,
+					  SQ_SIZE(nvmeq->q_depth),
+					  nvmeq->sq_cmds,
+					  nvmeq->sq_dma_addr);
+	}
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 				int qid, int depth)
 {
-	/* CMB SQEs will be mapped before creation */
-	if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-		return 0;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+	if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+						nvmeq->sq_cmds);
+		nvmeq->sq_cmds_is_io = true;
+	}
+
+	if (!nvmeq->sq_cmds) {
+		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+					&nvmeq->sq_dma_addr, GFP_KERNEL);
+		nvmeq->sq_cmds_is_io = false;
+	}
 
-	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					    &nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
@@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
 
-	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
-						      dev->ctrl.page_size);
-		nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
-		nvmeq->sq_cmds_io = dev->cmb + offset;
-	}
-
 	nvmeq->cq_vector = qid - 1;
 	result = adapter_alloc_cq(dev, qid, nvmeq);
 	if (result < 0)
@@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 		return;
 	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
-	if (!use_cmb_sqes)
-		return;
-
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
 	bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	if (size > bar_size - offset)
 		size = bar_size - offset;
 
-	dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
-	if (!dev->cmb)
+	if (pci_p2pdma_add_resource(pdev, bar, size, offset))
 		return;
-	dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
 	dev->cmb_size = size;
+	dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+	if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+			(NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+		pci_p2pmem_publish(pdev, true);
 
 	if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
 				    &dev_attr_cmb.attr, NULL))
@@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 
 static inline void nvme_release_cmb(struct nvme_dev *dev)
 {
-	if (dev->cmb) {
-		iounmap(dev->cmb);
-		dev->cmb = NULL;
+	if (dev->cmb_size) {
 		sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
 					     &dev_attr_cmb.attr, NULL);
-		dev->cmbsz = 0;
+		dev->cmb_size = 0;
 	}
 }
 
@@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
 				sizeof(struct nvme_command));
 		if (result > 0)
 			dev->q_depth = result;
 		else
-			nvme_release_cmb(dev);
+			dev->cmb_use_sqes = false;
 	}
 
 	do {
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO SQ.

If the CMB supports WDS and RDS, publish it for use as p2p memory
by other devices.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2fbbd5..56ca79be8476 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvme.h"
 
@@ -91,9 +92,8 @@ struct nvme_dev {
 	struct work_struct remove_work;
 	struct mutex shutdown_lock;
 	bool subsystem;
-	void __iomem *cmb;
-	pci_bus_addr_t cmb_bus_addr;
 	u64 cmb_size;
+	bool cmb_use_sqes;
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
@@ -148,7 +148,7 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t q_lock;
 	struct nvme_command *sq_cmds;
-	struct nvme_command __iomem *sq_cmds_io;
+	bool sq_cmds_is_io;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
 	u16 tail = nvmeq->sq_tail;
 
-	if (nvmeq->sq_cmds_io)
-		memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
-	else
-		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
@@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-	if (nvmeq->sq_cmds)
-		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+	if (nvmeq->sq_cmds) {
+		if (nvmeq->sq_cmds_is_io)
+			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+					nvmeq->sq_cmds,
+					SQ_SIZE(nvmeq->q_depth));
+		else
+			dma_free_coherent(nvmeq->q_dmadev,
+					  SQ_SIZE(nvmeq->q_depth),
+					  nvmeq->sq_cmds,
+					  nvmeq->sq_dma_addr);
+	}
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 				int qid, int depth)
 {
-	/* CMB SQEs will be mapped before creation */
-	if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-		return 0;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+	if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+						nvmeq->sq_cmds);
+		nvmeq->sq_cmds_is_io = true;
+	}
+
+	if (!nvmeq->sq_cmds) {
+		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+					&nvmeq->sq_dma_addr, GFP_KERNEL);
+		nvmeq->sq_cmds_is_io = false;
+	}
 
-	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					    &nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
@@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
 
-	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
-						      dev->ctrl.page_size);
-		nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
-		nvmeq->sq_cmds_io = dev->cmb + offset;
-	}
-
 	nvmeq->cq_vector = qid - 1;
 	result = adapter_alloc_cq(dev, qid, nvmeq);
 	if (result < 0)
@@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 		return;
 	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
-	if (!use_cmb_sqes)
-		return;
-
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
 	bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	if (size > bar_size - offset)
 		size = bar_size - offset;
 
-	dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
-	if (!dev->cmb)
+	if (pci_p2pdma_add_resource(pdev, bar, size, offset))
 		return;
-	dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
 	dev->cmb_size = size;
+	dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+	if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+			(NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+		pci_p2pmem_publish(pdev, true);
 
 	if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
 				    &dev_attr_cmb.attr, NULL))
@@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 
 static inline void nvme_release_cmb(struct nvme_dev *dev)
 {
-	if (dev->cmb) {
-		iounmap(dev->cmb);
-		dev->cmb = NULL;
+	if (dev->cmb_size) {
 		sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
 					     &dev_attr_cmb.attr, NULL);
-		dev->cmbsz = 0;
+		dev->cmb_size = 0;
 	}
 }
 
@@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
 				sizeof(struct nvme_command));
 		if (result > 0)
 			dev->q_depth = result;
 		else
-			nvme_release_cmb(dev);
+			dev->cmb_use_sqes = false;
 	}
 
 	do {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO SQ.

If the CMB supports WDS and RDS, publish it for use as p2p memory
by other devices.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2fbbd5..56ca79be8476 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvme.h"
 
@@ -91,9 +92,8 @@ struct nvme_dev {
 	struct work_struct remove_work;
 	struct mutex shutdown_lock;
 	bool subsystem;
-	void __iomem *cmb;
-	pci_bus_addr_t cmb_bus_addr;
 	u64 cmb_size;
+	bool cmb_use_sqes;
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
@@ -148,7 +148,7 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t q_lock;
 	struct nvme_command *sq_cmds;
-	struct nvme_command __iomem *sq_cmds_io;
+	bool sq_cmds_is_io;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
 	u16 tail = nvmeq->sq_tail;
 
-	if (nvmeq->sq_cmds_io)
-		memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
-	else
-		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
@@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-	if (nvmeq->sq_cmds)
-		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+	if (nvmeq->sq_cmds) {
+		if (nvmeq->sq_cmds_is_io)
+			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+					nvmeq->sq_cmds,
+					SQ_SIZE(nvmeq->q_depth));
+		else
+			dma_free_coherent(nvmeq->q_dmadev,
+					  SQ_SIZE(nvmeq->q_depth),
+					  nvmeq->sq_cmds,
+					  nvmeq->sq_dma_addr);
+	}
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 				int qid, int depth)
 {
-	/* CMB SQEs will be mapped before creation */
-	if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-		return 0;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+	if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+						nvmeq->sq_cmds);
+		nvmeq->sq_cmds_is_io = true;
+	}
+
+	if (!nvmeq->sq_cmds) {
+		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+					&nvmeq->sq_dma_addr, GFP_KERNEL);
+		nvmeq->sq_cmds_is_io = false;
+	}
 
-	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					    &nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
@@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
 
-	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
-						      dev->ctrl.page_size);
-		nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
-		nvmeq->sq_cmds_io = dev->cmb + offset;
-	}
-
 	nvmeq->cq_vector = qid - 1;
 	result = adapter_alloc_cq(dev, qid, nvmeq);
 	if (result < 0)
@@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 		return;
 	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
-	if (!use_cmb_sqes)
-		return;
-
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
 	bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	if (size > bar_size - offset)
 		size = bar_size - offset;
 
-	dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
-	if (!dev->cmb)
+	if (pci_p2pdma_add_resource(pdev, bar, size, offset))
 		return;
-	dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
 	dev->cmb_size = size;
+	dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+	if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+			(NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+		pci_p2pmem_publish(pdev, true);
 
 	if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
 				    &dev_attr_cmb.attr, NULL))
@@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 
 static inline void nvme_release_cmb(struct nvme_dev *dev)
 {
-	if (dev->cmb) {
-		iounmap(dev->cmb);
-		dev->cmb = NULL;
+	if (dev->cmb_size) {
 		sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
 					     &dev_attr_cmb.attr, NULL);
-		dev->cmbsz = 0;
+		dev->cmb_size = 0;
 	}
 }
 
@@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
 				sizeof(struct nvme_command));
 		if (result > 0)
 			dev->q_depth = result;
 		else
-			nvme_release_cmb(dev);
+			dev->cmb_use_sqes = false;
 	}
 
 	do {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)


Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO SQ.

If the CMB supports WDS and RDS, publish it for use as p2p memory
by other devices.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2fbbd5..56ca79be8476 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvme.h"
 
@@ -91,9 +92,8 @@ struct nvme_dev {
 	struct work_struct remove_work;
 	struct mutex shutdown_lock;
 	bool subsystem;
-	void __iomem *cmb;
-	pci_bus_addr_t cmb_bus_addr;
 	u64 cmb_size;
+	bool cmb_use_sqes;
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
@@ -148,7 +148,7 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t q_lock;
 	struct nvme_command *sq_cmds;
-	struct nvme_command __iomem *sq_cmds_io;
+	bool sq_cmds_is_io;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
 	u16 tail = nvmeq->sq_tail;
 
-	if (nvmeq->sq_cmds_io)
-		memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
-	else
-		memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
@@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-	if (nvmeq->sq_cmds)
-		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+	if (nvmeq->sq_cmds) {
+		if (nvmeq->sq_cmds_is_io)
+			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+					nvmeq->sq_cmds,
+					SQ_SIZE(nvmeq->q_depth));
+		else
+			dma_free_coherent(nvmeq->q_dmadev,
+					  SQ_SIZE(nvmeq->q_depth),
+					  nvmeq->sq_cmds,
+					  nvmeq->sq_dma_addr);
+	}
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 				int qid, int depth)
 {
-	/* CMB SQEs will be mapped before creation */
-	if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-		return 0;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+	if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+						nvmeq->sq_cmds);
+		nvmeq->sq_cmds_is_io = true;
+	}
+
+	if (!nvmeq->sq_cmds) {
+		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+					&nvmeq->sq_dma_addr, GFP_KERNEL);
+		nvmeq->sq_cmds_is_io = false;
+	}
 
-	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					    &nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
@@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
 
-	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
-						      dev->ctrl.page_size);
-		nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
-		nvmeq->sq_cmds_io = dev->cmb + offset;
-	}
-
 	nvmeq->cq_vector = qid - 1;
 	result = adapter_alloc_cq(dev, qid, nvmeq);
 	if (result < 0)
@@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 		return;
 	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
-	if (!use_cmb_sqes)
-		return;
-
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
 	bar = NVME_CMB_BIR(dev->cmbloc);
@@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	if (size > bar_size - offset)
 		size = bar_size - offset;
 
-	dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
-	if (!dev->cmb)
+	if (pci_p2pdma_add_resource(pdev, bar, size, offset))
 		return;
-	dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
+
 	dev->cmb_size = size;
+	dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
+
+	if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
+			(NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
+		pci_p2pmem_publish(pdev, true);
 
 	if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
 				    &dev_attr_cmb.attr, NULL))
@@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 
 static inline void nvme_release_cmb(struct nvme_dev *dev)
 {
-	if (dev->cmb) {
-		iounmap(dev->cmb);
-		dev->cmb = NULL;
+	if (dev->cmb_size) {
 		sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
 					     &dev_attr_cmb.attr, NULL);
-		dev->cmbsz = 0;
+		dev->cmb_size = 0;
 	}
 }
 
@@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
 				sizeof(struct nvme_command));
 		if (result > 0)
 			dev->q_depth = result;
 		else
-			nvme_release_cmb(dev);
+			dev->cmb_use_sqes = false;
 	}
 
 	do {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-02-28 23:40   ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c |  4 ++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 19 +++++++++++++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0fe7ea35c221..4dd564f175fa 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2954,7 +2954,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
 		goto out_free_ns;
+
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+		queue_flag_set_unlocked(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0521e4707d1c..e38beb1917e2 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -290,6 +290,7 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
+#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 56ca79be8476..b896a78225cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -796,8 +796,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out;
 
 	ret = BLK_STS_RESOURCE;
-	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-			DMA_ATTR_NO_WARN);
+
+	if (REQ_IS_PCI_P2PDMA(req))
+		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+					  dma_dir);
+	else
+		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+					     dma_dir,  DMA_ATTR_NO_WARN);
 	if (!nr_mapped)
 		goto out;
 
@@ -842,7 +847,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 			DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
 	if (iod->nents) {
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+		if (REQ_IS_PCI_P2PDMA(req))
+			pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+					    dma_dir);
+		else
+			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
 		if (blk_integrity_rq(req)) {
 			if (req_op(req) == REQ_OP_READ)
 				nvme_dif_remap(req, nvme_dif_complete);
@@ -2422,7 +2432,8 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED,
+	.flags			= NVME_F_METADATA_SUPPORTED |
+				  NVME_F_PCI_P2PDMA,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c |  4 ++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 19 +++++++++++++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0fe7ea35c221..4dd564f175fa 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2954,7 +2954,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
 		goto out_free_ns;
+
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+		queue_flag_set_unlocked(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0521e4707d1c..e38beb1917e2 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -290,6 +290,7 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
+#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 56ca79be8476..b896a78225cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -796,8 +796,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out;
 
 	ret = BLK_STS_RESOURCE;
-	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-			DMA_ATTR_NO_WARN);
+
+	if (REQ_IS_PCI_P2PDMA(req))
+		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+					  dma_dir);
+	else
+		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+					     dma_dir,  DMA_ATTR_NO_WARN);
 	if (!nr_mapped)
 		goto out;
 
@@ -842,7 +847,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 			DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
 	if (iod->nents) {
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+		if (REQ_IS_PCI_P2PDMA(req))
+			pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+					    dma_dir);
+		else
+			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
 		if (blk_integrity_rq(req)) {
 			if (req_op(req) == REQ_OP_READ)
 				nvme_dif_remap(req, nvme_dif_complete);
@@ -2422,7 +2432,8 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED,
+	.flags			= NVME_F_METADATA_SUPPORTED |
+				  NVME_F_PCI_P2PDMA,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/nvme/host/core.c |  4 ++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 19 +++++++++++++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0fe7ea35c221..4dd564f175fa 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2954,7 +2954,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
 		goto out_free_ns;
+
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+		queue_flag_set_unlocked(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0521e4707d1c..e38beb1917e2 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -290,6 +290,7 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
+#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 56ca79be8476..b896a78225cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -796,8 +796,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out;
 
 	ret = BLK_STS_RESOURCE;
-	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-			DMA_ATTR_NO_WARN);
+
+	if (REQ_IS_PCI_P2PDMA(req))
+		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+					  dma_dir);
+	else
+		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+					     dma_dir,  DMA_ATTR_NO_WARN);
 	if (!nr_mapped)
 		goto out;
 
@@ -842,7 +847,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 			DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
 	if (iod->nents) {
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+		if (REQ_IS_PCI_P2PDMA(req))
+			pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+					    dma_dir);
+		else
+			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
 		if (blk_integrity_rq(req)) {
 			if (req_op(req) == REQ_OP_READ)
 				nvme_dif_remap(req, nvme_dif_complete);
@@ -2422,7 +2432,8 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED,
+	.flags			= NVME_F_METADATA_SUPPORTED |
+				  NVME_F_PCI_P2PDMA,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)


For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/nvme/host/core.c |  4 ++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 19 +++++++++++++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0fe7ea35c221..4dd564f175fa 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2954,7 +2954,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
 		goto out_free_ns;
+
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+		queue_flag_set_unlocked(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0521e4707d1c..e38beb1917e2 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -290,6 +290,7 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
+#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 56ca79be8476..b896a78225cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -796,8 +796,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out;
 
 	ret = BLK_STS_RESOURCE;
-	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-			DMA_ATTR_NO_WARN);
+
+	if (REQ_IS_PCI_P2PDMA(req))
+		nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+					  dma_dir);
+	else
+		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+					     dma_dir,  DMA_ATTR_NO_WARN);
 	if (!nr_mapped)
 		goto out;
 
@@ -842,7 +847,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 			DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
 	if (iod->nents) {
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+		if (REQ_IS_PCI_P2PDMA(req))
+			pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
+					    dma_dir);
+		else
+			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
 		if (blk_integrity_rq(req)) {
 			if (req_op(req) == REQ_OP_READ)
 				nvme_dif_remap(req, nvme_dif_complete);
@@ -2422,7 +2432,8 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED,
+	.flags			= NVME_F_METADATA_SUPPORTED |
+				  NVME_F_PCI_P2PDMA,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-02-28 23:40   ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/nvme.h |  7 +++++++
 drivers/nvme/host/pci.c  | 24 ++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e38beb1917e2..c84a1b9580c8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -83,6 +83,13 @@ enum nvme_quirks {
 	 * Supports the LighNVM command set if indicated in vs[1].
 	 */
 	NVME_QUIRK_LIGHTNVM			= (1 << 6),
+
+	/*
+	 * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+	 * NVRAM that have CMB-like memory on a BAR but does not set
+	 * CMBLOC or CMBSZ.
+	 */
+	NVME_QUIRK_PSEUDO_CMB_BAR4		= (1 << 7),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b896a78225cb..2ad20fb9f05d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1679,6 +1679,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+	return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+		(((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+		((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
 	u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1698,10 +1705,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int bar;
 
-	dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-	if (!dev->cmbsz)
-		return;
-	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+		dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+		dev->cmbloc = 4;
+	} else {
+		dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+		if (!dev->cmbsz)
+			return;
+		dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	}
 
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2715,6 +2727,10 @@ static const struct pci_device_id nvme_id_table[] = {
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
 	{ PCI_DEVICE(0x1d1d, 0x2807),	/* CNEX WL */
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
+	{ PCI_DEVICE(0x11f8, 0xf117),	/* Microsemi NVRAM adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+	{ PCI_DEVICE(0x1db1, 0x0002),	/* Everspin nvNitro adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe

Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/nvme.h |  7 +++++++
 drivers/nvme/host/pci.c  | 24 ++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e38beb1917e2..c84a1b9580c8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -83,6 +83,13 @@ enum nvme_quirks {
 	 * Supports the LighNVM command set if indicated in vs[1].
 	 */
 	NVME_QUIRK_LIGHTNVM			= (1 << 6),
+
+	/*
+	 * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+	 * NVRAM that have CMB-like memory on a BAR but does not set
+	 * CMBLOC or CMBSZ.
+	 */
+	NVME_QUIRK_PSEUDO_CMB_BAR4		= (1 << 7),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b896a78225cb..2ad20fb9f05d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1679,6 +1679,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+	return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+		(((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+		((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
 	u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1698,10 +1705,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int bar;
 
-	dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-	if (!dev->cmbsz)
-		return;
-	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+		dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+		dev->cmbloc = 4;
+	} else {
+		dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+		if (!dev->cmbsz)
+			return;
+		dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	}
 
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2715,6 +2727,10 @@ static const struct pci_device_id nvme_id_table[] = {
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
 	{ PCI_DEVICE(0x1d1d, 0x2807),	/* CNEX WL */
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
+	{ PCI_DEVICE(0x11f8, 0xf117),	/* Microsemi NVRAM adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+	{ PCI_DEVICE(0x1db1, 0x0002),	/* Everspin nvNitro adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/nvme/host/nvme.h |  7 +++++++
 drivers/nvme/host/pci.c  | 24 ++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e38beb1917e2..c84a1b9580c8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -83,6 +83,13 @@ enum nvme_quirks {
 	 * Supports the LighNVM command set if indicated in vs[1].
 	 */
 	NVME_QUIRK_LIGHTNVM			= (1 << 6),
+
+	/*
+	 * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+	 * NVRAM that have CMB-like memory on a BAR but does not set
+	 * CMBLOC or CMBSZ.
+	 */
+	NVME_QUIRK_PSEUDO_CMB_BAR4		= (1 << 7),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b896a78225cb..2ad20fb9f05d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1679,6 +1679,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+	return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+		(((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+		((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
 	u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1698,10 +1705,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int bar;
 
-	dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-	if (!dev->cmbsz)
-		return;
-	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+		dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+		dev->cmbloc = 4;
+	} else {
+		dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+		if (!dev->cmbsz)
+			return;
+		dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	}
 
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2715,6 +2727,10 @@ static const struct pci_device_id nvme_id_table[] = {
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
 	{ PCI_DEVICE(0x1d1d, 0x2807),	/* CNEX WL */
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
+	{ PCI_DEVICE(0x11f8, 0xf117),	/* Microsemi NVRAM adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+	{ PCI_DEVICE(0x1db1, 0x0002),	/* Everspin nvNitro adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)


Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/nvme/host/nvme.h |  7 +++++++
 drivers/nvme/host/pci.c  | 24 ++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e38beb1917e2..c84a1b9580c8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -83,6 +83,13 @@ enum nvme_quirks {
 	 * Supports the LighNVM command set if indicated in vs[1].
 	 */
 	NVME_QUIRK_LIGHTNVM			= (1 << 6),
+
+	/*
+	 * Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+	 * NVRAM that have CMB-like memory on a BAR but does not set
+	 * CMBLOC or CMBSZ.
+	 */
+	NVME_QUIRK_PSEUDO_CMB_BAR4		= (1 << 7),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b896a78225cb..2ad20fb9f05d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1679,6 +1679,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+	return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+		(((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+		((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
 	u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1698,10 +1705,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int bar;
 
-	dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-	if (!dev->cmbsz)
-		return;
-	dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+		dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+		dev->cmbloc = 4;
+	} else {
+		dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+		if (!dev->cmbsz)
+			return;
+		dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+	}
 
 	size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
 	offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2715,6 +2727,10 @@ static const struct pci_device_id nvme_id_table[] = {
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
 	{ PCI_DEVICE(0x1d1d, 0x2807),	/* CNEX WL */
 		.driver_data = NVME_QUIRK_LIGHTNVM, },
+	{ PCI_DEVICE(0x11f8, 0xf117),	/* Microsemi NVRAM adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+	{ PCI_DEVICE(0x1db1, 0x0002),	/* Everspin nvNitro adaptor */
+		.driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-02-28 23:40   ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/target/configfs.c | 29 +++++++++++++
 drivers/nvme/target/core.c     | 95 +++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/target/io-cmd.c   |  3 ++
 drivers/nvme/target/nvmet.h    | 10 +++++
 drivers/nvme/target/rdma.c     | 41 ++++++++++++++----
 5 files changed, 169 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e6b2d2af81b6..f1ad9d32344d 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -867,12 +867,41 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+					const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	bool allow;
+	int ret;
+
+	ret = strtobool(page, &allow);
+	if (ret)
+		return ret;
+
+	down_write(&nvmet_config_sem);
+	port->allow_p2pmem = allow;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+	&nvmet_attr_allow_p2pmem,
+#endif
 	NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 0bd737117a80..59847aec27db 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
 	percpu_ref_put(&ns->ref);
 }
 
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+				   struct nvmet_ns *ns)
+{
+	int ret;
+
+	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+		pr_err("peer-to-peer DMA is not supported by %s\n",
+		       ns->device_path);
+		return -EINVAL;
+	}
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+	if (ret)
+		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+		       ns->device_path, ret);
+
+	return ret;
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
 	struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 	if (ret)
 		goto out_blkdev_put;
 
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		if (ctrl->p2p_dev) {
+			ret = nvmet_p2pdma_add_client(ctrl, ns);
+			if (ret)
+				goto out_remove_clients;
+		}
+	}
+
 	if (ns->nsid > subsys->max_nsid)
 		subsys->max_nsid = ns->nsid;
 
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 out_unlock:
 	mutex_unlock(&subsys->lock);
 	return ret;
+out_remove_clients:
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 out_blkdev_put:
 	blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
 	ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	percpu_ref_exit(&ns->ref);
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 		nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+	}
 
 	if (ns->bdev)
 		blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -761,6 +794,63 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
 		return __nvmet_host_allowed(subsys, hostnqn);
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+	struct nvmet_ns *ns;
+	int ret;
+
+	if (!req->port->allow_p2pmem || !req->p2p_client)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+	if (ret) {
+		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+		       dev_name(req->p2p_client), ret);
+		goto free_devices;
+	}
+
+	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+		ret = nvmet_p2pdma_add_client(ctrl, ns);
+		if (ret)
+			goto free_devices;
+	}
+
+	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+	if (!ctrl->p2p_dev) {
+		pr_info("no supported peer-to-peer memory devices found\n");
+		goto free_devices;
+	}
+	mutex_unlock(&ctrl->subsys->lock);
+
+	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+	return;
+
+free_devices:
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+	if (!ctrl->p2p_dev)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	pci_dev_put(ctrl->p2p_dev);
+	ctrl->p2p_dev = NULL;
+
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
 u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
 {
@@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 
 	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
 	INIT_LIST_HEAD(&ctrl->async_events);
+	INIT_LIST_HEAD(&ctrl->p2p_clients);
 
 	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
 	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		ctrl->kato = DIV_ROUND_UP(kato, 1000);
 	}
 	nvmet_start_keep_alive_timer(ctrl);
+	nvmet_setup_p2pmem(ctrl, req);
 
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fatal_err_work);
 
+	nvmet_release_p2pmem(ctrl);
 	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
 
 	kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 28bbdff4a88b..a213f8fc3bf3 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 		op = REQ_OP_READ;
 	}
 
+	if (is_pci_p2pdma_page(sg_page(req->sg)))
+		op_flags |= REQ_PCI_P2PDMA;
+
 	sector = le64_to_cpu(req->cmd->rw.slba);
 	sector <<= (req->ns->blksize_shift - 9);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 417f6c0331cc..85a170914588 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
 	return container_of(to_config_group(item), struct nvmet_ns, group);
 }
 
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+	return disk_to_dev(ns->bdev->bd_disk);
+}
+
 struct nvmet_cq {
 	u16			qid;
 	u16			size;
@@ -98,6 +103,7 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				allow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -131,6 +137,8 @@ struct nvmet_ctrl {
 	struct work_struct	fatal_err_work;
 
 	struct nvmet_fabrics_ops *ops;
+	struct pci_dev		*p2p_dev;
+	struct list_head	p2p_clients;
 
 	char			subsysnqn[NVMF_NQN_FIELD_LEN];
 	char			hostnqn[NVMF_NQN_FIELD_LEN];
@@ -232,6 +240,8 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+
+	struct device *p2p_client;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 020354e11351..7a1f09995ed5 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/pci-p2pdma.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
 	u8			n_rdma;
 	u32			flags;
 	u32			invalidate_rkey;
+	struct pci_dev		*p2p_dev;
 
 	struct list_head	wait_list;
 	struct list_head	free_list;
@@ -426,12 +428,18 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->n_rdma) {
 		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
-				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
+			queue->cm_id->port_num, rsp->req.sg,
+			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req),
+			rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
 	}
 
-	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		sgl_free(rsp->req.sg);
+	if (rsp->req.sg != &rsp->cmd->inline_sg) {
+		if (rsp->p2p_dev)
+			pci_p2pmem_free_sgl(rsp->p2p_dev, rsp->req.sg,
+					    rsp->req.sg_cnt);
+		else
+			sgl_free(rsp->req.sg);
+	}
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -567,19 +575,34 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	u64 addr = le64_to_cpu(sgl->addr);
 	u32 len = get_unaligned_le24(sgl->length);
 	u32 key = get_unaligned_le32(sgl->key);
+	struct pci_dev *p2p_dev = NULL;
 	int ret;
 
 	/* no data command? */
 	if (!len)
 		return 0;
 
-	rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
-	if (!rsp->req.sg)
-		return NVME_SC_INTERNAL;
+	if (rsp->queue->nvme_sq.ctrl)
+		p2p_dev = rsp->queue->nvme_sq.ctrl->p2p_dev;
+
+	rsp->p2p_dev = NULL;
+	if (rsp->queue->nvme_sq.qid && p2p_dev) {
+		ret = pci_p2pmem_alloc_sgl(p2p_dev, &rsp->req.sg,
+					   &rsp->req.sg_cnt, len);
+		if (!ret)
+			rsp->p2p_dev = p2p_dev;
+	}
+
+	if (!rsp->p2p_dev) {
+		rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
+		if (!rsp->req.sg)
+			return NVME_SC_INTERNAL;
+	}
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req), 0);
+			nvmet_data_dir(&rsp->req),
+			rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
 	if (ret < 0)
 		return NVME_SC_INTERNAL;
 	rsp->req.transfer_len += len;
@@ -658,6 +681,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
 		cmd->send_sge.addr, cmd->send_sge.length,
 		DMA_TO_DEVICE);
 
+	cmd->req.p2p_client = &queue->dev->device->dev;
+
 	if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
 			&queue->nvme_sq, &nvmet_rdma_ops))
 		return;
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Logan Gunthorpe, Steve Wise

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/target/configfs.c | 29 +++++++++++++
 drivers/nvme/target/core.c     | 95 +++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/target/io-cmd.c   |  3 ++
 drivers/nvme/target/nvmet.h    | 10 +++++
 drivers/nvme/target/rdma.c     | 41 ++++++++++++++----
 5 files changed, 169 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e6b2d2af81b6..f1ad9d32344d 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -867,12 +867,41 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+					const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	bool allow;
+	int ret;
+
+	ret = strtobool(page, &allow);
+	if (ret)
+		return ret;
+
+	down_write(&nvmet_config_sem);
+	port->allow_p2pmem = allow;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+	&nvmet_attr_allow_p2pmem,
+#endif
 	NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 0bd737117a80..59847aec27db 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
 	percpu_ref_put(&ns->ref);
 }
 
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+				   struct nvmet_ns *ns)
+{
+	int ret;
+
+	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+		pr_err("peer-to-peer DMA is not supported by %s\n",
+		       ns->device_path);
+		return -EINVAL;
+	}
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+	if (ret)
+		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+		       ns->device_path, ret);
+
+	return ret;
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
 	struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 	if (ret)
 		goto out_blkdev_put;
 
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		if (ctrl->p2p_dev) {
+			ret = nvmet_p2pdma_add_client(ctrl, ns);
+			if (ret)
+				goto out_remove_clients;
+		}
+	}
+
 	if (ns->nsid > subsys->max_nsid)
 		subsys->max_nsid = ns->nsid;
 
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 out_unlock:
 	mutex_unlock(&subsys->lock);
 	return ret;
+out_remove_clients:
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 out_blkdev_put:
 	blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
 	ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	percpu_ref_exit(&ns->ref);
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 		nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+	}
 
 	if (ns->bdev)
 		blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -761,6 +794,63 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
 		return __nvmet_host_allowed(subsys, hostnqn);
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+	struct nvmet_ns *ns;
+	int ret;
+
+	if (!req->port->allow_p2pmem || !req->p2p_client)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+	if (ret) {
+		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+		       dev_name(req->p2p_client), ret);
+		goto free_devices;
+	}
+
+	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+		ret = nvmet_p2pdma_add_client(ctrl, ns);
+		if (ret)
+			goto free_devices;
+	}
+
+	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+	if (!ctrl->p2p_dev) {
+		pr_info("no supported peer-to-peer memory devices found\n");
+		goto free_devices;
+	}
+	mutex_unlock(&ctrl->subsys->lock);
+
+	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+	return;
+
+free_devices:
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+	if (!ctrl->p2p_dev)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	pci_dev_put(ctrl->p2p_dev);
+	ctrl->p2p_dev = NULL;
+
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
 u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
 {
@@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 
 	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
 	INIT_LIST_HEAD(&ctrl->async_events);
+	INIT_LIST_HEAD(&ctrl->p2p_clients);
 
 	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
 	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		ctrl->kato = DIV_ROUND_UP(kato, 1000);
 	}
 	nvmet_start_keep_alive_timer(ctrl);
+	nvmet_setup_p2pmem(ctrl, req);
 
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fatal_err_work);
 
+	nvmet_release_p2pmem(ctrl);
 	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
 
 	kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 28bbdff4a88b..a213f8fc3bf3 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 		op = REQ_OP_READ;
 	}
 
+	if (is_pci_p2pdma_page(sg_page(req->sg)))
+		op_flags |= REQ_PCI_P2PDMA;
+
 	sector = le64_to_cpu(req->cmd->rw.slba);
 	sector <<= (req->ns->blksize_shift - 9);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 417f6c0331cc..85a170914588 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
 	return container_of(to_config_group(item), struct nvmet_ns, group);
 }
 
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+	return disk_to_dev(ns->bdev->bd_disk);
+}
+
 struct nvmet_cq {
 	u16			qid;
 	u16			size;
@@ -98,6 +103,7 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				allow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -131,6 +137,8 @@ struct nvmet_ctrl {
 	struct work_struct	fatal_err_work;
 
 	struct nvmet_fabrics_ops *ops;
+	struct pci_dev		*p2p_dev;
+	struct list_head	p2p_clients;
 
 	char			subsysnqn[NVMF_NQN_FIELD_LEN];
 	char			hostnqn[NVMF_NQN_FIELD_LEN];
@@ -232,6 +240,8 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+
+	struct device *p2p_client;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 020354e11351..7a1f09995ed5 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/pci-p2pdma.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
 	u8			n_rdma;
 	u32			flags;
 	u32			invalidate_rkey;
+	struct pci_dev		*p2p_dev;
 
 	struct list_head	wait_list;
 	struct list_head	free_list;
@@ -426,12 +428,18 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->n_rdma) {
 		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
-				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
+			queue->cm_id->port_num, rsp->req.sg,
+			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req),
+			rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
 	}
 
-	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		sgl_free(rsp->req.sg);
+	if (rsp->req.sg != &rsp->cmd->inline_sg) {
+		if (rsp->p2p_dev)
+			pci_p2pmem_free_sgl(rsp->p2p_dev, rsp->req.sg,
+					    rsp->req.sg_cnt);
+		else
+			sgl_free(rsp->req.sg);
+	}
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -567,19 +575,34 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	u64 addr = le64_to_cpu(sgl->addr);
 	u32 len = get_unaligned_le24(sgl->length);
 	u32 key = get_unaligned_le32(sgl->key);
+	struct pci_dev *p2p_dev = NULL;
 	int ret;
 
 	/* no data command? */
 	if (!len)
 		return 0;
 
-	rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
-	if (!rsp->req.sg)
-		return NVME_SC_INTERNAL;
+	if (rsp->queue->nvme_sq.ctrl)
+		p2p_dev = rsp->queue->nvme_sq.ctrl->p2p_dev;
+
+	rsp->p2p_dev = NULL;
+	if (rsp->queue->nvme_sq.qid && p2p_dev) {
+		ret = pci_p2pmem_alloc_sgl(p2p_dev, &rsp->req.sg,
+					   &rsp->req.sg_cnt, len);
+		if (!ret)
+			rsp->p2p_dev = p2p_dev;
+	}
+
+	if (!rsp->p2p_dev) {
+		rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
+		if (!rsp->req.sg)
+			return NVME_SC_INTERNAL;
+	}
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req), 0);
+			nvmet_data_dir(&rsp->req),
+			rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
 	if (ret < 0)
 		return NVME_SC_INTERNAL;
 	rsp->req.transfer_len += len;
@@ -658,6 +681,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
 		cmd->send_sge.addr, cmd->send_sge.length,
 		DMA_TO_DEVICE);
 
+	cmd->req.p2p_client = &queue->dev->device->dev;
+
 	if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
 			&queue->nvme_sq, &nvmet_rdma_ops))
 		return;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/target/configfs.c | 29 +++++++++++++
 drivers/nvme/target/core.c     | 95 +++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/target/io-cmd.c   |  3 ++
 drivers/nvme/target/nvmet.h    | 10 +++++
 drivers/nvme/target/rdma.c     | 41 ++++++++++++++----
 5 files changed, 169 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e6b2d2af81b6..f1ad9d32344d 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -867,12 +867,41 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+					const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	bool allow;
+	int ret;
+
+	ret = strtobool(page, &allow);
+	if (ret)
+		return ret;
+
+	down_write(&nvmet_config_sem);
+	port->allow_p2pmem = allow;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+	&nvmet_attr_allow_p2pmem,
+#endif
 	NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 0bd737117a80..59847aec27db 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
 	percpu_ref_put(&ns->ref);
 }
 
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+				   struct nvmet_ns *ns)
+{
+	int ret;
+
+	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+		pr_err("peer-to-peer DMA is not supported by %s\n",
+		       ns->device_path);
+		return -EINVAL;
+	}
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+	if (ret)
+		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+		       ns->device_path, ret);
+
+	return ret;
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
 	struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 	if (ret)
 		goto out_blkdev_put;
 
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		if (ctrl->p2p_dev) {
+			ret = nvmet_p2pdma_add_client(ctrl, ns);
+			if (ret)
+				goto out_remove_clients;
+		}
+	}
+
 	if (ns->nsid > subsys->max_nsid)
 		subsys->max_nsid = ns->nsid;
 
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 out_unlock:
 	mutex_unlock(&subsys->lock);
 	return ret;
+out_remove_clients:
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 out_blkdev_put:
 	blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
 	ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	percpu_ref_exit(&ns->ref);
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 		nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+	}
 
 	if (ns->bdev)
 		blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -761,6 +794,63 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
 		return __nvmet_host_allowed(subsys, hostnqn);
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+	struct nvmet_ns *ns;
+	int ret;
+
+	if (!req->port->allow_p2pmem || !req->p2p_client)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+	if (ret) {
+		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+		       dev_name(req->p2p_client), ret);
+		goto free_devices;
+	}
+
+	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+		ret = nvmet_p2pdma_add_client(ctrl, ns);
+		if (ret)
+			goto free_devices;
+	}
+
+	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+	if (!ctrl->p2p_dev) {
+		pr_info("no supported peer-to-peer memory devices found\n");
+		goto free_devices;
+	}
+	mutex_unlock(&ctrl->subsys->lock);
+
+	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+	return;
+
+free_devices:
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+	if (!ctrl->p2p_dev)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	pci_dev_put(ctrl->p2p_dev);
+	ctrl->p2p_dev = NULL;
+
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
 u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
 {
@@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 
 	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
 	INIT_LIST_HEAD(&ctrl->async_events);
+	INIT_LIST_HEAD(&ctrl->p2p_clients);
 
 	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
 	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		ctrl->kato = DIV_ROUND_UP(kato, 1000);
 	}
 	nvmet_start_keep_alive_timer(ctrl);
+	nvmet_setup_p2pmem(ctrl, req);
 
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fatal_err_work);
 
+	nvmet_release_p2pmem(ctrl);
 	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
 
 	kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 28bbdff4a88b..a213f8fc3bf3 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 		op = REQ_OP_READ;
 	}
 
+	if (is_pci_p2pdma_page(sg_page(req->sg)))
+		op_flags |= REQ_PCI_P2PDMA;
+
 	sector = le64_to_cpu(req->cmd->rw.slba);
 	sector <<= (req->ns->blksize_shift - 9);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 417f6c0331cc..85a170914588 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
 	return container_of(to_config_group(item), struct nvmet_ns, group);
 }
 
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+	return disk_to_dev(ns->bdev->bd_disk);
+}
+
 struct nvmet_cq {
 	u16			qid;
 	u16			size;
@@ -98,6 +103,7 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				allow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -131,6 +137,8 @@ struct nvmet_ctrl {
 	struct work_struct	fatal_err_work;
 
 	struct nvmet_fabrics_ops *ops;
+	struct pci_dev		*p2p_dev;
+	struct list_head	p2p_clients;
 
 	char			subsysnqn[NVMF_NQN_FIELD_LEN];
 	char			hostnqn[NVMF_NQN_FIELD_LEN];
@@ -232,6 +240,8 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+
+	struct device *p2p_client;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 020354e11351..7a1f09995ed5 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/pci-p2pdma.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
 	u8			n_rdma;
 	u32			flags;
 	u32			invalidate_rkey;
+	struct pci_dev		*p2p_dev;
 
 	struct list_head	wait_list;
 	struct list_head	free_list;
@@ -426,12 +428,18 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->n_rdma) {
 		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
-				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
+			queue->cm_id->port_num, rsp->req.sg,
+			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req),
+			rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
 	}
 
-	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		sgl_free(rsp->req.sg);
+	if (rsp->req.sg != &rsp->cmd->inline_sg) {
+		if (rsp->p2p_dev)
+			pci_p2pmem_free_sgl(rsp->p2p_dev, rsp->req.sg,
+					    rsp->req.sg_cnt);
+		else
+			sgl_free(rsp->req.sg);
+	}
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -567,19 +575,34 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	u64 addr = le64_to_cpu(sgl->addr);
 	u32 len = get_unaligned_le24(sgl->length);
 	u32 key = get_unaligned_le32(sgl->key);
+	struct pci_dev *p2p_dev = NULL;
 	int ret;
 
 	/* no data command? */
 	if (!len)
 		return 0;
 
-	rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
-	if (!rsp->req.sg)
-		return NVME_SC_INTERNAL;
+	if (rsp->queue->nvme_sq.ctrl)
+		p2p_dev = rsp->queue->nvme_sq.ctrl->p2p_dev;
+
+	rsp->p2p_dev = NULL;
+	if (rsp->queue->nvme_sq.qid && p2p_dev) {
+		ret = pci_p2pmem_alloc_sgl(p2p_dev, &rsp->req.sg,
+					   &rsp->req.sg_cnt, len);
+		if (!ret)
+			rsp->p2p_dev = p2p_dev;
+	}
+
+	if (!rsp->p2p_dev) {
+		rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
+		if (!rsp->req.sg)
+			return NVME_SC_INTERNAL;
+	}
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req), 0);
+			nvmet_data_dir(&rsp->req),
+			rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
 	if (ret < 0)
 		return NVME_SC_INTERNAL;
 	rsp->req.transfer_len += len;
@@ -658,6 +681,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
 		cmd->send_sge.addr, cmd->send_sge.length,
 		DMA_TO_DEVICE);
 
+	cmd->req.p2p_client = &queue->dev->device->dev;
+
 	if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
 			&queue->nvme_sq, &nvmet_rdma_ops))
 		return;
-- 
2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-02-28 23:40   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-02-28 23:40 UTC (permalink / raw)


We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Stephen Bates <sbates at raithlin.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
---
 drivers/nvme/target/configfs.c | 29 +++++++++++++
 drivers/nvme/target/core.c     | 95 +++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/target/io-cmd.c   |  3 ++
 drivers/nvme/target/nvmet.h    | 10 +++++
 drivers/nvme/target/rdma.c     | 41 ++++++++++++++----
 5 files changed, 169 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e6b2d2af81b6..f1ad9d32344d 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -867,12 +867,41 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+					const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	bool allow;
+	int ret;
+
+	ret = strtobool(page, &allow);
+	if (ret)
+		return ret;
+
+	down_write(&nvmet_config_sem);
+	port->allow_p2pmem = allow;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+#ifdef CONFIG_PCI_P2PDMA
+	&nvmet_attr_allow_p2pmem,
+#endif
 	NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 0bd737117a80..59847aec27db 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/rculist.h>
+#include <linux/pci-p2pdma.h>
 
 #include "nvmet.h"
 
@@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns)
 	percpu_ref_put(&ns->ref);
 }
 
+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+				   struct nvmet_ns *ns)
+{
+	int ret;
+
+	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+		pr_err("peer-to-peer DMA is not supported by %s\n",
+		       ns->device_path);
+		return -EINVAL;
+	}
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+	if (ret)
+		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+		       ns->device_path, ret);
+
+	return ret;
+}
+
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
 	struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 	if (ret)
 		goto out_blkdev_put;
 
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		if (ctrl->p2p_dev) {
+			ret = nvmet_p2pdma_add_client(ctrl, ns);
+			if (ret)
+				goto out_remove_clients;
+		}
+	}
+
 	if (ns->nsid > subsys->max_nsid)
 		subsys->max_nsid = ns->nsid;
 
@@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 out_unlock:
 	mutex_unlock(&subsys->lock);
 	return ret;
+out_remove_clients:
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 out_blkdev_put:
 	blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
 	ns->bdev = NULL;
@@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
 	percpu_ref_exit(&ns->ref);
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
 		nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
+	}
 
 	if (ns->bdev)
 		blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ);
@@ -761,6 +794,63 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys,
 		return __nvmet_host_allowed(subsys, hostnqn);
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * ?/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+	struct nvmet_ns *ns;
+	int ret;
+
+	if (!req->port->allow_p2pmem || !req->p2p_client)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+	if (ret) {
+		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+		       dev_name(req->p2p_client), ret);
+		goto free_devices;
+	}
+
+	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+		ret = nvmet_p2pdma_add_client(ctrl, ns);
+		if (ret)
+			goto free_devices;
+	}
+
+	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
+	if (!ctrl->p2p_dev) {
+		pr_info("no supported peer-to-peer memory devices found\n");
+		goto free_devices;
+	}
+	mutex_unlock(&ctrl->subsys->lock);
+
+	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+	return;
+
+free_devices:
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+	if (!ctrl->p2p_dev)
+		return;
+
+	mutex_lock(&ctrl->subsys->lock);
+
+	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+	pci_dev_put(ctrl->p2p_dev);
+	ctrl->p2p_dev = NULL;
+
+	mutex_unlock(&ctrl->subsys->lock);
+}
+
 u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
 {
@@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 
 	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
 	INIT_LIST_HEAD(&ctrl->async_events);
+	INIT_LIST_HEAD(&ctrl->p2p_clients);
 
 	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
 	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
 		ctrl->kato = DIV_ROUND_UP(kato, 1000);
 	}
 	nvmet_start_keep_alive_timer(ctrl);
+	nvmet_setup_p2pmem(ctrl, req);
 
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
@@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fatal_err_work);
 
+	nvmet_release_p2pmem(ctrl);
 	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
 
 	kfree(ctrl->sqs);
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 28bbdff4a88b..a213f8fc3bf3 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 		op = REQ_OP_READ;
 	}
 
+	if (is_pci_p2pdma_page(sg_page(req->sg)))
+		op_flags |= REQ_PCI_P2PDMA;
+
 	sector = le64_to_cpu(req->cmd->rw.slba);
 	sector <<= (req->ns->blksize_shift - 9);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 417f6c0331cc..85a170914588 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
 	return container_of(to_config_group(item), struct nvmet_ns, group);
 }
 
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
+{
+	return disk_to_dev(ns->bdev->bd_disk);
+}
+
 struct nvmet_cq {
 	u16			qid;
 	u16			size;
@@ -98,6 +103,7 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				allow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -131,6 +137,8 @@ struct nvmet_ctrl {
 	struct work_struct	fatal_err_work;
 
 	struct nvmet_fabrics_ops *ops;
+	struct pci_dev		*p2p_dev;
+	struct list_head	p2p_clients;
 
 	char			subsysnqn[NVMF_NQN_FIELD_LEN];
 	char			hostnqn[NVMF_NQN_FIELD_LEN];
@@ -232,6 +240,8 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+
+	struct device *p2p_client;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 020354e11351..7a1f09995ed5 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/pci-p2pdma.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
 	u8			n_rdma;
 	u32			flags;
 	u32			invalidate_rkey;
+	struct pci_dev		*p2p_dev;
 
 	struct list_head	wait_list;
 	struct list_head	free_list;
@@ -426,12 +428,18 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->n_rdma) {
 		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
-				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
+			queue->cm_id->port_num, rsp->req.sg,
+			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req),
+			rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
 	}
 
-	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		sgl_free(rsp->req.sg);
+	if (rsp->req.sg != &rsp->cmd->inline_sg) {
+		if (rsp->p2p_dev)
+			pci_p2pmem_free_sgl(rsp->p2p_dev, rsp->req.sg,
+					    rsp->req.sg_cnt);
+		else
+			sgl_free(rsp->req.sg);
+	}
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -567,19 +575,34 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	u64 addr = le64_to_cpu(sgl->addr);
 	u32 len = get_unaligned_le24(sgl->length);
 	u32 key = get_unaligned_le32(sgl->key);
+	struct pci_dev *p2p_dev = NULL;
 	int ret;
 
 	/* no data command? */
 	if (!len)
 		return 0;
 
-	rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
-	if (!rsp->req.sg)
-		return NVME_SC_INTERNAL;
+	if (rsp->queue->nvme_sq.ctrl)
+		p2p_dev = rsp->queue->nvme_sq.ctrl->p2p_dev;
+
+	rsp->p2p_dev = NULL;
+	if (rsp->queue->nvme_sq.qid && p2p_dev) {
+		ret = pci_p2pmem_alloc_sgl(p2p_dev, &rsp->req.sg,
+					   &rsp->req.sg_cnt, len);
+		if (!ret)
+			rsp->p2p_dev = p2p_dev;
+	}
+
+	if (!rsp->p2p_dev) {
+		rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);
+		if (!rsp->req.sg)
+			return NVME_SC_INTERNAL;
+	}
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req), 0);
+			nvmet_data_dir(&rsp->req),
+			rsp->p2p_dev ? RDMA_RW_CTX_FLAG_PCI_P2PDMA : 0);
 	if (ret < 0)
 		return NVME_SC_INTERNAL;
 	rsp->req.transfer_len += len;
@@ -658,6 +681,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue,
 		cmd->send_sge.addr, cmd->send_sge.length,
 		DMA_TO_DEVICE);
 
+	cmd->req.p2p_client = &queue->dev->device->dev;
+
 	if (!nvmet_req_init(&cmd->req, &queue->nvme_cq,
 			&queue->nvme_sq, &nvmet_rdma_ops))
 		return;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01  3:54   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01  3:54 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> Hi Everyone,


So Oliver (CC) was having issues getting any of that to work for us.

The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.

This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).

Then our MMIO space is so far away from our memory space that there is
not enough vmemmap virtual space to be able to do that.

So this can only work accross achitectures by using something like HMM
to create special device struct page's.

Ben.


> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>   as a bunch of cleanup and spelling fixes he pointed out in the last
>   series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>   just disabling ACS behind switches for any kernel that has
>   CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>   fairly simply handle Jason's concerns that this work might break with
>   the HFI, QIB and rxe drivers that use the virtual ops to implement
>   their own special DMA operations.
> 
> Thanks,
> 
> Logan
> 
> --
> 
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> provided valuable feedback to get these patches to where they are today.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the
> trade-off is currently a reduction in overall throughput. (Largely due
> to hardware issues that would certainly improve in the future).
> 
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch. This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We chose to go this route based on feedback we
> received at the last LSF). Future work may enable these transfers behind
> a fabric of PCI switches or perhaps using a white list of known good
> root complexes.
> 
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
> 
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices. This list is then also used to selectively disable the
> ACS bits for the downstream ports behind these devices.
> 
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
> 
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
> 
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not.
> 
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
> 
> Logan Gunthorpe (10):
>   PCI/P2PDMA: Support peer to peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci |  25 ++
>  block/blk-core.c                        |   3 +
>  drivers/infiniband/core/rw.c            |  21 +-
>  drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
>  drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
>  drivers/nvme/host/core.c                |   4 +
>  drivers/nvme/host/nvme.h                |   8 +
>  drivers/nvme/host/pci.c                 | 118 ++++--
>  drivers/nvme/target/configfs.c          |  29 ++
>  drivers/nvme/target/core.c              |  95 ++++-
>  drivers/nvme/target/io-cmd.c            |   3 +
>  drivers/nvme/target/nvmet.h             |  10 +
>  drivers/nvme/target/rdma.c              |  43 +-
>  drivers/pci/Kconfig                     |  20 +
>  drivers/pci/Makefile                    |   1 +
>  drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c                       |   4 +
>  include/linux/blk_types.h               |  18 +-
>  include/linux/blkdev.h                  |   3 +
>  include/linux/memremap.h                |  19 +
>  include/linux/pci-p2pdma.h              | 105 +++++
>  include/linux/pci.h                     |   4 +
>  include/rdma/rw.h                       |   7 +-
>  net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
>  24 files changed, 1204 insertions(+), 67 deletions(-)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> --
> 2.11.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01  3:54   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01  3:54 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Alex Williamson,
	Oliver OHalloran

On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> Hi Everyone,


So Oliver (CC) was having issues getting any of that to work for us.

The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.

This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).

Then our MMIO space is so far away from our memory space that there is
not enough vmemmap virtual space to be able to do that.

So this can only work accross achitectures by using something like HMM
to create special device struct page's.

Ben.


> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>   as a bunch of cleanup and spelling fixes he pointed out in the last
>   series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>   just disabling ACS behind switches for any kernel that has
>   CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>   fairly simply handle Jason's concerns that this work might break with
>   the HFI, QIB and rxe drivers that use the virtual ops to implement
>   their own special DMA operations.
> 
> Thanks,
> 
> Logan
> 
> --
> 
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> provided valuable feedback to get these patches to where they are today.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the
> trade-off is currently a reduction in overall throughput. (Largely due
> to hardware issues that would certainly improve in the future).
> 
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch. This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We chose to go this route based on feedback we
> received at the last LSF). Future work may enable these transfers behind
> a fabric of PCI switches or perhaps using a white list of known good
> root complexes.
> 
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
> 
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices. This list is then also used to selectively disable the
> ACS bits for the downstream ports behind these devices.
> 
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
> 
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
> 
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not.
> 
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
> 
> Logan Gunthorpe (10):
>   PCI/P2PDMA: Support peer to peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci |  25 ++
>  block/blk-core.c                        |   3 +
>  drivers/infiniband/core/rw.c            |  21 +-
>  drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
>  drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
>  drivers/nvme/host/core.c                |   4 +
>  drivers/nvme/host/nvme.h                |   8 +
>  drivers/nvme/host/pci.c                 | 118 ++++--
>  drivers/nvme/target/configfs.c          |  29 ++
>  drivers/nvme/target/core.c              |  95 ++++-
>  drivers/nvme/target/io-cmd.c            |   3 +
>  drivers/nvme/target/nvmet.h             |  10 +
>  drivers/nvme/target/rdma.c              |  43 +-
>  drivers/pci/Kconfig                     |  20 +
>  drivers/pci/Makefile                    |   1 +
>  drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c                       |   4 +
>  include/linux/blk_types.h               |  18 +-
>  include/linux/blkdev.h                  |   3 +
>  include/linux/memremap.h                |  19 +
>  include/linux/pci-p2pdma.h              | 105 +++++
>  include/linux/pci.h                     |   4 +
>  include/rdma/rw.h                       |   7 +-
>  net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
>  24 files changed, 1204 insertions(+), 67 deletions(-)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01  3:54   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01  3:54 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> Hi Everyone,


So Oliver (CC) was having issues getting any of that to work for us.

The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.

This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).

Then our MMIO space is so far away from our memory space that there is
not enough vmemmap virtual space to be able to do that.

So this can only work accross achitectures by using something like HMM
to create special device struct page's.

Ben.


> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>   as a bunch of cleanup and spelling fixes he pointed out in the last
>   series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>   just disabling ACS behind switches for any kernel that has
>   CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>   fairly simply handle Jason's concerns that this work might break with
>   the HFI, QIB and rxe drivers that use the virtual ops to implement
>   their own special DMA operations.
> 
> Thanks,
> 
> Logan
> 
> --
> 
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> provided valuable feedback to get these patches to where they are today.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the
> trade-off is currently a reduction in overall throughput. (Largely due
> to hardware issues that would certainly improve in the future).
> 
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch. This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We chose to go this route based on feedback we
> received at the last LSF). Future work may enable these transfers behind
> a fabric of PCI switches or perhaps using a white list of known good
> root complexes.
> 
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
> 
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices. This list is then also used to selectively disable the
> ACS bits for the downstream ports behind these devices.
> 
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
> 
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
> 
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not.
> 
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
> 
> Logan Gunthorpe (10):
>   PCI/P2PDMA: Support peer to peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci |  25 ++
>  block/blk-core.c                        |   3 +
>  drivers/infiniband/core/rw.c            |  21 +-
>  drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
>  drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
>  drivers/nvme/host/core.c                |   4 +
>  drivers/nvme/host/nvme.h                |   8 +
>  drivers/nvme/host/pci.c                 | 118 ++++--
>  drivers/nvme/target/configfs.c          |  29 ++
>  drivers/nvme/target/core.c              |  95 ++++-
>  drivers/nvme/target/io-cmd.c            |   3 +
>  drivers/nvme/target/nvmet.h             |  10 +
>  drivers/nvme/target/rdma.c              |  43 +-
>  drivers/pci/Kconfig                     |  20 +
>  drivers/pci/Makefile                    |   1 +
>  drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c                       |   4 +
>  include/linux/blk_types.h               |  18 +-
>  include/linux/blkdev.h                  |   3 +
>  include/linux/memremap.h                |  19 +
>  include/linux/pci-p2pdma.h              | 105 +++++
>  include/linux/pci.h                     |   4 +
>  include/rdma/rw.h                       |   7 +-
>  net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
>  24 files changed, 1204 insertions(+), 67 deletions(-)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01  3:54   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01  3:54 UTC (permalink / raw)


On Wed, 2018-02-28@16:39 -0700, Logan Gunthorpe wrote:
> Hi Everyone,


So Oliver (CC) was having issues getting any of that to work for us.

The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.

This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).

Then our MMIO space is so far away from our memory space that there is
not enough vmemmap virtual space to be able to do that.

So this can only work accross achitectures by using something like HMM
to create special device struct page's.

Ben.


> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>   as a bunch of cleanup and spelling fixes he pointed out in the last
>   series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>   just disabling ACS behind switches for any kernel that has
>   CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>   fairly simply handle Jason's concerns that this work might break with
>   the HFI, QIB and rxe drivers that use the virtual ops to implement
>   their own special DMA operations.
> 
> Thanks,
> 
> Logan
> 
> --
> 
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> provided valuable feedback to get these patches to where they are today.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the
> trade-off is currently a reduction in overall throughput. (Largely due
> to hardware issues that would certainly improve in the future).
> 
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch. This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We chose to go this route based on feedback we
> received at the last LSF). Future work may enable these transfers behind
> a fabric of PCI switches or perhaps using a white list of known good
> root complexes.
> 
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
> 
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices. This list is then also used to selectively disable the
> ACS bits for the downstream ports behind these devices.
> 
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
> 
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
> 
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not.
> 
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
> 
> Logan Gunthorpe (10):
>   PCI/P2PDMA: Support peer to peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci |  25 ++
>  block/blk-core.c                        |   3 +
>  drivers/infiniband/core/rw.c            |  21 +-
>  drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
>  drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
>  drivers/nvme/host/core.c                |   4 +
>  drivers/nvme/host/nvme.h                |   8 +
>  drivers/nvme/host/pci.c                 | 118 ++++--
>  drivers/nvme/target/configfs.c          |  29 ++
>  drivers/nvme/target/core.c              |  95 ++++-
>  drivers/nvme/target/io-cmd.c            |   3 +
>  drivers/nvme/target/nvmet.h             |  10 +
>  drivers/nvme/target/rdma.c              |  43 +-
>  drivers/pci/Kconfig                     |  20 +
>  drivers/pci/Makefile                    |   1 +
>  drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c                       |   4 +
>  include/linux/blk_types.h               |  18 +-
>  include/linux/blkdev.h                  |   3 +
>  include/linux/memremap.h                |  19 +
>  include/linux/pci-p2pdma.h              | 105 +++++
>  include/linux/pci.h                     |   4 +
>  include/rdma/rw.h                       |   7 +-
>  net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
>  24 files changed, 1204 insertions(+), 67 deletions(-)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01  3:54   ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01  3:56     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01  3:56 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > Hi Everyone,
> 
> 
> So Oliver (CC) was having issues getting any of that to work for us.
> 
> The problem is that acccording to him (I didn't double check the latest
> patches) you effectively hotplug the PCIe memory into the system when
> creating struct pages.
> 
> This cannot possibly work for us. First we cannot map PCIe memory as
> cachable. (Note that doing so is a bad idea if you are behind a PLX
> switch anyway since you'd ahve to manage cache coherency in SW).

Note: I think the above means it won't work behind a switch on x86
either, will it ?

> Then our MMIO space is so far away from our memory space that there is
> not enough vmemmap virtual space to be able to do that.
> 
> So this can only work accross achitectures by using something like HMM
> to create special device struct page's.
> 
> Ben.
> 
> 
> > Here's v2 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.16-rc3 which already
> > includes Christoph's devpagemap work the previous version was based
> > off as well as a couple of the cleanup patches that were in v1.
> > 
> > Additionally, we've made the following changes based on feedback:
> > 
> > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> >   as a bunch of cleanup and spelling fixes he pointed out in the last
> >   series.
> > 
> > * To address Alex's ACS concerns, we change to a simpler method of
> >   just disabling ACS behind switches for any kernel that has
> >   CONFIG_PCI_P2PDMA.
> > 
> > * We also reject using devices that employ 'dma_virt_ops' which should
> >   fairly simply handle Jason's concerns that this work might break with
> >   the HFI, QIB and rxe drivers that use the virtual ops to implement
> >   their own special DMA operations.
> > 
> > Thanks,
> > 
> > Logan
> > 
> > --
> > 
> > This is a continuation of our work to enable using Peer-to-Peer PCI
> > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> > provided valuable feedback to get these patches to where they are today.
> > 
> > The concept here is to use memory that's exposed on a PCI BAR as
> > data buffers in the NVME target code such that data can be transferred
> > from an RDMA NIC to the special memory and then directly to an NVMe
> > device avoiding system memory entirely. The upside of this is better
> > QoS for applications running on the CPU utilizing memory and lower
> > PCI bandwidth required to the CPU (such that systems could be designed
> > with fewer lanes connected to the CPU). However, presently, the
> > trade-off is currently a reduction in overall throughput. (Largely due
> > to hardware issues that would certainly improve in the future).
> > 
> > Due to these trade-offs we've designed the system to only enable using
> > the PCI memory in cases where the NIC, NVMe devices and memory are all
> > behind the same PCI switch. This will mean many setups that could likely
> > work well will not be supported so that we can be more confident it
> > will work and not place any responsibility on the user to understand
> > their topology. (We chose to go this route based on feedback we
> > received at the last LSF). Future work may enable these transfers behind
> > a fabric of PCI switches or perhaps using a white list of known good
> > root complexes.
> > 
> > In order to enable this functionality, we introduce a few new PCI
> > functions such that a driver can register P2P memory with the system.
> > Struct pages are created for this memory using devm_memremap_pages()
> > and the PCI bus offset is stored in the corresponding pagemap structure.
> > 
> > Another set of functions allow a client driver to create a list of
> > client devices that will be used in a given P2P transactions and then
> > use that list to find any P2P memory that is supported by all the
> > client devices. This list is then also used to selectively disable the
> > ACS bits for the downstream ports behind these devices.
> > 
> > In the block layer, we also introduce a P2P request flag to indicate a
> > given request targets P2P memory as well as a flag for a request queue
> > to indicate a given queue supports targeting P2P memory. P2P requests
> > will only be accepted by queues that support it. Also, P2P requests
> > are marked to not be merged seeing a non-homogenous request would
> > complicate the DMA mapping requirements.
> > 
> > In the PCI NVMe driver, we modify the existing CMB support to utilize
> > the new PCI P2P memory infrastructure and also add support for P2P
> > memory in its request queue. When a P2P request is received it uses the
> > pci_p2pmem_map_sg() function which applies the necessary transformation
> > to get the corrent pci_bus_addr_t for the DMA transactions.
> > 
> > In the RDMA core, we also adjust rdma_rw_ctx_init() and
> > rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> > to use the PCI P2P mapping functions or not.
> > 
> > Finally, in the NVMe fabrics target port we introduce a new
> > configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> > to find P2P memory supported by the RDMA NIC and all namespaces. If
> > supported memory is found, it will be used in all IO transfers. And if
> > a port is using P2P memory, adding new namespaces that are not supported
> > by that memory will fail.
> > 
> > Logan Gunthorpe (10):
> >   PCI/P2PDMA: Support peer to peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci |  25 ++
> >  block/blk-core.c                        |   3 +
> >  drivers/infiniband/core/rw.c            |  21 +-
> >  drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
> >  drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
> >  drivers/nvme/host/core.c                |   4 +
> >  drivers/nvme/host/nvme.h                |   8 +
> >  drivers/nvme/host/pci.c                 | 118 ++++--
> >  drivers/nvme/target/configfs.c          |  29 ++
> >  drivers/nvme/target/core.c              |  95 ++++-
> >  drivers/nvme/target/io-cmd.c            |   3 +
> >  drivers/nvme/target/nvmet.h             |  10 +
> >  drivers/nvme/target/rdma.c              |  43 +-
> >  drivers/pci/Kconfig                     |  20 +
> >  drivers/pci/Makefile                    |   1 +
> >  drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
> >  drivers/pci/pci.c                       |   4 +
> >  include/linux/blk_types.h               |  18 +-
> >  include/linux/blkdev.h                  |   3 +
> >  include/linux/memremap.h                |  19 +
> >  include/linux/pci-p2pdma.h              | 105 +++++
> >  include/linux/pci.h                     |   4 +
> >  include/rdma/rw.h                       |   7 +-
> >  net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
> >  24 files changed, 1204 insertions(+), 67 deletions(-)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h
> > 
> > --
> > 2.11.0

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01  3:56     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01  3:56 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Alex Williamson,
	Oliver OHalloran

On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > Hi Everyone,
> 
> 
> So Oliver (CC) was having issues getting any of that to work for us.
> 
> The problem is that acccording to him (I didn't double check the latest
> patches) you effectively hotplug the PCIe memory into the system when
> creating struct pages.
> 
> This cannot possibly work for us. First we cannot map PCIe memory as
> cachable. (Note that doing so is a bad idea if you are behind a PLX
> switch anyway since you'd ahve to manage cache coherency in SW).

Note: I think the above means it won't work behind a switch on x86
either, will it ?

> Then our MMIO space is so far away from our memory space that there is
> not enough vmemmap virtual space to be able to do that.
> 
> So this can only work accross achitectures by using something like HMM
> to create special device struct page's.
> 
> Ben.
> 
> 
> > Here's v2 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.16-rc3 which already
> > includes Christoph's devpagemap work the previous version was based
> > off as well as a couple of the cleanup patches that were in v1.
> > 
> > Additionally, we've made the following changes based on feedback:
> > 
> > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> >   as a bunch of cleanup and spelling fixes he pointed out in the last
> >   series.
> > 
> > * To address Alex's ACS concerns, we change to a simpler method of
> >   just disabling ACS behind switches for any kernel that has
> >   CONFIG_PCI_P2PDMA.
> > 
> > * We also reject using devices that employ 'dma_virt_ops' which should
> >   fairly simply handle Jason's concerns that this work might break with
> >   the HFI, QIB and rxe drivers that use the virtual ops to implement
> >   their own special DMA operations.
> > 
> > Thanks,
> > 
> > Logan
> > 
> > --
> > 
> > This is a continuation of our work to enable using Peer-to-Peer PCI
> > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> > provided valuable feedback to get these patches to where they are today.
> > 
> > The concept here is to use memory that's exposed on a PCI BAR as
> > data buffers in the NVME target code such that data can be transferred
> > from an RDMA NIC to the special memory and then directly to an NVMe
> > device avoiding system memory entirely. The upside of this is better
> > QoS for applications running on the CPU utilizing memory and lower
> > PCI bandwidth required to the CPU (such that systems could be designed
> > with fewer lanes connected to the CPU). However, presently, the
> > trade-off is currently a reduction in overall throughput. (Largely due
> > to hardware issues that would certainly improve in the future).
> > 
> > Due to these trade-offs we've designed the system to only enable using
> > the PCI memory in cases where the NIC, NVMe devices and memory are all
> > behind the same PCI switch. This will mean many setups that could likely
> > work well will not be supported so that we can be more confident it
> > will work and not place any responsibility on the user to understand
> > their topology. (We chose to go this route based on feedback we
> > received at the last LSF). Future work may enable these transfers behind
> > a fabric of PCI switches or perhaps using a white list of known good
> > root complexes.
> > 
> > In order to enable this functionality, we introduce a few new PCI
> > functions such that a driver can register P2P memory with the system.
> > Struct pages are created for this memory using devm_memremap_pages()
> > and the PCI bus offset is stored in the corresponding pagemap structure.
> > 
> > Another set of functions allow a client driver to create a list of
> > client devices that will be used in a given P2P transactions and then
> > use that list to find any P2P memory that is supported by all the
> > client devices. This list is then also used to selectively disable the
> > ACS bits for the downstream ports behind these devices.
> > 
> > In the block layer, we also introduce a P2P request flag to indicate a
> > given request targets P2P memory as well as a flag for a request queue
> > to indicate a given queue supports targeting P2P memory. P2P requests
> > will only be accepted by queues that support it. Also, P2P requests
> > are marked to not be merged seeing a non-homogenous request would
> > complicate the DMA mapping requirements.
> > 
> > In the PCI NVMe driver, we modify the existing CMB support to utilize
> > the new PCI P2P memory infrastructure and also add support for P2P
> > memory in its request queue. When a P2P request is received it uses the
> > pci_p2pmem_map_sg() function which applies the necessary transformation
> > to get the corrent pci_bus_addr_t for the DMA transactions.
> > 
> > In the RDMA core, we also adjust rdma_rw_ctx_init() and
> > rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> > to use the PCI P2P mapping functions or not.
> > 
> > Finally, in the NVMe fabrics target port we introduce a new
> > configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> > to find P2P memory supported by the RDMA NIC and all namespaces. If
> > supported memory is found, it will be used in all IO transfers. And if
> > a port is using P2P memory, adding new namespaces that are not supported
> > by that memory will fail.
> > 
> > Logan Gunthorpe (10):
> >   PCI/P2PDMA: Support peer to peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci |  25 ++
> >  block/blk-core.c                        |   3 +
> >  drivers/infiniband/core/rw.c            |  21 +-
> >  drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
> >  drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
> >  drivers/nvme/host/core.c                |   4 +
> >  drivers/nvme/host/nvme.h                |   8 +
> >  drivers/nvme/host/pci.c                 | 118 ++++--
> >  drivers/nvme/target/configfs.c          |  29 ++
> >  drivers/nvme/target/core.c              |  95 ++++-
> >  drivers/nvme/target/io-cmd.c            |   3 +
> >  drivers/nvme/target/nvmet.h             |  10 +
> >  drivers/nvme/target/rdma.c              |  43 +-
> >  drivers/pci/Kconfig                     |  20 +
> >  drivers/pci/Makefile                    |   1 +
> >  drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
> >  drivers/pci/pci.c                       |   4 +
> >  include/linux/blk_types.h               |  18 +-
> >  include/linux/blkdev.h                  |   3 +
> >  include/linux/memremap.h                |  19 +
> >  include/linux/pci-p2pdma.h              | 105 +++++
> >  include/linux/pci.h                     |   4 +
> >  include/rdma/rw.h                       |   7 +-
> >  net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
> >  24 files changed, 1204 insertions(+), 67 deletions(-)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h
> > 
> > --
> > 2.11.0

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01  3:56     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01  3:56 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > Hi Everyone,
> 
> 
> So Oliver (CC) was having issues getting any of that to work for us.
> 
> The problem is that acccording to him (I didn't double check the latest
> patches) you effectively hotplug the PCIe memory into the system when
> creating struct pages.
> 
> This cannot possibly work for us. First we cannot map PCIe memory as
> cachable. (Note that doing so is a bad idea if you are behind a PLX
> switch anyway since you'd ahve to manage cache coherency in SW).

Note: I think the above means it won't work behind a switch on x86
either, will it ?

> Then our MMIO space is so far away from our memory space that there is
> not enough vmemmap virtual space to be able to do that.
> 
> So this can only work accross achitectures by using something like HMM
> to create special device struct page's.
> 
> Ben.
> 
> 
> > Here's v2 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.16-rc3 which already
> > includes Christoph's devpagemap work the previous version was based
> > off as well as a couple of the cleanup patches that were in v1.
> > 
> > Additionally, we've made the following changes based on feedback:
> > 
> > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> >   as a bunch of cleanup and spelling fixes he pointed out in the last
> >   series.
> > 
> > * To address Alex's ACS concerns, we change to a simpler method of
> >   just disabling ACS behind switches for any kernel that has
> >   CONFIG_PCI_P2PDMA.
> > 
> > * We also reject using devices that employ 'dma_virt_ops' which should
> >   fairly simply handle Jason's concerns that this work might break with
> >   the HFI, QIB and rxe drivers that use the virtual ops to implement
> >   their own special DMA operations.
> > 
> > Thanks,
> > 
> > Logan
> > 
> > --
> > 
> > This is a continuation of our work to enable using Peer-to-Peer PCI
> > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> > provided valuable feedback to get these patches to where they are today.
> > 
> > The concept here is to use memory that's exposed on a PCI BAR as
> > data buffers in the NVME target code such that data can be transferred
> > from an RDMA NIC to the special memory and then directly to an NVMe
> > device avoiding system memory entirely. The upside of this is better
> > QoS for applications running on the CPU utilizing memory and lower
> > PCI bandwidth required to the CPU (such that systems could be designed
> > with fewer lanes connected to the CPU). However, presently, the
> > trade-off is currently a reduction in overall throughput. (Largely due
> > to hardware issues that would certainly improve in the future).
> > 
> > Due to these trade-offs we've designed the system to only enable using
> > the PCI memory in cases where the NIC, NVMe devices and memory are all
> > behind the same PCI switch. This will mean many setups that could likely
> > work well will not be supported so that we can be more confident it
> > will work and not place any responsibility on the user to understand
> > their topology. (We chose to go this route based on feedback we
> > received at the last LSF). Future work may enable these transfers behind
> > a fabric of PCI switches or perhaps using a white list of known good
> > root complexes.
> > 
> > In order to enable this functionality, we introduce a few new PCI
> > functions such that a driver can register P2P memory with the system.
> > Struct pages are created for this memory using devm_memremap_pages()
> > and the PCI bus offset is stored in the corresponding pagemap structure.
> > 
> > Another set of functions allow a client driver to create a list of
> > client devices that will be used in a given P2P transactions and then
> > use that list to find any P2P memory that is supported by all the
> > client devices. This list is then also used to selectively disable the
> > ACS bits for the downstream ports behind these devices.
> > 
> > In the block layer, we also introduce a P2P request flag to indicate a
> > given request targets P2P memory as well as a flag for a request queue
> > to indicate a given queue supports targeting P2P memory. P2P requests
> > will only be accepted by queues that support it. Also, P2P requests
> > are marked to not be merged seeing a non-homogenous request would
> > complicate the DMA mapping requirements.
> > 
> > In the PCI NVMe driver, we modify the existing CMB support to utilize
> > the new PCI P2P memory infrastructure and also add support for P2P
> > memory in its request queue. When a P2P request is received it uses the
> > pci_p2pmem_map_sg() function which applies the necessary transformation
> > to get the corrent pci_bus_addr_t for the DMA transactions.
> > 
> > In the RDMA core, we also adjust rdma_rw_ctx_init() and
> > rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> > to use the PCI P2P mapping functions or not.
> > 
> > Finally, in the NVMe fabrics target port we introduce a new
> > configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> > to find P2P memory supported by the RDMA NIC and all namespaces. If
> > supported memory is found, it will be used in all IO transfers. And if
> > a port is using P2P memory, adding new namespaces that are not supported
> > by that memory will fail.
> > 
> > Logan Gunthorpe (10):
> >   PCI/P2PDMA: Support peer to peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci |  25 ++
> >  block/blk-core.c                        |   3 +
> >  drivers/infiniband/core/rw.c            |  21 +-
> >  drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
> >  drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
> >  drivers/nvme/host/core.c                |   4 +
> >  drivers/nvme/host/nvme.h                |   8 +
> >  drivers/nvme/host/pci.c                 | 118 ++++--
> >  drivers/nvme/target/configfs.c          |  29 ++
> >  drivers/nvme/target/core.c              |  95 ++++-
> >  drivers/nvme/target/io-cmd.c            |   3 +
> >  drivers/nvme/target/nvmet.h             |  10 +
> >  drivers/nvme/target/rdma.c              |  43 +-
> >  drivers/pci/Kconfig                     |  20 +
> >  drivers/pci/Makefile                    |   1 +
> >  drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
> >  drivers/pci/pci.c                       |   4 +
> >  include/linux/blk_types.h               |  18 +-
> >  include/linux/blkdev.h                  |   3 +
> >  include/linux/memremap.h                |  19 +
> >  include/linux/pci-p2pdma.h              | 105 +++++
> >  include/linux/pci.h                     |   4 +
> >  include/rdma/rw.h                       |   7 +-
> >  net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
> >  24 files changed, 1204 insertions(+), 67 deletions(-)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h
> > 
> > --
> > 2.11.0

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01  3:56     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01  3:56 UTC (permalink / raw)


On Thu, 2018-03-01@14:54 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-02-28@16:39 -0700, Logan Gunthorpe wrote:
> > Hi Everyone,
> 
> 
> So Oliver (CC) was having issues getting any of that to work for us.
> 
> The problem is that acccording to him (I didn't double check the latest
> patches) you effectively hotplug the PCIe memory into the system when
> creating struct pages.
> 
> This cannot possibly work for us. First we cannot map PCIe memory as
> cachable. (Note that doing so is a bad idea if you are behind a PLX
> switch anyway since you'd ahve to manage cache coherency in SW).

Note: I think the above means it won't work behind a switch on x86
either, will it ?

> Then our MMIO space is so far away from our memory space that there is
> not enough vmemmap virtual space to be able to do that.
> 
> So this can only work accross achitectures by using something like HMM
> to create special device struct page's.
> 
> Ben.
> 
> 
> > Here's v2 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.16-rc3 which already
> > includes Christoph's devpagemap work the previous version was based
> > off as well as a couple of the cleanup patches that were in v1.
> > 
> > Additionally, we've made the following changes based on feedback:
> > 
> > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> >   as a bunch of cleanup and spelling fixes he pointed out in the last
> >   series.
> > 
> > * To address Alex's ACS concerns, we change to a simpler method of
> >   just disabling ACS behind switches for any kernel that has
> >   CONFIG_PCI_P2PDMA.
> > 
> > * We also reject using devices that employ 'dma_virt_ops' which should
> >   fairly simply handle Jason's concerns that this work might break with
> >   the HFI, QIB and rxe drivers that use the virtual ops to implement
> >   their own special DMA operations.
> > 
> > Thanks,
> > 
> > Logan
> > 
> > --
> > 
> > This is a continuation of our work to enable using Peer-to-Peer PCI
> > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> > provided valuable feedback to get these patches to where they are today.
> > 
> > The concept here is to use memory that's exposed on a PCI BAR as
> > data buffers in the NVME target code such that data can be transferred
> > from an RDMA NIC to the special memory and then directly to an NVMe
> > device avoiding system memory entirely. The upside of this is better
> > QoS for applications running on the CPU utilizing memory and lower
> > PCI bandwidth required to the CPU (such that systems could be designed
> > with fewer lanes connected to the CPU). However, presently, the
> > trade-off is currently a reduction in overall throughput. (Largely due
> > to hardware issues that would certainly improve in the future).
> > 
> > Due to these trade-offs we've designed the system to only enable using
> > the PCI memory in cases where the NIC, NVMe devices and memory are all
> > behind the same PCI switch. This will mean many setups that could likely
> > work well will not be supported so that we can be more confident it
> > will work and not place any responsibility on the user to understand
> > their topology. (We chose to go this route based on feedback we
> > received at the last LSF). Future work may enable these transfers behind
> > a fabric of PCI switches or perhaps using a white list of known good
> > root complexes.
> > 
> > In order to enable this functionality, we introduce a few new PCI
> > functions such that a driver can register P2P memory with the system.
> > Struct pages are created for this memory using devm_memremap_pages()
> > and the PCI bus offset is stored in the corresponding pagemap structure.
> > 
> > Another set of functions allow a client driver to create a list of
> > client devices that will be used in a given P2P transactions and then
> > use that list to find any P2P memory that is supported by all the
> > client devices. This list is then also used to selectively disable the
> > ACS bits for the downstream ports behind these devices.
> > 
> > In the block layer, we also introduce a P2P request flag to indicate a
> > given request targets P2P memory as well as a flag for a request queue
> > to indicate a given queue supports targeting P2P memory. P2P requests
> > will only be accepted by queues that support it. Also, P2P requests
> > are marked to not be merged seeing a non-homogenous request would
> > complicate the DMA mapping requirements.
> > 
> > In the PCI NVMe driver, we modify the existing CMB support to utilize
> > the new PCI P2P memory infrastructure and also add support for P2P
> > memory in its request queue. When a P2P request is received it uses the
> > pci_p2pmem_map_sg() function which applies the necessary transformation
> > to get the corrent pci_bus_addr_t for the DMA transactions.
> > 
> > In the RDMA core, we also adjust rdma_rw_ctx_init() and
> > rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> > to use the PCI P2P mapping functions or not.
> > 
> > Finally, in the NVMe fabrics target port we introduce a new
> > configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> > to find P2P memory supported by the RDMA NIC and all namespaces. If
> > supported memory is found, it will be used in all IO transfers. And if
> > a port is using P2P memory, adding new namespaces that are not supported
> > by that memory will fail.
> > 
> > Logan Gunthorpe (10):
> >   PCI/P2PDMA: Support peer to peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci |  25 ++
> >  block/blk-core.c                        |   3 +
> >  drivers/infiniband/core/rw.c            |  21 +-
> >  drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
> >  drivers/infiniband/ulp/srpt/ib_srpt.c   |   7 +-
> >  drivers/nvme/host/core.c                |   4 +
> >  drivers/nvme/host/nvme.h                |   8 +
> >  drivers/nvme/host/pci.c                 | 118 ++++--
> >  drivers/nvme/target/configfs.c          |  29 ++
> >  drivers/nvme/target/core.c              |  95 ++++-
> >  drivers/nvme/target/io-cmd.c            |   3 +
> >  drivers/nvme/target/nvmet.h             |  10 +
> >  drivers/nvme/target/rdma.c              |  43 +-
> >  drivers/pci/Kconfig                     |  20 +
> >  drivers/pci/Makefile                    |   1 +
> >  drivers/pci/p2pdma.c                    | 713 ++++++++++++++++++++++++++++++++
> >  drivers/pci/pci.c                       |   4 +
> >  include/linux/blk_types.h               |  18 +-
> >  include/linux/blkdev.h                  |   3 +
> >  include/linux/memremap.h                |  19 +
> >  include/linux/pci-p2pdma.h              | 105 +++++
> >  include/linux/pci.h                     |   4 +
> >  include/rdma/rw.h                       |   7 +-
> >  net/sunrpc/xprtrdma/svc_rdma_rw.c       |   6 +-
> >  24 files changed, 1204 insertions(+), 67 deletions(-)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h
> > 
> > --
> > 2.11.0

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-02-28 23:39 ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 10:31   ` Sagi Grimberg
  -1 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 10:31 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig


> Hi Everyone,

Hi Logan,

> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>    as a bunch of cleanup and spelling fixes he pointed out in the last
>    series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>    just disabling ACS behind switches for any kernel that has
>    CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>    fairly simply handle Jason's concerns that this work might break with
>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>    their own special DMA operations.

That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 10:31   ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 10:31 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson


> Hi Everyone,

Hi Logan,

> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>    as a bunch of cleanup and spelling fixes he pointed out in the last
>    series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>    just disabling ACS behind switches for any kernel that has
>    CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>    fairly simply handle Jason's concerns that this work might break with
>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>    their own special DMA operations.

That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 10:31   ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 10:31 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig


> Hi Everyone,

Hi Logan,

> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>    as a bunch of cleanup and spelling fixes he pointed out in the last
>    series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>    just disabling ACS behind switches for any kernel that has
>    CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>    fairly simply handle Jason's concerns that this work might break with
>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>    their own special DMA operations.

That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 10:31   ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 10:31 UTC (permalink / raw)



> Hi Everyone,

Hi Logan,

> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>    as a bunch of cleanup and spelling fixes he pointed out in the last
>    series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>    just disabling ACS behind switches for any kernel that has
>    CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>    fairly simply handle Jason's concerns that this work might break with
>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>    their own special DMA operations.

That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
  2018-02-28 23:40   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 10:32     ` Sagi Grimberg
  -1 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 10:32 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 03/01/2018 01:40 AM, Logan Gunthorpe wrote:
> In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
> called to map the correct DMA address. To do this, we add a flags
> variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
> specified use the appropriate map function.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   drivers/infiniband/core/rw.c            | 21 +++++++++++++++++----
>   drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
>   drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ++++---
>   drivers/nvme/target/rdma.c              |  6 +++---
>   include/rdma/rw.h                       |  7 +++++--
>   net/sunrpc/xprtrdma/svc_rdma_rw.c       |  6 +++---
>   6 files changed, 35 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
> index c8963e91f92a..775a9f8b15a6 100644
> --- a/drivers/infiniband/core/rw.c
> +++ b/drivers/infiniband/core/rw.c
> @@ -12,6 +12,7 @@
>    */
>   #include <linux/moduleparam.h>
>   #include <linux/slab.h>
> +#include <linux/pci-p2pdma.h>
>   #include <rdma/mr_pool.h>
>   #include <rdma/rw.h>
>   
> @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
>    * @remote_addr:remote address to read/write (relative to @rkey)
>    * @rkey:	remote key to operate on
>    * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> + * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
>    *
>    * Returns the number of WQEs that will be needed on the workqueue if
>    * successful, or a negative error code.
>    */
>   int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
>   		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
> -		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> +		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
> +		unsigned int flags)
>   {
>   	struct ib_device *dev = qp->pd->device;
>   	int ret;
>   
> -	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
> +		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
> +	else
> +		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +

Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
much cleaner...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
@ 2018-03-01 10:32     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 10:32 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson



On 03/01/2018 01:40 AM, Logan Gunthorpe wrote:
> In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
> called to map the correct DMA address. To do this, we add a flags
> variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
> specified use the appropriate map function.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   drivers/infiniband/core/rw.c            | 21 +++++++++++++++++----
>   drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
>   drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ++++---
>   drivers/nvme/target/rdma.c              |  6 +++---
>   include/rdma/rw.h                       |  7 +++++--
>   net/sunrpc/xprtrdma/svc_rdma_rw.c       |  6 +++---
>   6 files changed, 35 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
> index c8963e91f92a..775a9f8b15a6 100644
> --- a/drivers/infiniband/core/rw.c
> +++ b/drivers/infiniband/core/rw.c
> @@ -12,6 +12,7 @@
>    */
>   #include <linux/moduleparam.h>
>   #include <linux/slab.h>
> +#include <linux/pci-p2pdma.h>
>   #include <rdma/mr_pool.h>
>   #include <rdma/rw.h>
>   
> @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
>    * @remote_addr:remote address to read/write (relative to @rkey)
>    * @rkey:	remote key to operate on
>    * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> + * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
>    *
>    * Returns the number of WQEs that will be needed on the workqueue if
>    * successful, or a negative error code.
>    */
>   int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
>   		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
> -		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> +		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
> +		unsigned int flags)
>   {
>   	struct ib_device *dev = qp->pd->device;
>   	int ret;
>   
> -	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
> +		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
> +	else
> +		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +

Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
much cleaner...

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
@ 2018-03-01 10:32     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 10:32 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 03/01/2018 01:40 AM, Logan Gunthorpe wrote:
> In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
> called to map the correct DMA address. To do this, we add a flags
> variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
> specified use the appropriate map function.
> 
> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> ---
>   drivers/infiniband/core/rw.c            | 21 +++++++++++++++++----
>   drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
>   drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ++++---
>   drivers/nvme/target/rdma.c              |  6 +++---
>   include/rdma/rw.h                       |  7 +++++--
>   net/sunrpc/xprtrdma/svc_rdma_rw.c       |  6 +++---
>   6 files changed, 35 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
> index c8963e91f92a..775a9f8b15a6 100644
> --- a/drivers/infiniband/core/rw.c
> +++ b/drivers/infiniband/core/rw.c
> @@ -12,6 +12,7 @@
>    */
>   #include <linux/moduleparam.h>
>   #include <linux/slab.h>
> +#include <linux/pci-p2pdma.h>
>   #include <rdma/mr_pool.h>
>   #include <rdma/rw.h>
>   
> @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
>    * @remote_addr:remote address to read/write (relative to @rkey)
>    * @rkey:	remote key to operate on
>    * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> + * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
>    *
>    * Returns the number of WQEs that will be needed on the workqueue if
>    * successful, or a negative error code.
>    */
>   int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
>   		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
> -		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> +		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
> +		unsigned int flags)
>   {
>   	struct ib_device *dev = qp->pd->device;
>   	int ret;
>   
> -	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
> +		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
> +	else
> +		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +

Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
much cleaner...

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
@ 2018-03-01 10:32     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 10:32 UTC (permalink / raw)




On 03/01/2018 01:40 AM, Logan Gunthorpe wrote:
> In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
> called to map the correct DMA address. To do this, we add a flags
> variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
> specified use the appropriate map function.
> 
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> ---
>   drivers/infiniband/core/rw.c            | 21 +++++++++++++++++----
>   drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
>   drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ++++---
>   drivers/nvme/target/rdma.c              |  6 +++---
>   include/rdma/rw.h                       |  7 +++++--
>   net/sunrpc/xprtrdma/svc_rdma_rw.c       |  6 +++---
>   6 files changed, 35 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
> index c8963e91f92a..775a9f8b15a6 100644
> --- a/drivers/infiniband/core/rw.c
> +++ b/drivers/infiniband/core/rw.c
> @@ -12,6 +12,7 @@
>    */
>   #include <linux/moduleparam.h>
>   #include <linux/slab.h>
> +#include <linux/pci-p2pdma.h>
>   #include <rdma/mr_pool.h>
>   #include <rdma/rw.h>
>   
> @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
>    * @remote_addr:remote address to read/write (relative to @rkey)
>    * @rkey:	remote key to operate on
>    * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> + * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
>    *
>    * Returns the number of WQEs that will be needed on the workqueue if
>    * successful, or a negative error code.
>    */
>   int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
>   		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
> -		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> +		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
> +		unsigned int flags)
>   {
>   	struct ib_device *dev = qp->pd->device;
>   	int ret;
>   
> -	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +	if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
> +		ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
> +	else
> +		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +

Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
much cleaner...

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-02-28 23:40   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 11:03     ` Sagi Grimberg
  -1 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:03 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig


> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch as the RDMA port and all the block devices in use. If
> the user enabled it an no devices are found, then the system will
> silently fall back on using regular memory.
> 
> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.

Nice.

> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available.

Can you describe what would be the plan to have it when these devices
do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
and not from nvmet_ctrl. Then, when cmb capable devices come along, the
ns can prefer to use its own cmb instead of locating a p2p_dev device?

> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
> +				   struct nvmet_ns *ns)
> +{
> +	int ret;
> +
> +	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
> +		pr_err("peer-to-peer DMA is not supported by %s\n",
> +		       ns->device_path);
> +		return -EINVAL;

I'd say that just skip it instead of failing it, in theory one can
connect nvme devices via p2p mem and expose other devices in the
same subsystem. The message would be pr_debug to reduce chattiness.

> +	}
> +
> +	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
> +	if (ret)
> +		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
> +		       ns->device_path, ret);
> +
> +	return ret;
> +}
> +
>   int nvmet_ns_enable(struct nvmet_ns *ns)
>   {
>   	struct nvmet_subsys *subsys = ns->subsys;
> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>   	if (ret)
>   		goto out_blkdev_put;
>   
> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> +		if (ctrl->p2p_dev) {
> +			ret = nvmet_p2pdma_add_client(ctrl, ns);
> +			if (ret)
> +				goto out_remove_clients;

Is this really a fatal failure given that we fall-back to main
memory? Why not continue with main memory (and warn at best)?

> +/*
> + * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
> + * Ι/O commands. This requires the PCI p2p device to be compatible with the
> + * backing device for every namespace on this controller.
> + */
> +static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
> +{
> +	struct nvmet_ns *ns;
> +	int ret;
> +
> +	if (!req->port->allow_p2pmem || !req->p2p_client)
> +		return;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +
> +	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
> +	if (ret) {
> +		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
> +		       dev_name(req->p2p_client), ret);
> +		goto free_devices;
> +	}
> +
> +	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
> +		ret = nvmet_p2pdma_add_client(ctrl, ns);
> +		if (ret)
> +			goto free_devices;
> +	}
> +
> +	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);

This is the first p2p_dev found right? What happens if I have more than
a single p2p device? In theory I'd have more p2p memory I can use. Have
you considered making pci_p2pmem_find return the least used suitable
device?

> +	if (!ctrl->p2p_dev) {
> +		pr_info("no supported peer-to-peer memory devices found\n");
> +		goto free_devices;
> +	}
> +	mutex_unlock(&ctrl->subsys->lock);
> +
> +	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
> +	return;
> +
> +free_devices:
> +	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> +	mutex_unlock(&ctrl->subsys->lock);
> +}
> +
> +static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
> +{
> +	if (!ctrl->p2p_dev)
> +		return;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +
> +	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> +	pci_dev_put(ctrl->p2p_dev);
> +	ctrl->p2p_dev = NULL;
> +
> +	mutex_unlock(&ctrl->subsys->lock);
> +}
> +
>   u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
>   {
> @@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   
>   	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
>   	INIT_LIST_HEAD(&ctrl->async_events);
> +	INIT_LIST_HEAD(&ctrl->p2p_clients);
>   
>   	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
>   	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
> @@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   		ctrl->kato = DIV_ROUND_UP(kato, 1000);
>   	}
>   	nvmet_start_keep_alive_timer(ctrl);
> +	nvmet_setup_p2pmem(ctrl, req);
>   
>   	mutex_lock(&subsys->lock);
>   	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
> @@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
>   	flush_work(&ctrl->async_event_work);
>   	cancel_work_sync(&ctrl->fatal_err_work);
>   
> +	nvmet_release_p2pmem(ctrl);
>   	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
>   
>   	kfree(ctrl->sqs);
> diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
> index 28bbdff4a88b..a213f8fc3bf3 100644
> --- a/drivers/nvme/target/io-cmd.c
> +++ b/drivers/nvme/target/io-cmd.c
> @@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
>   		op = REQ_OP_READ;
>   	}
>   
> +	if (is_pci_p2pdma_page(sg_page(req->sg)))
> +		op_flags |= REQ_PCI_P2PDMA;
> +
>   	sector = le64_to_cpu(req->cmd->rw.slba);
>   	sector <<= (req->ns->blksize_shift - 9);
>   
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index 417f6c0331cc..85a170914588 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
>   	return container_of(to_config_group(item), struct nvmet_ns, group);
>   }
>   
> +static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
> +{
> +	return disk_to_dev(ns->bdev->bd_disk);
> +}
> +
>   struct nvmet_cq {
>   	u16			qid;
>   	u16			size;
> @@ -98,6 +103,7 @@ struct nvmet_port {
>   	struct list_head		referrals;
>   	void				*priv;
>   	bool				enabled;
> +	bool				allow_p2pmem;
>   };
>   
>   static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
> @@ -131,6 +137,8 @@ struct nvmet_ctrl {
>   	struct work_struct	fatal_err_work;
>   
>   	struct nvmet_fabrics_ops *ops;
> +	struct pci_dev		*p2p_dev;
> +	struct list_head	p2p_clients;
>   
>   	char			subsysnqn[NVMF_NQN_FIELD_LEN];
>   	char			hostnqn[NVMF_NQN_FIELD_LEN];
> @@ -232,6 +240,8 @@ struct nvmet_req {
>   
>   	void (*execute)(struct nvmet_req *req);
>   	struct nvmet_fabrics_ops *ops;
> +
> +	struct device *p2p_client;
>   };
>   
>   static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 020354e11351..7a1f09995ed5 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -23,6 +23,7 @@
>   #include <linux/string.h>
>   #include <linux/wait.h>
>   #include <linux/inet.h>
> +#include <linux/pci-p2pdma.h>
>   #include <asm/unaligned.h>
>   
>   #include <rdma/ib_verbs.h>
> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
>   	u8			n_rdma;
>   	u32			flags;
>   	u32			invalidate_rkey;
> +	struct pci_dev		*p2p_dev;

Given that p2p_client is in nvmet_req, I think it make sense
that the p2p_dev itself would also live there. In theory, nothing
is preventing FC from using it as well.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 11:03     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:03 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise


> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch as the RDMA port and all the block devices in use. If
> the user enabled it an no devices are found, then the system will
> silently fall back on using regular memory.
> 
> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.

Nice.

> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available.

Can you describe what would be the plan to have it when these devices
do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
and not from nvmet_ctrl. Then, when cmb capable devices come along, the
ns can prefer to use its own cmb instead of locating a p2p_dev device?

> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
> +				   struct nvmet_ns *ns)
> +{
> +	int ret;
> +
> +	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
> +		pr_err("peer-to-peer DMA is not supported by %s\n",
> +		       ns->device_path);
> +		return -EINVAL;

I'd say that just skip it instead of failing it, in theory one can
connect nvme devices via p2p mem and expose other devices in the
same subsystem. The message would be pr_debug to reduce chattiness.

> +	}
> +
> +	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
> +	if (ret)
> +		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
> +		       ns->device_path, ret);
> +
> +	return ret;
> +}
> +
>   int nvmet_ns_enable(struct nvmet_ns *ns)
>   {
>   	struct nvmet_subsys *subsys = ns->subsys;
> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>   	if (ret)
>   		goto out_blkdev_put;
>   
> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> +		if (ctrl->p2p_dev) {
> +			ret = nvmet_p2pdma_add_client(ctrl, ns);
> +			if (ret)
> +				goto out_remove_clients;

Is this really a fatal failure given that we fall-back to main
memory? Why not continue with main memory (and warn at best)?

> +/*
> + * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
> + * Ι/O commands. This requires the PCI p2p device to be compatible with the
> + * backing device for every namespace on this controller.
> + */
> +static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
> +{
> +	struct nvmet_ns *ns;
> +	int ret;
> +
> +	if (!req->port->allow_p2pmem || !req->p2p_client)
> +		return;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +
> +	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
> +	if (ret) {
> +		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
> +		       dev_name(req->p2p_client), ret);
> +		goto free_devices;
> +	}
> +
> +	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
> +		ret = nvmet_p2pdma_add_client(ctrl, ns);
> +		if (ret)
> +			goto free_devices;
> +	}
> +
> +	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);

This is the first p2p_dev found right? What happens if I have more than
a single p2p device? In theory I'd have more p2p memory I can use. Have
you considered making pci_p2pmem_find return the least used suitable
device?

> +	if (!ctrl->p2p_dev) {
> +		pr_info("no supported peer-to-peer memory devices found\n");
> +		goto free_devices;
> +	}
> +	mutex_unlock(&ctrl->subsys->lock);
> +
> +	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
> +	return;
> +
> +free_devices:
> +	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> +	mutex_unlock(&ctrl->subsys->lock);
> +}
> +
> +static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
> +{
> +	if (!ctrl->p2p_dev)
> +		return;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +
> +	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> +	pci_dev_put(ctrl->p2p_dev);
> +	ctrl->p2p_dev = NULL;
> +
> +	mutex_unlock(&ctrl->subsys->lock);
> +}
> +
>   u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
>   {
> @@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   
>   	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
>   	INIT_LIST_HEAD(&ctrl->async_events);
> +	INIT_LIST_HEAD(&ctrl->p2p_clients);
>   
>   	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
>   	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
> @@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   		ctrl->kato = DIV_ROUND_UP(kato, 1000);
>   	}
>   	nvmet_start_keep_alive_timer(ctrl);
> +	nvmet_setup_p2pmem(ctrl, req);
>   
>   	mutex_lock(&subsys->lock);
>   	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
> @@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
>   	flush_work(&ctrl->async_event_work);
>   	cancel_work_sync(&ctrl->fatal_err_work);
>   
> +	nvmet_release_p2pmem(ctrl);
>   	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
>   
>   	kfree(ctrl->sqs);
> diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
> index 28bbdff4a88b..a213f8fc3bf3 100644
> --- a/drivers/nvme/target/io-cmd.c
> +++ b/drivers/nvme/target/io-cmd.c
> @@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
>   		op = REQ_OP_READ;
>   	}
>   
> +	if (is_pci_p2pdma_page(sg_page(req->sg)))
> +		op_flags |= REQ_PCI_P2PDMA;
> +
>   	sector = le64_to_cpu(req->cmd->rw.slba);
>   	sector <<= (req->ns->blksize_shift - 9);
>   
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index 417f6c0331cc..85a170914588 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
>   	return container_of(to_config_group(item), struct nvmet_ns, group);
>   }
>   
> +static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
> +{
> +	return disk_to_dev(ns->bdev->bd_disk);
> +}
> +
>   struct nvmet_cq {
>   	u16			qid;
>   	u16			size;
> @@ -98,6 +103,7 @@ struct nvmet_port {
>   	struct list_head		referrals;
>   	void				*priv;
>   	bool				enabled;
> +	bool				allow_p2pmem;
>   };
>   
>   static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
> @@ -131,6 +137,8 @@ struct nvmet_ctrl {
>   	struct work_struct	fatal_err_work;
>   
>   	struct nvmet_fabrics_ops *ops;
> +	struct pci_dev		*p2p_dev;
> +	struct list_head	p2p_clients;
>   
>   	char			subsysnqn[NVMF_NQN_FIELD_LEN];
>   	char			hostnqn[NVMF_NQN_FIELD_LEN];
> @@ -232,6 +240,8 @@ struct nvmet_req {
>   
>   	void (*execute)(struct nvmet_req *req);
>   	struct nvmet_fabrics_ops *ops;
> +
> +	struct device *p2p_client;
>   };
>   
>   static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 020354e11351..7a1f09995ed5 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -23,6 +23,7 @@
>   #include <linux/string.h>
>   #include <linux/wait.h>
>   #include <linux/inet.h>
> +#include <linux/pci-p2pdma.h>
>   #include <asm/unaligned.h>
>   
>   #include <rdma/ib_verbs.h>
> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
>   	u8			n_rdma;
>   	u32			flags;
>   	u32			invalidate_rkey;
> +	struct pci_dev		*p2p_dev;

Given that p2p_client is in nvmet_req, I think it make sense
that the p2p_dev itself would also live there. In theory, nothing
is preventing FC from using it as well.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 11:03     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:03 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig


> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch as the RDMA port and all the block devices in use. If
> the user enabled it an no devices are found, then the system will
> silently fall back on using regular memory.
> 
> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.

Nice.

> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available.

Can you describe what would be the plan to have it when these devices
do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
and not from nvmet_ctrl. Then, when cmb capable devices come along, the
ns can prefer to use its own cmb instead of locating a p2p_dev device?

> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
> +				   struct nvmet_ns *ns)
> +{
> +	int ret;
> +
> +	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
> +		pr_err("peer-to-peer DMA is not supported by %s\n",
> +		       ns->device_path);
> +		return -EINVAL;

I'd say that just skip it instead of failing it, in theory one can
connect nvme devices via p2p mem and expose other devices in the
same subsystem. The message would be pr_debug to reduce chattiness.

> +	}
> +
> +	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
> +	if (ret)
> +		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
> +		       ns->device_path, ret);
> +
> +	return ret;
> +}
> +
>   int nvmet_ns_enable(struct nvmet_ns *ns)
>   {
>   	struct nvmet_subsys *subsys = ns->subsys;
> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>   	if (ret)
>   		goto out_blkdev_put;
>   
> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> +		if (ctrl->p2p_dev) {
> +			ret = nvmet_p2pdma_add_client(ctrl, ns);
> +			if (ret)
> +				goto out_remove_clients;

Is this really a fatal failure given that we fall-back to main
memory? Why not continue with main memory (and warn at best)?

> +/*
> + * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
> + * Ι/O commands. This requires the PCI p2p device to be compatible with the
> + * backing device for every namespace on this controller.
> + */
> +static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
> +{
> +	struct nvmet_ns *ns;
> +	int ret;
> +
> +	if (!req->port->allow_p2pmem || !req->p2p_client)
> +		return;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +
> +	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
> +	if (ret) {
> +		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
> +		       dev_name(req->p2p_client), ret);
> +		goto free_devices;
> +	}
> +
> +	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
> +		ret = nvmet_p2pdma_add_client(ctrl, ns);
> +		if (ret)
> +			goto free_devices;
> +	}
> +
> +	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);

This is the first p2p_dev found right? What happens if I have more than
a single p2p device? In theory I'd have more p2p memory I can use. Have
you considered making pci_p2pmem_find return the least used suitable
device?

> +	if (!ctrl->p2p_dev) {
> +		pr_info("no supported peer-to-peer memory devices found\n");
> +		goto free_devices;
> +	}
> +	mutex_unlock(&ctrl->subsys->lock);
> +
> +	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
> +	return;
> +
> +free_devices:
> +	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> +	mutex_unlock(&ctrl->subsys->lock);
> +}
> +
> +static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
> +{
> +	if (!ctrl->p2p_dev)
> +		return;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +
> +	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> +	pci_dev_put(ctrl->p2p_dev);
> +	ctrl->p2p_dev = NULL;
> +
> +	mutex_unlock(&ctrl->subsys->lock);
> +}
> +
>   u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
>   {
> @@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   
>   	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
>   	INIT_LIST_HEAD(&ctrl->async_events);
> +	INIT_LIST_HEAD(&ctrl->p2p_clients);
>   
>   	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
>   	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
> @@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   		ctrl->kato = DIV_ROUND_UP(kato, 1000);
>   	}
>   	nvmet_start_keep_alive_timer(ctrl);
> +	nvmet_setup_p2pmem(ctrl, req);
>   
>   	mutex_lock(&subsys->lock);
>   	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
> @@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
>   	flush_work(&ctrl->async_event_work);
>   	cancel_work_sync(&ctrl->fatal_err_work);
>   
> +	nvmet_release_p2pmem(ctrl);
>   	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
>   
>   	kfree(ctrl->sqs);
> diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
> index 28bbdff4a88b..a213f8fc3bf3 100644
> --- a/drivers/nvme/target/io-cmd.c
> +++ b/drivers/nvme/target/io-cmd.c
> @@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
>   		op = REQ_OP_READ;
>   	}
>   
> +	if (is_pci_p2pdma_page(sg_page(req->sg)))
> +		op_flags |= REQ_PCI_P2PDMA;
> +
>   	sector = le64_to_cpu(req->cmd->rw.slba);
>   	sector <<= (req->ns->blksize_shift - 9);
>   
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index 417f6c0331cc..85a170914588 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
>   	return container_of(to_config_group(item), struct nvmet_ns, group);
>   }
>   
> +static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
> +{
> +	return disk_to_dev(ns->bdev->bd_disk);
> +}
> +
>   struct nvmet_cq {
>   	u16			qid;
>   	u16			size;
> @@ -98,6 +103,7 @@ struct nvmet_port {
>   	struct list_head		referrals;
>   	void				*priv;
>   	bool				enabled;
> +	bool				allow_p2pmem;
>   };
>   
>   static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
> @@ -131,6 +137,8 @@ struct nvmet_ctrl {
>   	struct work_struct	fatal_err_work;
>   
>   	struct nvmet_fabrics_ops *ops;
> +	struct pci_dev		*p2p_dev;
> +	struct list_head	p2p_clients;
>   
>   	char			subsysnqn[NVMF_NQN_FIELD_LEN];
>   	char			hostnqn[NVMF_NQN_FIELD_LEN];
> @@ -232,6 +240,8 @@ struct nvmet_req {
>   
>   	void (*execute)(struct nvmet_req *req);
>   	struct nvmet_fabrics_ops *ops;
> +
> +	struct device *p2p_client;
>   };
>   
>   static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 020354e11351..7a1f09995ed5 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -23,6 +23,7 @@
>   #include <linux/string.h>
>   #include <linux/wait.h>
>   #include <linux/inet.h>
> +#include <linux/pci-p2pdma.h>
>   #include <asm/unaligned.h>
>   
>   #include <rdma/ib_verbs.h>
> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
>   	u8			n_rdma;
>   	u32			flags;
>   	u32			invalidate_rkey;
> +	struct pci_dev		*p2p_dev;

Given that p2p_client is in nvmet_req, I think it make sense
that the p2p_dev itself would also live there. In theory, nothing
is preventing FC from using it as well.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 11:03     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:03 UTC (permalink / raw)



> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch as the RDMA port and all the block devices in use. If
> the user enabled it an no devices are found, then the system will
> silently fall back on using regular memory.
> 
> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.

Nice.

> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available.

Can you describe what would be the plan to have it when these devices
do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
and not from nvmet_ctrl. Then, when cmb capable devices come along, the
ns can prefer to use its own cmb instead of locating a p2p_dev device?

> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
> +				   struct nvmet_ns *ns)
> +{
> +	int ret;
> +
> +	if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
> +		pr_err("peer-to-peer DMA is not supported by %s\n",
> +		       ns->device_path);
> +		return -EINVAL;

I'd say that just skip it instead of failing it, in theory one can
connect nvme devices via p2p mem and expose other devices in the
same subsystem. The message would be pr_debug to reduce chattiness.

> +	}
> +
> +	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
> +	if (ret)
> +		pr_err("failed to add peer-to-peer DMA client %s: %d\n",
> +		       ns->device_path, ret);
> +
> +	return ret;
> +}
> +
>   int nvmet_ns_enable(struct nvmet_ns *ns)
>   {
>   	struct nvmet_subsys *subsys = ns->subsys;
> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>   	if (ret)
>   		goto out_blkdev_put;
>   
> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> +		if (ctrl->p2p_dev) {
> +			ret = nvmet_p2pdma_add_client(ctrl, ns);
> +			if (ret)
> +				goto out_remove_clients;

Is this really a fatal failure given that we fall-back to main
memory? Why not continue with main memory (and warn at best)?

> +/*
> + * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
> + * ?/O commands. This requires the PCI p2p device to be compatible with the
> + * backing device for every namespace on this controller.
> + */
> +static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
> +{
> +	struct nvmet_ns *ns;
> +	int ret;
> +
> +	if (!req->port->allow_p2pmem || !req->p2p_client)
> +		return;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +
> +	ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
> +	if (ret) {
> +		pr_err("failed adding peer-to-peer DMA client %s: %d\n",
> +		       dev_name(req->p2p_client), ret);
> +		goto free_devices;
> +	}
> +
> +	list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
> +		ret = nvmet_p2pdma_add_client(ctrl, ns);
> +		if (ret)
> +			goto free_devices;
> +	}
> +
> +	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);

This is the first p2p_dev found right? What happens if I have more than
a single p2p device? In theory I'd have more p2p memory I can use. Have
you considered making pci_p2pmem_find return the least used suitable
device?

> +	if (!ctrl->p2p_dev) {
> +		pr_info("no supported peer-to-peer memory devices found\n");
> +		goto free_devices;
> +	}
> +	mutex_unlock(&ctrl->subsys->lock);
> +
> +	pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
> +	return;
> +
> +free_devices:
> +	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> +	mutex_unlock(&ctrl->subsys->lock);
> +}
> +
> +static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
> +{
> +	if (!ctrl->p2p_dev)
> +		return;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +
> +	pci_p2pdma_client_list_free(&ctrl->p2p_clients);
> +	pci_dev_put(ctrl->p2p_dev);
> +	ctrl->p2p_dev = NULL;
> +
> +	mutex_unlock(&ctrl->subsys->lock);
> +}
> +
>   u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   		struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
>   {
> @@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   
>   	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
>   	INIT_LIST_HEAD(&ctrl->async_events);
> +	INIT_LIST_HEAD(&ctrl->p2p_clients);
>   
>   	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);
>   	memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
> @@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
>   		ctrl->kato = DIV_ROUND_UP(kato, 1000);
>   	}
>   	nvmet_start_keep_alive_timer(ctrl);
> +	nvmet_setup_p2pmem(ctrl, req);
>   
>   	mutex_lock(&subsys->lock);
>   	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
> @@ -891,6 +983,7 @@ static void nvmet_ctrl_free(struct kref *ref)
>   	flush_work(&ctrl->async_event_work);
>   	cancel_work_sync(&ctrl->fatal_err_work);
>   
> +	nvmet_release_p2pmem(ctrl);
>   	ida_simple_remove(&cntlid_ida, ctrl->cntlid);
>   
>   	kfree(ctrl->sqs);
> diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
> index 28bbdff4a88b..a213f8fc3bf3 100644
> --- a/drivers/nvme/target/io-cmd.c
> +++ b/drivers/nvme/target/io-cmd.c
> @@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req)
>   		op = REQ_OP_READ;
>   	}
>   
> +	if (is_pci_p2pdma_page(sg_page(req->sg)))
> +		op_flags |= REQ_PCI_P2PDMA;
> +
>   	sector = le64_to_cpu(req->cmd->rw.slba);
>   	sector <<= (req->ns->blksize_shift - 9);
>   
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index 417f6c0331cc..85a170914588 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)
>   	return container_of(to_config_group(item), struct nvmet_ns, group);
>   }
>   
> +static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
> +{
> +	return disk_to_dev(ns->bdev->bd_disk);
> +}
> +
>   struct nvmet_cq {
>   	u16			qid;
>   	u16			size;
> @@ -98,6 +103,7 @@ struct nvmet_port {
>   	struct list_head		referrals;
>   	void				*priv;
>   	bool				enabled;
> +	bool				allow_p2pmem;
>   };
>   
>   static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
> @@ -131,6 +137,8 @@ struct nvmet_ctrl {
>   	struct work_struct	fatal_err_work;
>   
>   	struct nvmet_fabrics_ops *ops;
> +	struct pci_dev		*p2p_dev;
> +	struct list_head	p2p_clients;
>   
>   	char			subsysnqn[NVMF_NQN_FIELD_LEN];
>   	char			hostnqn[NVMF_NQN_FIELD_LEN];
> @@ -232,6 +240,8 @@ struct nvmet_req {
>   
>   	void (*execute)(struct nvmet_req *req);
>   	struct nvmet_fabrics_ops *ops;
> +
> +	struct device *p2p_client;
>   };
>   
>   static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 020354e11351..7a1f09995ed5 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -23,6 +23,7 @@
>   #include <linux/string.h>
>   #include <linux/wait.h>
>   #include <linux/inet.h>
> +#include <linux/pci-p2pdma.h>
>   #include <asm/unaligned.h>
>   
>   #include <rdma/ib_verbs.h>
> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
>   	u8			n_rdma;
>   	u32			flags;
>   	u32			invalidate_rkey;
> +	struct pci_dev		*p2p_dev;

Given that p2p_client is in nvmet_req, I think it make sense
that the p2p_dev itself would also live there. In theory, nothing
is preventing FC from using it as well.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB
  2018-02-28 23:40   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 11:03     ` Sagi Grimberg
  -1 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:03 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Looks fine,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB
@ 2018-03-01 11:03     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:03 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson

Looks fine,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB
@ 2018-03-01 11:03     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:03 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Looks fine,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB
@ 2018-03-01 11:03     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:03 UTC (permalink / raw)


Looks fine,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
  2018-02-28 23:40   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 11:07     ` Sagi Grimberg
  -1 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:07 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig


> For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
> instead of the dma_map_sg functions.
> 
> With that, we can then indicate PCI_P2P support in the request queue.
> For this, we create an NVME_F_PCI_P2P flag which tells the core to
> set QUEUE_FLAG_PCI_P2P in the request queue.

This looks fine to me,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Any plans adding the capability to nvme-rdma? Should be
straight-forward... In theory, the use-case would be rdma backend
fabric behind. Shouldn't be hard to test either...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-01 11:07     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:07 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson


> For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
> instead of the dma_map_sg functions.
> 
> With that, we can then indicate PCI_P2P support in the request queue.
> For this, we create an NVME_F_PCI_P2P flag which tells the core to
> set QUEUE_FLAG_PCI_P2P in the request queue.

This looks fine to me,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Any plans adding the capability to nvme-rdma? Should be
straight-forward... In theory, the use-case would be rdma backend
fabric behind. Shouldn't be hard to test either...

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-01 11:07     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:07 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig


> For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
> instead of the dma_map_sg functions.
> 
> With that, we can then indicate PCI_P2P support in the request queue.
> For this, we create an NVME_F_PCI_P2P flag which tells the core to
> set QUEUE_FLAG_PCI_P2P in the request queue.

This looks fine to me,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

Any plans adding the capability to nvme-rdma? Should be
straight-forward... In theory, the use-case would be rdma backend
fabric behind. Shouldn't be hard to test either...

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-01 11:07     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:07 UTC (permalink / raw)



> For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
> instead of the dma_map_sg functions.
> 
> With that, we can then indicate PCI_P2P support in the request queue.
> For this, we create an NVME_F_PCI_P2P flag which tells the core to
> set QUEUE_FLAG_PCI_P2P in the request queue.

This looks fine to me,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

Any plans adding the capability to nvme-rdma? Should be
straight-forward... In theory, the use-case would be rdma backend
fabric behind. Shouldn't be hard to test either...

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue
  2018-02-28 23:40   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 11:08     ` Sagi Grimberg
  -1 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:08 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Looks fine,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue
@ 2018-03-01 11:08     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:08 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson

Looks fine,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue
@ 2018-03-01 11:08     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:08 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Looks fine,

Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue
@ 2018-03-01 11:08     ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 11:08 UTC (permalink / raw)


Looks fine,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
  2018-03-01 11:07     ` Sagi Grimberg
                         ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 15:58       ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 15:58 UTC (permalink / raw)
  To: Sagi Grimberg, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

> Any plans adding the capability to nvme-rdma? Should be
> straight-forward... In theory, the use-case would be rdma backend
> fabric behind. Shouldn't be hard to test either...

Nice idea Sagi. Yes we have been starting to look at that. Though again we would probably want to impose the "attached to the same PCIe switch" rule which might be less common to satisfy in initiator systems. 

Down the road I would also like to discuss the best way to use this P2P framework to facilitate copies between NVMe namespaces (on both PCIe and fabric attached namespaces) without having to expose the CMB up to user space. Wasn't something like that done in the SCSI world at some point Martin?

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-01 15:58       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 15:58 UTC (permalink / raw)
  To: Sagi Grimberg, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Martin K. Petersen

PiBBbnkgcGxhbnMgYWRkaW5nIHRoZSBjYXBhYmlsaXR5IHRvIG52bWUtcmRtYT8gU2hvdWxkIGJl
DQo+IHN0cmFpZ2h0LWZvcndhcmQuLi4gSW4gdGhlb3J5LCB0aGUgdXNlLWNhc2Ugd291bGQgYmUg
cmRtYSBiYWNrZW5kDQo+IGZhYnJpYyBiZWhpbmQuIFNob3VsZG4ndCBiZSBoYXJkIHRvIHRlc3Qg
ZWl0aGVyLi4uDQoNCk5pY2UgaWRlYSBTYWdpLiBZZXMgd2UgaGF2ZSBiZWVuIHN0YXJ0aW5nIHRv
IGxvb2sgYXQgdGhhdC4gVGhvdWdoIGFnYWluIHdlIHdvdWxkIHByb2JhYmx5IHdhbnQgdG8gaW1w
b3NlIHRoZSAiYXR0YWNoZWQgdG8gdGhlIHNhbWUgUENJZSBzd2l0Y2giIHJ1bGUgd2hpY2ggbWln
aHQgYmUgbGVzcyBjb21tb24gdG8gc2F0aXNmeSBpbiBpbml0aWF0b3Igc3lzdGVtcy4gDQoNCkRv
d24gdGhlIHJvYWQgSSB3b3VsZCBhbHNvIGxpa2UgdG8gZGlzY3VzcyB0aGUgYmVzdCB3YXkgdG8g
dXNlIHRoaXMgUDJQIGZyYW1ld29yayB0byBmYWNpbGl0YXRlIGNvcGllcyBiZXR3ZWVuIE5WTWUg
bmFtZXNwYWNlcyAob24gYm90aCBQQ0llIGFuZCBmYWJyaWMgYXR0YWNoZWQgbmFtZXNwYWNlcykg
d2l0aG91dCBoYXZpbmcgdG8gZXhwb3NlIHRoZSBDTUIgdXAgdG8gdXNlciBzcGFjZS4gV2Fzbid0
IHNvbWV0aGluZyBsaWtlIHRoYXQgZG9uZSBpbiB0aGUgU0NTSSB3b3JsZCBhdCBzb21lIHBvaW50
IE1hcnRpbj8NCg0KU3RlcGhlbg0KICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-01 15:58       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 15:58 UTC (permalink / raw)
  To: Sagi Grimberg, Logan Gunthorpe,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

> Any plans adding the capability to nvme-rdma? Should be
> straight-forward... In theory, the use-case would be rdma backend
> fabric behind. Shouldn't be hard to test either...

Nice idea Sagi. Yes we have been starting to look at that. Though again we would probably want to impose the "attached to the same PCIe switch" rule which might be less common to satisfy in initiator systems. 

Down the road I would also like to discuss the best way to use this P2P framework to facilitate copies between NVMe namespaces (on both PCIe and fabric attached namespaces) without having to expose the CMB up to user space. Wasn't something like that done in the SCSI world at some point Martin?

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-01 15:58       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 15:58 UTC (permalink / raw)
  To: Sagi Grimberg, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Martin K. Petersen

> Any plans adding the capability to nvme-rdma? Should be
> straight-forward... In theory, the use-case would be rdma backend
> fabric behind. Shouldn't be hard to test either...

Nice idea Sagi. Yes we have been starting to look at that. Though again we would probably want to impose the "attached to the same PCIe switch" rule which might be less common to satisfy in initiator systems. 

Down the road I would also like to discuss the best way to use this P2P framework to facilitate copies between NVMe namespaces (on both PCIe and fabric attached namespaces) without having to expose the CMB up to user space. Wasn't something like that done in the SCSI world at some point Martin?

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-01 15:58       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 15:58 UTC (permalink / raw)


> Any plans adding the capability to nvme-rdma? Should be
> straight-forward... In theory, the use-case would be rdma backend
> fabric behind. Shouldn't be hard to test either...

Nice idea Sagi. Yes we have been starting to look at that. Though again we would probably want to impose the "attached to the same PCIe switch" rule which might be less common to satisfy in initiator systems. 

Down the road I would also like to discuss the best way to use this P2P framework to facilitate copies between NVMe namespaces (on both PCIe and fabric attached namespaces) without having to expose the CMB up to user space. Wasn't something like that done in the SCSI world at some point Martin?

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 11:03     ` Sagi Grimberg
                         ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 16:15       ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 16:15 UTC (permalink / raw)
  To: Sagi Grimberg, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> > save an extra PCI transfer as the NVME card could just take the data
> > out of it's own memory. However, at this time, cards with CMB buffers
> > don't seem to be available.

> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

Hi Sagi

Thanks for the review! That commit message is somewhat dated as NVMe controllers with CMBs that support RDS and WDS are now commercially available [1]. However we have not yet tried to do any kind of optimization around this yet in terms of determining which p2p_dev to use. Your suggest above looks good and we can look into this kind of optimization in due course.

[1] http://www.eideticom.com/uploads/images/NoLoad_Product_Spec.pdf
    
>> +	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);

> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?
    
Yes pci_p2pmem_find will always return the first valid p2p_dev found. At the very least we should update this allocate over all the valid p2p_dev. Since the load on any given p2p_dev will vary over time I think a random allocation of the devices makes sense (at least for now). 

Stephen

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 16:15       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 16:15 UTC (permalink / raw)
  To: Sagi Grimberg, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise

PiA+IElkZWFsbHksIHdlJ2Qgd2FudCB0byB1c2UgYW4gTlZNRSBDTUIgYnVmZmVyIGFzIHAycCBt
ZW1vcnkuIFRoaXMgd291bGQNCj4gPiBzYXZlIGFuIGV4dHJhIFBDSSB0cmFuc2ZlciBhcyB0aGUg
TlZNRSBjYXJkIGNvdWxkIGp1c3QgdGFrZSB0aGUgZGF0YQ0KPiA+IG91dCBvZiBpdCdzIG93biBt
ZW1vcnkuIEhvd2V2ZXIsIGF0IHRoaXMgdGltZSwgY2FyZHMgd2l0aCBDTUIgYnVmZmVycw0KPiA+
IGRvbid0IHNlZW0gdG8gYmUgYXZhaWxhYmxlLg0KDQo+IENhbiB5b3UgZGVzY3JpYmUgd2hhdCB3
b3VsZCBiZSB0aGUgcGxhbiB0byBoYXZlIGl0IHdoZW4gdGhlc2UgZGV2aWNlcw0KPiBkbyBjb21l
IGFsb25nPyBJJ2Qgc2F5IHRoYXQgcDJwX2RldiBuZWVkcyB0byBiZWNvbWUgYSBudm1ldF9ucyBy
ZWZlcmVuY2UNCj4gYW5kIG5vdCBmcm9tIG52bWV0X2N0cmwuIFRoZW4sIHdoZW4gY21iIGNhcGFi
bGUgZGV2aWNlcyBjb21lIGFsb25nLCB0aGUNCj4gbnMgY2FuIHByZWZlciB0byB1c2UgaXRzIG93
biBjbWIgaW5zdGVhZCBvZiBsb2NhdGluZyBhIHAycF9kZXYgZGV2aWNlPw0KDQpIaSBTYWdpDQoN
ClRoYW5rcyBmb3IgdGhlIHJldmlldyEgVGhhdCBjb21taXQgbWVzc2FnZSBpcyBzb21ld2hhdCBk
YXRlZCBhcyBOVk1lIGNvbnRyb2xsZXJzIHdpdGggQ01CcyB0aGF0IHN1cHBvcnQgUkRTIGFuZCBX
RFMgYXJlIG5vdyBjb21tZXJjaWFsbHkgYXZhaWxhYmxlIFsxXS4gSG93ZXZlciB3ZSBoYXZlIG5v
dCB5ZXQgdHJpZWQgdG8gZG8gYW55IGtpbmQgb2Ygb3B0aW1pemF0aW9uIGFyb3VuZCB0aGlzIHll
dCBpbiB0ZXJtcyBvZiBkZXRlcm1pbmluZyB3aGljaCBwMnBfZGV2IHRvIHVzZS4gWW91ciBzdWdn
ZXN0IGFib3ZlIGxvb2tzIGdvb2QgYW5kIHdlIGNhbiBsb29rIGludG8gdGhpcyBraW5kIG9mIG9w
dGltaXphdGlvbiBpbiBkdWUgY291cnNlLg0KDQpbMV0gaHR0cDovL3d3dy5laWRldGljb20uY29t
L3VwbG9hZHMvaW1hZ2VzL05vTG9hZF9Qcm9kdWN0X1NwZWMucGRmDQogICAgDQo+PiArCWN0cmwt
PnAycF9kZXYgPSBwY2lfcDJwbWVtX2ZpbmQoJmN0cmwtPnAycF9jbGllbnRzKTsNCg0KPiBUaGlz
IGlzIHRoZSBmaXJzdCBwMnBfZGV2IGZvdW5kIHJpZ2h0PyBXaGF0IGhhcHBlbnMgaWYgSSBoYXZl
IG1vcmUgdGhhbg0KPiBhIHNpbmdsZSBwMnAgZGV2aWNlPyBJbiB0aGVvcnkgSSdkIGhhdmUgbW9y
ZSBwMnAgbWVtb3J5IEkgY2FuIHVzZS4gSGF2ZQ0KPiB5b3UgY29uc2lkZXJlZCBtYWtpbmcgcGNp
X3AycG1lbV9maW5kIHJldHVybiB0aGUgbGVhc3QgdXNlZCBzdWl0YWJsZQ0KPiBkZXZpY2U/DQog
ICAgDQpZZXMgcGNpX3AycG1lbV9maW5kIHdpbGwgYWx3YXlzIHJldHVybiB0aGUgZmlyc3QgdmFs
aWQgcDJwX2RldiBmb3VuZC4gQXQgdGhlIHZlcnkgbGVhc3Qgd2Ugc2hvdWxkIHVwZGF0ZSB0aGlz
IGFsbG9jYXRlIG92ZXIgYWxsIHRoZSB2YWxpZCBwMnBfZGV2LiBTaW5jZSB0aGUgbG9hZCBvbiBh
bnkgZ2l2ZW4gcDJwX2RldiB3aWxsIHZhcnkgb3ZlciB0aW1lIEkgdGhpbmsgYSByYW5kb20gYWxs
b2NhdGlvbiBvZiB0aGUgZGV2aWNlcyBtYWtlcyBzZW5zZSAoYXQgbGVhc3QgZm9yIG5vdykuIA0K
DQpTdGVwaGVuDQoNCg==

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 16:15       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 16:15 UTC (permalink / raw)
  To: Sagi Grimberg, Logan Gunthorpe,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> > save an extra PCI transfer as the NVME card could just take the data
> > out of it's own memory. However, at this time, cards with CMB buffers
> > don't seem to be available.

> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

Hi Sagi

Thanks for the review! That commit message is somewhat dated as NVMe controllers with CMBs that support RDS and WDS are now commercially available [1]. However we have not yet tried to do any kind of optimization around this yet in terms of determining which p2p_dev to use. Your suggest above looks good and we can look into this kind of optimization in due course.

[1] http://www.eideticom.com/uploads/images/NoLoad_Product_Spec.pdf
    
>> +	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);

> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?
    
Yes pci_p2pmem_find will always return the first valid p2p_dev found. At the very least we should update this allocate over all the valid p2p_dev. Since the load on any given p2p_dev will vary over time I think a random allocation of the devices makes sense (at least for now). 

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 16:15       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 16:15 UTC (permalink / raw)
  To: Sagi Grimberg, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise

> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> > save an extra PCI transfer as the NVME card could just take the data
> > out of it's own memory. However, at this time, cards with CMB buffers
> > don't seem to be available.

> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

Hi Sagi

Thanks for the review! That commit message is somewhat dated as NVMe controllers with CMBs that support RDS and WDS are now commercially available [1]. However we have not yet tried to do any kind of optimization around this yet in terms of determining which p2p_dev to use. Your suggest above looks good and we can look into this kind of optimization in due course.

[1] http://www.eideticom.com/uploads/images/NoLoad_Product_Spec.pdf
    
>> +	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);

> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?
    
Yes pci_p2pmem_find will always return the first valid p2p_dev found. At the very least we should update this allocate over all the valid p2p_dev. Since the load on any given p2p_dev will vary over time I think a random allocation of the devices makes sense (at least for now). 

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 16:15       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 16:15 UTC (permalink / raw)


> > Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> > save an extra PCI transfer as the NVME card could just take the data
> > out of it's own memory. However, at this time, cards with CMB buffers
> > don't seem to be available.

> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

Hi Sagi

Thanks for the review! That commit message is somewhat dated as NVMe controllers with CMBs that support RDS and WDS are now commercially available [1]. However we have not yet tried to do any kind of optimization around this yet in terms of determining which p2p_dev to use. Your suggest above looks good and we can look into this kind of optimization in due course.

[1] http://www.eideticom.com/uploads/images/NoLoad_Product_Spec.pdf
    
>> +	ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);

> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?
    
Yes pci_p2pmem_find will always return the first valid p2p_dev found. At the very least we should update this allocate over all the valid p2p_dev. Since the load on any given p2p_dev will vary over time I think a random allocation of the devices makes sense (at least for now). 

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
  2018-03-01 10:32     ` Sagi Grimberg
  (?)
  (?)
@ 2018-03-01 17:16       ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 17:16 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hey Sagi,

Thanks for the review!


On 01/03/18 03:32 AM, Sagi Grimberg wrote:
>>   int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 
>> port_num,
>>           struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
>> -        u64 remote_addr, u32 rkey, enum dma_data_direction dir)
>> +        u64 remote_addr, u32 rkey, enum dma_data_direction dir,
>> +        unsigned int flags)
>>   {
>>       struct ib_device *dev = qp->pd->device;
>>       int ret;
>> -    ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> +    if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
>> +        ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
>> +    else
>> +        ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> +
> 
> Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
> much cleaner...

Yes, that sounds like a good idea. We can make that change for v3.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
@ 2018-03-01 17:16       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 17:16 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson

Hey Sagi,

Thanks for the review!


On 01/03/18 03:32 AM, Sagi Grimberg wrote:
>>   int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 
>> port_num,
>>           struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
>> -        u64 remote_addr, u32 rkey, enum dma_data_direction dir)
>> +        u64 remote_addr, u32 rkey, enum dma_data_direction dir,
>> +        unsigned int flags)
>>   {
>>       struct ib_device *dev = qp->pd->device;
>>       int ret;
>> -    ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> +    if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
>> +        ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
>> +    else
>> +        ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> +
> 
> Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
> much cleaner...

Yes, that sounds like a good idea. We can make that change for v3.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
@ 2018-03-01 17:16       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 17:16 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hey Sagi,

Thanks for the review!


On 01/03/18 03:32 AM, Sagi Grimberg wrote:
>>   int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 
>> port_num,
>>           struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
>> -        u64 remote_addr, u32 rkey, enum dma_data_direction dir)
>> +        u64 remote_addr, u32 rkey, enum dma_data_direction dir,
>> +        unsigned int flags)
>>   {
>>       struct ib_device *dev = qp->pd->device;
>>       int ret;
>> -    ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> +    if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
>> +        ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
>> +    else
>> +        ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> +
> 
> Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
> much cleaner...

Yes, that sounds like a good idea. We can make that change for v3.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
@ 2018-03-01 17:16       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 17:16 UTC (permalink / raw)


Hey Sagi,

Thanks for the review!


On 01/03/18 03:32 AM, Sagi Grimberg wrote:
>> ? int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 
>> port_num,
>> ????????? struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
>> -??????? u64 remote_addr, u32 rkey, enum dma_data_direction dir)
>> +??????? u64 remote_addr, u32 rkey, enum dma_data_direction dir,
>> +??????? unsigned int flags)
>> ? {
>> ????? struct ib_device *dev = qp->pd->device;
>> ????? int ret;
>> -??? ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> +??? if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
>> +??????? ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
>> +??? else
>> +??????? ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
>> +
> 
> Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
> much cleaner...

Yes, that sounds like a good idea. We can make that change for v3.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
  2018-02-28 23:39     ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 17:37       ` Bjorn Helgaas
  -1 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:37 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

s/peer to peer/peer-to-peer/ to match text below and in spec.

On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.

s/Peer-to-Peer/peer-to-peer/ to match spec and typical usage.

Is there anything about this memory that makes it specifically
intended for peer-to-peer transactions?  I assume the device can't
really tell whether a transaction is from a CPU or a peer.

> A kernel interface is provided so that other subsystems can find and
> allocate chunks of P2P memory as necessary to facilitate transfers
> between two PCI peers. Depending on hardware, this may reduce the
> bandwidth of the transfer but would significantly reduce pressure
> on system memory. This may be desirable in many cases: for example a
> system could be designed with a small CPU connected to a PCI switch by a
> small number of lanes which would maximize the number of lanes available
> to connect to NVME devices.

"A kernel interface is provided" could mean "the kernel provides an
interface", independent of anything this patch does, but I think you
mean *this patch specifically* adds the interface.

Maybe something like:

  Add interfaces for other subsystems to find and allocate ...:

    int pci_p2pdma_add_client();
    struct pci_dev *pci_p2pmem_find();
    void *pci_alloc_p2pmem();

  This may reduce bandwidth of the transfer but significantly reduce
  ...

BTW, maybe there could be some kind of guide for device driver writers
in Documentation/PCI/?

> The interface requires a user driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary. The ACS bits on the
> downstream switch port will be managed for all the registered clients.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI switch. This is because
> using P2P transactions through the PCI root complex can have performance
> limitations or, worse, might not work at all. Finding out how well a
> particular RC supports P2P transfers is non-trivial. Additionally, the
> benefits of P2P transfers that go through the RC is limited to only
> reducing DRAM usage.

I think it would be clearer and sufficient to simply say that we have
no way to know whether peer-to-peer routing between PCIe Root Ports is
supported (PCIe r4.0, sec 1.3.1).

The fact that you use the PCIe term "switch" suggests that a PCIe
Switch is required, but isn't it sufficient for the peers to be below
the same "PCI bridge", which would include PCIe Root Ports, PCIe
Switch Downstream Ports, and conventional PCI bridges?

The comments at get_upstream_bridge_port() suggest that this isn't
enough, and the peers actually do have to be below the same PCIe
Switch, but I don't know why.

> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..840831418cbd 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,22 @@ config PCI_PASID
>  
>  	  If unsure, say N.
>  
> +config PCI_P2PDMA
> +	bool "PCI Peer to Peer transfer support"
> +	depends on ZONE_DEVICE
> +	select GENERIC_ALLOCATOR
> +	help
> +	  Enableѕ drivers to do PCI peer to peer transactions to and from

s/peer to peer/peer-to-peer/ (in bool and help text)

> +	  BARs that are exposed in other devices that are the part of
> +	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +	  specification to work (ie. anything below a single PCI bridge).
> +
> +	  Many PCIe root complexes do not support P2P transactions and
> +	  it's hard to tell which support it with good performance, so
> +	  at this time you will need a PCIe switch.

Until we have a way to figure out which of them support P2P,
performance is a don't-care.

> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> new file mode 100644
> index 000000000000..ec0a6cb9e500
> --- /dev/null
> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,568 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * PCI Peer 2 Peer DMA support.

s/Peer 2 Peer/peer-to-peer/

> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

I think the SPDX tag is meant to remove the need for including the
license text, so you should be able to remove this.  Oh, and one
trivial and annoying nit, I think for the SPDX tag, we're supposed to
use "//" in .c files and "/* */" in .h files.

> + * pci_p2pdma_add_resource - add memory for use as p2p memory
> + * @pci: the device to add the memory to

s/@pci/@pdev/

> + * @bar: PCI BAR to add
> + * @size: size of the memory to add, may be zero to use the whole BAR
> + * @offset: offset into the PCI BAR
> + *
> + * The memory will be given ZONE_DEVICE struct pages so that it may
> + * be used with any dma request.

s/dma/DMA/

> +int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> +			    u64 offset)
> +{
> +	struct dev_pagemap *pgmap;
> +	void *addr;
> +	int error;
> +
> +	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
> +		return -EINVAL;
> +
> +	if (offset >= pci_resource_len(pdev, bar))
> +		return -EINVAL;
> +
> +	if (!size)
> +		size = pci_resource_len(pdev, bar) - offset;
> +
> +	if (size + offset > pci_resource_len(pdev, bar))
> +		return -EINVAL;
> +
> +	if (!pdev->p2pdma) {
> +		error = pci_p2pdma_setup(pdev);
> +		if (error)
> +			return error;
> +	}
> +
> +	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
> +	if (!pgmap)
> +		return -ENOMEM;
> +
> +	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
> +	pgmap->res.end = pgmap->res.start + size - 1;
> +	pgmap->res.flags = pci_resource_flags(pdev, bar);
> +	pgmap->ref = &pdev->p2pdma->devmap_ref;
> +	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +
> +	addr = devm_memremap_pages(&pdev->dev, pgmap);
> +	if (IS_ERR(addr))

Free pgmap here?  And in the other error case below?  Or maybe this
happens via the devm_* magic?  If so, when would that actually happen?
Would pgmap be effectively leaked until the pdev is destroyed?

> +		return PTR_ERR(addr);
> +
> +	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
> +			pci_bus_address(pdev, bar) + offset,
> +			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
> +	if (error)
> +		return error;
> +
> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
> +					  &pdev->p2pdma->devmap_ref);
> +	if (error)
> +		return error;
> +
> +	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
> +		 &pgmap->res);

s/dev_info/pci_info/ (also similar usages below, except for the one or
two cases where you don't have a pci_dev).

> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);

> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge:
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.

s/pci_upstream_bridge/pci_upstream_bridge()/

This whole thing is confusing to me.  Why do you want to reject peers
directly connected to the same root port?  Why do you require the same
Switch Upstream Port?  You don't exclude conventional PCI, but it
looks like you would require peers to share *two* upstream PCI-to-PCI
bridges?  I would think a single shared upstream bridge (conventional,
PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?

Apologies if you've answered this before; maybe just include a little
explanation here so I don't ask again :)

> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +{
> +	struct pci_dev *up1, *up2;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> +	if (!up1)
> +		return NULL;
> +
> +	up2 = pci_dev_get(pci_upstream_bridge(up1));
> +	pci_dev_put(up1);
> +
> +	return up2;
> +}
> +
> +static bool __upstream_bridges_match(struct pci_dev *upstream,
> +				     struct pci_dev *client)
> +{
> +	struct pci_dev *dma_up;
> +	bool ret = true;
> +
> +	dma_up = get_upstream_bridge_port(client);
> +
> +	if (!dma_up) {
> +		dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
> +		ret = false;
> +		goto out;
> +	}
> +
> +	if (upstream != dma_up) {
> +		dev_dbg(&client->dev,
> +			"does not reside on the same upstream bridge\n");
> +		ret = false;
> +		goto out;
> +	}
> +
> +out:
> +	pci_dev_put(dma_up);
> +	return ret;
> +}
> +
> +static bool upstream_bridges_match(struct pci_dev *pdev,
> +				   struct pci_dev *client)
> +{
> +	struct pci_dev *upstream;
> +	bool ret;
> +
> +	upstream = get_upstream_bridge_port(pdev);
> +	if (!upstream) {
> +		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
> +		return false;
> +	}
> +
> +	ret = __upstream_bridges_match(upstream, client);
> +
> +	pci_dev_put(upstream);
> +
> +	return ret;
> +}
> +
> +struct pci_p2pdma_client {
> +	struct list_head list;
> +	struct pci_dev *client;
> +	struct pci_dev *p2pdma;

Maybe call this "peer" or something instead of "p2pdma", since p2pdma
is also used for struct pci_p2pdma things?

> + * pci_p2pdma_add_client - allocate a new element in a client device list
> + * @head: list head of p2pdma clients
> + * @dev: device to add to the list
> + *
> + * This adds @dev to a list of clients used by a p2pdma device.
> + * This list should be passed to p2pmem_find(). Once p2pmem_find() has
> + * been called successfully, the list will be bound to a specific p2pdma
> + * device and new clients can only be added to the list if they are
> + * supported by that p2pdma device.
> + *
> + * The caller is expected to have a lock which protects @head as necessary
> + * so that none of the pci_p2p functions can be called concurrently
> + * on that list.
> + *
> + * Returns 0 if the client was successfully added.
> + */
> +int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
> +{
> +	struct pci_p2pdma_client *item, *new_item;
> +	struct pci_dev *p2pdma = NULL;
> +	struct pci_dev *client;
> +	int ret;
> +
> +	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
> +		dev_warn(dev,
> +			 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
> +		return -ENODEV;
> +	}
> +
> +
> +	client = find_parent_pci_dev(dev);

Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
sort of weird that callers supply a non-PCI device and then we look up
a PCI device here.  I assume you have some reason for this; if you
added a writeup in Documentation/PCI, that would be a good place to
elaborate on that, maybe with a one-line clue here.

> +	if (!client) {
> +		dev_warn(dev,
> +			 "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
> +		return -ENODEV;
> +	}
> +
> +	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
> +	if (item && item->p2pdma) {
> +		p2pdma = item->p2pdma;
> +
> +		if (!upstream_bridges_match(p2pdma, client)) {
> +			ret = -EXDEV;
> +			goto put_client;
> +		}
> +	}
> +
> +	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
> +	if (!new_item) {
> +		ret = -ENOMEM;
> +		goto put_client;
> +	}
> +
> +	new_item->client = client;
> +	new_item->p2pdma = pci_dev_get(p2pdma);
> +
> +	list_add_tail(&new_item->list, head);
> +
> +	return 0;
> +
> +put_client:
> +	pci_dev_put(client);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);

> + * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
> + * @pdev:	the device to allocate memory from
> + * @size:	number of bytes to allocate
> + *
> + * Returns the allocated memory or NULL on error.
> + */
> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> +{
> +	void *ret;
> +
> +	if (unlikely(!pdev->p2pdma))

Is this a hot path?  I'm not sure it's worth cluttering
non-performance paths with likely/unlikely.

> +		return NULL;
> +
> +	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
> +		return NULL;
> +
> +	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);

Why the double cast?  Wouldn't "(void *)" be sufficient?

> +	if (unlikely(!ret))
> +		percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
> +
> +/**
> + * pci_free_p2pmem - allocate peer-to-peer DMA memory
> + * @pdev:	the device the memory was allocated from
> + * @addr:	address of the memory that was allocated
> + * @size:	number of bytes that was allocated
> + */
> +void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
> +{
> +	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);

In v4.6-rc1, gen_pool_free() takes "unsigned long addr".  I know this
is based on -rc3; is this something that changed between -rc1 and
-rc3?

> +	percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +}
> +EXPORT_SYMBOL_GPL(pci_free_p2pmem);
> +
> +/**
> + * pci_virt_to_bus - return the PCI bus address for a given virtual
> + *	address obtained with pci_alloc_p2pmem

s/pci_alloc_p2pmem/pci_alloc_p2pmem()/

> + * @pdev:	the device the memory was allocated from
> + * @addr:	address of the memory that was allocated
> + */
> +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
> +{
> +	if (!addr)
> +		return 0;
> +	if (!pdev->p2pdma)
> +		return 0;
> +
> +	/*
> +	 * Note: when we added the memory to the pool we used the PCI
> +	 * bus address as the physical address. So gen_pool_virt_to_phys()
> +	 * actually returns the bus address despite the misleading name.
> +	 */
> +	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
> +
> +/**
> + * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in an scatterlist

s/an scatterlist/a scatterlist/

> + * @pdev:	the device to allocate memory from
> + * @sgl:	the allocated scatterlist
> + * @nents:      the number of SG entries in the list
> + * @length:     number of bytes to allocate
> + *
> + * Returns 0 on success
> + */
> +int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
> +			 unsigned int *nents, u32 length)
> +{
> +	struct scatterlist *sg;
> +	void *addr;
> +
> +	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
> +	if (!sg)
> +		return -ENOMEM;
> +
> +	sg_init_table(sg, 1);
> +
> +	addr = pci_alloc_p2pmem(pdev, length);
> +	if (!addr)
> +		goto out_free_sg;
> +
> +	sg_set_buf(sg, addr, length);
> +	*sgl = sg;
> +	*nents = 1;
> +	return 0;
> +
> +out_free_sg:
> +	kfree(sg);
> +	return -ENOMEM;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
> +
> +/**
> + * pci_p2pmem_free_sgl - free a scatterlist  allocated by pci_p2pmem_alloc_sgl

s/  allocated/ allocated/ (remove extra space)
s/pci_p2pmem_alloc_sgl/pci_p2pmem_alloc_sgl()/

> + * @pdev:	the device to allocate memory from
> + * @sgl:	the allocated scatterlist
> + * @nents:      the number of SG entries in the list
> + */
> +void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
> +			 unsigned int nents)
> +{
> +	struct scatterlist *sg;
> +	int count;
> +
> +	if (!sgl || !nents)
> +		return;
> +
> +	for_each_sg(sgl, sg, nents, count)
> +		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
> +	kfree(sgl);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
> +
> +/**
> + * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
> + *	other devices with pci_p2pmem_find

s/pci_p2pmem_find/pci_p2pmem_find()/

> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> new file mode 100644
> index 000000000000..c0dde3d3aac4
> --- /dev/null
> +++ b/include/linux/pci-p2pdma.h
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

Remove license text.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 17:37       ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:37 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson

s/peer to peer/peer-to-peer/ to match text below and in spec.

On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.

s/Peer-to-Peer/peer-to-peer/ to match spec and typical usage.

Is there anything about this memory that makes it specifically
intended for peer-to-peer transactions?  I assume the device can't
really tell whether a transaction is from a CPU or a peer.

> A kernel interface is provided so that other subsystems can find and
> allocate chunks of P2P memory as necessary to facilitate transfers
> between two PCI peers. Depending on hardware, this may reduce the
> bandwidth of the transfer but would significantly reduce pressure
> on system memory. This may be desirable in many cases: for example a
> system could be designed with a small CPU connected to a PCI switch by a
> small number of lanes which would maximize the number of lanes available
> to connect to NVME devices.

"A kernel interface is provided" could mean "the kernel provides an
interface", independent of anything this patch does, but I think you
mean *this patch specifically* adds the interface.

Maybe something like:

  Add interfaces for other subsystems to find and allocate ...:

    int pci_p2pdma_add_client();
    struct pci_dev *pci_p2pmem_find();
    void *pci_alloc_p2pmem();

  This may reduce bandwidth of the transfer but significantly reduce
  ...

BTW, maybe there could be some kind of guide for device driver writers
in Documentation/PCI/?

> The interface requires a user driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary. The ACS bits on the
> downstream switch port will be managed for all the registered clients.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI switch. This is because
> using P2P transactions through the PCI root complex can have performance
> limitations or, worse, might not work at all. Finding out how well a
> particular RC supports P2P transfers is non-trivial. Additionally, the
> benefits of P2P transfers that go through the RC is limited to only
> reducing DRAM usage.

I think it would be clearer and sufficient to simply say that we have
no way to know whether peer-to-peer routing between PCIe Root Ports is
supported (PCIe r4.0, sec 1.3.1).

The fact that you use the PCIe term "switch" suggests that a PCIe
Switch is required, but isn't it sufficient for the peers to be below
the same "PCI bridge", which would include PCIe Root Ports, PCIe
Switch Downstream Ports, and conventional PCI bridges?

The comments at get_upstream_bridge_port() suggest that this isn't
enough, and the peers actually do have to be below the same PCIe
Switch, but I don't know why.

> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..840831418cbd 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,22 @@ config PCI_PASID
>  
>  	  If unsure, say N.
>  
> +config PCI_P2PDMA
> +	bool "PCI Peer to Peer transfer support"
> +	depends on ZONE_DEVICE
> +	select GENERIC_ALLOCATOR
> +	help
> +	  Enableѕ drivers to do PCI peer to peer transactions to and from

s/peer to peer/peer-to-peer/ (in bool and help text)

> +	  BARs that are exposed in other devices that are the part of
> +	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +	  specification to work (ie. anything below a single PCI bridge).
> +
> +	  Many PCIe root complexes do not support P2P transactions and
> +	  it's hard to tell which support it with good performance, so
> +	  at this time you will need a PCIe switch.

Until we have a way to figure out which of them support P2P,
performance is a don't-care.

> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> new file mode 100644
> index 000000000000..ec0a6cb9e500
> --- /dev/null
> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,568 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * PCI Peer 2 Peer DMA support.

s/Peer 2 Peer/peer-to-peer/

> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

I think the SPDX tag is meant to remove the need for including the
license text, so you should be able to remove this.  Oh, and one
trivial and annoying nit, I think for the SPDX tag, we're supposed to
use "//" in .c files and "/* */" in .h files.

> + * pci_p2pdma_add_resource - add memory for use as p2p memory
> + * @pci: the device to add the memory to

s/@pci/@pdev/

> + * @bar: PCI BAR to add
> + * @size: size of the memory to add, may be zero to use the whole BAR
> + * @offset: offset into the PCI BAR
> + *
> + * The memory will be given ZONE_DEVICE struct pages so that it may
> + * be used with any dma request.

s/dma/DMA/

> +int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> +			    u64 offset)
> +{
> +	struct dev_pagemap *pgmap;
> +	void *addr;
> +	int error;
> +
> +	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
> +		return -EINVAL;
> +
> +	if (offset >= pci_resource_len(pdev, bar))
> +		return -EINVAL;
> +
> +	if (!size)
> +		size = pci_resource_len(pdev, bar) - offset;
> +
> +	if (size + offset > pci_resource_len(pdev, bar))
> +		return -EINVAL;
> +
> +	if (!pdev->p2pdma) {
> +		error = pci_p2pdma_setup(pdev);
> +		if (error)
> +			return error;
> +	}
> +
> +	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
> +	if (!pgmap)
> +		return -ENOMEM;
> +
> +	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
> +	pgmap->res.end = pgmap->res.start + size - 1;
> +	pgmap->res.flags = pci_resource_flags(pdev, bar);
> +	pgmap->ref = &pdev->p2pdma->devmap_ref;
> +	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +
> +	addr = devm_memremap_pages(&pdev->dev, pgmap);
> +	if (IS_ERR(addr))

Free pgmap here?  And in the other error case below?  Or maybe this
happens via the devm_* magic?  If so, when would that actually happen?
Would pgmap be effectively leaked until the pdev is destroyed?

> +		return PTR_ERR(addr);
> +
> +	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
> +			pci_bus_address(pdev, bar) + offset,
> +			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
> +	if (error)
> +		return error;
> +
> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
> +					  &pdev->p2pdma->devmap_ref);
> +	if (error)
> +		return error;
> +
> +	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
> +		 &pgmap->res);

s/dev_info/pci_info/ (also similar usages below, except for the one or
two cases where you don't have a pci_dev).

> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);

> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge:
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.

s/pci_upstream_bridge/pci_upstream_bridge()/

This whole thing is confusing to me.  Why do you want to reject peers
directly connected to the same root port?  Why do you require the same
Switch Upstream Port?  You don't exclude conventional PCI, but it
looks like you would require peers to share *two* upstream PCI-to-PCI
bridges?  I would think a single shared upstream bridge (conventional,
PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?

Apologies if you've answered this before; maybe just include a little
explanation here so I don't ask again :)

> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +{
> +	struct pci_dev *up1, *up2;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> +	if (!up1)
> +		return NULL;
> +
> +	up2 = pci_dev_get(pci_upstream_bridge(up1));
> +	pci_dev_put(up1);
> +
> +	return up2;
> +}
> +
> +static bool __upstream_bridges_match(struct pci_dev *upstream,
> +				     struct pci_dev *client)
> +{
> +	struct pci_dev *dma_up;
> +	bool ret = true;
> +
> +	dma_up = get_upstream_bridge_port(client);
> +
> +	if (!dma_up) {
> +		dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
> +		ret = false;
> +		goto out;
> +	}
> +
> +	if (upstream != dma_up) {
> +		dev_dbg(&client->dev,
> +			"does not reside on the same upstream bridge\n");
> +		ret = false;
> +		goto out;
> +	}
> +
> +out:
> +	pci_dev_put(dma_up);
> +	return ret;
> +}
> +
> +static bool upstream_bridges_match(struct pci_dev *pdev,
> +				   struct pci_dev *client)
> +{
> +	struct pci_dev *upstream;
> +	bool ret;
> +
> +	upstream = get_upstream_bridge_port(pdev);
> +	if (!upstream) {
> +		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
> +		return false;
> +	}
> +
> +	ret = __upstream_bridges_match(upstream, client);
> +
> +	pci_dev_put(upstream);
> +
> +	return ret;
> +}
> +
> +struct pci_p2pdma_client {
> +	struct list_head list;
> +	struct pci_dev *client;
> +	struct pci_dev *p2pdma;

Maybe call this "peer" or something instead of "p2pdma", since p2pdma
is also used for struct pci_p2pdma things?

> + * pci_p2pdma_add_client - allocate a new element in a client device list
> + * @head: list head of p2pdma clients
> + * @dev: device to add to the list
> + *
> + * This adds @dev to a list of clients used by a p2pdma device.
> + * This list should be passed to p2pmem_find(). Once p2pmem_find() has
> + * been called successfully, the list will be bound to a specific p2pdma
> + * device and new clients can only be added to the list if they are
> + * supported by that p2pdma device.
> + *
> + * The caller is expected to have a lock which protects @head as necessary
> + * so that none of the pci_p2p functions can be called concurrently
> + * on that list.
> + *
> + * Returns 0 if the client was successfully added.
> + */
> +int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
> +{
> +	struct pci_p2pdma_client *item, *new_item;
> +	struct pci_dev *p2pdma = NULL;
> +	struct pci_dev *client;
> +	int ret;
> +
> +	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
> +		dev_warn(dev,
> +			 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
> +		return -ENODEV;
> +	}
> +
> +
> +	client = find_parent_pci_dev(dev);

Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
sort of weird that callers supply a non-PCI device and then we look up
a PCI device here.  I assume you have some reason for this; if you
added a writeup in Documentation/PCI, that would be a good place to
elaborate on that, maybe with a one-line clue here.

> +	if (!client) {
> +		dev_warn(dev,
> +			 "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
> +		return -ENODEV;
> +	}
> +
> +	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
> +	if (item && item->p2pdma) {
> +		p2pdma = item->p2pdma;
> +
> +		if (!upstream_bridges_match(p2pdma, client)) {
> +			ret = -EXDEV;
> +			goto put_client;
> +		}
> +	}
> +
> +	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
> +	if (!new_item) {
> +		ret = -ENOMEM;
> +		goto put_client;
> +	}
> +
> +	new_item->client = client;
> +	new_item->p2pdma = pci_dev_get(p2pdma);
> +
> +	list_add_tail(&new_item->list, head);
> +
> +	return 0;
> +
> +put_client:
> +	pci_dev_put(client);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);

> + * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
> + * @pdev:	the device to allocate memory from
> + * @size:	number of bytes to allocate
> + *
> + * Returns the allocated memory or NULL on error.
> + */
> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> +{
> +	void *ret;
> +
> +	if (unlikely(!pdev->p2pdma))

Is this a hot path?  I'm not sure it's worth cluttering
non-performance paths with likely/unlikely.

> +		return NULL;
> +
> +	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
> +		return NULL;
> +
> +	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);

Why the double cast?  Wouldn't "(void *)" be sufficient?

> +	if (unlikely(!ret))
> +		percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
> +
> +/**
> + * pci_free_p2pmem - allocate peer-to-peer DMA memory
> + * @pdev:	the device the memory was allocated from
> + * @addr:	address of the memory that was allocated
> + * @size:	number of bytes that was allocated
> + */
> +void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
> +{
> +	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);

In v4.6-rc1, gen_pool_free() takes "unsigned long addr".  I know this
is based on -rc3; is this something that changed between -rc1 and
-rc3?

> +	percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +}
> +EXPORT_SYMBOL_GPL(pci_free_p2pmem);
> +
> +/**
> + * pci_virt_to_bus - return the PCI bus address for a given virtual
> + *	address obtained with pci_alloc_p2pmem

s/pci_alloc_p2pmem/pci_alloc_p2pmem()/

> + * @pdev:	the device the memory was allocated from
> + * @addr:	address of the memory that was allocated
> + */
> +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
> +{
> +	if (!addr)
> +		return 0;
> +	if (!pdev->p2pdma)
> +		return 0;
> +
> +	/*
> +	 * Note: when we added the memory to the pool we used the PCI
> +	 * bus address as the physical address. So gen_pool_virt_to_phys()
> +	 * actually returns the bus address despite the misleading name.
> +	 */
> +	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
> +
> +/**
> + * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in an scatterlist

s/an scatterlist/a scatterlist/

> + * @pdev:	the device to allocate memory from
> + * @sgl:	the allocated scatterlist
> + * @nents:      the number of SG entries in the list
> + * @length:     number of bytes to allocate
> + *
> + * Returns 0 on success
> + */
> +int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
> +			 unsigned int *nents, u32 length)
> +{
> +	struct scatterlist *sg;
> +	void *addr;
> +
> +	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
> +	if (!sg)
> +		return -ENOMEM;
> +
> +	sg_init_table(sg, 1);
> +
> +	addr = pci_alloc_p2pmem(pdev, length);
> +	if (!addr)
> +		goto out_free_sg;
> +
> +	sg_set_buf(sg, addr, length);
> +	*sgl = sg;
> +	*nents = 1;
> +	return 0;
> +
> +out_free_sg:
> +	kfree(sg);
> +	return -ENOMEM;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
> +
> +/**
> + * pci_p2pmem_free_sgl - free a scatterlist  allocated by pci_p2pmem_alloc_sgl

s/  allocated/ allocated/ (remove extra space)
s/pci_p2pmem_alloc_sgl/pci_p2pmem_alloc_sgl()/

> + * @pdev:	the device to allocate memory from
> + * @sgl:	the allocated scatterlist
> + * @nents:      the number of SG entries in the list
> + */
> +void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
> +			 unsigned int nents)
> +{
> +	struct scatterlist *sg;
> +	int count;
> +
> +	if (!sgl || !nents)
> +		return;
> +
> +	for_each_sg(sgl, sg, nents, count)
> +		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
> +	kfree(sgl);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
> +
> +/**
> + * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
> + *	other devices with pci_p2pmem_find

s/pci_p2pmem_find/pci_p2pmem_find()/

> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> new file mode 100644
> index 000000000000..c0dde3d3aac4
> --- /dev/null
> +++ b/include/linux/pci-p2pdma.h
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

Remove license text.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 17:37       ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:37 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

s/peer to peer/peer-to-peer/ to match text below and in spec.

On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.

s/Peer-to-Peer/peer-to-peer/ to match spec and typical usage.

Is there anything about this memory that makes it specifically
intended for peer-to-peer transactions?  I assume the device can't
really tell whether a transaction is from a CPU or a peer.

> A kernel interface is provided so that other subsystems can find and
> allocate chunks of P2P memory as necessary to facilitate transfers
> between two PCI peers. Depending on hardware, this may reduce the
> bandwidth of the transfer but would significantly reduce pressure
> on system memory. This may be desirable in many cases: for example a
> system could be designed with a small CPU connected to a PCI switch by a
> small number of lanes which would maximize the number of lanes available
> to connect to NVME devices.

"A kernel interface is provided" could mean "the kernel provides an
interface", independent of anything this patch does, but I think you
mean *this patch specifically* adds the interface.

Maybe something like:

  Add interfaces for other subsystems to find and allocate ...:

    int pci_p2pdma_add_client();
    struct pci_dev *pci_p2pmem_find();
    void *pci_alloc_p2pmem();

  This may reduce bandwidth of the transfer but significantly reduce
  ...

BTW, maybe there could be some kind of guide for device driver writers
in Documentation/PCI/?

> The interface requires a user driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary. The ACS bits on the
> downstream switch port will be managed for all the registered clients.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI switch. This is because
> using P2P transactions through the PCI root complex can have performance
> limitations or, worse, might not work at all. Finding out how well a
> particular RC supports P2P transfers is non-trivial. Additionally, the
> benefits of P2P transfers that go through the RC is limited to only
> reducing DRAM usage.

I think it would be clearer and sufficient to simply say that we have
no way to know whether peer-to-peer routing between PCIe Root Ports is
supported (PCIe r4.0, sec 1.3.1).

The fact that you use the PCIe term "switch" suggests that a PCIe
Switch is required, but isn't it sufficient for the peers to be below
the same "PCI bridge", which would include PCIe Root Ports, PCIe
Switch Downstream Ports, and conventional PCI bridges?

The comments at get_upstream_bridge_port() suggest that this isn't
enough, and the peers actually do have to be below the same PCIe
Switch, but I don't know why.

> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..840831418cbd 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,22 @@ config PCI_PASID
>  
>  	  If unsure, say N.
>  
> +config PCI_P2PDMA
> +	bool "PCI Peer to Peer transfer support"
> +	depends on ZONE_DEVICE
> +	select GENERIC_ALLOCATOR
> +	help
> +	  Enableѕ drivers to do PCI peer to peer transactions to and from

s/peer to peer/peer-to-peer/ (in bool and help text)

> +	  BARs that are exposed in other devices that are the part of
> +	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +	  specification to work (ie. anything below a single PCI bridge).
> +
> +	  Many PCIe root complexes do not support P2P transactions and
> +	  it's hard to tell which support it with good performance, so
> +	  at this time you will need a PCIe switch.

Until we have a way to figure out which of them support P2P,
performance is a don't-care.

> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> new file mode 100644
> index 000000000000..ec0a6cb9e500
> --- /dev/null
> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,568 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * PCI Peer 2 Peer DMA support.

s/Peer 2 Peer/peer-to-peer/

> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

I think the SPDX tag is meant to remove the need for including the
license text, so you should be able to remove this.  Oh, and one
trivial and annoying nit, I think for the SPDX tag, we're supposed to
use "//" in .c files and "/* */" in .h files.

> + * pci_p2pdma_add_resource - add memory for use as p2p memory
> + * @pci: the device to add the memory to

s/@pci/@pdev/

> + * @bar: PCI BAR to add
> + * @size: size of the memory to add, may be zero to use the whole BAR
> + * @offset: offset into the PCI BAR
> + *
> + * The memory will be given ZONE_DEVICE struct pages so that it may
> + * be used with any dma request.

s/dma/DMA/

> +int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> +			    u64 offset)
> +{
> +	struct dev_pagemap *pgmap;
> +	void *addr;
> +	int error;
> +
> +	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
> +		return -EINVAL;
> +
> +	if (offset >= pci_resource_len(pdev, bar))
> +		return -EINVAL;
> +
> +	if (!size)
> +		size = pci_resource_len(pdev, bar) - offset;
> +
> +	if (size + offset > pci_resource_len(pdev, bar))
> +		return -EINVAL;
> +
> +	if (!pdev->p2pdma) {
> +		error = pci_p2pdma_setup(pdev);
> +		if (error)
> +			return error;
> +	}
> +
> +	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
> +	if (!pgmap)
> +		return -ENOMEM;
> +
> +	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
> +	pgmap->res.end = pgmap->res.start + size - 1;
> +	pgmap->res.flags = pci_resource_flags(pdev, bar);
> +	pgmap->ref = &pdev->p2pdma->devmap_ref;
> +	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +
> +	addr = devm_memremap_pages(&pdev->dev, pgmap);
> +	if (IS_ERR(addr))

Free pgmap here?  And in the other error case below?  Or maybe this
happens via the devm_* magic?  If so, when would that actually happen?
Would pgmap be effectively leaked until the pdev is destroyed?

> +		return PTR_ERR(addr);
> +
> +	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
> +			pci_bus_address(pdev, bar) + offset,
> +			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
> +	if (error)
> +		return error;
> +
> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
> +					  &pdev->p2pdma->devmap_ref);
> +	if (error)
> +		return error;
> +
> +	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
> +		 &pgmap->res);

s/dev_info/pci_info/ (also similar usages below, except for the one or
two cases where you don't have a pci_dev).

> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);

> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge:
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.

s/pci_upstream_bridge/pci_upstream_bridge()/

This whole thing is confusing to me.  Why do you want to reject peers
directly connected to the same root port?  Why do you require the same
Switch Upstream Port?  You don't exclude conventional PCI, but it
looks like you would require peers to share *two* upstream PCI-to-PCI
bridges?  I would think a single shared upstream bridge (conventional,
PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?

Apologies if you've answered this before; maybe just include a little
explanation here so I don't ask again :)

> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +{
> +	struct pci_dev *up1, *up2;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> +	if (!up1)
> +		return NULL;
> +
> +	up2 = pci_dev_get(pci_upstream_bridge(up1));
> +	pci_dev_put(up1);
> +
> +	return up2;
> +}
> +
> +static bool __upstream_bridges_match(struct pci_dev *upstream,
> +				     struct pci_dev *client)
> +{
> +	struct pci_dev *dma_up;
> +	bool ret = true;
> +
> +	dma_up = get_upstream_bridge_port(client);
> +
> +	if (!dma_up) {
> +		dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
> +		ret = false;
> +		goto out;
> +	}
> +
> +	if (upstream != dma_up) {
> +		dev_dbg(&client->dev,
> +			"does not reside on the same upstream bridge\n");
> +		ret = false;
> +		goto out;
> +	}
> +
> +out:
> +	pci_dev_put(dma_up);
> +	return ret;
> +}
> +
> +static bool upstream_bridges_match(struct pci_dev *pdev,
> +				   struct pci_dev *client)
> +{
> +	struct pci_dev *upstream;
> +	bool ret;
> +
> +	upstream = get_upstream_bridge_port(pdev);
> +	if (!upstream) {
> +		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
> +		return false;
> +	}
> +
> +	ret = __upstream_bridges_match(upstream, client);
> +
> +	pci_dev_put(upstream);
> +
> +	return ret;
> +}
> +
> +struct pci_p2pdma_client {
> +	struct list_head list;
> +	struct pci_dev *client;
> +	struct pci_dev *p2pdma;

Maybe call this "peer" or something instead of "p2pdma", since p2pdma
is also used for struct pci_p2pdma things?

> + * pci_p2pdma_add_client - allocate a new element in a client device list
> + * @head: list head of p2pdma clients
> + * @dev: device to add to the list
> + *
> + * This adds @dev to a list of clients used by a p2pdma device.
> + * This list should be passed to p2pmem_find(). Once p2pmem_find() has
> + * been called successfully, the list will be bound to a specific p2pdma
> + * device and new clients can only be added to the list if they are
> + * supported by that p2pdma device.
> + *
> + * The caller is expected to have a lock which protects @head as necessary
> + * so that none of the pci_p2p functions can be called concurrently
> + * on that list.
> + *
> + * Returns 0 if the client was successfully added.
> + */
> +int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
> +{
> +	struct pci_p2pdma_client *item, *new_item;
> +	struct pci_dev *p2pdma = NULL;
> +	struct pci_dev *client;
> +	int ret;
> +
> +	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
> +		dev_warn(dev,
> +			 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
> +		return -ENODEV;
> +	}
> +
> +
> +	client = find_parent_pci_dev(dev);

Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
sort of weird that callers supply a non-PCI device and then we look up
a PCI device here.  I assume you have some reason for this; if you
added a writeup in Documentation/PCI, that would be a good place to
elaborate on that, maybe with a one-line clue here.

> +	if (!client) {
> +		dev_warn(dev,
> +			 "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
> +		return -ENODEV;
> +	}
> +
> +	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
> +	if (item && item->p2pdma) {
> +		p2pdma = item->p2pdma;
> +
> +		if (!upstream_bridges_match(p2pdma, client)) {
> +			ret = -EXDEV;
> +			goto put_client;
> +		}
> +	}
> +
> +	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
> +	if (!new_item) {
> +		ret = -ENOMEM;
> +		goto put_client;
> +	}
> +
> +	new_item->client = client;
> +	new_item->p2pdma = pci_dev_get(p2pdma);
> +
> +	list_add_tail(&new_item->list, head);
> +
> +	return 0;
> +
> +put_client:
> +	pci_dev_put(client);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);

> + * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
> + * @pdev:	the device to allocate memory from
> + * @size:	number of bytes to allocate
> + *
> + * Returns the allocated memory or NULL on error.
> + */
> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> +{
> +	void *ret;
> +
> +	if (unlikely(!pdev->p2pdma))

Is this a hot path?  I'm not sure it's worth cluttering
non-performance paths with likely/unlikely.

> +		return NULL;
> +
> +	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
> +		return NULL;
> +
> +	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);

Why the double cast?  Wouldn't "(void *)" be sufficient?

> +	if (unlikely(!ret))
> +		percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
> +
> +/**
> + * pci_free_p2pmem - allocate peer-to-peer DMA memory
> + * @pdev:	the device the memory was allocated from
> + * @addr:	address of the memory that was allocated
> + * @size:	number of bytes that was allocated
> + */
> +void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
> +{
> +	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);

In v4.6-rc1, gen_pool_free() takes "unsigned long addr".  I know this
is based on -rc3; is this something that changed between -rc1 and
-rc3?

> +	percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +}
> +EXPORT_SYMBOL_GPL(pci_free_p2pmem);
> +
> +/**
> + * pci_virt_to_bus - return the PCI bus address for a given virtual
> + *	address obtained with pci_alloc_p2pmem

s/pci_alloc_p2pmem/pci_alloc_p2pmem()/

> + * @pdev:	the device the memory was allocated from
> + * @addr:	address of the memory that was allocated
> + */
> +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
> +{
> +	if (!addr)
> +		return 0;
> +	if (!pdev->p2pdma)
> +		return 0;
> +
> +	/*
> +	 * Note: when we added the memory to the pool we used the PCI
> +	 * bus address as the physical address. So gen_pool_virt_to_phys()
> +	 * actually returns the bus address despite the misleading name.
> +	 */
> +	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
> +
> +/**
> + * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in an scatterlist

s/an scatterlist/a scatterlist/

> + * @pdev:	the device to allocate memory from
> + * @sgl:	the allocated scatterlist
> + * @nents:      the number of SG entries in the list
> + * @length:     number of bytes to allocate
> + *
> + * Returns 0 on success
> + */
> +int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
> +			 unsigned int *nents, u32 length)
> +{
> +	struct scatterlist *sg;
> +	void *addr;
> +
> +	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
> +	if (!sg)
> +		return -ENOMEM;
> +
> +	sg_init_table(sg, 1);
> +
> +	addr = pci_alloc_p2pmem(pdev, length);
> +	if (!addr)
> +		goto out_free_sg;
> +
> +	sg_set_buf(sg, addr, length);
> +	*sgl = sg;
> +	*nents = 1;
> +	return 0;
> +
> +out_free_sg:
> +	kfree(sg);
> +	return -ENOMEM;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
> +
> +/**
> + * pci_p2pmem_free_sgl - free a scatterlist  allocated by pci_p2pmem_alloc_sgl

s/  allocated/ allocated/ (remove extra space)
s/pci_p2pmem_alloc_sgl/pci_p2pmem_alloc_sgl()/

> + * @pdev:	the device to allocate memory from
> + * @sgl:	the allocated scatterlist
> + * @nents:      the number of SG entries in the list
> + */
> +void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
> +			 unsigned int nents)
> +{
> +	struct scatterlist *sg;
> +	int count;
> +
> +	if (!sgl || !nents)
> +		return;
> +
> +	for_each_sg(sgl, sg, nents, count)
> +		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
> +	kfree(sgl);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
> +
> +/**
> + * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
> + *	other devices with pci_p2pmem_find

s/pci_p2pmem_find/pci_p2pmem_find()/

> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> new file mode 100644
> index 000000000000..c0dde3d3aac4
> --- /dev/null
> +++ b/include/linux/pci-p2pdma.h
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

Remove license text.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 17:37       ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:37 UTC (permalink / raw)


s/peer to peer/peer-to-peer/ to match text below and in spec.

On Wed, Feb 28, 2018@04:39:57PM -0700, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in Peer-to-Peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.

s/Peer-to-Peer/peer-to-peer/ to match spec and typical usage.

Is there anything about this memory that makes it specifically
intended for peer-to-peer transactions?  I assume the device can't
really tell whether a transaction is from a CPU or a peer.

> A kernel interface is provided so that other subsystems can find and
> allocate chunks of P2P memory as necessary to facilitate transfers
> between two PCI peers. Depending on hardware, this may reduce the
> bandwidth of the transfer but would significantly reduce pressure
> on system memory. This may be desirable in many cases: for example a
> system could be designed with a small CPU connected to a PCI switch by a
> small number of lanes which would maximize the number of lanes available
> to connect to NVME devices.

"A kernel interface is provided" could mean "the kernel provides an
interface", independent of anything this patch does, but I think you
mean *this patch specifically* adds the interface.

Maybe something like:

  Add interfaces for other subsystems to find and allocate ...:

    int pci_p2pdma_add_client();
    struct pci_dev *pci_p2pmem_find();
    void *pci_alloc_p2pmem();

  This may reduce bandwidth of the transfer but significantly reduce
  ...

BTW, maybe there could be some kind of guide for device driver writers
in Documentation/PCI/?

> The interface requires a user driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary. The ACS bits on the
> downstream switch port will be managed for all the registered clients.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI switch. This is because
> using P2P transactions through the PCI root complex can have performance
> limitations or, worse, might not work at all. Finding out how well a
> particular RC supports P2P transfers is non-trivial. Additionally, the
> benefits of P2P transfers that go through the RC is limited to only
> reducing DRAM usage.

I think it would be clearer and sufficient to simply say that we have
no way to know whether peer-to-peer routing between PCIe Root Ports is
supported (PCIe r4.0, sec 1.3.1).

The fact that you use the PCIe term "switch" suggests that a PCIe
Switch is required, but isn't it sufficient for the peers to be below
the same "PCI bridge", which would include PCIe Root Ports, PCIe
Switch Downstream Ports, and conventional PCI bridges?

The comments at get_upstream_bridge_port() suggest that this isn't
enough, and the peers actually do have to be below the same PCIe
Switch, but I don't know why.

> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..840831418cbd 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,22 @@ config PCI_PASID
>  
>  	  If unsure, say N.
>  
> +config PCI_P2PDMA
> +	bool "PCI Peer to Peer transfer support"
> +	depends on ZONE_DEVICE
> +	select GENERIC_ALLOCATOR
> +	help
> +	  Enable? drivers to do PCI peer to peer transactions to and from

s/peer to peer/peer-to-peer/ (in bool and help text)

> +	  BARs that are exposed in other devices that are the part of
> +	  the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +	  specification to work (ie. anything below a single PCI bridge).
> +
> +	  Many PCIe root complexes do not support P2P transactions and
> +	  it's hard to tell which support it with good performance, so
> +	  at this time you will need a PCIe switch.

Until we have a way to figure out which of them support P2P,
performance is a don't-care.

> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> new file mode 100644
> index 000000000000..ec0a6cb9e500
> --- /dev/null
> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,568 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * PCI Peer 2 Peer DMA support.

s/Peer 2 Peer/peer-to-peer/

> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

I think the SPDX tag is meant to remove the need for including the
license text, so you should be able to remove this.  Oh, and one
trivial and annoying nit, I think for the SPDX tag, we're supposed to
use "//" in .c files and "/* */" in .h files.

> + * pci_p2pdma_add_resource - add memory for use as p2p memory
> + * @pci: the device to add the memory to

s/@pci/@pdev/

> + * @bar: PCI BAR to add
> + * @size: size of the memory to add, may be zero to use the whole BAR
> + * @offset: offset into the PCI BAR
> + *
> + * The memory will be given ZONE_DEVICE struct pages so that it may
> + * be used with any dma request.

s/dma/DMA/

> +int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> +			    u64 offset)
> +{
> +	struct dev_pagemap *pgmap;
> +	void *addr;
> +	int error;
> +
> +	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
> +		return -EINVAL;
> +
> +	if (offset >= pci_resource_len(pdev, bar))
> +		return -EINVAL;
> +
> +	if (!size)
> +		size = pci_resource_len(pdev, bar) - offset;
> +
> +	if (size + offset > pci_resource_len(pdev, bar))
> +		return -EINVAL;
> +
> +	if (!pdev->p2pdma) {
> +		error = pci_p2pdma_setup(pdev);
> +		if (error)
> +			return error;
> +	}
> +
> +	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
> +	if (!pgmap)
> +		return -ENOMEM;
> +
> +	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
> +	pgmap->res.end = pgmap->res.start + size - 1;
> +	pgmap->res.flags = pci_resource_flags(pdev, bar);
> +	pgmap->ref = &pdev->p2pdma->devmap_ref;
> +	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +
> +	addr = devm_memremap_pages(&pdev->dev, pgmap);
> +	if (IS_ERR(addr))

Free pgmap here?  And in the other error case below?  Or maybe this
happens via the devm_* magic?  If so, when would that actually happen?
Would pgmap be effectively leaked until the pdev is destroyed?

> +		return PTR_ERR(addr);
> +
> +	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
> +			pci_bus_address(pdev, bar) + offset,
> +			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
> +	if (error)
> +		return error;
> +
> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
> +					  &pdev->p2pdma->devmap_ref);
> +	if (error)
> +		return error;
> +
> +	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
> +		 &pgmap->res);

s/dev_info/pci_info/ (also similar usages below, except for the one or
two cases where you don't have a pci_dev).

> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);

> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge:
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.

s/pci_upstream_bridge/pci_upstream_bridge()/

This whole thing is confusing to me.  Why do you want to reject peers
directly connected to the same root port?  Why do you require the same
Switch Upstream Port?  You don't exclude conventional PCI, but it
looks like you would require peers to share *two* upstream PCI-to-PCI
bridges?  I would think a single shared upstream bridge (conventional,
PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?

Apologies if you've answered this before; maybe just include a little
explanation here so I don't ask again :)

> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +{
> +	struct pci_dev *up1, *up2;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	up1 = pci_dev_get(pci_upstream_bridge(pdev));
> +	if (!up1)
> +		return NULL;
> +
> +	up2 = pci_dev_get(pci_upstream_bridge(up1));
> +	pci_dev_put(up1);
> +
> +	return up2;
> +}
> +
> +static bool __upstream_bridges_match(struct pci_dev *upstream,
> +				     struct pci_dev *client)
> +{
> +	struct pci_dev *dma_up;
> +	bool ret = true;
> +
> +	dma_up = get_upstream_bridge_port(client);
> +
> +	if (!dma_up) {
> +		dev_dbg(&client->dev, "not a PCI device behind a bridge\n");
> +		ret = false;
> +		goto out;
> +	}
> +
> +	if (upstream != dma_up) {
> +		dev_dbg(&client->dev,
> +			"does not reside on the same upstream bridge\n");
> +		ret = false;
> +		goto out;
> +	}
> +
> +out:
> +	pci_dev_put(dma_up);
> +	return ret;
> +}
> +
> +static bool upstream_bridges_match(struct pci_dev *pdev,
> +				   struct pci_dev *client)
> +{
> +	struct pci_dev *upstream;
> +	bool ret;
> +
> +	upstream = get_upstream_bridge_port(pdev);
> +	if (!upstream) {
> +		dev_warn(&pdev->dev, "not behind a PCI bridge\n");
> +		return false;
> +	}
> +
> +	ret = __upstream_bridges_match(upstream, client);
> +
> +	pci_dev_put(upstream);
> +
> +	return ret;
> +}
> +
> +struct pci_p2pdma_client {
> +	struct list_head list;
> +	struct pci_dev *client;
> +	struct pci_dev *p2pdma;

Maybe call this "peer" or something instead of "p2pdma", since p2pdma
is also used for struct pci_p2pdma things?

> + * pci_p2pdma_add_client - allocate a new element in a client device list
> + * @head: list head of p2pdma clients
> + * @dev: device to add to the list
> + *
> + * This adds @dev to a list of clients used by a p2pdma device.
> + * This list should be passed to p2pmem_find(). Once p2pmem_find() has
> + * been called successfully, the list will be bound to a specific p2pdma
> + * device and new clients can only be added to the list if they are
> + * supported by that p2pdma device.
> + *
> + * The caller is expected to have a lock which protects @head as necessary
> + * so that none of the pci_p2p functions can be called concurrently
> + * on that list.
> + *
> + * Returns 0 if the client was successfully added.
> + */
> +int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
> +{
> +	struct pci_p2pdma_client *item, *new_item;
> +	struct pci_dev *p2pdma = NULL;
> +	struct pci_dev *client;
> +	int ret;
> +
> +	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) {
> +		dev_warn(dev,
> +			 "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n");
> +		return -ENODEV;
> +	}
> +
> +
> +	client = find_parent_pci_dev(dev);

Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
sort of weird that callers supply a non-PCI device and then we look up
a PCI device here.  I assume you have some reason for this; if you
added a writeup in Documentation/PCI, that would be a good place to
elaborate on that, maybe with a one-line clue here.

> +	if (!client) {
> +		dev_warn(dev,
> +			 "cannot be used for peer-to-peer DMA as it is not a PCI device\n");
> +		return -ENODEV;
> +	}
> +
> +	item = list_first_entry_or_null(head, struct pci_p2pdma_client, list);
> +	if (item && item->p2pdma) {
> +		p2pdma = item->p2pdma;
> +
> +		if (!upstream_bridges_match(p2pdma, client)) {
> +			ret = -EXDEV;
> +			goto put_client;
> +		}
> +	}
> +
> +	new_item = kzalloc(sizeof(*new_item), GFP_KERNEL);
> +	if (!new_item) {
> +		ret = -ENOMEM;
> +		goto put_client;
> +	}
> +
> +	new_item->client = client;
> +	new_item->p2pdma = pci_dev_get(p2pdma);
> +
> +	list_add_tail(&new_item->list, head);
> +
> +	return 0;
> +
> +put_client:
> +	pci_dev_put(client);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_add_client);

> + * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory
> + * @pdev:	the device to allocate memory from
> + * @size:	number of bytes to allocate
> + *
> + * Returns the allocated memory or NULL on error.
> + */
> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> +{
> +	void *ret;
> +
> +	if (unlikely(!pdev->p2pdma))

Is this a hot path?  I'm not sure it's worth cluttering
non-performance paths with likely/unlikely.

> +		return NULL;
> +
> +	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
> +		return NULL;
> +
> +	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);

Why the double cast?  Wouldn't "(void *)" be sufficient?

> +	if (unlikely(!ret))
> +		percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
> +
> +/**
> + * pci_free_p2pmem - allocate peer-to-peer DMA memory
> + * @pdev:	the device the memory was allocated from
> + * @addr:	address of the memory that was allocated
> + * @size:	number of bytes that was allocated
> + */
> +void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
> +{
> +	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);

In v4.6-rc1, gen_pool_free() takes "unsigned long addr".  I know this
is based on -rc3; is this something that changed between -rc1 and
-rc3?

> +	percpu_ref_put(&pdev->p2pdma->devmap_ref);
> +}
> +EXPORT_SYMBOL_GPL(pci_free_p2pmem);
> +
> +/**
> + * pci_virt_to_bus - return the PCI bus address for a given virtual
> + *	address obtained with pci_alloc_p2pmem

s/pci_alloc_p2pmem/pci_alloc_p2pmem()/

> + * @pdev:	the device the memory was allocated from
> + * @addr:	address of the memory that was allocated
> + */
> +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr)
> +{
> +	if (!addr)
> +		return 0;
> +	if (!pdev->p2pdma)
> +		return 0;
> +
> +	/*
> +	 * Note: when we added the memory to the pool we used the PCI
> +	 * bus address as the physical address. So gen_pool_virt_to_phys()
> +	 * actually returns the bus address despite the misleading name.
> +	 */
> +	return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus);
> +
> +/**
> + * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in an scatterlist

s/an scatterlist/a scatterlist/

> + * @pdev:	the device to allocate memory from
> + * @sgl:	the allocated scatterlist
> + * @nents:      the number of SG entries in the list
> + * @length:     number of bytes to allocate
> + *
> + * Returns 0 on success
> + */
> +int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
> +			 unsigned int *nents, u32 length)
> +{
> +	struct scatterlist *sg;
> +	void *addr;
> +
> +	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
> +	if (!sg)
> +		return -ENOMEM;
> +
> +	sg_init_table(sg, 1);
> +
> +	addr = pci_alloc_p2pmem(pdev, length);
> +	if (!addr)
> +		goto out_free_sg;
> +
> +	sg_set_buf(sg, addr, length);
> +	*sgl = sg;
> +	*nents = 1;
> +	return 0;
> +
> +out_free_sg:
> +	kfree(sg);
> +	return -ENOMEM;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl);
> +
> +/**
> + * pci_p2pmem_free_sgl - free a scatterlist  allocated by pci_p2pmem_alloc_sgl

s/  allocated/ allocated/ (remove extra space)
s/pci_p2pmem_alloc_sgl/pci_p2pmem_alloc_sgl()/

> + * @pdev:	the device to allocate memory from
> + * @sgl:	the allocated scatterlist
> + * @nents:      the number of SG entries in the list
> + */
> +void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
> +			 unsigned int nents)
> +{
> +	struct scatterlist *sg;
> +	int count;
> +
> +	if (!sgl || !nents)
> +		return;
> +
> +	for_each_sg(sgl, sg, nents, count)
> +		pci_free_p2pmem(pdev, sg_virt(sg), sg->length);
> +	kfree(sgl);
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl);
> +
> +/**
> + * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by
> + *	other devices with pci_p2pmem_find

s/pci_p2pmem_find/pci_p2pmem_find()/

> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> new file mode 100644
> index 000000000000..c0dde3d3aac4
> --- /dev/null
> +++ b/include/linux/pci-p2pdma.h
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

Remove license text.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 11:03     ` Sagi Grimberg
  (?)
  (?)
@ 2018-03-01 17:40       ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 17:40 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:03 AM, Sagi Grimberg wrote:
> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

The patchset already supports CMB drives. That's essentially what patch 
7 is for. We change the nvme-pci driver to use the p2pmem code to 
register and manage the CMB memory. After that, it is simply available 
to the nvmet code. We have already been using this with a couple 
prototype CMB drives.

>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>> +                   struct nvmet_ns *ns)
>> +{
>> +    int ret;
>> +
>> +    if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>> +        pr_err("peer-to-peer DMA is not supported by %s\n",
>> +               ns->device_path);
>> +        return -EINVAL;
> 
> I'd say that just skip it instead of failing it, in theory one can
> connect nvme devices via p2p mem and expose other devices in the
> same subsystem. The message would be pr_debug to reduce chattiness.

No, it's actually a bit more complicated than you make it. There's a 
couple cases:

1) The user adds a namespace but there hasn't been a connection and no 
p2pmem has been selected. In this case the add_client function is never 
called and the user can add whatever namespace they like.

2) When the first connect happens, nvmet_setup_p2pmem() will call 
add_client() for each namespace and rdma device. If any of them fail 
then it does not use a P2P device and falls back, as you'd like, to 
regular memory.

3) When the user adds a namespace after a port is in use and a 
compatible P2P device has been found. In this case, if the user tries to 
add a namespace that is not compatible with the P2P device in use then 
it fails adding the new namespace. The only alternative here is to tear 
everything down, ensure there are no P2P enabled buffers open and start 
using regular memory again... That is very difficult.

I also disagree that these messages should be pr_debug. If a user is 
trying to use P2P memory and they enable it but the system doesn't 
choose to use their memory they must know why that is so they can make 
the necessary adjustments. If the system doesn't at least print a dmesg 
they are in the dark.

>> +    }
>> +
>> +    ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>> +    if (ret)
>> +        pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>> +               ns->device_path, ret);
>> +
>> +    return ret;
>> +}
>> +
>>   int nvmet_ns_enable(struct nvmet_ns *ns)
>>   {
>>       struct nvmet_subsys *subsys = ns->subsys;
>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>>       if (ret)
>>           goto out_blkdev_put;
>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>> +        if (ctrl->p2p_dev) {
>> +            ret = nvmet_p2pdma_add_client(ctrl, ns);
>> +            if (ret)
>> +                goto out_remove_clients;
> 
> Is this really a fatal failure given that we fall-back to main
> memory? Why not continue with main memory (and warn at best)?

See above. It's fatal because we are already using p2p memory and we 
can't easily tear that all down when a user adds a new namespace.

>> +    ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
> 
> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?

Yes, it currently returns the first found. I imagine a bunch of 
improvements could be made to it in future work. However, I'd probably 
start with finding the nearest p2p device and then falling back to the 
least used if that's necessary. At this time though it's not clear how 
complicated these systems will get and what's actually needed here.

>> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
>>       u8            n_rdma;
>>       u32            flags;
>>       u32            invalidate_rkey;
>> +    struct pci_dev        *p2p_dev;
> 
> Given that p2p_client is in nvmet_req, I think it make sense
> that the p2p_dev itself would also live there. In theory, nothing
> is preventing FC from using it as well.

Fair point. I can look at moving it in v3.

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 17:40       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 17:40 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise



On 01/03/18 04:03 AM, Sagi Grimberg wrote:
> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

The patchset already supports CMB drives. That's essentially what patch 
7 is for. We change the nvme-pci driver to use the p2pmem code to 
register and manage the CMB memory. After that, it is simply available 
to the nvmet code. We have already been using this with a couple 
prototype CMB drives.

>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>> +                   struct nvmet_ns *ns)
>> +{
>> +    int ret;
>> +
>> +    if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>> +        pr_err("peer-to-peer DMA is not supported by %s\n",
>> +               ns->device_path);
>> +        return -EINVAL;
> 
> I'd say that just skip it instead of failing it, in theory one can
> connect nvme devices via p2p mem and expose other devices in the
> same subsystem. The message would be pr_debug to reduce chattiness.

No, it's actually a bit more complicated than you make it. There's a 
couple cases:

1) The user adds a namespace but there hasn't been a connection and no 
p2pmem has been selected. In this case the add_client function is never 
called and the user can add whatever namespace they like.

2) When the first connect happens, nvmet_setup_p2pmem() will call 
add_client() for each namespace and rdma device. If any of them fail 
then it does not use a P2P device and falls back, as you'd like, to 
regular memory.

3) When the user adds a namespace after a port is in use and a 
compatible P2P device has been found. In this case, if the user tries to 
add a namespace that is not compatible with the P2P device in use then 
it fails adding the new namespace. The only alternative here is to tear 
everything down, ensure there are no P2P enabled buffers open and start 
using regular memory again... That is very difficult.

I also disagree that these messages should be pr_debug. If a user is 
trying to use P2P memory and they enable it but the system doesn't 
choose to use their memory they must know why that is so they can make 
the necessary adjustments. If the system doesn't at least print a dmesg 
they are in the dark.

>> +    }
>> +
>> +    ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>> +    if (ret)
>> +        pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>> +               ns->device_path, ret);
>> +
>> +    return ret;
>> +}
>> +
>>   int nvmet_ns_enable(struct nvmet_ns *ns)
>>   {
>>       struct nvmet_subsys *subsys = ns->subsys;
>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>>       if (ret)
>>           goto out_blkdev_put;
>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>> +        if (ctrl->p2p_dev) {
>> +            ret = nvmet_p2pdma_add_client(ctrl, ns);
>> +            if (ret)
>> +                goto out_remove_clients;
> 
> Is this really a fatal failure given that we fall-back to main
> memory? Why not continue with main memory (and warn at best)?

See above. It's fatal because we are already using p2p memory and we 
can't easily tear that all down when a user adds a new namespace.

>> +    ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
> 
> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?

Yes, it currently returns the first found. I imagine a bunch of 
improvements could be made to it in future work. However, I'd probably 
start with finding the nearest p2p device and then falling back to the 
least used if that's necessary. At this time though it's not clear how 
complicated these systems will get and what's actually needed here.

>> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
>>       u8            n_rdma;
>>       u32            flags;
>>       u32            invalidate_rkey;
>> +    struct pci_dev        *p2p_dev;
> 
> Given that p2p_client is in nvmet_req, I think it make sense
> that the p2p_dev itself would also live there. In theory, nothing
> is preventing FC from using it as well.

Fair point. I can look at moving it in v3.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 17:40       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 17:40 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:03 AM, Sagi Grimberg wrote:
> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

The patchset already supports CMB drives. That's essentially what patch 
7 is for. We change the nvme-pci driver to use the p2pmem code to 
register and manage the CMB memory. After that, it is simply available 
to the nvmet code. We have already been using this with a couple 
prototype CMB drives.

>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>> +                   struct nvmet_ns *ns)
>> +{
>> +    int ret;
>> +
>> +    if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>> +        pr_err("peer-to-peer DMA is not supported by %s\n",
>> +               ns->device_path);
>> +        return -EINVAL;
> 
> I'd say that just skip it instead of failing it, in theory one can
> connect nvme devices via p2p mem and expose other devices in the
> same subsystem. The message would be pr_debug to reduce chattiness.

No, it's actually a bit more complicated than you make it. There's a 
couple cases:

1) The user adds a namespace but there hasn't been a connection and no 
p2pmem has been selected. In this case the add_client function is never 
called and the user can add whatever namespace they like.

2) When the first connect happens, nvmet_setup_p2pmem() will call 
add_client() for each namespace and rdma device. If any of them fail 
then it does not use a P2P device and falls back, as you'd like, to 
regular memory.

3) When the user adds a namespace after a port is in use and a 
compatible P2P device has been found. In this case, if the user tries to 
add a namespace that is not compatible with the P2P device in use then 
it fails adding the new namespace. The only alternative here is to tear 
everything down, ensure there are no P2P enabled buffers open and start 
using regular memory again... That is very difficult.

I also disagree that these messages should be pr_debug. If a user is 
trying to use P2P memory and they enable it but the system doesn't 
choose to use their memory they must know why that is so they can make 
the necessary adjustments. If the system doesn't at least print a dmesg 
they are in the dark.

>> +    }
>> +
>> +    ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>> +    if (ret)
>> +        pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>> +               ns->device_path, ret);
>> +
>> +    return ret;
>> +}
>> +
>>   int nvmet_ns_enable(struct nvmet_ns *ns)
>>   {
>>       struct nvmet_subsys *subsys = ns->subsys;
>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>>       if (ret)
>>           goto out_blkdev_put;
>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>> +        if (ctrl->p2p_dev) {
>> +            ret = nvmet_p2pdma_add_client(ctrl, ns);
>> +            if (ret)
>> +                goto out_remove_clients;
> 
> Is this really a fatal failure given that we fall-back to main
> memory? Why not continue with main memory (and warn at best)?

See above. It's fatal because we are already using p2p memory and we 
can't easily tear that all down when a user adds a new namespace.

>> +    ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
> 
> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?

Yes, it currently returns the first found. I imagine a bunch of 
improvements could be made to it in future work. However, I'd probably 
start with finding the nearest p2p device and then falling back to the 
least used if that's necessary. At this time though it's not clear how 
complicated these systems will get and what's actually needed here.

>> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
>>       u8            n_rdma;
>>       u32            flags;
>>       u32            invalidate_rkey;
>> +    struct pci_dev        *p2p_dev;
> 
> Given that p2p_client is in nvmet_req, I think it make sense
> that the p2p_dev itself would also live there. In theory, nothing
> is preventing FC from using it as well.

Fair point. I can look at moving it in v3.

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 17:40       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 17:40 UTC (permalink / raw)




On 01/03/18 04:03 AM, Sagi Grimberg wrote:
> Can you describe what would be the plan to have it when these devices
> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> ns can prefer to use its own cmb instead of locating a p2p_dev device?

The patchset already supports CMB drives. That's essentially what patch 
7 is for. We change the nvme-pci driver to use the p2pmem code to 
register and manage the CMB memory. After that, it is simply available 
to the nvmet code. We have already been using this with a couple 
prototype CMB drives.

>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>> +?????????????????? struct nvmet_ns *ns)
>> +{
>> +??? int ret;
>> +
>> +??? if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>> +??????? pr_err("peer-to-peer DMA is not supported by %s\n",
>> +?????????????? ns->device_path);
>> +??????? return -EINVAL;
> 
> I'd say that just skip it instead of failing it, in theory one can
> connect nvme devices via p2p mem and expose other devices in the
> same subsystem. The message would be pr_debug to reduce chattiness.

No, it's actually a bit more complicated than you make it. There's a 
couple cases:

1) The user adds a namespace but there hasn't been a connection and no 
p2pmem has been selected. In this case the add_client function is never 
called and the user can add whatever namespace they like.

2) When the first connect happens, nvmet_setup_p2pmem() will call 
add_client() for each namespace and rdma device. If any of them fail 
then it does not use a P2P device and falls back, as you'd like, to 
regular memory.

3) When the user adds a namespace after a port is in use and a 
compatible P2P device has been found. In this case, if the user tries to 
add a namespace that is not compatible with the P2P device in use then 
it fails adding the new namespace. The only alternative here is to tear 
everything down, ensure there are no P2P enabled buffers open and start 
using regular memory again... That is very difficult.

I also disagree that these messages should be pr_debug. If a user is 
trying to use P2P memory and they enable it but the system doesn't 
choose to use their memory they must know why that is so they can make 
the necessary adjustments. If the system doesn't at least print a dmesg 
they are in the dark.

>> +??? }
>> +
>> +??? ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>> +??? if (ret)
>> +??????? pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>> +?????????????? ns->device_path, ret);
>> +
>> +??? return ret;
>> +}
>> +
>> ? int nvmet_ns_enable(struct nvmet_ns *ns)
>> ? {
>> ????? struct nvmet_subsys *subsys = ns->subsys;
>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>> ????? if (ret)
>> ????????? goto out_blkdev_put;
>> +??? list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>> +??????? if (ctrl->p2p_dev) {
>> +??????????? ret = nvmet_p2pdma_add_client(ctrl, ns);
>> +??????????? if (ret)
>> +??????????????? goto out_remove_clients;
> 
> Is this really a fatal failure given that we fall-back to main
> memory? Why not continue with main memory (and warn at best)?

See above. It's fatal because we are already using p2p memory and we 
can't easily tear that all down when a user adds a new namespace.

>> +??? ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
> 
> This is the first p2p_dev found right? What happens if I have more than
> a single p2p device? In theory I'd have more p2p memory I can use. Have
> you considered making pci_p2pmem_find return the least used suitable
> device?

Yes, it currently returns the first found. I imagine a bunch of 
improvements could be made to it in future work. However, I'd probably 
start with finding the nearest p2p device and then falling back to the 
least used if that's necessary. At this time though it's not clear how 
complicated these systems will get and what's actually needed here.

>> @@ -68,6 +69,7 @@ struct nvmet_rdma_rsp {
>> ????? u8??????????? n_rdma;
>> ????? u32??????????? flags;
>> ????? u32??????????? invalidate_rkey;
>> +??? struct pci_dev??????? *p2p_dev;
> 
> Given that p2p_client is in nvmet_req, I think it make sense
> that the p2p_dev itself would also live there. In theory, nothing
> is preventing FC from using it as well.

Fair point. I can look at moving it in v3.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
  2018-02-28 23:39   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 17:44     ` Bjorn Helgaas
  -1 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Wed, Feb 28, 2018 at 04:39:58PM -0700, Logan Gunthorpe wrote:
> Attributes display the total amount of P2P memory, the amount available
> and whether it is published or not.

Can you add enough text here to make the body of the changelog
complete in itself?  That might mean just repeating the subject, which
is fine.

> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
>  drivers/pci/p2pdma.c                    | 50 +++++++++++++++++++++++++++++++++
>  2 files changed, 75 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 44d4b2be92fd..f5656dae21be 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -323,3 +323,28 @@ Description:
>  
>  		This is similar to /sys/bus/pci/drivers_autoprobe, but
>  		affects only the VFs associated with a specific PF.
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/available
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang@deltatee.com>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains the amount of memory that has not been
> +		allocated (in decimal).
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/size
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang@deltatee.com>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains the total ammount of memory that the device

s/ammount/amount/

> +		provides (in decimal).
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/published
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang@deltatee.com>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains a '1' if the memory has been published for
> +		use inside the kernel or a '0' if it is only intended
> +		for use within the driver that published it.
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ec0a6cb9e500..a57df78f6a32 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -29,6 +29,53 @@ struct pci_p2pdma {
>  	bool published;
>  };
>  
> +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> +			 char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	size_t size = 0;
> +
> +	if (pdev->p2pdma->pool)
> +		size = gen_pool_size(pdev->p2pdma->pool);
> +
> +	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
> +}
> +static DEVICE_ATTR_RO(size);
> +
> +static ssize_t available_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	size_t avail = 0;
> +
> +	if (pdev->p2pdma->pool)
> +		avail = gen_pool_avail(pdev->p2pdma->pool);
> +
> +	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
> +}
> +static DEVICE_ATTR_RO(available);
> +
> +static ssize_t published_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
> +}
> +static DEVICE_ATTR_RO(published);
> +
> +static struct attribute *p2pmem_attrs[] = {
> +	&dev_attr_size.attr,
> +	&dev_attr_available.attr,
> +	&dev_attr_published.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group p2pmem_group = {
> +	.attrs = p2pmem_attrs,
> +	.name = "p2pmem",
> +};
> +
>  static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
>  {
>  	struct pci_p2pdma *p2p =
> @@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
>  	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
>  
>  	gen_pool_destroy(pdev->p2pdma->pool);
> +	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
>  	pdev->p2pdma = NULL;
>  }
>  
> @@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>  	if (error)
>  		goto out_pool_destroy;
>  
> +	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
> +
>  	pdev->p2pdma = p2p;

I think these two statements are out of order, since the attributes
dereference pdev->p2pdma.  And it looks like you set "error"
unnecessarily, since you return immediately looking at it.

>  	return 0;
> -- 
> 2.11.0
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-01 17:44     ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson

On Wed, Feb 28, 2018 at 04:39:58PM -0700, Logan Gunthorpe wrote:
> Attributes display the total amount of P2P memory, the amount available
> and whether it is published or not.

Can you add enough text here to make the body of the changelog
complete in itself?  That might mean just repeating the subject, which
is fine.

> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
>  drivers/pci/p2pdma.c                    | 50 +++++++++++++++++++++++++++++++++
>  2 files changed, 75 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 44d4b2be92fd..f5656dae21be 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -323,3 +323,28 @@ Description:
>  
>  		This is similar to /sys/bus/pci/drivers_autoprobe, but
>  		affects only the VFs associated with a specific PF.
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/available
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang@deltatee.com>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains the amount of memory that has not been
> +		allocated (in decimal).
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/size
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang@deltatee.com>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains the total ammount of memory that the device

s/ammount/amount/

> +		provides (in decimal).
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/published
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang@deltatee.com>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains a '1' if the memory has been published for
> +		use inside the kernel or a '0' if it is only intended
> +		for use within the driver that published it.
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ec0a6cb9e500..a57df78f6a32 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -29,6 +29,53 @@ struct pci_p2pdma {
>  	bool published;
>  };
>  
> +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> +			 char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	size_t size = 0;
> +
> +	if (pdev->p2pdma->pool)
> +		size = gen_pool_size(pdev->p2pdma->pool);
> +
> +	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
> +}
> +static DEVICE_ATTR_RO(size);
> +
> +static ssize_t available_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	size_t avail = 0;
> +
> +	if (pdev->p2pdma->pool)
> +		avail = gen_pool_avail(pdev->p2pdma->pool);
> +
> +	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
> +}
> +static DEVICE_ATTR_RO(available);
> +
> +static ssize_t published_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
> +}
> +static DEVICE_ATTR_RO(published);
> +
> +static struct attribute *p2pmem_attrs[] = {
> +	&dev_attr_size.attr,
> +	&dev_attr_available.attr,
> +	&dev_attr_published.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group p2pmem_group = {
> +	.attrs = p2pmem_attrs,
> +	.name = "p2pmem",
> +};
> +
>  static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
>  {
>  	struct pci_p2pdma *p2p =
> @@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
>  	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
>  
>  	gen_pool_destroy(pdev->p2pdma->pool);
> +	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
>  	pdev->p2pdma = NULL;
>  }
>  
> @@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>  	if (error)
>  		goto out_pool_destroy;
>  
> +	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
> +
>  	pdev->p2pdma = p2p;

I think these two statements are out of order, since the attributes
dereference pdev->p2pdma.  And it looks like you set "error"
unnecessarily, since you return immediately looking at it.

>  	return 0;
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-01 17:44     ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Wed, Feb 28, 2018 at 04:39:58PM -0700, Logan Gunthorpe wrote:
> Attributes display the total amount of P2P memory, the amount available
> and whether it is published or not.

Can you add enough text here to make the body of the changelog
complete in itself?  That might mean just repeating the subject, which
is fine.

> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
>  drivers/pci/p2pdma.c                    | 50 +++++++++++++++++++++++++++++++++
>  2 files changed, 75 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 44d4b2be92fd..f5656dae21be 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -323,3 +323,28 @@ Description:
>  
>  		This is similar to /sys/bus/pci/drivers_autoprobe, but
>  		affects only the VFs associated with a specific PF.
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/available
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains the amount of memory that has not been
> +		allocated (in decimal).
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/size
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains the total ammount of memory that the device

s/ammount/amount/

> +		provides (in decimal).
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/published
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains a '1' if the memory has been published for
> +		use inside the kernel or a '0' if it is only intended
> +		for use within the driver that published it.
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ec0a6cb9e500..a57df78f6a32 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -29,6 +29,53 @@ struct pci_p2pdma {
>  	bool published;
>  };
>  
> +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> +			 char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	size_t size = 0;
> +
> +	if (pdev->p2pdma->pool)
> +		size = gen_pool_size(pdev->p2pdma->pool);
> +
> +	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
> +}
> +static DEVICE_ATTR_RO(size);
> +
> +static ssize_t available_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	size_t avail = 0;
> +
> +	if (pdev->p2pdma->pool)
> +		avail = gen_pool_avail(pdev->p2pdma->pool);
> +
> +	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
> +}
> +static DEVICE_ATTR_RO(available);
> +
> +static ssize_t published_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
> +}
> +static DEVICE_ATTR_RO(published);
> +
> +static struct attribute *p2pmem_attrs[] = {
> +	&dev_attr_size.attr,
> +	&dev_attr_available.attr,
> +	&dev_attr_published.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group p2pmem_group = {
> +	.attrs = p2pmem_attrs,
> +	.name = "p2pmem",
> +};
> +
>  static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
>  {
>  	struct pci_p2pdma *p2p =
> @@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
>  	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
>  
>  	gen_pool_destroy(pdev->p2pdma->pool);
> +	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
>  	pdev->p2pdma = NULL;
>  }
>  
> @@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>  	if (error)
>  		goto out_pool_destroy;
>  
> +	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
> +
>  	pdev->p2pdma = p2p;

I think these two statements are out of order, since the attributes
dereference pdev->p2pdma.  And it looks like you set "error"
unnecessarily, since you return immediately looking at it.

>  	return 0;
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-01 17:44     ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:44 UTC (permalink / raw)


On Wed, Feb 28, 2018@04:39:58PM -0700, Logan Gunthorpe wrote:
> Attributes display the total amount of P2P memory, the amount available
> and whether it is published or not.

Can you add enough text here to make the body of the changelog
complete in itself?  That might mean just repeating the subject, which
is fine.

> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++++
>  drivers/pci/p2pdma.c                    | 50 +++++++++++++++++++++++++++++++++
>  2 files changed, 75 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 44d4b2be92fd..f5656dae21be 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -323,3 +323,28 @@ Description:
>  
>  		This is similar to /sys/bus/pci/drivers_autoprobe, but
>  		affects only the VFs associated with a specific PF.
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/available
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang at deltatee.com>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains the amount of memory that has not been
> +		allocated (in decimal).
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/size
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang at deltatee.com>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains the total ammount of memory that the device

s/ammount/amount/

> +		provides (in decimal).
> +
> +What:		/sys/bus/pci/devices/.../p2pmem/published
> +Date:		November 2017
> +Contact:	Logan Gunthorpe <logang at deltatee.com>
> +Description:
> +		If the device has any Peer-to-Peer memory registered, this
> +	        file contains a '1' if the memory has been published for
> +		use inside the kernel or a '0' if it is only intended
> +		for use within the driver that published it.
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ec0a6cb9e500..a57df78f6a32 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -29,6 +29,53 @@ struct pci_p2pdma {
>  	bool published;
>  };
>  
> +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> +			 char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	size_t size = 0;
> +
> +	if (pdev->p2pdma->pool)
> +		size = gen_pool_size(pdev->p2pdma->pool);
> +
> +	return snprintf(buf, PAGE_SIZE, "%zd\n", size);
> +}
> +static DEVICE_ATTR_RO(size);
> +
> +static ssize_t available_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	size_t avail = 0;
> +
> +	if (pdev->p2pdma->pool)
> +		avail = gen_pool_avail(pdev->p2pdma->pool);
> +
> +	return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
> +}
> +static DEVICE_ATTR_RO(available);
> +
> +static ssize_t published_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return snprintf(buf, PAGE_SIZE, "%d\n", pdev->p2pdma->published);
> +}
> +static DEVICE_ATTR_RO(published);
> +
> +static struct attribute *p2pmem_attrs[] = {
> +	&dev_attr_size.attr,
> +	&dev_attr_available.attr,
> +	&dev_attr_published.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group p2pmem_group = {
> +	.attrs = p2pmem_attrs,
> +	.name = "p2pmem",
> +};
> +
>  static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
>  {
>  	struct pci_p2pdma *p2p =
> @@ -55,6 +102,7 @@ static void pci_p2pdma_release(void *data)
>  	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
>  
>  	gen_pool_destroy(pdev->p2pdma->pool);
> +	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
>  	pdev->p2pdma = NULL;
>  }
>  
> @@ -83,6 +131,8 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>  	if (error)
>  		goto out_pool_destroy;
>  
> +	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
> +
>  	pdev->p2pdma = p2p;

I think these two statements are out of order, since the attributes
dereference pdev->p2pdma.  And it looks like you set "error"
unnecessarily, since you return immediately looking at it.

>  	return 0;
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  2018-02-28 23:39     ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 17:49       ` Bjorn Helgaas
  -1 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:49 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Wed, Feb 28, 2018 at 04:39:59PM -0700, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> p2p memory or no p2p memory.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/memremap.h   |  1 +
>  include/linux/pci-p2pdma.h | 13 ++++++++++++
>  3 files changed, 65 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index a57df78f6a32..4e1c81f64b29 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  	pgmap->res.flags = pci_resource_flags(pdev, bar);
>  	pgmap->ref = &pdev->p2pdma->devmap_ref;
>  	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
> +		pci_resource_start(pdev, bar);
>  
>  	addr = devm_memremap_pages(&pdev->dev, pgmap);
>  	if (IS_ERR(addr))
> @@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  	pdev->p2pdma->published = publish;
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
> +
> +/*
> + * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
> + * @dev:        device doing the DMA request
> + * @sg:		scatter list to map
> + * @nents:	elements in the scatterlist
> + * @dir:        DMA direction

Can you fix these so the descriptions all have a single space after
the "@dev:", which seems to be the existing convention in this file?
The indentation looks pretty random now.

> + *
> + * Returns the number of SG entries mapped
> + */
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +		      enum dma_data_direction dir)

Same question as before about why the mixture of "pci_*" interfaces
that take "struct device *" parameters.

> +{
> +	struct dev_pagemap *pgmap;
> +	struct scatterlist *s;
> +	phys_addr_t paddr;
> +	int i;
> +
> +	/*
> +	 * p2pdma mappings are not compatible with devices that use
> +	 * dma_virt_ops.
> +	 */
> +	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
> +		return 0;
> +
> +	for_each_sg(sg, s, nents, i) {
> +		pgmap = sg_page(s)->pgmap;
> +		paddr = sg_phys(s);
> +
> +		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
> +		sg_dma_len(s) = s->length;
> +	}
> +
> +	return nents;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
> +
> +/**
> + * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
> + * @dev:        device doing the DMA request
> + * @sg:		scatter list to map
> + * @nents:	elements in the scatterlist
> + * @dir:        DMA direction

Same whitespace comment as above.

> + */
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> +			 enum dma_data_direction dir)
> +{
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9e907c338a44..1660f64ce96f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -125,6 +125,7 @@ struct dev_pagemap {
>  	struct device *dev;
>  	void *data;
>  	enum memory_type type;
> +	u64 pci_p2pdma_bus_offset;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index c0dde3d3aac4..126eca697ab3 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
>  void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
>  		unsigned int nents);
>  void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +		      enum dma_data_direction dir);
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> +			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
> @@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
>  static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  {
>  }
> +static inline int pci_p2pdma_map_sg(struct device *dev,
> +	struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +	return 0;
> +}
> +static inline void pci_p2pdma_unmap_sg(struct device *dev,
> +	struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +}
>  #endif /* CONFIG_PCI_P2PDMA */
>  #endif /* _LINUX_PCI_P2P_H */
> -- 
> 2.11.0
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-03-01 17:49       ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:49 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson

On Wed, Feb 28, 2018 at 04:39:59PM -0700, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> p2p memory or no p2p memory.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/memremap.h   |  1 +
>  include/linux/pci-p2pdma.h | 13 ++++++++++++
>  3 files changed, 65 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index a57df78f6a32..4e1c81f64b29 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  	pgmap->res.flags = pci_resource_flags(pdev, bar);
>  	pgmap->ref = &pdev->p2pdma->devmap_ref;
>  	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
> +		pci_resource_start(pdev, bar);
>  
>  	addr = devm_memremap_pages(&pdev->dev, pgmap);
>  	if (IS_ERR(addr))
> @@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  	pdev->p2pdma->published = publish;
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
> +
> +/*
> + * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
> + * @dev:        device doing the DMA request
> + * @sg:		scatter list to map
> + * @nents:	elements in the scatterlist
> + * @dir:        DMA direction

Can you fix these so the descriptions all have a single space after
the "@dev:", which seems to be the existing convention in this file?
The indentation looks pretty random now.

> + *
> + * Returns the number of SG entries mapped
> + */
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +		      enum dma_data_direction dir)

Same question as before about why the mixture of "pci_*" interfaces
that take "struct device *" parameters.

> +{
> +	struct dev_pagemap *pgmap;
> +	struct scatterlist *s;
> +	phys_addr_t paddr;
> +	int i;
> +
> +	/*
> +	 * p2pdma mappings are not compatible with devices that use
> +	 * dma_virt_ops.
> +	 */
> +	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
> +		return 0;
> +
> +	for_each_sg(sg, s, nents, i) {
> +		pgmap = sg_page(s)->pgmap;
> +		paddr = sg_phys(s);
> +
> +		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
> +		sg_dma_len(s) = s->length;
> +	}
> +
> +	return nents;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
> +
> +/**
> + * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
> + * @dev:        device doing the DMA request
> + * @sg:		scatter list to map
> + * @nents:	elements in the scatterlist
> + * @dir:        DMA direction

Same whitespace comment as above.

> + */
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> +			 enum dma_data_direction dir)
> +{
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9e907c338a44..1660f64ce96f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -125,6 +125,7 @@ struct dev_pagemap {
>  	struct device *dev;
>  	void *data;
>  	enum memory_type type;
> +	u64 pci_p2pdma_bus_offset;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index c0dde3d3aac4..126eca697ab3 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
>  void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
>  		unsigned int nents);
>  void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +		      enum dma_data_direction dir);
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> +			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
> @@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
>  static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  {
>  }
> +static inline int pci_p2pdma_map_sg(struct device *dev,
> +	struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +	return 0;
> +}
> +static inline void pci_p2pdma_unmap_sg(struct device *dev,
> +	struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +}
>  #endif /* CONFIG_PCI_P2PDMA */
>  #endif /* _LINUX_PCI_P2P_H */
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-03-01 17:49       ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:49 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Wed, Feb 28, 2018 at 04:39:59PM -0700, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> p2p memory or no p2p memory.
> 
> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/memremap.h   |  1 +
>  include/linux/pci-p2pdma.h | 13 ++++++++++++
>  3 files changed, 65 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index a57df78f6a32..4e1c81f64b29 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  	pgmap->res.flags = pci_resource_flags(pdev, bar);
>  	pgmap->ref = &pdev->p2pdma->devmap_ref;
>  	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
> +		pci_resource_start(pdev, bar);
>  
>  	addr = devm_memremap_pages(&pdev->dev, pgmap);
>  	if (IS_ERR(addr))
> @@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  	pdev->p2pdma->published = publish;
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
> +
> +/*
> + * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
> + * @dev:        device doing the DMA request
> + * @sg:		scatter list to map
> + * @nents:	elements in the scatterlist
> + * @dir:        DMA direction

Can you fix these so the descriptions all have a single space after
the "@dev:", which seems to be the existing convention in this file?
The indentation looks pretty random now.

> + *
> + * Returns the number of SG entries mapped
> + */
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +		      enum dma_data_direction dir)

Same question as before about why the mixture of "pci_*" interfaces
that take "struct device *" parameters.

> +{
> +	struct dev_pagemap *pgmap;
> +	struct scatterlist *s;
> +	phys_addr_t paddr;
> +	int i;
> +
> +	/*
> +	 * p2pdma mappings are not compatible with devices that use
> +	 * dma_virt_ops.
> +	 */
> +	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
> +		return 0;
> +
> +	for_each_sg(sg, s, nents, i) {
> +		pgmap = sg_page(s)->pgmap;
> +		paddr = sg_phys(s);
> +
> +		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
> +		sg_dma_len(s) = s->length;
> +	}
> +
> +	return nents;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
> +
> +/**
> + * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
> + * @dev:        device doing the DMA request
> + * @sg:		scatter list to map
> + * @nents:	elements in the scatterlist
> + * @dir:        DMA direction

Same whitespace comment as above.

> + */
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> +			 enum dma_data_direction dir)
> +{
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9e907c338a44..1660f64ce96f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -125,6 +125,7 @@ struct dev_pagemap {
>  	struct device *dev;
>  	void *data;
>  	enum memory_type type;
> +	u64 pci_p2pdma_bus_offset;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index c0dde3d3aac4..126eca697ab3 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
>  void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
>  		unsigned int nents);
>  void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +		      enum dma_data_direction dir);
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> +			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
> @@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
>  static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  {
>  }
> +static inline int pci_p2pdma_map_sg(struct device *dev,
> +	struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +	return 0;
> +}
> +static inline void pci_p2pdma_unmap_sg(struct device *dev,
> +	struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +}
>  #endif /* CONFIG_PCI_P2PDMA */
>  #endif /* _LINUX_PCI_P2P_H */
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-03-01 17:49       ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 17:49 UTC (permalink / raw)


On Wed, Feb 28, 2018@04:39:59PM -0700, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> p2p memory or no p2p memory.
> 
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> ---
>  drivers/pci/p2pdma.c       | 51 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/memremap.h   |  1 +
>  include/linux/pci-p2pdma.h | 13 ++++++++++++
>  3 files changed, 65 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index a57df78f6a32..4e1c81f64b29 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -188,6 +188,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  	pgmap->res.flags = pci_resource_flags(pdev, bar);
>  	pgmap->ref = &pdev->p2pdma->devmap_ref;
>  	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
> +		pci_resource_start(pdev, bar);
>  
>  	addr = devm_memremap_pages(&pdev->dev, pgmap);
>  	if (IS_ERR(addr))
> @@ -616,3 +618,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  	pdev->p2pdma->published = publish;
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
> +
> +/*
> + * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA
> + * @dev:        device doing the DMA request
> + * @sg:		scatter list to map
> + * @nents:	elements in the scatterlist
> + * @dir:        DMA direction

Can you fix these so the descriptions all have a single space after
the "@dev:", which seems to be the existing convention in this file?
The indentation looks pretty random now.

> + *
> + * Returns the number of SG entries mapped
> + */
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +		      enum dma_data_direction dir)

Same question as before about why the mixture of "pci_*" interfaces
that take "struct device *" parameters.

> +{
> +	struct dev_pagemap *pgmap;
> +	struct scatterlist *s;
> +	phys_addr_t paddr;
> +	int i;
> +
> +	/*
> +	 * p2pdma mappings are not compatible with devices that use
> +	 * dma_virt_ops.
> +	 */
> +	if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops)
> +		return 0;
> +
> +	for_each_sg(sg, s, nents, i) {
> +		pgmap = sg_page(s)->pgmap;
> +		paddr = sg_phys(s);
> +
> +		s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
> +		sg_dma_len(s) = s->length;
> +	}
> +
> +	return nents;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
> +
> +/**
> + * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA
> + * @dev:        device doing the DMA request
> + * @sg:		scatter list to map
> + * @nents:	elements in the scatterlist
> + * @dir:        DMA direction

Same whitespace comment as above.

> + */
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> +			 enum dma_data_direction dir)
> +{
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9e907c338a44..1660f64ce96f 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -125,6 +125,7 @@ struct dev_pagemap {
>  	struct device *dev;
>  	void *data;
>  	enum memory_type type;
> +	u64 pci_p2pdma_bus_offset;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index c0dde3d3aac4..126eca697ab3 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl,
>  void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl,
>  		unsigned int nents);
>  void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
> +		      enum dma_data_direction dir);
> +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> +			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
> @@ -83,5 +87,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
>  static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  {
>  }
> +static inline int pci_p2pdma_map_sg(struct device *dev,
> +	struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +	return 0;
> +}
> +static inline void pci_p2pdma_unmap_sg(struct device *dev,
> +	struct scatterlist *sg, int nents, enum dma_data_direction dir)
> +{
> +}
>  #endif /* CONFIG_PCI_P2PDMA */
>  #endif /* _LINUX_PCI_P2P_H */
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-02-28 23:40   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 18:02     ` Bjorn Helgaas
  -1 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 18:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Wed, Feb 28, 2018 at 04:40:00PM -0700, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the the groups are
> assigned.

s/the the/the/

> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any switch will be in the same IOMMU group.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  4 ++++
>  drivers/pci/p2pdma.c       | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c          |  4 ++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 57 insertions(+)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 840831418cbd..a430672f0ad4 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -138,6 +138,10 @@ config PCI_P2PDMA
>  	  it's hard to tell which support it with good performance, so
>  	  at this time you will need a PCIe switch.
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effictively puts all devices behind any
> +	  switch into the same IOMMU group.

s/effictively/effectively/

Does this really mean "all devices behind the same Root Port"?

What does this mean in terms of device security?  I assume it means,
at least, that individual devices can't be assigned to separate VMs.

I don't mind admitting that this patch makes me pretty nervous, and I
don't have a clear idea of what the implications of this are, or how
to communicate those to end users.  "The same IOMMU group" is a pretty
abstract idea.

>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4e1c81f64b29..61af07acd21a 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>  	return up2;
>  }
>  
> +/*
> + * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
> + *	bridges/switches
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any downstream port in any switch in order for
> + * the TLPs to not be forwarded up to the RC which is not what we want
> + * for P2P.
> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any switch to be in the same IOMMU
> + * group. At this time there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
> + */
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	struct pci_dev *up;
> +	int pos;
> +	u16 ctrl;
> +
> +	up = get_upstream_bridge_port(pdev);
> +	if (!up)
> +		return 0;
> +	pci_dev_put(up);
> +
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
> +
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
> +
> +	return 1;
> +}
> +
>  static bool __upstream_bridges_match(struct pci_dev *upstream,
>  				     struct pci_dev *client)
>  {
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f6a4dd10d9b0..95ad3cf288c8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;

This doesn't read naturally to me.  I do see that when
CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
and returns 0, so we'll go ahead and try to enable ACS as before.

But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
right here so it's more obvious that we only disable ACS when it's
selected.

>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 126eca697ab3..f537f521f60c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -22,6 +22,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 18:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 18:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson

On Wed, Feb 28, 2018 at 04:40:00PM -0700, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the the groups are
> assigned.

s/the the/the/

> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any switch will be in the same IOMMU group.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/Kconfig        |  4 ++++
>  drivers/pci/p2pdma.c       | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c          |  4 ++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 57 insertions(+)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 840831418cbd..a430672f0ad4 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -138,6 +138,10 @@ config PCI_P2PDMA
>  	  it's hard to tell which support it with good performance, so
>  	  at this time you will need a PCIe switch.
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effictively puts all devices behind any
> +	  switch into the same IOMMU group.

s/effictively/effectively/

Does this really mean "all devices behind the same Root Port"?

What does this mean in terms of device security?  I assume it means,
at least, that individual devices can't be assigned to separate VMs.

I don't mind admitting that this patch makes me pretty nervous, and I
don't have a clear idea of what the implications of this are, or how
to communicate those to end users.  "The same IOMMU group" is a pretty
abstract idea.

>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4e1c81f64b29..61af07acd21a 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>  	return up2;
>  }
>  
> +/*
> + * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
> + *	bridges/switches
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any downstream port in any switch in order for
> + * the TLPs to not be forwarded up to the RC which is not what we want
> + * for P2P.
> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any switch to be in the same IOMMU
> + * group. At this time there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
> + */
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	struct pci_dev *up;
> +	int pos;
> +	u16 ctrl;
> +
> +	up = get_upstream_bridge_port(pdev);
> +	if (!up)
> +		return 0;
> +	pci_dev_put(up);
> +
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
> +
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
> +
> +	return 1;
> +}
> +
>  static bool __upstream_bridges_match(struct pci_dev *upstream,
>  				     struct pci_dev *client)
>  {
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f6a4dd10d9b0..95ad3cf288c8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;

This doesn't read naturally to me.  I do see that when
CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
and returns 0, so we'll go ahead and try to enable ACS as before.

But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
right here so it's more obvious that we only disable ACS when it's
selected.

>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 126eca697ab3..f537f521f60c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -22,6 +22,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 18:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 18:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Wed, Feb 28, 2018 at 04:40:00PM -0700, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the the groups are
> assigned.

s/the the/the/

> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any switch will be in the same IOMMU group.
> 
> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/pci/Kconfig        |  4 ++++
>  drivers/pci/p2pdma.c       | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c          |  4 ++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 57 insertions(+)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 840831418cbd..a430672f0ad4 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -138,6 +138,10 @@ config PCI_P2PDMA
>  	  it's hard to tell which support it with good performance, so
>  	  at this time you will need a PCIe switch.
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effictively puts all devices behind any
> +	  switch into the same IOMMU group.

s/effictively/effectively/

Does this really mean "all devices behind the same Root Port"?

What does this mean in terms of device security?  I assume it means,
at least, that individual devices can't be assigned to separate VMs.

I don't mind admitting that this patch makes me pretty nervous, and I
don't have a clear idea of what the implications of this are, or how
to communicate those to end users.  "The same IOMMU group" is a pretty
abstract idea.

>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4e1c81f64b29..61af07acd21a 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>  	return up2;
>  }
>  
> +/*
> + * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
> + *	bridges/switches
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any downstream port in any switch in order for
> + * the TLPs to not be forwarded up to the RC which is not what we want
> + * for P2P.
> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any switch to be in the same IOMMU
> + * group. At this time there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
> + */
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	struct pci_dev *up;
> +	int pos;
> +	u16 ctrl;
> +
> +	up = get_upstream_bridge_port(pdev);
> +	if (!up)
> +		return 0;
> +	pci_dev_put(up);
> +
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
> +
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
> +
> +	return 1;
> +}
> +
>  static bool __upstream_bridges_match(struct pci_dev *upstream,
>  				     struct pci_dev *client)
>  {
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f6a4dd10d9b0..95ad3cf288c8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;

This doesn't read naturally to me.  I do see that when
CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
and returns 0, so we'll go ahead and try to enable ACS as before.

But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
right here so it's more obvious that we only disable ACS when it's
selected.

>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 126eca697ab3..f537f521f60c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -22,6 +22,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 18:02     ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 18:02 UTC (permalink / raw)


On Wed, Feb 28, 2018@04:40:00PM -0700, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the the groups are
> assigned.

s/the the/the/

> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any switch will be in the same IOMMU group.
> 
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> ---
>  drivers/pci/Kconfig        |  4 ++++
>  drivers/pci/p2pdma.c       | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c          |  4 ++++
>  include/linux/pci-p2pdma.h |  5 +++++
>  4 files changed, 57 insertions(+)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 840831418cbd..a430672f0ad4 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -138,6 +138,10 @@ config PCI_P2PDMA
>  	  it's hard to tell which support it with good performance, so
>  	  at this time you will need a PCIe switch.
>  
> +	  Enabling this option will also disable ACS on all ports behind
> +	  any PCIe switch. This effictively puts all devices behind any
> +	  switch into the same IOMMU group.

s/effictively/effectively/

Does this really mean "all devices behind the same Root Port"?

What does this mean in terms of device security?  I assume it means,
at least, that individual devices can't be assigned to separate VMs.

I don't mind admitting that this patch makes me pretty nervous, and I
don't have a clear idea of what the implications of this are, or how
to communicate those to end users.  "The same IOMMU group" is a pretty
abstract idea.

>  	  If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4e1c81f64b29..61af07acd21a 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -255,6 +255,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
>  	return up2;
>  }
>  
> +/*
> + * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI
> + *	bridges/switches
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any downstream port in any switch in order for
> + * the TLPs to not be forwarded up to the RC which is not what we want
> + * for P2P.
> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any switch to be in the same IOMMU
> + * group. At this time there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device were cleared, otherwise 0.
> + */
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	struct pci_dev *up;
> +	int pos;
> +	u16 ctrl;
> +
> +	up = get_upstream_bridge_port(pdev);
> +	if (!up)
> +		return 0;
> +	pci_dev_put(up);
> +
> +	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> +	if (!pos)
> +		return 0;
> +
> +	dev_info(&pdev->dev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> +	pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> +	ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
> +
> +	pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
> +
> +	return 1;
> +}
> +
>  static bool __upstream_bridges_match(struct pci_dev *upstream,
>  				     struct pci_dev *client)
>  {
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f6a4dd10d9b0..95ad3cf288c8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -16,6 +16,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>  #include <linux/pm.h>
>  #include <linux/slab.h>
>  #include <linux/module.h>
> @@ -2826,6 +2827,9 @@ static void pci_std_enable_acs(struct pci_dev *dev)
>   */
>  void pci_enable_acs(struct pci_dev *dev)
>  {
> +	if (pci_p2pdma_disable_acs(dev))
> +		return;

This doesn't read naturally to me.  I do see that when
CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
and returns 0, so we'll go ahead and try to enable ACS as before.

But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
right here so it's more obvious that we only disable ACS when it's
selected.

>  	if (!pci_acs_enable)
>  		return;
>  
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 126eca697ab3..f537f521f60c 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -22,6 +22,7 @@ struct block_device;
>  struct scatterlist;
>  
>  #ifdef CONFIG_PCI_P2PDMA
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev);
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
>  int pci_p2pdma_add_client(struct list_head *head, struct device *dev);
> @@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>  void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir);
>  #else /* CONFIG_PCI_P2PDMA */
> +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev)
> +{
> +	return 0;
> +}
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
>  {
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01  3:56     ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01 18:04       ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 18:04 UTC (permalink / raw)
  To: benh, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
> 
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

This works perfectly fine on x86 behind a switch and we've tested it on 
multiple machines. We've never had an issue of running out of virtual 
space despite our PCI bars typically being located with an offset of 
56TB or more. The arch code on x86 also somehow figures out not to map 
the memory as cachable so that's not an issue (though, at this point, 
the CPU never accesses the memory so even if it were, it wouldn't affect 
anything).

We also had this working on ARM64 a while back but it required some out 
of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
code to ioremap the memory into the page map.

You didn't mention what architecture you were trying this on.

It may make sense at this point to make this feature dependent on x86 
until more work is done to make it properly portable. Something like 
arch functions that allow adding IO memory pages to with a specific 
cache setting. Though, if an arch has such restrictive limits on the map 
size it would probably need to address that too somehow.

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 18:04       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 18:04 UTC (permalink / raw)
  To: benh, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Alex Williamson,
	Oliver OHalloran



On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
> 
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

This works perfectly fine on x86 behind a switch and we've tested it on 
multiple machines. We've never had an issue of running out of virtual 
space despite our PCI bars typically being located with an offset of 
56TB or more. The arch code on x86 also somehow figures out not to map 
the memory as cachable so that's not an issue (though, at this point, 
the CPU never accesses the memory so even if it were, it wouldn't affect 
anything).

We also had this working on ARM64 a while back but it required some out 
of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
code to ioremap the memory into the page map.

You didn't mention what architecture you were trying this on.

It may make sense at this point to make this feature dependent on x86 
until more work is done to make it properly portable. Something like 
arch functions that allow adding IO memory pages to with a specific 
cache setting. Though, if an arch has such restrictive limits on the map 
size it would probably need to address that too somehow.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 18:04       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 18:04 UTC (permalink / raw)
  To: benh-8fk3Idey6ehBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
> 
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

This works perfectly fine on x86 behind a switch and we've tested it on 
multiple machines. We've never had an issue of running out of virtual 
space despite our PCI bars typically being located with an offset of 
56TB or more. The arch code on x86 also somehow figures out not to map 
the memory as cachable so that's not an issue (though, at this point, 
the CPU never accesses the memory so even if it were, it wouldn't affect 
anything).

We also had this working on ARM64 a while back but it required some out 
of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
code to ioremap the memory into the page map.

You didn't mention what architecture you were trying this on.

It may make sense at this point to make this feature dependent on x86 
until more work is done to make it properly portable. Something like 
arch functions that allow adding IO memory pages to with a specific 
cache setting. Though, if an arch has such restrictive limits on the map 
size it would probably need to address that too somehow.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 18:04       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 18:04 UTC (permalink / raw)




On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01@14:54 +1100, Benjamin Herrenschmidt wrote:
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
> 
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

This works perfectly fine on x86 behind a switch and we've tested it on 
multiple machines. We've never had an issue of running out of virtual 
space despite our PCI bars typically being located with an offset of 
56TB or more. The arch code on x86 also somehow figures out not to map 
the memory as cachable so that's not an issue (though, at this point, 
the CPU never accesses the memory so even if it were, it wouldn't affect 
anything).

We also had this working on ARM64 a while back but it required some out 
of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
code to ioremap the memory into the page map.

You didn't mention what architecture you were trying this on.

It may make sense at this point to make this feature dependent on x86 
until more work is done to make it properly portable. Something like 
arch functions that allow adding IO memory pages to with a specific 
cache setting. Though, if an arch has such restrictive limits on the map 
size it would probably need to address that too somehow.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01  3:56     ` Benjamin Herrenschmidt
                         ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 18:09       ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:09 UTC (permalink / raw)
  To: benh, Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

>> So Oliver (CC) was having issues getting any of that to work for us.
>> 
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>> 
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>   
>   Note: I think the above means it won't work behind a switch on x86
>   either, will it ?
 
Ben 

We have done extensive testing of this series and its predecessors using PCIe switches from both Broadcom (PLX) and Microsemi. We have also done testing on x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The series as it currently stands only works on x86_64 but modified (hacky) versions have been made to work on ARM64. The x86_64 testing has been done on a range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least three vendors, NVMe SSDs from at least four vendors and P2P devices from four vendors) and PCI switches.

I do find it slightly offensive that you would question the series even working. I hope you are not suggesting we would submit this framework multiple times without having done testing on it....

Stephen

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 18:09       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:09 UTC (permalink / raw)
  To: benh, Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

Pj4gU28gT2xpdmVyIChDQykgd2FzIGhhdmluZyBpc3N1ZXMgZ2V0dGluZyBhbnkgb2YgdGhhdCB0
byB3b3JrIGZvciB1cy4NCj4+IA0KPj4gVGhlIHByb2JsZW0gaXMgdGhhdCBhY2Njb3JkaW5nIHRv
IGhpbSAoSSBkaWRuJ3QgZG91YmxlIGNoZWNrIHRoZSBsYXRlc3QNCj4+IHBhdGNoZXMpIHlvdSBl
ZmZlY3RpdmVseSBob3RwbHVnIHRoZSBQQ0llIG1lbW9yeSBpbnRvIHRoZSBzeXN0ZW0gd2hlbg0K
Pj4gY3JlYXRpbmcgc3RydWN0IHBhZ2VzLg0KPj4gDQo+PiBUaGlzIGNhbm5vdCBwb3NzaWJseSB3
b3JrIGZvciB1cy4gRmlyc3Qgd2UgY2Fubm90IG1hcCBQQ0llIG1lbW9yeSBhcw0KPj4gY2FjaGFi
bGUuIChOb3RlIHRoYXQgZG9pbmcgc28gaXMgYSBiYWQgaWRlYSBpZiB5b3UgYXJlIGJlaGluZCBh
IFBMWA0KPj4gc3dpdGNoIGFueXdheSBzaW5jZSB5b3UnZCBhaHZlIHRvIG1hbmFnZSBjYWNoZSBj
b2hlcmVuY3kgaW4gU1cpLg0KPiAgIA0KPiAgIE5vdGU6IEkgdGhpbmsgdGhlIGFib3ZlIG1lYW5z
IGl0IHdvbid0IHdvcmsgYmVoaW5kIGEgc3dpdGNoIG9uIHg4Ng0KPiAgIGVpdGhlciwgd2lsbCBp
dCA/DQogDQpCZW4gDQoNCldlIGhhdmUgZG9uZSBleHRlbnNpdmUgdGVzdGluZyBvZiB0aGlzIHNl
cmllcyBhbmQgaXRzIHByZWRlY2Vzc29ycyB1c2luZyBQQ0llIHN3aXRjaGVzIGZyb20gYm90aCBC
cm9hZGNvbSAoUExYKSBhbmQgTWljcm9zZW1pLiBXZSBoYXZlIGFsc28gZG9uZSB0ZXN0aW5nIG9u
IHg4Nl82NCwgQVJNNjQgYW5kIHBwYzY0ZWwgYmFzZWQgQVJDSCB3aXRoIHZhcnlpbmcgZGVncmVl
cyBvZiBzdWNjZXNzLiBUaGUgc2VyaWVzIGFzIGl0IGN1cnJlbnRseSBzdGFuZHMgb25seSB3b3Jr
cyBvbiB4ODZfNjQgYnV0IG1vZGlmaWVkIChoYWNreSkgdmVyc2lvbnMgaGF2ZSBiZWVuIG1hZGUg
dG8gd29yayBvbiBBUk02NC4gVGhlIHg4Nl82NCB0ZXN0aW5nIGhhcyBiZWVuIGRvbmUgb24gYSBy
YW5nZSBvZiAoSW50ZWwpIENQVXMsIHNlcnZlcnMsIFBDSSBFUHMgKGluY2x1ZGluZyBSRE1BIE5J
Q3MgZnJvbSBhdCBsZWFzdCB0aHJlZSB2ZW5kb3JzLCBOVk1lIFNTRHMgZnJvbSBhdCBsZWFzdCBm
b3VyIHZlbmRvcnMgYW5kIFAyUCBkZXZpY2VzIGZyb20gZm91ciB2ZW5kb3JzKSBhbmQgUENJIHN3
aXRjaGVzLg0KDQpJIGRvIGZpbmQgaXQgc2xpZ2h0bHkgb2ZmZW5zaXZlIHRoYXQgeW91IHdvdWxk
IHF1ZXN0aW9uIHRoZSBzZXJpZXMgZXZlbiB3b3JraW5nLiBJIGhvcGUgeW91IGFyZSBub3Qgc3Vn
Z2VzdGluZyB3ZSB3b3VsZCBzdWJtaXQgdGhpcyBmcmFtZXdvcmsgbXVsdGlwbGUgdGltZXMgd2l0
aG91dCBoYXZpbmcgZG9uZSB0ZXN0aW5nIG9uIGl0Li4uLg0KDQpTdGVwaGVuDQoNCg==

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 18:09       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:09 UTC (permalink / raw)
  To: benh-8fk3Idey6ehBDgjK7y7TUQ, Logan Gunthorpe,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

>> So Oliver (CC) was having issues getting any of that to work for us.
>> 
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>> 
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>   
>   Note: I think the above means it won't work behind a switch on x86
>   either, will it ?
 
Ben 

We have done extensive testing of this series and its predecessors using PCIe switches from both Broadcom (PLX) and Microsemi. We have also done testing on x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The series as it currently stands only works on x86_64 but modified (hacky) versions have been made to work on ARM64. The x86_64 testing has been done on a range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least three vendors, NVMe SSDs from at least four vendors and P2P devices from four vendors) and PCI switches.

I do find it slightly offensive that you would question the series even working. I hope you are not suggesting we would submit this framework multiple times without having done testing on it....

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 18:09       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:09 UTC (permalink / raw)
  To: benh, Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

>> So Oliver (CC) was having issues getting any of that to work for us.
>> 
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>> 
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>   
>   Note: I think the above means it won't work behind a switch on x86
>   either, will it ?
 
Ben 

We have done extensive testing of this series and its predecessors using PCIe switches from both Broadcom (PLX) and Microsemi. We have also done testing on x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The series as it currently stands only works on x86_64 but modified (hacky) versions have been made to work on ARM64. The x86_64 testing has been done on a range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least three vendors, NVMe SSDs from at least four vendors and P2P devices from four vendors) and PCI switches.

I do find it slightly offensive that you would question the series even working. I hope you are not suggesting we would submit this framework multiple times without having done testing on it....

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 18:09       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:09 UTC (permalink / raw)


>> So Oliver (CC) was having issues getting any of that to work for us.
>> 
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>> 
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>   
>   Note: I think the above means it won't work behind a switch on x86
>   either, will it ?
 
Ben 

We have done extensive testing of this series and its predecessors using PCIe switches from both Broadcom (PLX) and Microsemi. We have also done testing on x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The series as it currently stands only works on x86_64 but modified (hacky) versions have been made to work on ARM64. The x86_64 testing has been done on a range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least three vendors, NVMe SSDs from at least four vendors and P2P devices from four vendors) and PCI switches.

I do find it slightly offensive that you would question the series even working. I hope you are not suggesting we would submit this framework multiple times without having done testing on it....

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 17:40       ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 18:35         ` Sagi Grimberg
  -1 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 18:35 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig


> On 01/03/18 04:03 AM, Sagi Grimberg wrote:
>> Can you describe what would be the plan to have it when these devices
>> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
>> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
>> ns can prefer to use its own cmb instead of locating a p2p_dev device?
> 
> The patchset already supports CMB drives. That's essentially what patch 
> 7 is for. We change the nvme-pci driver to use the p2pmem code to 
> register and manage the CMB memory. After that, it is simply available 
> to the nvmet code. We have already been using this with a couple 
> prototype CMB drives.

The comment was to your statement:
"Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available."

Maybe its a left-over which confused me...

Anyways, my question still holds. If I rack several of these
nvme drives, ideally we would use _their_ cmbs for I/O that is
directed to these namespaces. This is why I was suggesting that
p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
p2p_dev used by all namespaces.

nvmet_ns seems like a more natural place to host p2p_dev with
cmb in mind.

>>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>>> +                   struct nvmet_ns *ns)
>>> +{
>>> +    int ret;
>>> +
>>> +    if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>>> +        pr_err("peer-to-peer DMA is not supported by %s\n",
>>> +               ns->device_path);
>>> +        return -EINVAL;
>>
>> I'd say that just skip it instead of failing it, in theory one can
>> connect nvme devices via p2p mem and expose other devices in the
>> same subsystem. The message would be pr_debug to reduce chattiness.
> 
> No, it's actually a bit more complicated than you make it. There's a 
> couple cases:
> 
> 1) The user adds a namespace but there hasn't been a connection and no 
> p2pmem has been selected. In this case the add_client function is never 
> called and the user can add whatever namespace they like.
> 
> 2) When the first connect happens, nvmet_setup_p2pmem() will call 
> add_client() for each namespace and rdma device. If any of them fail 
> then it does not use a P2P device and falls back, as you'd like, to 
> regular memory.
> 
> 3) When the user adds a namespace after a port is in use and a 
> compatible P2P device has been found. In this case, if the user tries to 
> add a namespace that is not compatible with the P2P device in use then 
> it fails adding the new namespace. The only alternative here is to tear 
> everything down, ensure there are no P2P enabled buffers open and start 
> using regular memory again... That is very difficult.
> 

Wouldn't it all be simpler if the p2p_dev resolution would be private
to the namespace?

So is adding some all the namespaces in a subsystem must comply to
using p2p? Seems a little bit harsh if its not absolutely needed. Would
be nice to export a subsystems between two ports (on two HCAs, across
NUMA nodes) where the home node (primary path) would use p2p and
failover would use host memory...

Can you help me understand why this is absolutely not feasible?

> I also disagree that these messages should be pr_debug. If a user is 
> trying to use P2P memory and they enable it but the system doesn't 
> choose to use their memory they must know why that is so they can make 
> the necessary adjustments. If the system doesn't at least print a dmesg 
> they are in the dark.

I was arguing that the user might have intentionally wanted to use p2p
where possible but still expose other namespaces over host memory. For
this user, the messages are spam. I guess info/warn could also suffice
(assuming we allow it, if we fail it then no point of the debug level
discussion).

> 
>>> +    }
>>> +
>>> +    ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>>> +    if (ret)
>>> +        pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>>> +               ns->device_path, ret);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   int nvmet_ns_enable(struct nvmet_ns *ns)
>>>   {
>>>       struct nvmet_subsys *subsys = ns->subsys;
>>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>>>       if (ret)
>>>           goto out_blkdev_put;
>>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>>> +        if (ctrl->p2p_dev) {
>>> +            ret = nvmet_p2pdma_add_client(ctrl, ns);
>>> +            if (ret)
>>> +                goto out_remove_clients;
>>
>> Is this really a fatal failure given that we fall-back to main
>> memory? Why not continue with main memory (and warn at best)?
> 
> See above. It's fatal because we are already using p2p memory and we 
> can't easily tear that all down when a user adds a new namespace.

In my mind, I/O to this namespace would use host memory and be done
with it. I guess I still don't understand why this is not possible.

>>> +    ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
>>
>> This is the first p2p_dev found right? What happens if I have more than
>> a single p2p device? In theory I'd have more p2p memory I can use. Have
>> you considered making pci_p2pmem_find return the least used suitable
>> device?
> 
> Yes, it currently returns the first found. I imagine a bunch of 
> improvements could be made to it in future work. However, I'd probably 
> start with finding the nearest p2p device and then falling back to the 
> least used if that's necessary. At this time though it's not clear how 
> complicated these systems will get and what's actually needed here.

Fair enough. Just wandering.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 18:35         ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 18:35 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise


> On 01/03/18 04:03 AM, Sagi Grimberg wrote:
>> Can you describe what would be the plan to have it when these devices
>> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
>> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
>> ns can prefer to use its own cmb instead of locating a p2p_dev device?
> 
> The patchset already supports CMB drives. That's essentially what patch 
> 7 is for. We change the nvme-pci driver to use the p2pmem code to 
> register and manage the CMB memory. After that, it is simply available 
> to the nvmet code. We have already been using this with a couple 
> prototype CMB drives.

The comment was to your statement:
"Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available."

Maybe its a left-over which confused me...

Anyways, my question still holds. If I rack several of these
nvme drives, ideally we would use _their_ cmbs for I/O that is
directed to these namespaces. This is why I was suggesting that
p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
p2p_dev used by all namespaces.

nvmet_ns seems like a more natural place to host p2p_dev with
cmb in mind.

>>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>>> +                   struct nvmet_ns *ns)
>>> +{
>>> +    int ret;
>>> +
>>> +    if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>>> +        pr_err("peer-to-peer DMA is not supported by %s\n",
>>> +               ns->device_path);
>>> +        return -EINVAL;
>>
>> I'd say that just skip it instead of failing it, in theory one can
>> connect nvme devices via p2p mem and expose other devices in the
>> same subsystem. The message would be pr_debug to reduce chattiness.
> 
> No, it's actually a bit more complicated than you make it. There's a 
> couple cases:
> 
> 1) The user adds a namespace but there hasn't been a connection and no 
> p2pmem has been selected. In this case the add_client function is never 
> called and the user can add whatever namespace they like.
> 
> 2) When the first connect happens, nvmet_setup_p2pmem() will call 
> add_client() for each namespace and rdma device. If any of them fail 
> then it does not use a P2P device and falls back, as you'd like, to 
> regular memory.
> 
> 3) When the user adds a namespace after a port is in use and a 
> compatible P2P device has been found. In this case, if the user tries to 
> add a namespace that is not compatible with the P2P device in use then 
> it fails adding the new namespace. The only alternative here is to tear 
> everything down, ensure there are no P2P enabled buffers open and start 
> using regular memory again... That is very difficult.
> 

Wouldn't it all be simpler if the p2p_dev resolution would be private
to the namespace?

So is adding some all the namespaces in a subsystem must comply to
using p2p? Seems a little bit harsh if its not absolutely needed. Would
be nice to export a subsystems between two ports (on two HCAs, across
NUMA nodes) where the home node (primary path) would use p2p and
failover would use host memory...

Can you help me understand why this is absolutely not feasible?

> I also disagree that these messages should be pr_debug. If a user is 
> trying to use P2P memory and they enable it but the system doesn't 
> choose to use their memory they must know why that is so they can make 
> the necessary adjustments. If the system doesn't at least print a dmesg 
> they are in the dark.

I was arguing that the user might have intentionally wanted to use p2p
where possible but still expose other namespaces over host memory. For
this user, the messages are spam. I guess info/warn could also suffice
(assuming we allow it, if we fail it then no point of the debug level
discussion).

> 
>>> +    }
>>> +
>>> +    ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>>> +    if (ret)
>>> +        pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>>> +               ns->device_path, ret);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   int nvmet_ns_enable(struct nvmet_ns *ns)
>>>   {
>>>       struct nvmet_subsys *subsys = ns->subsys;
>>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>>>       if (ret)
>>>           goto out_blkdev_put;
>>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>>> +        if (ctrl->p2p_dev) {
>>> +            ret = nvmet_p2pdma_add_client(ctrl, ns);
>>> +            if (ret)
>>> +                goto out_remove_clients;
>>
>> Is this really a fatal failure given that we fall-back to main
>> memory? Why not continue with main memory (and warn at best)?
> 
> See above. It's fatal because we are already using p2p memory and we 
> can't easily tear that all down when a user adds a new namespace.

In my mind, I/O to this namespace would use host memory and be done
with it. I guess I still don't understand why this is not possible.

>>> +    ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
>>
>> This is the first p2p_dev found right? What happens if I have more than
>> a single p2p device? In theory I'd have more p2p memory I can use. Have
>> you considered making pci_p2pmem_find return the least used suitable
>> device?
> 
> Yes, it currently returns the first found. I imagine a bunch of 
> improvements could be made to it in future work. However, I'd probably 
> start with finding the nearest p2p device and then falling back to the 
> least used if that's necessary. At this time though it's not clear how 
> complicated these systems will get and what's actually needed here.

Fair enough. Just wandering.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 18:35         ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 18:35 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig


> On 01/03/18 04:03 AM, Sagi Grimberg wrote:
>> Can you describe what would be the plan to have it when these devices
>> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
>> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
>> ns can prefer to use its own cmb instead of locating a p2p_dev device?
> 
> The patchset already supports CMB drives. That's essentially what patch 
> 7 is for. We change the nvme-pci driver to use the p2pmem code to 
> register and manage the CMB memory. After that, it is simply available 
> to the nvmet code. We have already been using this with a couple 
> prototype CMB drives.

The comment was to your statement:
"Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available."

Maybe its a left-over which confused me...

Anyways, my question still holds. If I rack several of these
nvme drives, ideally we would use _their_ cmbs for I/O that is
directed to these namespaces. This is why I was suggesting that
p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
p2p_dev used by all namespaces.

nvmet_ns seems like a more natural place to host p2p_dev with
cmb in mind.

>>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>>> +                   struct nvmet_ns *ns)
>>> +{
>>> +    int ret;
>>> +
>>> +    if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>>> +        pr_err("peer-to-peer DMA is not supported by %s\n",
>>> +               ns->device_path);
>>> +        return -EINVAL;
>>
>> I'd say that just skip it instead of failing it, in theory one can
>> connect nvme devices via p2p mem and expose other devices in the
>> same subsystem. The message would be pr_debug to reduce chattiness.
> 
> No, it's actually a bit more complicated than you make it. There's a 
> couple cases:
> 
> 1) The user adds a namespace but there hasn't been a connection and no 
> p2pmem has been selected. In this case the add_client function is never 
> called and the user can add whatever namespace they like.
> 
> 2) When the first connect happens, nvmet_setup_p2pmem() will call 
> add_client() for each namespace and rdma device. If any of them fail 
> then it does not use a P2P device and falls back, as you'd like, to 
> regular memory.
> 
> 3) When the user adds a namespace after a port is in use and a 
> compatible P2P device has been found. In this case, if the user tries to 
> add a namespace that is not compatible with the P2P device in use then 
> it fails adding the new namespace. The only alternative here is to tear 
> everything down, ensure there are no P2P enabled buffers open and start 
> using regular memory again... That is very difficult.
> 

Wouldn't it all be simpler if the p2p_dev resolution would be private
to the namespace?

So is adding some all the namespaces in a subsystem must comply to
using p2p? Seems a little bit harsh if its not absolutely needed. Would
be nice to export a subsystems between two ports (on two HCAs, across
NUMA nodes) where the home node (primary path) would use p2p and
failover would use host memory...

Can you help me understand why this is absolutely not feasible?

> I also disagree that these messages should be pr_debug. If a user is 
> trying to use P2P memory and they enable it but the system doesn't 
> choose to use their memory they must know why that is so they can make 
> the necessary adjustments. If the system doesn't at least print a dmesg 
> they are in the dark.

I was arguing that the user might have intentionally wanted to use p2p
where possible but still expose other namespaces over host memory. For
this user, the messages are spam. I guess info/warn could also suffice
(assuming we allow it, if we fail it then no point of the debug level
discussion).

> 
>>> +    }
>>> +
>>> +    ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>>> +    if (ret)
>>> +        pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>>> +               ns->device_path, ret);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>   int nvmet_ns_enable(struct nvmet_ns *ns)
>>>   {
>>>       struct nvmet_subsys *subsys = ns->subsys;
>>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>>>       if (ret)
>>>           goto out_blkdev_put;
>>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>>> +        if (ctrl->p2p_dev) {
>>> +            ret = nvmet_p2pdma_add_client(ctrl, ns);
>>> +            if (ret)
>>> +                goto out_remove_clients;
>>
>> Is this really a fatal failure given that we fall-back to main
>> memory? Why not continue with main memory (and warn at best)?
> 
> See above. It's fatal because we are already using p2p memory and we 
> can't easily tear that all down when a user adds a new namespace.

In my mind, I/O to this namespace would use host memory and be done
with it. I guess I still don't understand why this is not possible.

>>> +    ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
>>
>> This is the first p2p_dev found right? What happens if I have more than
>> a single p2p device? In theory I'd have more p2p memory I can use. Have
>> you considered making pci_p2pmem_find return the least used suitable
>> device?
> 
> Yes, it currently returns the first found. I imagine a bunch of 
> improvements could be made to it in future work. However, I'd probably 
> start with finding the nearest p2p device and then falling back to the 
> least used if that's necessary. At this time though it's not clear how 
> complicated these systems will get and what's actually needed here.

Fair enough. Just wandering.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 18:35         ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-01 18:35 UTC (permalink / raw)



> On 01/03/18 04:03 AM, Sagi Grimberg wrote:
>> Can you describe what would be the plan to have it when these devices
>> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
>> and not from nvmet_ctrl. Then, when cmb capable devices come along, the
>> ns can prefer to use its own cmb instead of locating a p2p_dev device?
> 
> The patchset already supports CMB drives. That's essentially what patch 
> 7 is for. We change the nvme-pci driver to use the p2pmem code to 
> register and manage the CMB memory. After that, it is simply available 
> to the nvmet code. We have already been using this with a couple 
> prototype CMB drives.

The comment was to your statement:
"Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available."

Maybe its a left-over which confused me...

Anyways, my question still holds. If I rack several of these
nvme drives, ideally we would use _their_ cmbs for I/O that is
directed to these namespaces. This is why I was suggesting that
p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
p2p_dev used by all namespaces.

nvmet_ns seems like a more natural place to host p2p_dev with
cmb in mind.

>>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
>>> +?????????????????? struct nvmet_ns *ns)
>>> +{
>>> +??? int ret;
>>> +
>>> +??? if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
>>> +??????? pr_err("peer-to-peer DMA is not supported by %s\n",
>>> +?????????????? ns->device_path);
>>> +??????? return -EINVAL;
>>
>> I'd say that just skip it instead of failing it, in theory one can
>> connect nvme devices via p2p mem and expose other devices in the
>> same subsystem. The message would be pr_debug to reduce chattiness.
> 
> No, it's actually a bit more complicated than you make it. There's a 
> couple cases:
> 
> 1) The user adds a namespace but there hasn't been a connection and no 
> p2pmem has been selected. In this case the add_client function is never 
> called and the user can add whatever namespace they like.
> 
> 2) When the first connect happens, nvmet_setup_p2pmem() will call 
> add_client() for each namespace and rdma device. If any of them fail 
> then it does not use a P2P device and falls back, as you'd like, to 
> regular memory.
> 
> 3) When the user adds a namespace after a port is in use and a 
> compatible P2P device has been found. In this case, if the user tries to 
> add a namespace that is not compatible with the P2P device in use then 
> it fails adding the new namespace. The only alternative here is to tear 
> everything down, ensure there are no P2P enabled buffers open and start 
> using regular memory again... That is very difficult.
> 

Wouldn't it all be simpler if the p2p_dev resolution would be private
to the namespace?

So is adding some all the namespaces in a subsystem must comply to
using p2p? Seems a little bit harsh if its not absolutely needed. Would
be nice to export a subsystems between two ports (on two HCAs, across
NUMA nodes) where the home node (primary path) would use p2p and
failover would use host memory...

Can you help me understand why this is absolutely not feasible?

> I also disagree that these messages should be pr_debug. If a user is 
> trying to use P2P memory and they enable it but the system doesn't 
> choose to use their memory they must know why that is so they can make 
> the necessary adjustments. If the system doesn't at least print a dmesg 
> they are in the dark.

I was arguing that the user might have intentionally wanted to use p2p
where possible but still expose other namespaces over host memory. For
this user, the messages are spam. I guess info/warn could also suffice
(assuming we allow it, if we fail it then no point of the debug level
discussion).

> 
>>> +??? }
>>> +
>>> +??? ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
>>> +??? if (ret)
>>> +??????? pr_err("failed to add peer-to-peer DMA client %s: %d\n",
>>> +?????????????? ns->device_path, ret);
>>> +
>>> +??? return ret;
>>> +}
>>> +
>>> ? int nvmet_ns_enable(struct nvmet_ns *ns)
>>> ? {
>>> ????? struct nvmet_subsys *subsys = ns->subsys;
>>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
>>> ????? if (ret)
>>> ????????? goto out_blkdev_put;
>>> +??? list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>>> +??????? if (ctrl->p2p_dev) {
>>> +??????????? ret = nvmet_p2pdma_add_client(ctrl, ns);
>>> +??????????? if (ret)
>>> +??????????????? goto out_remove_clients;
>>
>> Is this really a fatal failure given that we fall-back to main
>> memory? Why not continue with main memory (and warn at best)?
> 
> See above. It's fatal because we are already using p2p memory and we 
> can't easily tear that all down when a user adds a new namespace.

In my mind, I/O to this namespace would use host memory and be done
with it. I guess I still don't understand why this is not possible.

>>> +??? ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);
>>
>> This is the first p2p_dev found right? What happens if I have more than
>> a single p2p device? In theory I'd have more p2p memory I can use. Have
>> you considered making pci_p2pmem_find return the least used suitable
>> device?
> 
> Yes, it currently returns the first found. I imagine a bunch of 
> improvements could be made to it in future work. However, I'd probably 
> start with finding the nearest p2p device and then falling back to the 
> least used if that's necessary. At this time though it's not clear how 
> complicated these systems will get and what's actually needed here.

Fair enough. Just wandering.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 18:35         ` Sagi Grimberg
@ 2018-03-01 18:42           ` Jason Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-01 18:42 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise

On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> 
> >On 01/03/18 04:03 AM, Sagi Grimberg wrote:
> >>Can you describe what would be the plan to have it when these devices
> >>do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> >>and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> >>ns can prefer to use its own cmb instead of locating a p2p_dev device?
> >
> >The patchset already supports CMB drives. That's essentially what patch 7
> >is for. We change the nvme-pci driver to use the p2pmem code to register
> >and manage the CMB memory. After that, it is simply available to the nvmet
> >code. We have already been using this with a couple prototype CMB drives.
> 
> The comment was to your statement:
> "Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available."
> 
> Maybe its a left-over which confused me...
> 
> Anyways, my question still holds. If I rack several of these
> nvme drives, ideally we would use _their_ cmbs for I/O that is
> directed to these namespaces. This is why I was suggesting that
> p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
> p2p_dev used by all namespaces.

I agree, I don't think this series should target anything other than
using p2p memory located in one of the devices expected to participate
in the p2p trasnaction for a first pass..

locality is super important for p2p, so I don't think things should
start out in a way that makes specifying the desired locality hard.

This is also why I don't entirely understand why this series has a
generic allocator for p2p mem, it makes little sense to me.

Why wouldn't the nmve driver just claim the entire CMB of its local
device for its own use? Why involve p2p core code in this?

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 18:42           ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-01 18:42 UTC (permalink / raw)


On Thu, Mar 01, 2018@08:35:55PM +0200, Sagi Grimberg wrote:
> 
> >On 01/03/18 04:03 AM, Sagi Grimberg wrote:
> >>Can you describe what would be the plan to have it when these devices
> >>do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
> >>and not from nvmet_ctrl. Then, when cmb capable devices come along, the
> >>ns can prefer to use its own cmb instead of locating a p2p_dev device?
> >
> >The patchset already supports CMB drives. That's essentially what patch 7
> >is for. We change the nvme-pci driver to use the p2pmem code to register
> >and manage the CMB memory. After that, it is simply available to the nvmet
> >code. We have already been using this with a couple prototype CMB drives.
> 
> The comment was to your statement:
> "Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available."
> 
> Maybe its a left-over which confused me...
> 
> Anyways, my question still holds. If I rack several of these
> nvme drives, ideally we would use _their_ cmbs for I/O that is
> directed to these namespaces. This is why I was suggesting that
> p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
> p2p_dev used by all namespaces.

I agree, I don't think this series should target anything other than
using p2p memory located in one of the devices expected to participate
in the p2p trasnaction for a first pass..

locality is super important for p2p, so I don't think things should
start out in a way that makes specifying the desired locality hard.

This is also why I don't entirely understand why this series has a
generic allocator for p2p mem, it makes little sense to me.

Why wouldn't the nmve driver just claim the entire CMB of its local
device for its own use? Why involve p2p core code in this?

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 18:02     ` Bjorn Helgaas
                         ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 18:54       ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:54 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Thanks for the detailed review Bjorn!

>>  
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effictively puts all devices behind any
>> +	  switch into the same IOMMU group.

>
>  Does this really mean "all devices behind the same Root Port"?

Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.

> What does this mean in terms of device security?  I assume it means,
> at least, that individual devices can't be assigned to separate VMs.

This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.
    
> I don't mind admitting that this patch makes me pretty nervous, and I
> don't have a clear idea of what the implications of this are, or how
> to communicate those to end users.  "The same IOMMU group" is a pretty
> abstract idea.
    
Alex gave a good overview of the implications in [1].

Stephen 

[1] https://marc.info/?l=linux-pci&m=151512320031739&w=2

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 18:54       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:54 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson

VGhhbmtzIGZvciB0aGUgZGV0YWlsZWQgcmV2aWV3IEJqb3JuIQ0KDQo+PiAgDQo+PiArCSAgRW5h
YmxpbmcgdGhpcyBvcHRpb24gd2lsbCBhbHNvIGRpc2FibGUgQUNTIG9uIGFsbCBwb3J0cyBiZWhp
bmQNCj4+ICsJICBhbnkgUENJZSBzd2l0Y2guIFRoaXMgZWZmaWN0aXZlbHkgcHV0cyBhbGwgZGV2
aWNlcyBiZWhpbmQgYW55DQo+PiArCSAgc3dpdGNoIGludG8gdGhlIHNhbWUgSU9NTVUgZ3JvdXAu
DQoNCj4NCj4gIERvZXMgdGhpcyByZWFsbHkgbWVhbiAiYWxsIGRldmljZXMgYmVoaW5kIHRoZSBz
YW1lIFJvb3QgUG9ydCI/DQoNCk5vdCBuZWNlc3NhcmlseS4gWW91IG1pZ2h0IGhhdmUgYSBjYXNj
YWRlIG9mIHN3aXRjaGVzIChpLmUgc3dpdGNoZXMgYmVsb3cgYSBzd2l0Y2gpIHRvIGFjaGlldmUg
YSB2ZXJ5IGxhcmdlIGZhbi1vdXQgKGluIGFuIE5WTWUgU1NEIGFycmF5IGZvciBleGFtcGxlKSBh
bmQgd2Ugd2lsbCBvbmx5IGRpc2FibGUgQUNTIG9uIHRoZSBwb3J0cyBiZWxvdyB0aGUgcmVsZXZh
bnQgc3dpdGNoLg0KDQo+IFdoYXQgZG9lcyB0aGlzIG1lYW4gaW4gdGVybXMgb2YgZGV2aWNlIHNl
Y3VyaXR5PyAgSSBhc3N1bWUgaXQgbWVhbnMsDQo+IGF0IGxlYXN0LCB0aGF0IGluZGl2aWR1YWwg
ZGV2aWNlcyBjYW4ndCBiZSBhc3NpZ25lZCB0byBzZXBhcmF0ZSBWTXMuDQoNClRoaXMgd2FzIGRp
c2N1c3NlZCBkdXJpbmcgdjEgWzFdLiBEaXNhYmxpbmcgQUNTIG9uIGFsbCBkb3duc3RyZWFtIHBv
cnRzIG9mIHRoZSBzd2l0Y2ggbWVhbnMgdGhhdCBhbGwgdGhlIEVQcyBiZWxvdyBpdCBoYXZlIHRv
IHBhcnQgb2YgdGhlIHNhbWUgSU9NTVUgZ3JvdXBpbmcuIEhvd2V2ZXIgaXQgd2FzIGFsc28gYWdy
ZWVkIHRoYXQgYXMgbG9uZyBhcyB0aGUgQUNTIGRpc2FibGUgb2NjdXJyZWQgYXQgYm9vdCB0aW1l
ICh3aGljaCBpcyBkb2VzIGluIHYyKSB0aGVuIHRoZSB2aXJ0dWFsaXphdGlvbiBsYXllciB3aWxs
IGJlIGF3YXJlIG9mIGl0IGFuZCB3aWxsIHBlcmZvcm0gdGhlIElPTU1VIGdyb3VwIGZvcm1hdGlv
biBjb3JyZWN0bHkuDQogICAgDQo+IEkgZG9uJ3QgbWluZCBhZG1pdHRpbmcgdGhhdCB0aGlzIHBh
dGNoIG1ha2VzIG1lIHByZXR0eSBuZXJ2b3VzLCBhbmQgSQ0KPiBkb24ndCBoYXZlIGEgY2xlYXIg
aWRlYSBvZiB3aGF0IHRoZSBpbXBsaWNhdGlvbnMgb2YgdGhpcyBhcmUsIG9yIGhvdw0KPiB0byBj
b21tdW5pY2F0ZSB0aG9zZSB0byBlbmQgdXNlcnMuICAiVGhlIHNhbWUgSU9NTVUgZ3JvdXAiIGlz
IGEgcHJldHR5DQo+IGFic3RyYWN0IGlkZWEuDQogICAgDQpBbGV4IGdhdmUgYSBnb29kIG92ZXJ2
aWV3IG9mIHRoZSBpbXBsaWNhdGlvbnMgaW4gWzFdLg0KDQpTdGVwaGVuIA0KDQpbMV0gaHR0cHM6
Ly9tYXJjLmluZm8vP2w9bGludXgtcGNpJm09MTUxNTEyMzIwMDMxNzM5Jnc9Mg0KDQo=

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 18:54       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:54 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Thanks for the detailed review Bjorn!

>>  
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effictively puts all devices behind any
>> +	  switch into the same IOMMU group.

>
>  Does this really mean "all devices behind the same Root Port"?

Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.

> What does this mean in terms of device security?  I assume it means,
> at least, that individual devices can't be assigned to separate VMs.

This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.
    
> I don't mind admitting that this patch makes me pretty nervous, and I
> don't have a clear idea of what the implications of this are, or how
> to communicate those to end users.  "The same IOMMU group" is a pretty
> abstract idea.
    
Alex gave a good overview of the implications in [1].

Stephen 

[1] https://marc.info/?l=linux-pci&m=151512320031739&w=2

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 18:54       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:54 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson

Thanks for the detailed review Bjorn!

>>  
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effictively puts all devices behind any
>> +	  switch into the same IOMMU group.

>
>  Does this really mean "all devices behind the same Root Port"?

Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.

> What does this mean in terms of device security?  I assume it means,
> at least, that individual devices can't be assigned to separate VMs.

This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.
    
> I don't mind admitting that this patch makes me pretty nervous, and I
> don't have a clear idea of what the implications of this are, or how
> to communicate those to end users.  "The same IOMMU group" is a pretty
> abstract idea.
    
Alex gave a good overview of the implications in [1].

Stephen 

[1] https://marc.info/?l=linux-pci&m=151512320031739&w=2

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 18:54       ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 18:54 UTC (permalink / raw)


Thanks for the detailed review Bjorn!

>>  
>> +	  Enabling this option will also disable ACS on all ports behind
>> +	  any PCIe switch. This effictively puts all devices behind any
>> +	  switch into the same IOMMU group.

>
>  Does this really mean "all devices behind the same Root Port"?

Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.

> What does this mean in terms of device security?  I assume it means,
> at least, that individual devices can't be assigned to separate VMs.

This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.
    
> I don't mind admitting that this patch makes me pretty nervous, and I
> don't have a clear idea of what the implications of this are, or how
> to communicate those to end users.  "The same IOMMU group" is a pretty
> abstract idea.
    
Alex gave a good overview of the implications in [1].

Stephen 

[1] https://marc.info/?l=linux-pci&m=151512320031739&w=2

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
  2018-03-01 17:37       ` Bjorn Helgaas
  (?)
  (?)
@ 2018-03-01 18:55         ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 18:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

Hi Bjorn,

Thanks for the review. I'll correct all the nits for the next version.

On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
>> Some PCI devices may have memory mapped in a BAR space that's
>> intended for use in Peer-to-Peer transactions. In order to enable
>> such transactions the memory must be registered with ZONE_DEVICE pages
>> so it can be used by DMA interfaces in existing drivers.

> Is there anything about this memory that makes it specifically
> intended for peer-to-peer transactions?  I assume the device can't
> really tell whether a transaction is from a CPU or a peer.

No there's nothing special about the memory and it can still be accessed 
by the CPU. This is just the intended purpose. You could use this PCI 
memory as regular DMA buffers or regular memory but I'm not sure why you 
would. It would probably be pretty bad performance-wise.


> BTW, maybe there could be some kind of guide for device driver writers
> in Documentation/PCI/?
Makes sense we can look at writing something for the next iteration.

> I think it would be clearer and sufficient to simply say that we have
> no way to know whether peer-to-peer routing between PCIe Root Ports is
> supported (PCIe r4.0, sec 1.3.1).

Fair enough.

> The fact that you use the PCIe term "switch" suggests that a PCIe
> Switch is required, but isn't it sufficient for the peers to be below
> the same "PCI bridge", which would include PCIe Root Ports, PCIe
> Switch Downstream Ports, and conventional PCI bridges?
> The comments at get_upstream_bridge_port() suggest that this isn't
> enough, and the peers actually do have to be below the same PCIe
> Switch, but I don't know why.

I do mean Switch as we do need to keep the traffic off the root complex. 
Seeing, as stated above, we don't know if it actually support it. (While 
we can be certain any PCI switch does). So we specifically want to 
exclude PCIe Root ports and I'm not sure about the support of PCI 
bridges but I can't imagine anyone wanting to do P2P around them so I'd 
rather be safe than sorry and exclude them.


>> +	addr = devm_memremap_pages(&pdev->dev, pgmap);
>> +	if (IS_ERR(addr))
> 
> Free pgmap here?  And in the other error case below?  Or maybe this
> happens via the devm_* magic?  If so, when would that actually happen?
> Would pgmap be effectively leaked until the pdev is destroyed?

Yes, it happens via the devm magic as that's the way the 
devm_memremap_pages() interface was designed. If I remember correctly, 
in my testing, it would be de-allocated when the driver gets unbound.

>> +		return PTR_ERR(addr);
>> +
>> +	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
>> +			pci_bus_address(pdev, bar) + offset,
>> +			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
>> +	if (error)
>> +		return error;
>> +
>> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
>> +					  &pdev->p2pdma->devmap_ref);
>> +	if (error)
>> +		return error;
>> +
>> +	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
>> +		 &pgmap->res);
> 
> s/dev_info/pci_info/ (also similar usages below, except for the one or
> two cases where you don't have a pci_dev).

Oh, nice, I didn't notice that was added.

> This whole thing is confusing to me.  Why do you want to reject peers
> directly connected to the same root port?  Why do you require the same
> Switch Upstream Port?  You don't exclude conventional PCI, but it
> looks like you would require peers to share *two* upstream PCI-to-PCI
> bridges?  I would think a single shared upstream bridge (conventional,
> PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?

Hmm, yes, this may just be laziness on my part. Finding the shared 
upstream bridge is a bit more tricky than just showing that they are on 
the same switch. So as coded, a fabric of switches with peers on 
different legs of the fabric are not supported. But yes, maybe they just 
need to be two devices with a single shared upstream bridge that is not 
the root port. Again, we need to reject the root port because we can't 
know if the root complex can support P2P traffic.

> Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> sort of weird that callers supply a non-PCI device and then we look up
> a PCI device here.  I assume you have some reason for this; if you
> added a writeup in Documentation/PCI, that would be a good place to
> elaborate on that, maybe with a one-line clue here.

Well yes, but this is much more convenient for callers which don't need 
to care if the device they are attempting to add (which in the NVMe 
target case, could be a random block device) is a pci device or not. 
Especially seeing find_parent_pci_dev() is non-trivial.

>> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
>> +{
>> +	void *ret;
>> +
>> +	if (unlikely(!pdev->p2pdma))
> 
> Is this a hot path?  I'm not sure it's worth cluttering
> non-performance paths with likely/unlikely.

I'd say it can be pretty hot given that we need to allocate and free 
buffers at multiple GB/s for the NVMe target case. I don't exactly have 
benchmarks or anything to show this though...

>> +		return NULL;
>> +
>> +	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
>> +		return NULL;
>> +
>> +	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
> 
> Why the double cast?  Wouldn't "(void *)" be sufficient?

Oh, hmm, I can't remember now. I suspect I added it to fix a kbuild test 
robot error. I'll check and see if it can be removed.


> In v4.6-rc1, gen_pool_free() takes "unsigned long addr".  I know this
> is based on -rc3; is this something that changed between -rc1 and
> -rc3?

Yes, I think it's taken an unsigned long for a while now. Per the above, 
I can't remember exactly why I casted it this way and I'll take a look 
again.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 18:55         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 18:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson

Hi Bjorn,

Thanks for the review. I'll correct all the nits for the next version.

On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
>> Some PCI devices may have memory mapped in a BAR space that's
>> intended for use in Peer-to-Peer transactions. In order to enable
>> such transactions the memory must be registered with ZONE_DEVICE pages
>> so it can be used by DMA interfaces in existing drivers.

> Is there anything about this memory that makes it specifically
> intended for peer-to-peer transactions?  I assume the device can't
> really tell whether a transaction is from a CPU or a peer.

No there's nothing special about the memory and it can still be accessed 
by the CPU. This is just the intended purpose. You could use this PCI 
memory as regular DMA buffers or regular memory but I'm not sure why you 
would. It would probably be pretty bad performance-wise.


> BTW, maybe there could be some kind of guide for device driver writers
> in Documentation/PCI/?
Makes sense we can look at writing something for the next iteration.

> I think it would be clearer and sufficient to simply say that we have
> no way to know whether peer-to-peer routing between PCIe Root Ports is
> supported (PCIe r4.0, sec 1.3.1).

Fair enough.

> The fact that you use the PCIe term "switch" suggests that a PCIe
> Switch is required, but isn't it sufficient for the peers to be below
> the same "PCI bridge", which would include PCIe Root Ports, PCIe
> Switch Downstream Ports, and conventional PCI bridges?
> The comments at get_upstream_bridge_port() suggest that this isn't
> enough, and the peers actually do have to be below the same PCIe
> Switch, but I don't know why.

I do mean Switch as we do need to keep the traffic off the root complex. 
Seeing, as stated above, we don't know if it actually support it. (While 
we can be certain any PCI switch does). So we specifically want to 
exclude PCIe Root ports and I'm not sure about the support of PCI 
bridges but I can't imagine anyone wanting to do P2P around them so I'd 
rather be safe than sorry and exclude them.


>> +	addr = devm_memremap_pages(&pdev->dev, pgmap);
>> +	if (IS_ERR(addr))
> 
> Free pgmap here?  And in the other error case below?  Or maybe this
> happens via the devm_* magic?  If so, when would that actually happen?
> Would pgmap be effectively leaked until the pdev is destroyed?

Yes, it happens via the devm magic as that's the way the 
devm_memremap_pages() interface was designed. If I remember correctly, 
in my testing, it would be de-allocated when the driver gets unbound.

>> +		return PTR_ERR(addr);
>> +
>> +	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
>> +			pci_bus_address(pdev, bar) + offset,
>> +			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
>> +	if (error)
>> +		return error;
>> +
>> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
>> +					  &pdev->p2pdma->devmap_ref);
>> +	if (error)
>> +		return error;
>> +
>> +	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
>> +		 &pgmap->res);
> 
> s/dev_info/pci_info/ (also similar usages below, except for the one or
> two cases where you don't have a pci_dev).

Oh, nice, I didn't notice that was added.

> This whole thing is confusing to me.  Why do you want to reject peers
> directly connected to the same root port?  Why do you require the same
> Switch Upstream Port?  You don't exclude conventional PCI, but it
> looks like you would require peers to share *two* upstream PCI-to-PCI
> bridges?  I would think a single shared upstream bridge (conventional,
> PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?

Hmm, yes, this may just be laziness on my part. Finding the shared 
upstream bridge is a bit more tricky than just showing that they are on 
the same switch. So as coded, a fabric of switches with peers on 
different legs of the fabric are not supported. But yes, maybe they just 
need to be two devices with a single shared upstream bridge that is not 
the root port. Again, we need to reject the root port because we can't 
know if the root complex can support P2P traffic.

> Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> sort of weird that callers supply a non-PCI device and then we look up
> a PCI device here.  I assume you have some reason for this; if you
> added a writeup in Documentation/PCI, that would be a good place to
> elaborate on that, maybe with a one-line clue here.

Well yes, but this is much more convenient for callers which don't need 
to care if the device they are attempting to add (which in the NVMe 
target case, could be a random block device) is a pci device or not. 
Especially seeing find_parent_pci_dev() is non-trivial.

>> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
>> +{
>> +	void *ret;
>> +
>> +	if (unlikely(!pdev->p2pdma))
> 
> Is this a hot path?  I'm not sure it's worth cluttering
> non-performance paths with likely/unlikely.

I'd say it can be pretty hot given that we need to allocate and free 
buffers at multiple GB/s for the NVMe target case. I don't exactly have 
benchmarks or anything to show this though...

>> +		return NULL;
>> +
>> +	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
>> +		return NULL;
>> +
>> +	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
> 
> Why the double cast?  Wouldn't "(void *)" be sufficient?

Oh, hmm, I can't remember now. I suspect I added it to fix a kbuild test 
robot error. I'll check and see if it can be removed.


> In v4.6-rc1, gen_pool_free() takes "unsigned long addr".  I know this
> is based on -rc3; is this something that changed between -rc1 and
> -rc3?

Yes, I think it's taken an unsigned long for a while now. Per the above, 
I can't remember exactly why I casted it this way and I'll take a look 
again.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 18:55         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 18:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

Hi Bjorn,

Thanks for the review. I'll correct all the nits for the next version.

On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
>> Some PCI devices may have memory mapped in a BAR space that's
>> intended for use in Peer-to-Peer transactions. In order to enable
>> such transactions the memory must be registered with ZONE_DEVICE pages
>> so it can be used by DMA interfaces in existing drivers.

> Is there anything about this memory that makes it specifically
> intended for peer-to-peer transactions?  I assume the device can't
> really tell whether a transaction is from a CPU or a peer.

No there's nothing special about the memory and it can still be accessed 
by the CPU. This is just the intended purpose. You could use this PCI 
memory as regular DMA buffers or regular memory but I'm not sure why you 
would. It would probably be pretty bad performance-wise.


> BTW, maybe there could be some kind of guide for device driver writers
> in Documentation/PCI/?
Makes sense we can look at writing something for the next iteration.

> I think it would be clearer and sufficient to simply say that we have
> no way to know whether peer-to-peer routing between PCIe Root Ports is
> supported (PCIe r4.0, sec 1.3.1).

Fair enough.

> The fact that you use the PCIe term "switch" suggests that a PCIe
> Switch is required, but isn't it sufficient for the peers to be below
> the same "PCI bridge", which would include PCIe Root Ports, PCIe
> Switch Downstream Ports, and conventional PCI bridges?
> The comments at get_upstream_bridge_port() suggest that this isn't
> enough, and the peers actually do have to be below the same PCIe
> Switch, but I don't know why.

I do mean Switch as we do need to keep the traffic off the root complex. 
Seeing, as stated above, we don't know if it actually support it. (While 
we can be certain any PCI switch does). So we specifically want to 
exclude PCIe Root ports and I'm not sure about the support of PCI 
bridges but I can't imagine anyone wanting to do P2P around them so I'd 
rather be safe than sorry and exclude them.


>> +	addr = devm_memremap_pages(&pdev->dev, pgmap);
>> +	if (IS_ERR(addr))
> 
> Free pgmap here?  And in the other error case below?  Or maybe this
> happens via the devm_* magic?  If so, when would that actually happen?
> Would pgmap be effectively leaked until the pdev is destroyed?

Yes, it happens via the devm magic as that's the way the 
devm_memremap_pages() interface was designed. If I remember correctly, 
in my testing, it would be de-allocated when the driver gets unbound.

>> +		return PTR_ERR(addr);
>> +
>> +	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
>> +			pci_bus_address(pdev, bar) + offset,
>> +			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
>> +	if (error)
>> +		return error;
>> +
>> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
>> +					  &pdev->p2pdma->devmap_ref);
>> +	if (error)
>> +		return error;
>> +
>> +	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
>> +		 &pgmap->res);
> 
> s/dev_info/pci_info/ (also similar usages below, except for the one or
> two cases where you don't have a pci_dev).

Oh, nice, I didn't notice that was added.

> This whole thing is confusing to me.  Why do you want to reject peers
> directly connected to the same root port?  Why do you require the same
> Switch Upstream Port?  You don't exclude conventional PCI, but it
> looks like you would require peers to share *two* upstream PCI-to-PCI
> bridges?  I would think a single shared upstream bridge (conventional,
> PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?

Hmm, yes, this may just be laziness on my part. Finding the shared 
upstream bridge is a bit more tricky than just showing that they are on 
the same switch. So as coded, a fabric of switches with peers on 
different legs of the fabric are not supported. But yes, maybe they just 
need to be two devices with a single shared upstream bridge that is not 
the root port. Again, we need to reject the root port because we can't 
know if the root complex can support P2P traffic.

> Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> sort of weird that callers supply a non-PCI device and then we look up
> a PCI device here.  I assume you have some reason for this; if you
> added a writeup in Documentation/PCI, that would be a good place to
> elaborate on that, maybe with a one-line clue here.

Well yes, but this is much more convenient for callers which don't need 
to care if the device they are attempting to add (which in the NVMe 
target case, could be a random block device) is a pci device or not. 
Especially seeing find_parent_pci_dev() is non-trivial.

>> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
>> +{
>> +	void *ret;
>> +
>> +	if (unlikely(!pdev->p2pdma))
> 
> Is this a hot path?  I'm not sure it's worth cluttering
> non-performance paths with likely/unlikely.

I'd say it can be pretty hot given that we need to allocate and free 
buffers at multiple GB/s for the NVMe target case. I don't exactly have 
benchmarks or anything to show this though...

>> +		return NULL;
>> +
>> +	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
>> +		return NULL;
>> +
>> +	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
> 
> Why the double cast?  Wouldn't "(void *)" be sufficient?

Oh, hmm, I can't remember now. I suspect I added it to fix a kbuild test 
robot error. I'll check and see if it can be removed.


> In v4.6-rc1, gen_pool_free() takes "unsigned long addr".  I know this
> is based on -rc3; is this something that changed between -rc1 and
> -rc3?

Yes, I think it's taken an unsigned long for a while now. Per the above, 
I can't remember exactly why I casted it this way and I'll take a look 
again.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 18:55         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 18:55 UTC (permalink / raw)


Hi Bjorn,

Thanks for the review. I'll correct all the nits for the next version.

On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> On Wed, Feb 28, 2018@04:39:57PM -0700, Logan Gunthorpe wrote:
>> Some PCI devices may have memory mapped in a BAR space that's
>> intended for use in Peer-to-Peer transactions. In order to enable
>> such transactions the memory must be registered with ZONE_DEVICE pages
>> so it can be used by DMA interfaces in existing drivers.

> Is there anything about this memory that makes it specifically
> intended for peer-to-peer transactions?  I assume the device can't
> really tell whether a transaction is from a CPU or a peer.

No there's nothing special about the memory and it can still be accessed 
by the CPU. This is just the intended purpose. You could use this PCI 
memory as regular DMA buffers or regular memory but I'm not sure why you 
would. It would probably be pretty bad performance-wise.


> BTW, maybe there could be some kind of guide for device driver writers
> in Documentation/PCI/?
Makes sense we can look at writing something for the next iteration.

> I think it would be clearer and sufficient to simply say that we have
> no way to know whether peer-to-peer routing between PCIe Root Ports is
> supported (PCIe r4.0, sec 1.3.1).

Fair enough.

> The fact that you use the PCIe term "switch" suggests that a PCIe
> Switch is required, but isn't it sufficient for the peers to be below
> the same "PCI bridge", which would include PCIe Root Ports, PCIe
> Switch Downstream Ports, and conventional PCI bridges?
> The comments at get_upstream_bridge_port() suggest that this isn't
> enough, and the peers actually do have to be below the same PCIe
> Switch, but I don't know why.

I do mean Switch as we do need to keep the traffic off the root complex. 
Seeing, as stated above, we don't know if it actually support it. (While 
we can be certain any PCI switch does). So we specifically want to 
exclude PCIe Root ports and I'm not sure about the support of PCI 
bridges but I can't imagine anyone wanting to do P2P around them so I'd 
rather be safe than sorry and exclude them.


>> +	addr = devm_memremap_pages(&pdev->dev, pgmap);
>> +	if (IS_ERR(addr))
> 
> Free pgmap here?  And in the other error case below?  Or maybe this
> happens via the devm_* magic?  If so, when would that actually happen?
> Would pgmap be effectively leaked until the pdev is destroyed?

Yes, it happens via the devm magic as that's the way the 
devm_memremap_pages() interface was designed. If I remember correctly, 
in my testing, it would be de-allocated when the driver gets unbound.

>> +		return PTR_ERR(addr);
>> +
>> +	error = gen_pool_add_virt(pdev->p2pdma->pool, (uintptr_t)addr,
>> +			pci_bus_address(pdev, bar) + offset,
>> +			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
>> +	if (error)
>> +		return error;
>> +
>> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
>> +					  &pdev->p2pdma->devmap_ref);
>> +	if (error)
>> +		return error;
>> +
>> +	dev_info(&pdev->dev, "added peer-to-peer DMA memory %pR\n",
>> +		 &pgmap->res);
> 
> s/dev_info/pci_info/ (also similar usages below, except for the one or
> two cases where you don't have a pci_dev).

Oh, nice, I didn't notice that was added.

> This whole thing is confusing to me.  Why do you want to reject peers
> directly connected to the same root port?  Why do you require the same
> Switch Upstream Port?  You don't exclude conventional PCI, but it
> looks like you would require peers to share *two* upstream PCI-to-PCI
> bridges?  I would think a single shared upstream bridge (conventional,
> PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?

Hmm, yes, this may just be laziness on my part. Finding the shared 
upstream bridge is a bit more tricky than just showing that they are on 
the same switch. So as coded, a fabric of switches with peers on 
different legs of the fabric are not supported. But yes, maybe they just 
need to be two devices with a single shared upstream bridge that is not 
the root port. Again, we need to reject the root port because we can't 
know if the root complex can support P2P traffic.

> Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> sort of weird that callers supply a non-PCI device and then we look up
> a PCI device here.  I assume you have some reason for this; if you
> added a writeup in Documentation/PCI, that would be a good place to
> elaborate on that, maybe with a one-line clue here.

Well yes, but this is much more convenient for callers which don't need 
to care if the device they are attempting to add (which in the NVMe 
target case, could be a random block device) is a pci device or not. 
Especially seeing find_parent_pci_dev() is non-trivial.

>> +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
>> +{
>> +	void *ret;
>> +
>> +	if (unlikely(!pdev->p2pdma))
> 
> Is this a hot path?  I'm not sure it's worth cluttering
> non-performance paths with likely/unlikely.

I'd say it can be pretty hot given that we need to allocate and free 
buffers at multiple GB/s for the NVMe target case. I don't exactly have 
benchmarks or anything to show this though...

>> +		return NULL;
>> +
>> +	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
>> +		return NULL;
>> +
>> +	ret = (void *)(uintptr_t)gen_pool_alloc(pdev->p2pdma->pool, size);
> 
> Why the double cast?  Wouldn't "(void *)" be sufficient?

Oh, hmm, I can't remember now. I suspect I added it to fix a kbuild test 
robot error. I'll check and see if it can be removed.


> In v4.6-rc1, gen_pool_free() takes "unsigned long addr".  I know this
> is based on -rc3; is this something that changed between -rc1 and
> -rc3?

Yes, I think it's taken an unsigned long for a while now. Per the above, 
I can't remember exactly why I casted it this way and I'll take a look 
again.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 18:42           ` Jason Gunthorpe
                               ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 19:01             ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 19:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	linux-block, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

    
> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..

I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer and in deploying systems where only some of the NVMe SSDs below a switch  have a CMB but use P2P to access all of them. Also there are some devices that only expose memory and their entire purpose is to act as a p2p device, supporting these devices would be valuable.
    
> locality is super important for p2p, so I don't think things should
>  start out in a way that makes specifying the desired locality hard.

Ensuring that the EPs engaged in p2p are all directly connected to the same PCIe switch ensures locality and (for the switches we have tested) performance. I agree solving the case where the namespace are CMB are on the same PCIe EP is valuable but I don't see it as critical to initial acceptance of the series.
    
Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:01             ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 19:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise

ICAgIA0KPiBJIGFncmVlLCBJIGRvbid0IHRoaW5rIHRoaXMgc2VyaWVzIHNob3VsZCB0YXJnZXQg
YW55dGhpbmcgb3RoZXIgdGhhbg0KPiB1c2luZyBwMnAgbWVtb3J5IGxvY2F0ZWQgaW4gb25lIG9m
IHRoZSBkZXZpY2VzIGV4cGVjdGVkIHRvIHBhcnRpY2lwYXRlDQo+IGluIHRoZSBwMnAgdHJhc25h
Y3Rpb24gZm9yIGEgZmlyc3QgcGFzcy4uDQoNCkkgZGlzYWdyZWUuIFRoZXJlIGlzIGRlZmluaXRl
bHkgaW50ZXJlc3QgaW4gdXNpbmcgYSBOVk1lIENNQiBhcyBhIGJvdW5jZSBidWZmZXIgYW5kIGlu
IGRlcGxveWluZyBzeXN0ZW1zIHdoZXJlIG9ubHkgc29tZSBvZiB0aGUgTlZNZSBTU0RzIGJlbG93
IGEgc3dpdGNoICBoYXZlIGEgQ01CIGJ1dCB1c2UgUDJQIHRvIGFjY2VzcyBhbGwgb2YgdGhlbS4g
QWxzbyB0aGVyZSBhcmUgc29tZSBkZXZpY2VzIHRoYXQgb25seSBleHBvc2UgbWVtb3J5IGFuZCB0
aGVpciBlbnRpcmUgcHVycG9zZSBpcyB0byBhY3QgYXMgYSBwMnAgZGV2aWNlLCBzdXBwb3J0aW5n
IHRoZXNlIGRldmljZXMgd291bGQgYmUgdmFsdWFibGUuDQogICAgDQo+IGxvY2FsaXR5IGlzIHN1
cGVyIGltcG9ydGFudCBmb3IgcDJwLCBzbyBJIGRvbid0IHRoaW5rIHRoaW5ncyBzaG91bGQNCj4g
IHN0YXJ0IG91dCBpbiBhIHdheSB0aGF0IG1ha2VzIHNwZWNpZnlpbmcgdGhlIGRlc2lyZWQgbG9j
YWxpdHkgaGFyZC4NCg0KRW5zdXJpbmcgdGhhdCB0aGUgRVBzIGVuZ2FnZWQgaW4gcDJwIGFyZSBh
bGwgZGlyZWN0bHkgY29ubmVjdGVkIHRvIHRoZSBzYW1lIFBDSWUgc3dpdGNoIGVuc3VyZXMgbG9j
YWxpdHkgYW5kIChmb3IgdGhlIHN3aXRjaGVzIHdlIGhhdmUgdGVzdGVkKSBwZXJmb3JtYW5jZS4g
SSBhZ3JlZSBzb2x2aW5nIHRoZSBjYXNlIHdoZXJlIHRoZSBuYW1lc3BhY2UgYXJlIENNQiBhcmUg
b24gdGhlIHNhbWUgUENJZSBFUCBpcyB2YWx1YWJsZSBidXQgSSBkb24ndCBzZWUgaXQgYXMgY3Jp
dGljYWwgdG8gaW5pdGlhbCBhY2NlcHRhbmNlIG9mIHRoZSBzZXJpZXMuDQogICAgDQpTdGVwaGVu
DQogICAgDQoNCg==

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:01             ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 19:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

    
> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..

I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer and in deploying systems where only some of the NVMe SSDs below a switch  have a CMB but use P2P to access all of them. Also there are some devices that only expose memory and their entire purpose is to act as a p2p device, supporting these devices would be valuable.
    
> locality is super important for p2p, so I don't think things should
>  start out in a way that makes specifying the desired locality hard.

Ensuring that the EPs engaged in p2p are all directly connected to the same PCIe switch ensures locality and (for the switches we have tested) performance. I agree solving the case where the namespace are CMB are on the same PCIe EP is valuable but I don't see it as critical to initial acceptance of the series.
    
Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:01             ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 19:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise

    
> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..

I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer and in deploying systems where only some of the NVMe SSDs below a switch  have a CMB but use P2P to access all of them. Also there are some devices that only expose memory and their entire purpose is to act as a p2p device, supporting these devices would be valuable.
    
> locality is super important for p2p, so I don't think things should
>  start out in a way that makes specifying the desired locality hard.

Ensuring that the EPs engaged in p2p are all directly connected to the same PCIe switch ensures locality and (for the switches we have tested) performance. I agree solving the case where the namespace are CMB are on the same PCIe EP is valuable but I don't see it as critical to initial acceptance of the series.
    
Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:01             ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 19:01 UTC (permalink / raw)


    
> I agree, I don't think this series should target anything other than
> using p2p memory located in one of the devices expected to participate
> in the p2p trasnaction for a first pass..

I disagree. There is definitely interest in using a NVMe CMB as a bounce buffer and in deploying systems where only some of the NVMe SSDs below a switch  have a CMB but use P2P to access all of them. Also there are some devices that only expose memory and their entire purpose is to act as a p2p device, supporting these devices would be valuable.
    
> locality is super important for p2p, so I don't think things should
>  start out in a way that makes specifying the desired locality hard.

Ensuring that the EPs engaged in p2p are all directly connected to the same PCIe switch ensures locality and (for the switches we have tested) performance. I agree solving the case where the namespace are CMB are on the same PCIe EP is valuable but I don't see it as critical to initial acceptance of the series.
    
Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 18:35         ` Sagi Grimberg
  (?)
  (?)
@ 2018-03-01 19:10           ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:10 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig


> Wouldn't it all be simpler if the p2p_dev resolution would be private
> to the namespace?
> 
> So is adding some all the namespaces in a subsystem must comply to
> using p2p? Seems a little bit harsh if its not absolutely needed. Would
> be nice to export a subsystems between two ports (on two HCAs, across
> NUMA nodes) where the home node (primary path) would use p2p and
> failover would use host memory...
> 
> Can you help me understand why this is absolutely not feasible?

Yes, it would simplify things. However, as best as I can tell, when we 
allocate memory we don't know what namespace it will be used for. If 
there is a way, I could probably rework it so there's a P2P device per 
namespace.

Can you show me how we'd know the namespace to use in 
nvmet_rdma_map_sgl_keyed()?

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:10           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:10 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise


> Wouldn't it all be simpler if the p2p_dev resolution would be private
> to the namespace?
> 
> So is adding some all the namespaces in a subsystem must comply to
> using p2p? Seems a little bit harsh if its not absolutely needed. Would
> be nice to export a subsystems between two ports (on two HCAs, across
> NUMA nodes) where the home node (primary path) would use p2p and
> failover would use host memory...
> 
> Can you help me understand why this is absolutely not feasible?

Yes, it would simplify things. However, as best as I can tell, when we 
allocate memory we don't know what namespace it will be used for. If 
there is a way, I could probably rework it so there's a P2P device per 
namespace.

Can you show me how we'd know the namespace to use in 
nvmet_rdma_map_sgl_keyed()?

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:10           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:10 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Steve Wise, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig


> Wouldn't it all be simpler if the p2p_dev resolution would be private
> to the namespace?
> 
> So is adding some all the namespaces in a subsystem must comply to
> using p2p? Seems a little bit harsh if its not absolutely needed. Would
> be nice to export a subsystems between two ports (on two HCAs, across
> NUMA nodes) where the home node (primary path) would use p2p and
> failover would use host memory...
> 
> Can you help me understand why this is absolutely not feasible?

Yes, it would simplify things. However, as best as I can tell, when we 
allocate memory we don't know what namespace it will be used for. If 
there is a way, I could probably rework it so there's a P2P device per 
namespace.

Can you show me how we'd know the namespace to use in 
nvmet_rdma_map_sgl_keyed()?

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:10           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:10 UTC (permalink / raw)



> Wouldn't it all be simpler if the p2p_dev resolution would be private
> to the namespace?
> 
> So is adding some all the namespaces in a subsystem must comply to
> using p2p? Seems a little bit harsh if its not absolutely needed. Would
> be nice to export a subsystems between two ports (on two HCAs, across
> NUMA nodes) where the home node (primary path) would use p2p and
> failover would use host memory...
> 
> Can you help me understand why this is absolutely not feasible?

Yes, it would simplify things. However, as best as I can tell, when we 
allocate memory we don't know what namespace it will be used for. If 
there is a way, I could probably rework it so there's a P2P device per 
namespace.

Can you show me how we'd know the namespace to use in 
nvmet_rdma_map_sgl_keyed()?

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 18:02     ` Bjorn Helgaas
  (?)
  (?)
@ 2018-03-01 19:13       ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
> 
> This doesn't read naturally to me.  I do see that when
> CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> and returns 0, so we'll go ahead and try to enable ACS as before.
> 
> But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> right here so it's more obvious that we only disable ACS when it's
> selected.

I could do this... however, I wrote it this way because I've read Linus 
dislikes using #ifdef's inside function bodies and I personally agree 
with that sentiment.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 19:13       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson



On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
> 
> This doesn't read naturally to me.  I do see that when
> CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> and returns 0, so we'll go ahead and try to enable ACS as before.
> 
> But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> right here so it's more obvious that we only disable ACS when it's
> selected.

I could do this... however, I wrote it this way because I've read Linus 
dislikes using #ifdef's inside function bodies and I personally agree 
with that sentiment.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 19:13       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
> 
> This doesn't read naturally to me.  I do see that when
> CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> and returns 0, so we'll go ahead and try to enable ACS as before.
> 
> But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> right here so it's more obvious that we only disable ACS when it's
> selected.

I could do this... however, I wrote it this way because I've read Linus 
dislikes using #ifdef's inside function bodies and I personally agree 
with that sentiment.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 19:13       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:13 UTC (permalink / raw)




On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
>>   void pci_enable_acs(struct pci_dev *dev)
>>   {
>> +	if (pci_p2pdma_disable_acs(dev))
>> +		return;
> 
> This doesn't read naturally to me.  I do see that when
> CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> and returns 0, so we'll go ahead and try to enable ACS as before.
> 
> But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> right here so it's more obvious that we only disable ACS when it's
> selected.

I could do this... however, I wrote it this way because I've read Linus 
dislikes using #ifdef's inside function bodies and I personally agree 
with that sentiment.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01  3:56     ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01 19:21       ` Dan Williams
  -1 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-01 19:21 UTC (permalink / raw)
  To: benh
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
<benh@au1.ibm.com> wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > Hi Everyone,
>>
>>
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 19:21       ` Dan Williams
  0 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-01 19:21 UTC (permalink / raw)
  To: benh
  Cc: Logan Gunthorpe, Linux Kernel Mailing List, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
<benh@au1.ibm.com> wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > Hi Everyone,
>>
>>
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 19:21       ` Dan Williams
  0 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-01 19:21 UTC (permalink / raw)
  To: benh-8fk3Idey6ehBDgjK7y7TUQ
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
<benh-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org> wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > Hi Everyone,
>>
>>
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 19:21       ` Dan Williams
  0 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-01 19:21 UTC (permalink / raw)


On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
<benh@au1.ibm.com> wrote:
> On Thu, 2018-03-01@14:54 +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-02-28@16:39 -0700, Logan Gunthorpe wrote:
>> > Hi Everyone,
>>
>>
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 18:42           ` Jason Gunthorpe
  (?)
  (?)
@ 2018-03-01 19:27             ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	linux-block, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> This is also why I don't entirely understand why this series has a
> generic allocator for p2p mem, it makes little sense to me.

> Why wouldn't the nmve driver just claim the entire CMB of its local
> device for its own use? Why involve p2p core code in this?

We'd prefer to have a generic way to get p2pmem instead of restricting 
ourselves to only using CMBs. We did work in the past where the P2P 
memory was part of an IB adapter and not the NVMe card. So this won't 
work if it's an NVMe only interface.

As Stephen mentioned, we also use a couple devices that only exposes P2P 
memory and this isn't related to NVMe at all. So there's value in having 
a generic interface and allocator to enable all devices to provide this 
memory.

If there were a hypothetical situation where a driver wants to use some 
of the memory for P2P and some of it for other purposes then they'd just 
divide it themselves and only pass a subset to 
pci_p2pdma_add_resource(). However, as per our changes to the nvme-pci 
code, it's really just easier to use the allocator for everything inside 
the driver.

Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:27             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise



On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> This is also why I don't entirely understand why this series has a
> generic allocator for p2p mem, it makes little sense to me.

> Why wouldn't the nmve driver just claim the entire CMB of its local
> device for its own use? Why involve p2p core code in this?

We'd prefer to have a generic way to get p2pmem instead of restricting 
ourselves to only using CMBs. We did work in the past where the P2P 
memory was part of an IB adapter and not the NVMe card. So this won't 
work if it's an NVMe only interface.

As Stephen mentioned, we also use a couple devices that only exposes P2P 
memory and this isn't related to NVMe at all. So there's value in having 
a generic interface and allocator to enable all devices to provide this 
memory.

If there were a hypothetical situation where a driver wants to use some 
of the memory for P2P and some of it for other purposes then they'd just 
divide it themselves and only pass a subset to 
pci_p2pdma_add_resource(). However, as per our changes to the nvme-pci 
code, it's really just easier to use the allocator for everything inside 
the driver.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:27             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> This is also why I don't entirely understand why this series has a
> generic allocator for p2p mem, it makes little sense to me.

> Why wouldn't the nmve driver just claim the entire CMB of its local
> device for its own use? Why involve p2p core code in this?

We'd prefer to have a generic way to get p2pmem instead of restricting 
ourselves to only using CMBs. We did work in the past where the P2P 
memory was part of an IB adapter and not the NVMe card. So this won't 
work if it's an NVMe only interface.

As Stephen mentioned, we also use a couple devices that only exposes P2P 
memory and this isn't related to NVMe at all. So there's value in having 
a generic interface and allocator to enable all devices to provide this 
memory.

If there were a hypothetical situation where a driver wants to use some 
of the memory for P2P and some of it for other purposes then they'd just 
divide it themselves and only pass a subset to 
pci_p2pdma_add_resource(). However, as per our changes to the nvme-pci 
code, it's really just easier to use the allocator for everything inside 
the driver.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 19:27             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:27 UTC (permalink / raw)




On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018@08:35:55PM +0200, Sagi Grimberg wrote:
> This is also why I don't entirely understand why this series has a
> generic allocator for p2p mem, it makes little sense to me.

> Why wouldn't the nmve driver just claim the entire CMB of its local
> device for its own use? Why involve p2p core code in this?

We'd prefer to have a generic way to get p2pmem instead of restricting 
ourselves to only using CMBs. We did work in the past where the P2P 
memory was part of an IB adapter and not the NVMe card. So this won't 
work if it's an NVMe only interface.

As Stephen mentioned, we also use a couple devices that only exposes P2P 
memory and this isn't related to NVMe at all. So there's value in having 
a generic interface and allocator to enable all devices to provide this 
memory.

If there were a hypothetical situation where a driver wants to use some 
of the memory for P2P and some of it for other purposes then they'd just 
divide it themselves and only pass a subset to 
pci_p2pdma_add_resource(). However, as per our changes to the nvme-pci 
code, it's really just easier to use the allocator for everything inside 
the driver.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 19:21       ` Dan Williams
  (?)
  (?)
@ 2018-03-01 19:30         ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:30 UTC (permalink / raw)
  To: Dan Williams, benh
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 12:21 PM, Dan Williams wrote:
>> Note: I think the above means it won't work behind a switch on x86
>> either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Thanks for the confirmation Dan!

Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 19:30         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:30 UTC (permalink / raw)
  To: Dan Williams, benh
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Jérôme Glisse,
	Alex Williamson, Oliver OHalloran



On 01/03/18 12:21 PM, Dan Williams wrote:
>> Note: I think the above means it won't work behind a switch on x86
>> either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Thanks for the confirmation Dan!

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 19:30         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:30 UTC (permalink / raw)
  To: Dan Williams, benh-8fk3Idey6ehBDgjK7y7TUQ
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 12:21 PM, Dan Williams wrote:
>> Note: I think the above means it won't work behind a switch on x86
>> either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Thanks for the confirmation Dan!

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 19:30         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:30 UTC (permalink / raw)




On 01/03/18 12:21 PM, Dan Williams wrote:
>> Note: I think the above means it won't work behind a switch on x86
>> either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Thanks for the confirmation Dan!

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 10:31   ` Sagi Grimberg
  (?)
  (?)
@ 2018-03-01 19:33     ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:33 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 03:31 AM, Sagi Grimberg wrote:
>> * We also reject using devices that employ 'dma_virt_ops' which should
>>    fairly simply handle Jason's concerns that this work might break with
>>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>>    their own special DMA operations.
> 
> That's good, but what would happen for these devices? simply fail the
> mapping causing the ulp to fail its rdma operation? I would think
> that we need a capability flag for devices that support it.

pci_p2pmem_find() will simply not return any devices when any client 
that uses dma_virt_ops. So in the NVMe target case it simply will not 
use P2P memory.

And just in case, pci_p2pdma_map_sg() will also return 0 if the device 
passed to it uses dma_virt_ops as well. So if someone bypasses 
pci_p2pmem_find() they will get a failure during map.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 19:33     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:33 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson



On 01/03/18 03:31 AM, Sagi Grimberg wrote:
>> * We also reject using devices that employ 'dma_virt_ops' which should
>>    fairly simply handle Jason's concerns that this work might break with
>>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>>    their own special DMA operations.
> 
> That's good, but what would happen for these devices? simply fail the
> mapping causing the ulp to fail its rdma operation? I would think
> that we need a capability flag for devices that support it.

pci_p2pmem_find() will simply not return any devices when any client 
that uses dma_virt_ops. So in the NVMe target case it simply will not 
use P2P memory.

And just in case, pci_p2pdma_map_sg() will also return 0 if the device 
passed to it uses dma_virt_ops as well. So if someone bypasses 
pci_p2pmem_find() they will get a failure during map.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 19:33     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:33 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Benjamin Herrenschmidt, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 03:31 AM, Sagi Grimberg wrote:
>> * We also reject using devices that employ 'dma_virt_ops' which should
>>    fairly simply handle Jason's concerns that this work might break with
>>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>>    their own special DMA operations.
> 
> That's good, but what would happen for these devices? simply fail the
> mapping causing the ulp to fail its rdma operation? I would think
> that we need a capability flag for devices that support it.

pci_p2pmem_find() will simply not return any devices when any client 
that uses dma_virt_ops. So in the NVMe target case it simply will not 
use P2P memory.

And just in case, pci_p2pdma_map_sg() will also return 0 if the device 
passed to it uses dma_virt_ops as well. So if someone bypasses 
pci_p2pmem_find() they will get a failure during map.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 19:33     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:33 UTC (permalink / raw)




On 01/03/18 03:31 AM, Sagi Grimberg wrote:
>> * We also reject using devices that employ 'dma_virt_ops' which should
>> ?? fairly simply handle Jason's concerns that this work might break with
>> ?? the HFI, QIB and rxe drivers that use the virtual ops to implement
>> ?? their own special DMA operations.
> 
> That's good, but what would happen for these devices? simply fail the
> mapping causing the ulp to fail its rdma operation? I would think
> that we need a capability flag for devices that support it.

pci_p2pmem_find() will simply not return any devices when any client 
that uses dma_virt_ops. So in the NVMe target case it simply will not 
use P2P memory.

And just in case, pci_p2pdma_map_sg() will also return 0 if the device 
passed to it uses dma_virt_ops as well. So if someone bypasses 
pci_p2pmem_find() they will get a failure during map.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  2018-03-01 17:49       ` Bjorn Helgaas
  (?)
  (?)
@ 2018-03-01 19:36         ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 10:49 AM, Bjorn Helgaas wrote:
>> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>> +		      enum dma_data_direction dir)
> 
> Same question as before about why the mixture of "pci_*" interfaces
> that take "struct device *" parameters.

In this case, per the discussion in v1, it was felt the _map_sg() 
function should take the same arguments as dma_map_sg().

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-03-01 19:36         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson



On 01/03/18 10:49 AM, Bjorn Helgaas wrote:
>> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>> +		      enum dma_data_direction dir)
> 
> Same question as before about why the mixture of "pci_*" interfaces
> that take "struct device *" parameters.

In this case, per the discussion in v1, it was felt the _map_sg() 
function should take the same arguments as dma_map_sg().

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-03-01 19:36         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 10:49 AM, Bjorn Helgaas wrote:
>> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>> +		      enum dma_data_direction dir)
> 
> Same question as before about why the mixture of "pci_*" interfaces
> that take "struct device *" parameters.

In this case, per the discussion in v1, it was felt the _map_sg() 
function should take the same arguments as dma_map_sg().

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
@ 2018-03-01 19:36         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 19:36 UTC (permalink / raw)




On 01/03/18 10:49 AM, Bjorn Helgaas wrote:
>> +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
>> +		      enum dma_data_direction dir)
> 
> Same question as before about why the mixture of "pci_*" interfaces
> that take "struct device *" parameters.

In this case, per the discussion in v1, it was felt the _map_sg() 
function should take the same arguments as dma_map_sg().

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 18:04       ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 20:29         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:29 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> 
> On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> This works perfectly fine on x86 behind a switch and we've tested it on 
> multiple machines. We've never had an issue of running out of virtual 
> space despite our PCI bars typically being located with an offset of 
> 56TB or more. The arch code on x86 also somehow figures out not to map 
> the memory as cachable so that's not an issue (though, at this point, 
> the CPU never accesses the memory so even if it were, it wouldn't affect 
> anything).

Oliver can you look into this ? You sais the memory was effectively
hotplug'ed into the system when creating the struct pages. That would
mean to me that it's a) mapped (which for us is cachable, maybe x86 has
tricks to avoid that) and b) potentially used to populate userspace
pages (that will definitely be cachable). Unless there's something in
there you didn't see that prevents it.

> We also had this working on ARM64 a while back but it required some out 
> of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> code to ioremap the memory into the page map.
> 
> You didn't mention what architecture you were trying this on.

ppc64.

> It may make sense at this point to make this feature dependent on x86 
> until more work is done to make it properly portable. Something like 
> arch functions that allow adding IO memory pages to with a specific 
> cache setting. Though, if an arch has such restrictive limits on the map 
> size it would probably need to address that too somehow.

Not fan of that approach.

So there are two issues to consider here:

 - Our MMIO space is very far away from memory (high bits set in the
address) which causes problem with things like vmmemmap, page_address,
virt_to_page etc... Do you have similar issues on arm64 ?

 - We need to ensure that the mechanism (which I'm not familiar with)
that you use to create the struct page's for the device don't end up
turning those device pages into normal "general use" pages for the
system. Oliver thinks it does, you say it doesn't, ... 

Jerome (Glisse), what's your take on this ? Smells like something that
could be covered by HMM...

Logan, the only reason you need struct page's to begin with is for the
DMA API right ? Or am I missing something here ?

Cheers,
Ben.

> Thanks,
> 
> Logan
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:29         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:29 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Alex Williamson,
	Oliver OHalloran

On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> 
> On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> This works perfectly fine on x86 behind a switch and we've tested it on 
> multiple machines. We've never had an issue of running out of virtual 
> space despite our PCI bars typically being located with an offset of 
> 56TB or more. The arch code on x86 also somehow figures out not to map 
> the memory as cachable so that's not an issue (though, at this point, 
> the CPU never accesses the memory so even if it were, it wouldn't affect 
> anything).

Oliver can you look into this ? You sais the memory was effectively
hotplug'ed into the system when creating the struct pages. That would
mean to me that it's a) mapped (which for us is cachable, maybe x86 has
tricks to avoid that) and b) potentially used to populate userspace
pages (that will definitely be cachable). Unless there's something in
there you didn't see that prevents it.

> We also had this working on ARM64 a while back but it required some out 
> of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> code to ioremap the memory into the page map.
> 
> You didn't mention what architecture you were trying this on.

ppc64.

> It may make sense at this point to make this feature dependent on x86 
> until more work is done to make it properly portable. Something like 
> arch functions that allow adding IO memory pages to with a specific 
> cache setting. Though, if an arch has such restrictive limits on the map 
> size it would probably need to address that too somehow.

Not fan of that approach.

So there are two issues to consider here:

 - Our MMIO space is very far away from memory (high bits set in the
address) which causes problem with things like vmmemmap, page_address,
virt_to_page etc... Do you have similar issues on arm64 ?

 - We need to ensure that the mechanism (which I'm not familiar with)
that you use to create the struct page's for the device don't end up
turning those device pages into normal "general use" pages for the
system. Oliver thinks it does, you say it doesn't, ... 

Jerome (Glisse), what's your take on this ? Smells like something that
could be covered by HMM...

Logan, the only reason you need struct page's to begin with is for the
DMA API right ? Or am I missing something here ?

Cheers,
Ben.

> Thanks,
> 
> Logan
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:29         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:29 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> 
> On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> This works perfectly fine on x86 behind a switch and we've tested it on 
> multiple machines. We've never had an issue of running out of virtual 
> space despite our PCI bars typically being located with an offset of 
> 56TB or more. The arch code on x86 also somehow figures out not to map 
> the memory as cachable so that's not an issue (though, at this point, 
> the CPU never accesses the memory so even if it were, it wouldn't affect 
> anything).

Oliver can you look into this ? You sais the memory was effectively
hotplug'ed into the system when creating the struct pages. That would
mean to me that it's a) mapped (which for us is cachable, maybe x86 has
tricks to avoid that) and b) potentially used to populate userspace
pages (that will definitely be cachable). Unless there's something in
there you didn't see that prevents it.

> We also had this working on ARM64 a while back but it required some out 
> of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> code to ioremap the memory into the page map.
> 
> You didn't mention what architecture you were trying this on.

ppc64.

> It may make sense at this point to make this feature dependent on x86 
> until more work is done to make it properly portable. Something like 
> arch functions that allow adding IO memory pages to with a specific 
> cache setting. Though, if an arch has such restrictive limits on the map 
> size it would probably need to address that too somehow.

Not fan of that approach.

So there are two issues to consider here:

 - Our MMIO space is very far away from memory (high bits set in the
address) which causes problem with things like vmmemmap, page_address,
virt_to_page etc... Do you have similar issues on arm64 ?

 - We need to ensure that the mechanism (which I'm not familiar with)
that you use to create the struct page's for the device don't end up
turning those device pages into normal "general use" pages for the
system. Oliver thinks it does, you say it doesn't, ... 

Jerome (Glisse), what's your take on this ? Smells like something that
could be covered by HMM...

Logan, the only reason you need struct page's to begin with is for the
DMA API right ? Or am I missing something here ?

Cheers,
Ben.

> Thanks,
> 
> Logan
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:29         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:29 UTC (permalink / raw)


On Thu, 2018-03-01@11:04 -0700, Logan Gunthorpe wrote:
> 
> On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-03-01@14:54 +1100, Benjamin Herrenschmidt wrote:
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> This works perfectly fine on x86 behind a switch and we've tested it on 
> multiple machines. We've never had an issue of running out of virtual 
> space despite our PCI bars typically being located with an offset of 
> 56TB or more. The arch code on x86 also somehow figures out not to map 
> the memory as cachable so that's not an issue (though, at this point, 
> the CPU never accesses the memory so even if it were, it wouldn't affect 
> anything).

Oliver can you look into this ? You sais the memory was effectively
hotplug'ed into the system when creating the struct pages. That would
mean to me that it's a) mapped (which for us is cachable, maybe x86 has
tricks to avoid that) and b) potentially used to populate userspace
pages (that will definitely be cachable). Unless there's something in
there you didn't see that prevents it.

> We also had this working on ARM64 a while back but it required some out 
> of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> code to ioremap the memory into the page map.
> 
> You didn't mention what architecture you were trying this on.

ppc64.

> It may make sense at this point to make this feature dependent on x86 
> until more work is done to make it properly portable. Something like 
> arch functions that allow adding IO memory pages to with a specific 
> cache setting. Though, if an arch has such restrictive limits on the map 
> size it would probably need to address that too somehow.

Not fan of that approach.

So there are two issues to consider here:

 - Our MMIO space is very far away from memory (high bits set in the
address) which causes problem with things like vmmemmap, page_address,
virt_to_page etc... Do you have similar issues on arm64 ?

 - We need to ensure that the mechanism (which I'm not familiar with)
that you use to create the struct page's for the device don't end up
turning those device pages into normal "general use" pages for the
system. Oliver thinks it does, you say it doesn't, ... 

Jerome (Glisse), what's your take on this ? Smells like something that
could be covered by HMM...

Logan, the only reason you need struct page's to begin with is for the
DMA API right ? Or am I missing something here ?

Cheers,
Ben.

> Thanks,
> 
> Logan
> 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 18:09       ` Stephen  Bates
  (?)
  (?)
@ 2018-03-01 20:32         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:32 UTC (permalink / raw)
  To: Stephen Bates, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 18:09 +0000, Stephen  Bates wrote:
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> >   
> >   Note: I think the above means it won't work behind a switch on x86
> >   either, will it ?
> 
>  
> Ben 
> 
> We have done extensive testing of this series and its predecessors
> using PCIe switches from both Broadcom (PLX) and Microsemi. We have
> also done testing on x86_64, ARM64 and ppc64el based ARCH with
> varying degrees of success. The series as it currently stands only
> works on x86_64 but modified (hacky) versions have been made to work
> on ARM64. The x86_64 testing has been done on a range of (Intel)
> CPUs, servers, PCI EPs (including RDMA NICs from at least three
> vendors, NVMe SSDs from at least four vendors and P2P devices from
> four vendors) and PCI switches.
> 
> I do find it slightly offensive that you would question the series
> even working. I hope you are not suggesting we would submit this
> framework multiple times without having done testing on it....

No need to get personal on that. I did specify that this was based on
some incomplete understanding of what's going on with that new hack
used to create struct pages.

As it is, it cannot work on ppc64 however, in part because according to
Oliver, we end up mapping things cachable, and in part, because of the
address range issues.

The latter issue might be fundamental to the approach and unfixable
unless we have ways to use hooks for virt_to_page/page_address on these
things.

Ben.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:32         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:32 UTC (permalink / raw)
  To: Stephen Bates, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

On Thu, 2018-03-01 at 18:09 +0000, Stephen  Bates wrote:
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> >   
> >   Note: I think the above means it won't work behind a switch on x86
> >   either, will it ?
> 
>  
> Ben 
> 
> We have done extensive testing of this series and its predecessors
> using PCIe switches from both Broadcom (PLX) and Microsemi. We have
> also done testing on x86_64, ARM64 and ppc64el based ARCH with
> varying degrees of success. The series as it currently stands only
> works on x86_64 but modified (hacky) versions have been made to work
> on ARM64. The x86_64 testing has been done on a range of (Intel)
> CPUs, servers, PCI EPs (including RDMA NICs from at least three
> vendors, NVMe SSDs from at least four vendors and P2P devices from
> four vendors) and PCI switches.
> 
> I do find it slightly offensive that you would question the series
> even working. I hope you are not suggesting we would submit this
> framework multiple times without having done testing on it....

No need to get personal on that. I did specify that this was based on
some incomplete understanding of what's going on with that new hack
used to create struct pages.

As it is, it cannot work on ppc64 however, in part because according to
Oliver, we end up mapping things cachable, and in part, because of the
address range issues.

The latter issue might be fundamental to the approach and unfixable
unless we have ways to use hooks for virt_to_page/page_address on these
things.

Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:32         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:32 UTC (permalink / raw)
  To: Stephen Bates, Logan Gunthorpe,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 18:09 +0000, Stephen  Bates wrote:
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> >   
> >   Note: I think the above means it won't work behind a switch on x86
> >   either, will it ?
> 
>  
> Ben 
> 
> We have done extensive testing of this series and its predecessors
> using PCIe switches from both Broadcom (PLX) and Microsemi. We have
> also done testing on x86_64, ARM64 and ppc64el based ARCH with
> varying degrees of success. The series as it currently stands only
> works on x86_64 but modified (hacky) versions have been made to work
> on ARM64. The x86_64 testing has been done on a range of (Intel)
> CPUs, servers, PCI EPs (including RDMA NICs from at least three
> vendors, NVMe SSDs from at least four vendors and P2P devices from
> four vendors) and PCI switches.
> 
> I do find it slightly offensive that you would question the series
> even working. I hope you are not suggesting we would submit this
> framework multiple times without having done testing on it....

No need to get personal on that. I did specify that this was based on
some incomplete understanding of what's going on with that new hack
used to create struct pages.

As it is, it cannot work on ppc64 however, in part because according to
Oliver, we end up mapping things cachable, and in part, because of the
address range issues.

The latter issue might be fundamental to the approach and unfixable
unless we have ways to use hooks for virt_to_page/page_address on these
things.

Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:32         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:32 UTC (permalink / raw)


On Thu, 2018-03-01@18:09 +0000, Stephen  Bates wrote:
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> >   
> >   Note: I think the above means it won't work behind a switch on x86
> >   either, will it ?
> 
>  
> Ben 
> 
> We have done extensive testing of this series and its predecessors
> using PCIe switches from both Broadcom (PLX) and Microsemi. We have
> also done testing on x86_64, ARM64 and ppc64el based ARCH with
> varying degrees of success. The series as it currently stands only
> works on x86_64 but modified (hacky) versions have been made to work
> on ARM64. The x86_64 testing has been done on a range of (Intel)
> CPUs, servers, PCI EPs (including RDMA NICs from at least three
> vendors, NVMe SSDs from at least four vendors and P2P devices from
> four vendors) and PCI switches.
> 
> I do find it slightly offensive that you would question the series
> even working. I hope you are not suggesting we would submit this
> framework multiple times without having done testing on it....

No need to get personal on that. I did specify that this was based on
some incomplete understanding of what's going on with that new hack
used to create struct pages.

As it is, it cannot work on ppc64 however, in part because according to
Oliver, we end up mapping things cachable, and in part, because of the
address range issues.

The latter issue might be fundamental to the approach and unfixable
unless we have ways to use hooks for virt_to_page/page_address on these
things.

Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 19:21       ` Dan Williams
  (?)
  (?)
@ 2018-03-01 20:34         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
> <benh@au1.ibm.com> wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > > > Hi Everyone,
> > > 
> > > 
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

But what happens with that PCI memory ? Is it effectively turned into
nromal memory (ie, usable for normal allocations, potentially used to
populate user pages etc...) or is it kept aside ?

Also on ppc64, the physical addresses of PCIe make it so far appart
that there's no way we can map them into the linear mapping at the
normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
page_address or virt_to_page cannot work as-is on PCIe addresses.

Ben.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:34         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Linux Kernel Mailing List, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
> <benh@au1.ibm.com> wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > > > Hi Everyone,
> > > 
> > > 
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

But what happens with that PCI memory ? Is it effectively turned into
nromal memory (ie, usable for normal allocations, potentially used to
populate user pages etc...) or is it kept aside ?

Also on ppc64, the physical addresses of PCIe make it so far appart
that there's no way we can map them into the linear mapping at the
normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
page_address or virt_to_page cannot work as-is on PCIe addresses.

Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:34         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
> <benh-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org> wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > > > Hi Everyone,
> > > 
> > > 
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

But what happens with that PCI memory ? Is it effectively turned into
nromal memory (ie, usable for normal allocations, potentially used to
populate user pages etc...) or is it kept aside ?

Also on ppc64, the physical addresses of PCIe make it so far appart
that there's no way we can map them into the linear mapping at the
normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
page_address or virt_to_page cannot work as-is on PCIe addresses.

Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:34         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:34 UTC (permalink / raw)


On Thu, 2018-03-01@11:21 -0800, Dan Williams wrote:
> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
> <benh@au1.ibm.com> wrote:
> > On Thu, 2018-03-01@14:54 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-02-28@16:39 -0700, Logan Gunthorpe wrote:
> > > > Hi Everyone,
> > > 
> > > 
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

But what happens with that PCI memory ? Is it effectively turned into
nromal memory (ie, usable for normal allocations, potentially used to
populate user pages etc...) or is it kept aside ?

Also on ppc64, the physical addresses of PCIe make it so far appart
that there's no way we can map them into the linear mapping at the
normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
page_address or virt_to_page cannot work as-is on PCIe addresses.

Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 20:34         ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01 20:40           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Fri, 2018-03-02 at 07:34 +1100, Benjamin Herrenschmidt wrote:
> 
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?

(What I mean is is it added to the page allocator basically)

Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
on ppc64 (maybe via an arch hook as it might depend on the processor
family). Server powerpc cannot do cachable accesses on IO memory
(unless it's special OpenCAPI or nVlink, but not on PCIe).

> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Talking of which ... is there any documentation on the whole
memremap_page ? my grep turned out empty...

Cheers,
Ben.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:40           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Linux Kernel Mailing List, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

On Fri, 2018-03-02 at 07:34 +1100, Benjamin Herrenschmidt wrote:
> 
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?

(What I mean is is it added to the page allocator basically)

Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
on ppc64 (maybe via an arch hook as it might depend on the processor
family). Server powerpc cannot do cachable accesses on IO memory
(unless it's special OpenCAPI or nVlink, but not on PCIe).

> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Talking of which ... is there any documentation on the whole
memremap_page ? my grep turned out empty...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:40           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Fri, 2018-03-02 at 07:34 +1100, Benjamin Herrenschmidt wrote:
> 
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?

(What I mean is is it added to the page allocator basically)

Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
on ppc64 (maybe via an arch hook as it might depend on the processor
family). Server powerpc cannot do cachable accesses on IO memory
(unless it's special OpenCAPI or nVlink, but not on PCIe).

> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Talking of which ... is there any documentation on the whole
memremap_page ? my grep turned out empty...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:40           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 20:40 UTC (permalink / raw)


On Fri, 2018-03-02@07:34 +1100, Benjamin Herrenschmidt wrote:
> 
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?

(What I mean is is it added to the page allocator basically)

Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
on ppc64 (maybe via an arch hook as it might depend on the processor
family). Server powerpc cannot do cachable accesses on IO memory
(unless it's special OpenCAPI or nVlink, but not on PCIe).

> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Talking of which ... is there any documentation on the whole
memremap_page ? my grep turned out empty...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 20:40           ` Benjamin Herrenschmidt
@ 2018-03-01 20:53             ` Jason Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-01 20:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Dan Williams, Logan Gunthorpe, Linux Kernel Mailing List,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Max Gurtovoy,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> on ppc64 (maybe via an arch hook as it might depend on the processor
> family). Server powerpc cannot do cachable accesses on IO memory
> (unless it's special OpenCAPI or nVlink, but not on PCIe).

I think you are right on this - even on x86 we must not create
cachable mappings of PCI BARs - there is no way that works the way
anyone would expect.

I think this series doesn't have a problem here only because it never
touches the BAR pages with the CPU.

BAR memory should be mapped into the CPU as WC at best on all arches..

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:53             ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-01 20:53 UTC (permalink / raw)


On Fri, Mar 02, 2018@07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> on ppc64 (maybe via an arch hook as it might depend on the processor
> family). Server powerpc cannot do cachable accesses on IO memory
> (unless it's special OpenCAPI or nVlink, but not on PCIe).

I think you are right on this - even on x86 we must not create
cachable mappings of PCI BARs - there is no way that works the way
anyone would expect.

I think this series doesn't have a problem here only because it never
touches the BAR pages with the CPU.

BAR memory should be mapped into the CPU as WC at best on all arches..

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 20:29         ` Benjamin Herrenschmidt
                             ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 20:55           ` Jerome Glisse
  -1 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 20:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> > 
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > > 
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > > 
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> > 
> > This works perfectly fine on x86 behind a switch and we've tested it on 
> > multiple machines. We've never had an issue of running out of virtual 
> > space despite our PCI bars typically being located with an offset of 
> > 56TB or more. The arch code on x86 also somehow figures out not to map 
> > the memory as cachable so that's not an issue (though, at this point, 
> > the CPU never accesses the memory so even if it were, it wouldn't affect 
> > anything).
> 
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
> 
> > We also had this working on ARM64 a while back but it required some out 
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> > code to ioremap the memory into the page map.
> > 
> > You didn't mention what architecture you were trying this on.
> 
> ppc64.
> 
> > It may make sense at this point to make this feature dependent on x86 
> > until more work is done to make it properly portable. Something like 
> > arch functions that allow adding IO memory pages to with a specific 
> > cache setting. Though, if an arch has such restrictive limits on the map 
> > size it would probably need to address that too somehow.
> 
> Not fan of that approach.
> 
> So there are two issues to consider here:
> 
>  - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).

To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.

> 
>  - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ... 
> 
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.

Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.


> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:55           ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 20:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams, Alex Williamson,
	Oliver OHalloran

On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> > 
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > > 
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > > 
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> > 
> > This works perfectly fine on x86 behind a switch and we've tested it on 
> > multiple machines. We've never had an issue of running out of virtual 
> > space despite our PCI bars typically being located with an offset of 
> > 56TB or more. The arch code on x86 also somehow figures out not to map 
> > the memory as cachable so that's not an issue (though, at this point, 
> > the CPU never accesses the memory so even if it were, it wouldn't affect 
> > anything).
> 
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
> 
> > We also had this working on ARM64 a while back but it required some out 
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> > code to ioremap the memory into the page map.
> > 
> > You didn't mention what architecture you were trying this on.
> 
> ppc64.
> 
> > It may make sense at this point to make this feature dependent on x86 
> > until more work is done to make it properly portable. Something like 
> > arch functions that allow adding IO memory pages to with a specific 
> > cache setting. Though, if an arch has such restrictive limits on the map 
> > size it would probably need to address that too somehow.
> 
> Not fan of that approach.
> 
> So there are two issues to consider here:
> 
>  - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).

To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.

> 
>  - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ... 
> 
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.

Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.


> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:55           ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 20:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> > 
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > > 
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > > 
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> > 
> > This works perfectly fine on x86 behind a switch and we've tested it on 
> > multiple machines. We've never had an issue of running out of virtual 
> > space despite our PCI bars typically being located with an offset of 
> > 56TB or more. The arch code on x86 also somehow figures out not to map 
> > the memory as cachable so that's not an issue (though, at this point, 
> > the CPU never accesses the memory so even if it were, it wouldn't affect 
> > anything).
> 
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
> 
> > We also had this working on ARM64 a while back but it required some out 
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> > code to ioremap the memory into the page map.
> > 
> > You didn't mention what architecture you were trying this on.
> 
> ppc64.
> 
> > It may make sense at this point to make this feature dependent on x86 
> > until more work is done to make it properly portable. Something like 
> > arch functions that allow adding IO memory pages to with a specific 
> > cache setting. Though, if an arch has such restrictive limits on the map 
> > size it would probably need to address that too somehow.
> 
> Not fan of that approach.
> 
> So there are two issues to consider here:
> 
>  - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).

To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.

> 
>  - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ... 
> 
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.

Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.


> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:55           ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 20:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams, Alex Williamson,
	Oliver OHalloran

On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> > 
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > > 
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > > 
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> > 
> > This works perfectly fine on x86 behind a switch and we've tested it on 
> > multiple machines. We've never had an issue of running out of virtual 
> > space despite our PCI bars typically being located with an offset of 
> > 56TB or more. The arch code on x86 also somehow figures out not to map 
> > the memory as cachable so that's not an issue (though, at this point, 
> > the CPU never accesses the memory so even if it were, it wouldn't affect 
> > anything).
> 
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
> 
> > We also had this working on ARM64 a while back but it required some out 
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> > code to ioremap the memory into the page map.
> > 
> > You didn't mention what architecture you were trying this on.
> 
> ppc64.
> 
> > It may make sense at this point to make this feature dependent on x86 
> > until more work is done to make it properly portable. Something like 
> > arch functions that allow adding IO memory pages to with a specific 
> > cache setting. Though, if an arch has such restrictive limits on the map 
> > size it would probably need to address that too somehow.
> 
> Not fan of that approach.
> 
> So there are two issues to consider here:
> 
>  - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).

To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.

> 
>  - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ... 
> 
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.

Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.


> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:55           ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 20:55 UTC (permalink / raw)


On Fri, Mar 02, 2018@07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01@11:04 -0700, Logan Gunthorpe wrote:
> > 
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01@14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > > 
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > > 
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> > 
> > This works perfectly fine on x86 behind a switch and we've tested it on 
> > multiple machines. We've never had an issue of running out of virtual 
> > space despite our PCI bars typically being located with an offset of 
> > 56TB or more. The arch code on x86 also somehow figures out not to map 
> > the memory as cachable so that's not an issue (though, at this point, 
> > the CPU never accesses the memory so even if it were, it wouldn't affect 
> > anything).
> 
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
> 
> > We also had this working on ARM64 a while back but it required some out 
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> > code to ioremap the memory into the page map.
> > 
> > You didn't mention what architecture you were trying this on.
> 
> ppc64.
> 
> > It may make sense at this point to make this feature dependent on x86 
> > until more work is done to make it properly portable. Something like 
> > arch functions that allow adding IO memory pages to with a specific 
> > cache setting. Though, if an arch has such restrictive limits on the map 
> > size it would probably need to address that too somehow.
> 
> Not fan of that approach.
> 
> So there are two issues to consider here:
> 
>  - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).

To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.

> 
>  - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ... 
> 
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.

Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.


> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 20:29         ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01 20:55           ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 20:55 UTC (permalink / raw)
  To: benh, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 01:29 PM, Benjamin Herrenschmidt wrote:
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.

Yes, we've been specifically prohibiting all cases where these pages get 
passed to userspace. We don't want that. Although it works in limited 
cases (ie x86), and we use it for some testing, there are dragons there.

>   - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

No similar issues on arm64. Any chance you could simply not map the PCI 
bars that way? What's the point of that? It may simply mean ppc64 can't 
be supported until either that changes or the kernel infrastructure gets 
more sophisticated.

> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

It's not so much the DMA map API as it is the entire kernel 
infrastructure. Scatter lists (which are universally used to setup DMA 
requests) require pages and bios require pages, etc, etc. In fact, this 
patch set, in its current form, routes around the DMA API entirely.

Myself[1] and others have done prototype work to migrate away from 
struct pages and to use pfn_t instead but this work doesn't seem to get 
very far in the community.

Logan


[1] https://marc.info/?l=linux-kernel&m=149566222124326&w=2
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:55           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 20:55 UTC (permalink / raw)
  To: benh, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block
  Cc: Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Alex Williamson,
	Oliver OHalloran



On 01/03/18 01:29 PM, Benjamin Herrenschmidt wrote:
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.

Yes, we've been specifically prohibiting all cases where these pages get 
passed to userspace. We don't want that. Although it works in limited 
cases (ie x86), and we use it for some testing, there are dragons there.

>   - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

No similar issues on arm64. Any chance you could simply not map the PCI 
bars that way? What's the point of that? It may simply mean ppc64 can't 
be supported until either that changes or the kernel infrastructure gets 
more sophisticated.

> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

It's not so much the DMA map API as it is the entire kernel 
infrastructure. Scatter lists (which are universally used to setup DMA 
requests) require pages and bios require pages, etc, etc. In fact, this 
patch set, in its current form, routes around the DMA API entirely.

Myself[1] and others have done prototype work to migrate away from 
struct pages and to use pfn_t instead but this work doesn't seem to get 
very far in the community.

Logan


[1] https://marc.info/?l=linux-kernel&m=149566222124326&w=2

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:55           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 20:55 UTC (permalink / raw)
  To: benh-8fk3Idey6ehBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: Jens Axboe, Oliver OHalloran, Alex Williamson, Keith Busch,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 01:29 PM, Benjamin Herrenschmidt wrote:
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.

Yes, we've been specifically prohibiting all cases where these pages get 
passed to userspace. We don't want that. Although it works in limited 
cases (ie x86), and we use it for some testing, there are dragons there.

>   - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

No similar issues on arm64. Any chance you could simply not map the PCI 
bars that way? What's the point of that? It may simply mean ppc64 can't 
be supported until either that changes or the kernel infrastructure gets 
more sophisticated.

> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

It's not so much the DMA map API as it is the entire kernel 
infrastructure. Scatter lists (which are universally used to setup DMA 
requests) require pages and bios require pages, etc, etc. In fact, this 
patch set, in its current form, routes around the DMA API entirely.

Myself[1] and others have done prototype work to migrate away from 
struct pages and to use pfn_t instead but this work doesn't seem to get 
very far in the community.

Logan


[1] https://marc.info/?l=linux-kernel&m=149566222124326&w=2

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:55           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 20:55 UTC (permalink / raw)




On 01/03/18 01:29 PM, Benjamin Herrenschmidt wrote:
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.

Yes, we've been specifically prohibiting all cases where these pages get 
passed to userspace. We don't want that. Although it works in limited 
cases (ie x86), and we use it for some testing, there are dragons there.

>   - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

No similar issues on arm64. Any chance you could simply not map the PCI 
bars that way? What's the point of that? It may simply mean ppc64 can't 
be supported until either that changes or the kernel infrastructure gets 
more sophisticated.

> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

It's not so much the DMA map API as it is the entire kernel 
infrastructure. Scatter lists (which are universally used to setup DMA 
requests) require pages and bios require pages, etc, etc. In fact, this 
patch set, in its current form, routes around the DMA API entirely.

Myself[1] and others have done prototype work to migrate away from 
struct pages and to use pfn_t instead but this work doesn't seem to get 
very far in the community.

Logan


[1] https://marc.info/?l=linux-kernel&m=149566222124326&w=2

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 20:53             ` Jason Gunthorpe
  (?)
  (?)
@ 2018-03-01 20:57               ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 20:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Benjamin Herrenschmidt
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 01:53 PM, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
>> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
>> on ppc64 (maybe via an arch hook as it might depend on the processor
>> family). Server powerpc cannot do cachable accesses on IO memory
>> (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.

On x86, even if I try to make a cachable mapping of a PCI BAR it always 
ends up being un-cached. The arch code in x86 always does the right 
thing here.... Other arches, not so much.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 20:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Benjamin Herrenschmidt
  Cc: Dan Williams, Linux Kernel Mailing List, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Max Gurtovoy, Jérôme Glisse,
	Alex Williamson, Oliver OHalloran



On 01/03/18 01:53 PM, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
>> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
>> on ppc64 (maybe via an arch hook as it might depend on the processor
>> family). Server powerpc cannot do cachable accesses on IO memory
>> (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.

On x86, even if I try to make a cachable mapping of a PCI BAR it always 
ends up being un-cached. The arch code in x86 always does the right 
thing here.... Other arches, not so much.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 20:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Benjamin Herrenschmidt
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 01:53 PM, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
>> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
>> on ppc64 (maybe via an arch hook as it might depend on the processor
>> family). Server powerpc cannot do cachable accesses on IO memory
>> (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.

On x86, even if I try to make a cachable mapping of a PCI BAR it always 
ends up being un-cached. The arch code in x86 always does the right 
thing here.... Other arches, not so much.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 20:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 20:57 UTC (permalink / raw)




On 01/03/18 01:53 PM, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018@07:40:15AM +1100, Benjamin Herrenschmidt wrote:
>> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
>> on ppc64 (maybe via an arch hook as it might depend on the processor
>> family). Server powerpc cannot do cachable accesses on IO memory
>> (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.

On x86, even if I try to make a cachable mapping of a PCI BAR it always 
ends up being un-cached. The arch code in x86 always does the right 
thing here.... Other arches, not so much.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 20:55           ` Jerome Glisse
  (?)
  (?)
@ 2018-03-01 21:03             ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:03 UTC (permalink / raw)
  To: Jerome Glisse, Benjamin Herrenschmidt
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 01/03/18 01:55 PM, Jerome Glisse wrote:
> Well this again a new user of struct page for device memory just for
> one usecase. I wanted HMM to be more versatile so that it could be use
> for this kind of thing too. I guess the message didn't go through. I
> will take some cycles tomorrow to look into this patchset to ascertain
> how struct page is use in this context.

We looked at it but didn't see how any of it was applicable to our needs.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:03             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:03 UTC (permalink / raw)
  To: Jerome Glisse, Benjamin Herrenschmidt
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Alex Williamson, Oliver OHalloran



On 01/03/18 01:55 PM, Jerome Glisse wrote:
> Well this again a new user of struct page for device memory just for
> one usecase. I wanted HMM to be more versatile so that it could be use
> for this kind of thing too. I guess the message didn't go through. I
> will take some cycles tomorrow to look into this patchset to ascertain
> how struct page is use in this context.

We looked at it but didn't see how any of it was applicable to our needs.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:03             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:03 UTC (permalink / raw)
  To: Jerome Glisse, Benjamin Herrenschmidt
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 01:55 PM, Jerome Glisse wrote:
> Well this again a new user of struct page for device memory just for
> one usecase. I wanted HMM to be more versatile so that it could be use
> for this kind of thing too. I guess the message didn't go through. I
> will take some cycles tomorrow to look into this patchset to ascertain
> how struct page is use in this context.

We looked at it but didn't see how any of it was applicable to our needs.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:03             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:03 UTC (permalink / raw)




On 01/03/18 01:55 PM, Jerome Glisse wrote:
> Well this again a new user of struct page for device memory just for
> one usecase. I wanted HMM to be more versatile so that it could be use
> for this kind of thing too. I guess the message didn't go through. I
> will take some cycles tomorrow to look into this patchset to ascertain
> how struct page is use in this context.

We looked at it but didn't see how any of it was applicable to our needs.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 19:21       ` Dan Williams
  (?)
  (?)
@ 2018-03-01 21:03         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 21:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> 
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Actually can you explain a bit more here ?

devm_memremap_pages() doesn't take any specific argument about what to
do with the memory.

It does create the vmemmap sections etc... but does so by calling
arch_add_memory(). So __add_memory() isn't called, which means the
pages aren't added to the linear mapping. Then you manually add them to
ZONE_DEVICE.

Am I correct ?

In that case, they indeed can't be used as normal memory pages, which
is good, and if they are indeed not in the linear mapping, then there
is no caching issues.

However, what happens if anything calls page_address() on them ? Some
DMA ops do that for example, or some devices might ...

This is all quite convoluted with no documentation I can find that
explains the various expectations.

So the question is are those pages landing in the linear mapping, and
if yes, by what code path ?

The next question is if we ever want that to work on ppc64, we need a
way to make this fit in our linear mapping and map it non-cachable,
which will require some wrangling on how we handle that mapping.

Cheers,
Ben.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:03         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 21:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Linux Kernel Mailing List, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> 
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Actually can you explain a bit more here ?

devm_memremap_pages() doesn't take any specific argument about what to
do with the memory.

It does create the vmemmap sections etc... but does so by calling
arch_add_memory(). So __add_memory() isn't called, which means the
pages aren't added to the linear mapping. Then you manually add them to
ZONE_DEVICE.

Am I correct ?

In that case, they indeed can't be used as normal memory pages, which
is good, and if they are indeed not in the linear mapping, then there
is no caching issues.

However, what happens if anything calls page_address() on them ? Some
DMA ops do that for example, or some devices might ...

This is all quite convoluted with no documentation I can find that
explains the various expectations.

So the question is are those pages landing in the linear mapping, and
if yes, by what code path ?

The next question is if we ever want that to work on ppc64, we need a
way to make this fit in our linear mapping and map it non-cachable,
which will require some wrangling on how we handle that mapping.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:03         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 21:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> 
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Actually can you explain a bit more here ?

devm_memremap_pages() doesn't take any specific argument about what to
do with the memory.

It does create the vmemmap sections etc... but does so by calling
arch_add_memory(). So __add_memory() isn't called, which means the
pages aren't added to the linear mapping. Then you manually add them to
ZONE_DEVICE.

Am I correct ?

In that case, they indeed can't be used as normal memory pages, which
is good, and if they are indeed not in the linear mapping, then there
is no caching issues.

However, what happens if anything calls page_address() on them ? Some
DMA ops do that for example, or some devices might ...

This is all quite convoluted with no documentation I can find that
explains the various expectations.

So the question is are those pages landing in the linear mapping, and
if yes, by what code path ?

The next question is if we ever want that to work on ppc64, we need a
way to make this fit in our linear mapping and map it non-cachable,
which will require some wrangling on how we handle that mapping.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:03         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 21:03 UTC (permalink / raw)


On Thu, 2018-03-01@11:21 -0800, Dan Williams wrote:
> 
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Actually can you explain a bit more here ?

devm_memremap_pages() doesn't take any specific argument about what to
do with the memory.

It does create the vmemmap sections etc... but does so by calling
arch_add_memory(). So __add_memory() isn't called, which means the
pages aren't added to the linear mapping. Then you manually add them to
ZONE_DEVICE.

Am I correct ?

In that case, they indeed can't be used as normal memory pages, which
is good, and if they are indeed not in the linear mapping, then there
is no caching issues.

However, what happens if anything calls page_address() on them ? Some
DMA ops do that for example, or some devices might ...

This is all quite convoluted with no documentation I can find that
explains the various expectations.

So the question is are those pages landing in the linear mapping, and
if yes, by what code path ?

The next question is if we ever want that to work on ppc64, we need a
way to make this fit in our linear mapping and map it non-cachable,
which will require some wrangling on how we handle that mapping.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:03             ` Logan Gunthorpe
                                 ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 21:10               ` Jerome Glisse
  -1 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:10 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
> 
> We looked at it but didn't see how any of it was applicable to our needs.
> 

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.

You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.

You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.

The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:10               ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:10 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Alex Williamson, Oliver OHalloran

On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
> 
> We looked at it but didn't see how any of it was applicable to our needs.
> 

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.

You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.

You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.

The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:10               ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:10 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
> 
> We looked at it but didn't see how any of it was applicable to our needs.
> 

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.

You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.

You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.

The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:10               ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:10 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Alex Williamson, Oliver OHalloran

On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
> 
> We looked at it but didn't see how any of it was applicable to our needs.
> 

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.

You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.

You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.

The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:10               ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:10 UTC (permalink / raw)


On Thu, Mar 01, 2018@02:03:26PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
> 
> We looked at it but didn't see how any of it was applicable to our needs.
> 

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.

You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.

You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.

The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:03         ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01 21:11           ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:11 UTC (permalink / raw)
  To: benh, Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> However, what happens if anything calls page_address() on them ? Some
> DMA ops do that for example, or some devices might ...

Although we could probably work around it with some pain, we rely on 
page_address() and virt_to_phys(), etc to work on these pages. So on 
x86, yes, it makes it into the linear mapping.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:11           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:11 UTC (permalink / raw)
  To: benh, Dan Williams
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Jérôme Glisse,
	Alex Williamson, Oliver OHalloran



On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> However, what happens if anything calls page_address() on them ? Some
> DMA ops do that for example, or some devices might ...

Although we could probably work around it with some pain, we rely on 
page_address() and virt_to_phys(), etc to work on these pages. So on 
x86, yes, it makes it into the linear mapping.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:11           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:11 UTC (permalink / raw)
  To: benh-8fk3Idey6ehBDgjK7y7TUQ, Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> However, what happens if anything calls page_address() on them ? Some
> DMA ops do that for example, or some devices might ...

Although we could probably work around it with some pain, we rely on 
page_address() and virt_to_phys(), etc to work on these pages. So on 
x86, yes, it makes it into the linear mapping.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:11           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:11 UTC (permalink / raw)




On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> However, what happens if anything calls page_address() on them ? Some
> DMA ops do that for example, or some devices might ...

Although we could probably work around it with some pain, we rely on 
page_address() and virt_to_phys(), etc to work on these pages. So on 
x86, yes, it makes it into the linear mapping.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:10               ` Jerome Glisse
  (?)
  (?)
@ 2018-03-01 21:15                 ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:10 PM, Jerome Glisse wrote:
> It seems people miss-understand HMM :( you do not have to use all of
> its features. If all you care about is having struct page then just
> use that for instance in your case only use those following 3 functions:
> 
> hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> for cleanup.

To what benefit over just using devm_memremap_pages()? If I'm using the 
hmm interface and disabling all the features, I don't see the point. 
We've also cleaned up the devm_memremap_pages() interface to be more 
usefully generic in such a way that I'd hope HMM starts using it too and 
gets rid of the code duplication.

Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:15                 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Alex Williamson, Oliver OHalloran



On 01/03/18 02:10 PM, Jerome Glisse wrote:
> It seems people miss-understand HMM :( you do not have to use all of
> its features. If all you care about is having struct page then just
> use that for instance in your case only use those following 3 functions:
> 
> hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> for cleanup.

To what benefit over just using devm_memremap_pages()? If I'm using the 
hmm interface and disabling all the features, I don't see the point. 
We've also cleaned up the devm_memremap_pages() interface to be more 
usefully generic in such a way that I'd hope HMM starts using it too and 
gets rid of the code duplication.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:15                 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:10 PM, Jerome Glisse wrote:
> It seems people miss-understand HMM :( you do not have to use all of
> its features. If all you care about is having struct page then just
> use that for instance in your case only use those following 3 functions:
> 
> hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> for cleanup.

To what benefit over just using devm_memremap_pages()? If I'm using the 
hmm interface and disabling all the features, I don't see the point. 
We've also cleaned up the devm_memremap_pages() interface to be more 
usefully generic in such a way that I'd hope HMM starts using it too and 
gets rid of the code duplication.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:15                 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:15 UTC (permalink / raw)




On 01/03/18 02:10 PM, Jerome Glisse wrote:
> It seems people miss-understand HMM :( you do not have to use all of
> its features. If all you care about is having struct page then just
> use that for instance in your case only use those following 3 functions:
> 
> hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> for cleanup.

To what benefit over just using devm_memremap_pages()? If I'm using the 
hmm interface and disabling all the features, I don't see the point. 
We've also cleaned up the devm_memremap_pages() interface to be more 
usefully generic in such a way that I'd hope HMM starts using it too and 
gets rid of the code duplication.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:11           ` Logan Gunthorpe
                               ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 21:18             ` Jerome Glisse
  -1 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:18 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, benh, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	linux-block, Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
> 
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
{
    struct hmm_devmem *devmem;
    unsigned long ppfn;

    /* Sanity test maybe BUG_ON() */
    if (!is_device_private_page(page))
        return -1UL;

    devmem = page->pgmap->data;
    ppfn = page_to_page(page) - devmem->pfn_first;
    return ppfn + devmem->device_phys_base_pfn;
}

Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:18             ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:18 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: benh, Dan Williams, Linux Kernel Mailing List, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Alex Williamson,
	Oliver OHalloran

On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
> 
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
{
    struct hmm_devmem *devmem;
    unsigned long ppfn;

    /* Sanity test maybe BUG_ON() */
    if (!is_device_private_page(page))
        return -1UL;

    devmem = page->pgmap->data;
    ppfn = page_to_page(page) - devmem->pfn_first;
    return ppfn + devmem->device_phys_base_pfn;
}

Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:18             ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:18 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	benh-8fk3Idey6ehBDgjK7y7TUQ, linux-nvdimm, linux-rdma,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
> 
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
{
    struct hmm_devmem *devmem;
    unsigned long ppfn;

    /* Sanity test maybe BUG_ON() */
    if (!is_device_private_page(page))
        return -1UL;

    devmem = page->pgmap->data;
    ppfn = page_to_page(page) - devmem->pfn_first;
    return ppfn + devmem->device_phys_base_pfn;
}

Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:18             ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:18 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: benh, Dan Williams, Linux Kernel Mailing List, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Alex Williamson,
	Oliver OHalloran

On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
> 
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
{
    struct hmm_devmem *devmem;
    unsigned long ppfn;

    /* Sanity test maybe BUG_ON() */
    if (!is_device_private_page(page))
        return -1UL;

    devmem = page->pgmap->data;
    ppfn = page_to_page(page) - devmem->pfn_first;
    return ppfn + devmem->device_phys_base_pfn;
}

Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:18             ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:18 UTC (permalink / raw)


On Thu, Mar 01, 2018@02:11:34PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
> 
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
{
    struct hmm_devmem *devmem;
    unsigned long ppfn;

    /* Sanity test maybe BUG_ON() */
    if (!is_device_private_page(page))
        return -1UL;

    devmem = page->pgmap->data;
    ppfn = page_to_page(page) - devmem->pfn_first;
    return ppfn + devmem->device_phys_base_pfn;
}

Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 18:54       ` Stephen  Bates
  (?)
  (?)
@ 2018-03-01 21:21         ` Alex Williamson
  -1 siblings, 0 replies; 504+ messages in thread
From: Alex Williamson @ 2018-03-01 21:21 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 1 Mar 2018 18:54:01 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Thanks for the detailed review Bjorn!
> 
> >>  
> >> +	  Enabling this option will also disable ACS on all ports behind
> >> +	  any PCIe switch. This effictively puts all devices behind any
> >> +	  switch into the same IOMMU group.  
> 
> >
> >  Does this really mean "all devices behind the same Root Port"?  
> 
> Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.
> 
> > What does this mean in terms of device security?  I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.  
> 
> This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.

This is still a pretty terrible solution though, your kernel provider
needs to decide whether they favor device assignment or p2p, because we
can't do both, unless there's a patch I haven't seen yet that allows
boot time rather than compile time configuration.  There are absolutely
supported device assignment cases of switches proving isolation between
devices allowing the downstream EPs to be used independently.  I think
this is a non-starter for distribution support without boot time or
dynamic configuration.  I could imagine dynamic configuration through
sysfs that might trigger a soft remove and rescan of the affected
devices in order to rebuild the IOMMU group.  The hard part might be
determining which points to allow that to guarantee correctness.  For
instance, upstream switch ports don't actually support ACS, but they'd
otherwise be an obvious concentration point to trigger a
reconfiguration.  Thanks,

Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:21         ` Alex Williamson
  0 siblings, 0 replies; 504+ messages in thread
From: Alex Williamson @ 2018-03-01 21:21 UTC (permalink / raw)
  To: Stephen  Bates
  Cc: Bjorn Helgaas, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

On Thu, 1 Mar 2018 18:54:01 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Thanks for the detailed review Bjorn!
> 
> >>  
> >> +	  Enabling this option will also disable ACS on all ports behind
> >> +	  any PCIe switch. This effictively puts all devices behind any
> >> +	  switch into the same IOMMU group.  
> 
> >
> >  Does this really mean "all devices behind the same Root Port"?  
> 
> Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.
> 
> > What does this mean in terms of device security?  I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.  
> 
> This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.

This is still a pretty terrible solution though, your kernel provider
needs to decide whether they favor device assignment or p2p, because we
can't do both, unless there's a patch I haven't seen yet that allows
boot time rather than compile time configuration.  There are absolutely
supported device assignment cases of switches proving isolation between
devices allowing the downstream EPs to be used independently.  I think
this is a non-starter for distribution support without boot time or
dynamic configuration.  I could imagine dynamic configuration through
sysfs that might trigger a soft remove and rescan of the affected
devices in order to rebuild the IOMMU group.  The hard part might be
determining which points to allow that to guarantee correctness.  For
instance, upstream switch ports don't actually support ACS, but they'd
otherwise be an obvious concentration point to trigger a
reconfiguration.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:21         ` Alex Williamson
  0 siblings, 0 replies; 504+ messages in thread
From: Alex Williamson @ 2018-03-01 21:21 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 1 Mar 2018 18:54:01 +0000
"Stephen  Bates" <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org> wrote:

> Thanks for the detailed review Bjorn!
> 
> >>  
> >> +	  Enabling this option will also disable ACS on all ports behind
> >> +	  any PCIe switch. This effictively puts all devices behind any
> >> +	  switch into the same IOMMU group.  
> 
> >
> >  Does this really mean "all devices behind the same Root Port"?  
> 
> Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.
> 
> > What does this mean in terms of device security?  I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.  
> 
> This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.

This is still a pretty terrible solution though, your kernel provider
needs to decide whether they favor device assignment or p2p, because we
can't do both, unless there's a patch I haven't seen yet that allows
boot time rather than compile time configuration.  There are absolutely
supported device assignment cases of switches proving isolation between
devices allowing the downstream EPs to be used independently.  I think
this is a non-starter for distribution support without boot time or
dynamic configuration.  I could imagine dynamic configuration through
sysfs that might trigger a soft remove and rescan of the affected
devices in order to rebuild the IOMMU group.  The hard part might be
determining which points to allow that to guarantee correctness.  For
instance, upstream switch ports don't actually support ACS, but they'd
otherwise be an obvious concentration point to trigger a
reconfiguration.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:21         ` Alex Williamson
  0 siblings, 0 replies; 504+ messages in thread
From: Alex Williamson @ 2018-03-01 21:21 UTC (permalink / raw)


On Thu, 1 Mar 2018 18:54:01 +0000
"Stephen  Bates" <sbates@raithlin.com> wrote:

> Thanks for the detailed review Bjorn!
> 
> >>  
> >> +	  Enabling this option will also disable ACS on all ports behind
> >> +	  any PCIe switch. This effictively puts all devices behind any
> >> +	  switch into the same IOMMU group.  
> 
> >
> >  Does this really mean "all devices behind the same Root Port"?  
> 
> Not necessarily. You might have a cascade of switches (i.e switches below a switch) to achieve a very large fan-out (in an NVMe SSD array for example) and we will only disable ACS on the ports below the relevant switch.
> 
> > What does this mean in terms of device security?  I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.  
> 
> This was discussed during v1 [1]. Disabling ACS on all downstream ports of the switch means that all the EPs below it have to part of the same IOMMU grouping. However it was also agreed that as long as the ACS disable occurred at boot time (which is does in v2) then the virtualization layer will be aware of it and will perform the IOMMU group formation correctly.

This is still a pretty terrible solution though, your kernel provider
needs to decide whether they favor device assignment or p2p, because we
can't do both, unless there's a patch I haven't seen yet that allows
boot time rather than compile time configuration.  There are absolutely
supported device assignment cases of switches proving isolation between
devices allowing the downstream EPs to be used independently.  I think
this is a non-starter for distribution support without boot time or
dynamic configuration.  I could imagine dynamic configuration through
sysfs that might trigger a soft remove and rescan of the affected
devices in order to rebuild the IOMMU group.  The hard part might be
determining which points to allow that to guarantee correctness.  For
instance, upstream switch ports don't actually support ACS, but they'd
otherwise be an obvious concentration point to trigger a
reconfiguration.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:18             ` Jerome Glisse
  (?)
  (?)
@ 2018-03-01 21:22               ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, benh, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	linux-block, Alex Williamson, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:18 PM, Jerome Glisse wrote:
> This is pretty easy to do with HMM:
> 
> unsigned long hmm_page_to_phys_pfn(struct page *page)
This is not useful unless you want to go through all the kernel paths we 
are using and replace page_to_phys() and friends with something else 
that calls an HMM function when appropriate...

The problem isn't getting the physical address from a page, it's that we 
are passing these pages through various kernel interfaces which expect 
pages that work in the usual manner. (Look at the code: we quite simply 
provide a way to get the PCI bus address from a page when necessary).

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:22               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: benh, Dan Williams, Linux Kernel Mailing List, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Alex Williamson,
	Oliver OHalloran



On 01/03/18 02:18 PM, Jerome Glisse wrote:
> This is pretty easy to do with HMM:
> 
> unsigned long hmm_page_to_phys_pfn(struct page *page)
This is not useful unless you want to go through all the kernel paths we 
are using and replace page_to_phys() and friends with something else 
that calls an HMM function when appropriate...

The problem isn't getting the physical address from a page, it's that we 
are passing these pages through various kernel interfaces which expect 
pages that work in the usual manner. (Look at the code: we quite simply 
provide a way to get the PCI bus address from a page when necessary).

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:22               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	benh-8fk3Idey6ehBDgjK7y7TUQ, linux-nvdimm, linux-rdma,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:18 PM, Jerome Glisse wrote:
> This is pretty easy to do with HMM:
> 
> unsigned long hmm_page_to_phys_pfn(struct page *page)
This is not useful unless you want to go through all the kernel paths we 
are using and replace page_to_phys() and friends with something else 
that calls an HMM function when appropriate...

The problem isn't getting the physical address from a page, it's that we 
are passing these pages through various kernel interfaces which expect 
pages that work in the usual manner. (Look at the code: we quite simply 
provide a way to get the PCI bus address from a page when necessary).

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:22               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:22 UTC (permalink / raw)




On 01/03/18 02:18 PM, Jerome Glisse wrote:
> This is pretty easy to do with HMM:
> 
> unsigned long hmm_page_to_phys_pfn(struct page *page)
This is not useful unless you want to go through all the kernel paths we 
are using and replace page_to_phys() and friends with something else 
that calls an HMM function when appropriate...

The problem isn't getting the physical address from a page, it's that we 
are passing these pages through various kernel interfaces which expect 
pages that work in the usual manner. (Look at the code: we quite simply 
provide a way to get the PCI bus address from a page when necessary).

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:15                 ` Logan Gunthorpe
                                     ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 21:25                   ` Jerome Glisse
  -1 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:25 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> > 
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
> 
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
> 

The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()

When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.

Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:25                   ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:25 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Alex Williamson, Oliver OHalloran

On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> > 
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
> 
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
> 

The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()

When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.

Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:25                   ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:25 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> > 
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
> 
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
> 

The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()

When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.

Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:25                   ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:25 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Alex Williamson, Oliver OHalloran

On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> > 
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
> 
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
> 

The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()

When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.

Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:25                   ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:25 UTC (permalink / raw)


On Thu, Mar 01, 2018@02:15:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> > 
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
> 
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
> 

The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()

When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.

Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 21:21         ` Alex Williamson
  (?)
  (?)
@ 2018-03-01 21:26           ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:26 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:21 PM, Alex Williamson wrote:
> This is still a pretty terrible solution though, your kernel provider
> needs to decide whether they favor device assignment or p2p, because we
> can't do both, unless there's a patch I haven't seen yet that allows
> boot time rather than compile time configuration.  There are absolutely
> supported device assignment cases of switches proving isolation between
> devices allowing the downstream EPs to be used independently.  I think
> this is a non-starter for distribution support without boot time or
> dynamic configuration.  I could imagine dynamic configuration through
> sysfs that might trigger a soft remove and rescan of the affected
> devices in order to rebuild the IOMMU group.  The hard part might be
> determining which points to allow that to guarantee correctness.  For
> instance, upstream switch ports don't actually support ACS, but they'd
> otherwise be an obvious concentration point to trigger a
> reconfiguration.  Thanks,

At this point, I don't expect this to be enabled in any distribution. 
The use case is for custom hardware intended to do P2P which means they 
can have a custom kernel with P2P enabled. The default for 
CONFIG_PCI_P2P is 'no' and I expect it to stay that way for a long time.

Another boot option which has to be set for this to work isn't something 
we'd like to see.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:26           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:26 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Bjorn Helgaas, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt



On 01/03/18 02:21 PM, Alex Williamson wrote:
> This is still a pretty terrible solution though, your kernel provider
> needs to decide whether they favor device assignment or p2p, because we
> can't do both, unless there's a patch I haven't seen yet that allows
> boot time rather than compile time configuration.  There are absolutely
> supported device assignment cases of switches proving isolation between
> devices allowing the downstream EPs to be used independently.  I think
> this is a non-starter for distribution support without boot time or
> dynamic configuration.  I could imagine dynamic configuration through
> sysfs that might trigger a soft remove and rescan of the affected
> devices in order to rebuild the IOMMU group.  The hard part might be
> determining which points to allow that to guarantee correctness.  For
> instance, upstream switch ports don't actually support ACS, but they'd
> otherwise be an obvious concentration point to trigger a
> reconfiguration.  Thanks,

At this point, I don't expect this to be enabled in any distribution. 
The use case is for custom hardware intended to do P2P which means they 
can have a custom kernel with P2P enabled. The default for 
CONFIG_PCI_P2P is 'no' and I expect it to stay that way for a long time.

Another boot option which has to be set for this to work isn't something 
we'd like to see.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:26           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:26 UTC (permalink / raw)
  To: Alex Williamson, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:21 PM, Alex Williamson wrote:
> This is still a pretty terrible solution though, your kernel provider
> needs to decide whether they favor device assignment or p2p, because we
> can't do both, unless there's a patch I haven't seen yet that allows
> boot time rather than compile time configuration.  There are absolutely
> supported device assignment cases of switches proving isolation between
> devices allowing the downstream EPs to be used independently.  I think
> this is a non-starter for distribution support without boot time or
> dynamic configuration.  I could imagine dynamic configuration through
> sysfs that might trigger a soft remove and rescan of the affected
> devices in order to rebuild the IOMMU group.  The hard part might be
> determining which points to allow that to guarantee correctness.  For
> instance, upstream switch ports don't actually support ACS, but they'd
> otherwise be an obvious concentration point to trigger a
> reconfiguration.  Thanks,

At this point, I don't expect this to be enabled in any distribution. 
The use case is for custom hardware intended to do P2P which means they 
can have a custom kernel with P2P enabled. The default for 
CONFIG_PCI_P2P is 'no' and I expect it to stay that way for a long time.

Another boot option which has to be set for this to work isn't something 
we'd like to see.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:26           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:26 UTC (permalink / raw)




On 01/03/18 02:21 PM, Alex Williamson wrote:
> This is still a pretty terrible solution though, your kernel provider
> needs to decide whether they favor device assignment or p2p, because we
> can't do both, unless there's a patch I haven't seen yet that allows
> boot time rather than compile time configuration.  There are absolutely
> supported device assignment cases of switches proving isolation between
> devices allowing the downstream EPs to be used independently.  I think
> this is a non-starter for distribution support without boot time or
> dynamic configuration.  I could imagine dynamic configuration through
> sysfs that might trigger a soft remove and rescan of the affected
> devices in order to rebuild the IOMMU group.  The hard part might be
> determining which points to allow that to guarantee correctness.  For
> instance, upstream switch ports don't actually support ACS, but they'd
> otherwise be an obvious concentration point to trigger a
> reconfiguration.  Thanks,

At this point, I don't expect this to be enabled in any distribution. 
The use case is for custom hardware intended to do P2P which means they 
can have a custom kernel with P2P enabled. The default for 
CONFIG_PCI_P2P is 'no' and I expect it to stay that way for a long time.

Another boot option which has to be set for this to work isn't something 
we'd like to see.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 21:21         ` Alex Williamson
                             ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 21:32           ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> your kernel provider needs to decide whether they favor device assignment or p2p

Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.

Stephen
 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:32           ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

PiB5b3VyIGtlcm5lbCBwcm92aWRlciBuZWVkcyB0byBkZWNpZGUgd2hldGhlciB0aGV5IGZhdm9y
IGRldmljZSBhc3NpZ25tZW50IG9yIHAycA0KDQpUaGFua3MgQWxleCEgVGhlIGhhcmR3YXJlIHJl
cXVpcmVtZW50cyBmb3IgUDJQIChzd2l0Y2gsIGhpZ2ggcGVyZm9ybWFuY2UgRVBzKSBhcmUgc3Vj
aCB0aGF0IHdlIHJlYWxseSBvbmx5IGV4cGVjdCBDT05GSUdfUDJQX0RNQSB0byBiZSBlbmFibGVk
IGluIHNwZWNpZmljIGluc3RhbmNlcyBhbmQgaW4gdGhvc2UgaW5zdGFuY2VzIHRoZSB1c2VycyBo
YXZlIG1hZGUgYSBkZWNpc2lvbiB0byBmYXZvciBQMlAgb3ZlciBJT01NVSBpc29sYXRpb24uIE9y
IHRoZXkgaGF2ZSBzZXR1cCB0aGVpciBQQ0llIHRvcG9sb2d5IGluIGEgd2F5IHRoYXQgZ2l2ZXMg
dGhlbSBJT01NVSBpc29sYXRpb24gd2hlcmUgdGhleSB3YW50IGl0IGFuZCBQMlAgd2hlcmUgdGhl
eSB3YW50IGl0Lg0KDQpTdGVwaGVuDQogDQoNCg==

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:32           ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> your kernel provider needs to decide whether they favor device assignment or p2p

Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:32           ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, Logan Gunthorpe, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt

> your kernel provider needs to decide whether they favor device assignment or p2p

Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.

Stephen
 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:32           ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:32 UTC (permalink / raw)


> your kernel provider needs to decide whether they favor device assignment or p2p

Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.

Stephen
 

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 21:32           ` Stephen  Bates
                               ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 21:35             ` Jerome Glisse
  -1 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:35 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 09:32:20PM +0000, Stephen  Bates wrote:
> > your kernel provider needs to decide whether they favor device assignment or p2p
> 
> Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.
> 
> 

Note that they are usecase for P2P where IOMMU isolation matter and
the traffic through root complex isn't see as an issue. For instance
for GPU the idea is that you want to allow the RDMA device to directly
read or write from GPU memory to avoid having to migrate memory to
system memory. This isn't so much for performance than for ease of
use.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:35             ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:35 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Alex Williamson, Bjorn Helgaas, Logan Gunthorpe, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

On Thu, Mar 01, 2018 at 09:32:20PM +0000, Stephen  Bates wrote:
> > your kernel provider needs to decide whether they favor device assignment or p2p
> 
> Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.
> 
> 

Note that they are usecase for P2P where IOMMU isolation matter and
the traffic through root complex isn't see as an issue. For instance
for GPU the idea is that you want to allow the RDMA device to directly
read or write from GPU memory to avoid having to migrate memory to
system memory. This isn't so much for performance than for ease of
use.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:35             ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:35 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 09:32:20PM +0000, Stephen  Bates wrote:
> > your kernel provider needs to decide whether they favor device assignment or p2p
> 
> Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.
> 
> 

Note that they are usecase for P2P where IOMMU isolation matter and
the traffic through root complex isn't see as an issue. For instance
for GPU the idea is that you want to allow the RDMA device to directly
read or write from GPU memory to avoid having to migrate memory to
system memory. This isn't so much for performance than for ease of
use.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:35             ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:35 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Alex Williamson, Bjorn Helgaas, Logan Gunthorpe, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt

On Thu, Mar 01, 2018 at 09:32:20PM +0000, Stephen  Bates wrote:
> > your kernel provider needs to decide whether they favor device assignment or p2p
> 
> Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.
> 
> 

Note that they are usecase for P2P where IOMMU isolation matter and
the traffic through root complex isn't see as an issue. For instance
for GPU the idea is that you want to allow the RDMA device to directly
read or write from GPU memory to avoid having to migrate memory to
system memory. This isn't so much for performance than for ease of
use.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:35             ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-01 21:35 UTC (permalink / raw)


On Thu, Mar 01, 2018@09:32:20PM +0000, Stephen  Bates wrote:
> > your kernel provider needs to decide whether they favor device assignment or p2p
> 
> Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) are such that we really only expect CONFIG_P2P_DMA to be enabled in specific instances and in those instances the users have made a decision to favor P2P over IOMMU isolation. Or they have setup their PCIe topology in a way that gives them IOMMU isolation where they want it and P2P where they want it.
> 
> 

Note that they are usecase for P2P where IOMMU isolation matter and
the traffic through root complex isn't see as an issue. For instance
for GPU the idea is that you want to allow the RDMA device to directly
read or write from GPU memory to avoid having to migrate memory to
system memory. This isn't so much for performance than for ease of
use.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:10               ` Jerome Glisse
                                   ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 21:37                 ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:37 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.

Hi Jermone and thanks for your input! Understood. We have looked at HMM in the past and long term I definitely would like to consider how we can add P2P functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass addressable and non-addressable blocks of data between devices. However that is well beyond the intentions of this series ;-).

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:37                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:37 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams, Alex Williamson,
	Oliver OHalloran

PiBUaGUgaW50ZW50aW9uIG9mIEhNTSBpcyB0byBiZSB1c2VmdWwgZm9yIGFsbCBkZXZpY2UgbWVt
b3J5IHRoYXQgd2lzaA0KPiB0byBoYXZlIHN0cnVjdCBwYWdlIGZvciB2YXJpb3VzIHJlYXNvbnMu
DQoNCkhpIEplcm1vbmUgYW5kIHRoYW5rcyBmb3IgeW91ciBpbnB1dCEgVW5kZXJzdG9vZC4gV2Ug
aGF2ZSBsb29rZWQgYXQgSE1NIGluIHRoZSBwYXN0IGFuZCBsb25nIHRlcm0gSSBkZWZpbml0ZWx5
IHdvdWxkIGxpa2UgdG8gY29uc2lkZXIgaG93IHdlIGNhbiBhZGQgUDJQIGZ1bmN0aW9uYWxpdHkg
dG8gSE1NIGZvciBib3RoIERFVklDRV9QUklWQVRFIGFuZCBERVZJQ0VfUFVCTElDIHNvIHdlIGNh
biBwYXNzIGFkZHJlc3NhYmxlIGFuZCBub24tYWRkcmVzc2FibGUgYmxvY2tzIG9mIGRhdGEgYmV0
d2VlbiBkZXZpY2VzLiBIb3dldmVyIHRoYXQgaXMgd2VsbCBiZXlvbmQgdGhlIGludGVudGlvbnMg
b2YgdGhpcyBzZXJpZXMgOy0pLg0KDQpTdGVwaGVuDQogICAgDQoNCg==

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:37                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:37 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.

Hi Jermone and thanks for your input! Understood. We have looked at HMM in the past and long term I definitely would like to consider how we can add P2P functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass addressable and non-addressable blocks of data between devices. However that is well beyond the intentions of this series ;-).

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:37                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:37 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams, Alex Williamson,
	Oliver OHalloran

> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.

Hi Jermone and thanks for your input! Understood. We have looked at HMM in the past and long term I definitely would like to consider how we can add P2P functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass addressable and non-addressable blocks of data between devices. However that is well beyond the intentions of this series ;-).

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:37                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 21:37 UTC (permalink / raw)


> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.

Hi Jermone and thanks for your input! Understood. We have looked at HMM in the past and long term I definitely would like to consider how we can add P2P functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass addressable and non-addressable blocks of data between devices. However that is well beyond the intentions of this series ;-).

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 20:34         ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01 21:37           ` Dan Williams
  -1 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-01 21:37 UTC (permalink / raw)
  To: benh
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 1, 2018 at 12:34 PM, Benjamin Herrenschmidt
<benh@au1.ibm.com> wrote:
> On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
>> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>> <benh@au1.ibm.com> wrote:
>> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > > > Hi Everyone,
>> > >
>> > >
>> > > So Oliver (CC) was having issues getting any of that to work for us.
>> > >
>> > > The problem is that acccording to him (I didn't double check the latest
>> > > patches) you effectively hotplug the PCIe memory into the system when
>> > > creating struct pages.
>> > >
>> > > This cannot possibly work for us. First we cannot map PCIe memory as
>> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
>> > > switch anyway since you'd ahve to manage cache coherency in SW).
>> >
>> > Note: I think the above means it won't work behind a switch on x86
>> > either, will it ?
>>
>> The devm_memremap_pages() infrastructure allows placing the memmap in
>> "System-RAM" even if the hotplugged range is in PCI space. So, even if
>> it is an issue on some configurations, it's just a simple adjustment
>> to where the memmap is placed.
>
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?
>
> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:37           ` Dan Williams
  0 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-01 21:37 UTC (permalink / raw)
  To: benh
  Cc: Logan Gunthorpe, Linux Kernel Mailing List, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

On Thu, Mar 1, 2018 at 12:34 PM, Benjamin Herrenschmidt
<benh@au1.ibm.com> wrote:
> On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
>> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>> <benh@au1.ibm.com> wrote:
>> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > > > Hi Everyone,
>> > >
>> > >
>> > > So Oliver (CC) was having issues getting any of that to work for us.
>> > >
>> > > The problem is that acccording to him (I didn't double check the latest
>> > > patches) you effectively hotplug the PCIe memory into the system when
>> > > creating struct pages.
>> > >
>> > > This cannot possibly work for us. First we cannot map PCIe memory as
>> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
>> > > switch anyway since you'd ahve to manage cache coherency in SW).
>> >
>> > Note: I think the above means it won't work behind a switch on x86
>> > either, will it ?
>>
>> The devm_memremap_pages() infrastructure allows placing the memmap in
>> "System-RAM" even if the hotplugged range is in PCI space. So, even if
>> it is an issue on some configurations, it's just a simple adjustment
>> to where the memmap is placed.
>
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?
>
> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:37           ` Dan Williams
  0 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-01 21:37 UTC (permalink / raw)
  To: benh-8fk3Idey6ehBDgjK7y7TUQ
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 1, 2018 at 12:34 PM, Benjamin Herrenschmidt
<benh-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org> wrote:
> On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
>> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>> <benh-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org> wrote:
>> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > > > Hi Everyone,
>> > >
>> > >
>> > > So Oliver (CC) was having issues getting any of that to work for us.
>> > >
>> > > The problem is that acccording to him (I didn't double check the latest
>> > > patches) you effectively hotplug the PCIe memory into the system when
>> > > creating struct pages.
>> > >
>> > > This cannot possibly work for us. First we cannot map PCIe memory as
>> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
>> > > switch anyway since you'd ahve to manage cache coherency in SW).
>> >
>> > Note: I think the above means it won't work behind a switch on x86
>> > either, will it ?
>>
>> The devm_memremap_pages() infrastructure allows placing the memmap in
>> "System-RAM" even if the hotplugged range is in PCI space. So, even if
>> it is an issue on some configurations, it's just a simple adjustment
>> to where the memmap is placed.
>
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?
>
> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:37           ` Dan Williams
  0 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-01 21:37 UTC (permalink / raw)


On Thu, Mar 1, 2018 at 12:34 PM, Benjamin Herrenschmidt
<benh@au1.ibm.com> wrote:
> On Thu, 2018-03-01@11:21 -0800, Dan Williams wrote:
>> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>> <benh@au1.ibm.com> wrote:
>> > On Thu, 2018-03-01@14:54 +1100, Benjamin Herrenschmidt wrote:
>> > > On Wed, 2018-02-28@16:39 -0700, Logan Gunthorpe wrote:
>> > > > Hi Everyone,
>> > >
>> > >
>> > > So Oliver (CC) was having issues getting any of that to work for us.
>> > >
>> > > The problem is that acccording to him (I didn't double check the latest
>> > > patches) you effectively hotplug the PCIe memory into the system when
>> > > creating struct pages.
>> > >
>> > > This cannot possibly work for us. First we cannot map PCIe memory as
>> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
>> > > switch anyway since you'd ahve to manage cache coherency in SW).
>> >
>> > Note: I think the above means it won't work behind a switch on x86
>> > either, will it ?
>>
>> The devm_memremap_pages() infrastructure allows placing the memmap in
>> "System-RAM" even if the hotplugged range is in PCI space. So, even if
>> it is an issue on some configurations, it's just a simple adjustment
>> to where the memmap is placed.
>
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?
>
> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 21:35             ` Jerome Glisse
  (?)
  (?)
@ 2018-03-01 21:37               ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:37 UTC (permalink / raw)
  To: Jerome Glisse, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:35 PM, Jerome Glisse wrote:
> Note that they are usecase for P2P where IOMMU isolation matter and
> the traffic through root complex isn't see as an issue. 

Well, we can worry about that once we have a solution to the problem of 
knowing whether a root complex supports P2P at all. I'm not sure how 
people are going to solve that one.

For the time being, this work is for cards behind a switch only.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:37               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:37 UTC (permalink / raw)
  To: Jerome Glisse, Stephen Bates
  Cc: Alex Williamson, Bjorn Helgaas, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Benjamin Herrenschmidt



On 01/03/18 02:35 PM, Jerome Glisse wrote:
> Note that they are usecase for P2P where IOMMU isolation matter and
> the traffic through root complex isn't see as an issue. 

Well, we can worry about that once we have a solution to the problem of 
knowing whether a root complex supports P2P at all. I'm not sure how 
people are going to solve that one.

For the time being, this work is for cards behind a switch only.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:37               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:37 UTC (permalink / raw)
  To: Jerome Glisse, Stephen Bates
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:35 PM, Jerome Glisse wrote:
> Note that they are usecase for P2P where IOMMU isolation matter and
> the traffic through root complex isn't see as an issue. 

Well, we can worry about that once we have a solution to the problem of 
knowing whether a root complex supports P2P at all. I'm not sure how 
people are going to solve that one.

For the time being, this work is for cards behind a switch only.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 21:37               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:37 UTC (permalink / raw)




On 01/03/18 02:35 PM, Jerome Glisse wrote:
> Note that they are usecase for P2P where IOMMU isolation matter and
> the traffic through root complex isn't see as an issue. 

Well, we can worry about that once we have a solution to the problem of 
knowing whether a root complex supports P2P at all. I'm not sure how 
people are going to solve that one.

For the time being, this work is for cards behind a switch only.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:37           ` Dan Williams
  (?)
@ 2018-03-01 21:45             ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:45 UTC (permalink / raw)
  To: Dan Williams, benh
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:37 PM, Dan Williams wrote:
> Ah ok, I'd need to look at the details. I had been assuming that
> sparse-vmemmap could handle such a situation, but that could indeed be
> a broken assumption.

It handles it fine for many situations. But when you try to map 
something that is at the end of the physical address space then the 
spares-vmemmap needs virtual address space that's the size of the 
physical address space divided by PAGE_SIZE which may be a little bit 
too large...

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:45             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:45 UTC (permalink / raw)
  To: Dan Williams, benh
  Cc: Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Jérôme Glisse,
	Alex Williamson, Oliver OHalloran



On 01/03/18 02:37 PM, Dan Williams wrote:
> Ah ok, I'd need to look at the details. I had been assuming that
> sparse-vmemmap could handle such a situation, but that could indeed be
> a broken assumption.

It handles it fine for many situations. But when you try to map 
something that is at the end of the physical address space then the 
spares-vmemmap needs virtual address space that's the size of the 
physical address space divided by PAGE_SIZE which may be a little bit 
too large...

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:45             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:45 UTC (permalink / raw)




On 01/03/18 02:37 PM, Dan Williams wrote:
> Ah ok, I'd need to look at the details. I had been assuming that
> sparse-vmemmap could handle such a situation, but that could indeed be
> a broken assumption.

It handles it fine for many situations. But when you try to map 
something that is at the end of the physical address space then the 
spares-vmemmap needs virtual address space that's the size of the 
physical address space divided by PAGE_SIZE which may be a little bit 
too large...

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:45             ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 21:57               ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:57 UTC (permalink / raw)
  To: Dan Williams, benh
  Cc: Jens Axboe, linux-block, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> It handles it fine for many situations. But when you try to map 
> something that is at the end of the physical address space then the 
> spares-vmemmap needs virtual address space that's the size of the 
> physical address space divided by PAGE_SIZE which may be a little bit 
> too large...

Though, considering this more, maybe this shouldn't be a problem...

Lets say you have 56bits of address space. That's 64PB. If you use need 
a sparse vmemmap for the entire space it will take 16TB which leaves you 
with 63.98PB of address space left. (Similar calculations for other 
numbers of address bits.)

So I'm not sure what the problem with this is.

We still have to ensure all the arches map the memory with the right 
cache bits but that should be relatively easy to solve.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:57 UTC (permalink / raw)
  To: Dan Williams, benh
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> It handles it fine for many situations. But when you try to map 
> something that is at the end of the physical address space then the 
> spares-vmemmap needs virtual address space that's the size of the 
> physical address space divided by PAGE_SIZE which may be a little bit 
> too large...

Though, considering this more, maybe this shouldn't be a problem...

Lets say you have 56bits of address space. That's 64PB. If you use need 
a sparse vmemmap for the entire space it will take 16TB which leaves you 
with 63.98PB of address space left. (Similar calculations for other 
numbers of address bits.)

So I'm not sure what the problem with this is.

We still have to ensure all the arches map the memory with the right 
cache bits but that should be relatively easy to solve.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:57 UTC (permalink / raw)
  To: Dan Williams, benh-8fk3Idey6ehBDgjK7y7TUQ
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA, Oliver OHalloran,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> It handles it fine for many situations. But when you try to map 
> something that is at the end of the physical address space then the 
> spares-vmemmap needs virtual address space that's the size of the 
> physical address space divided by PAGE_SIZE which may be a little bit 
> too large...

Though, considering this more, maybe this shouldn't be a problem...

Lets say you have 56bits of address space. That's 64PB. If you use need 
a sparse vmemmap for the entire space it will take 16TB which leaves you 
with 63.98PB of address space left. (Similar calculations for other 
numbers of address bits.)

So I'm not sure what the problem with this is.

We still have to ensure all the arches map the memory with the right 
cache bits but that should be relatively easy to solve.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 21:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 21:57 UTC (permalink / raw)




On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> It handles it fine for many situations. But when you try to map 
> something that is at the end of the physical address space then the 
> spares-vmemmap needs virtual address space that's the size of the 
> physical address space divided by PAGE_SIZE which may be a little bit 
> too large...

Though, considering this more, maybe this shouldn't be a problem...

Lets say you have 56bits of address space. That's 64PB. If you use need 
a sparse vmemmap for the entire space it will take 16TB which leaves you 
with 63.98PB of address space left. (Similar calculations for other 
numbers of address bits.)

So I'm not sure what the problem with this is.

We still have to ensure all the arches map the memory with the right 
cache bits but that should be relatively easy to solve.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 20:53             ` Jason Gunthorpe
  (?)
  (?)
@ 2018-03-01 22:06               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 22:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 13:53 -0700, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> > Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> > on ppc64 (maybe via an arch hook as it might depend on the processor
> > family). Server powerpc cannot do cachable accesses on IO memory
> > (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
> 
> I think this series doesn't have a problem here only because it never
> touches the BAR pages with the CPU.
> 
> BAR memory should be mapped into the CPU as WC at best on all arches..

Could be that x86 has the smarts to do the right thing, still trying to
untangle the code :-)

Cheers,
Ben.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 22:06               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 22:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Logan Gunthorpe, Linux Kernel Mailing List,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Stephen Bates, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Max Gurtovoy,
	Jérôme Glisse, Alex Williamson, Oliver OHalloran

On Thu, 2018-03-01 at 13:53 -0700, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> > Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> > on ppc64 (maybe via an arch hook as it might depend on the processor
> > family). Server powerpc cannot do cachable accesses on IO memory
> > (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
> 
> I think this series doesn't have a problem here only because it never
> touches the BAR pages with the CPU.
> 
> BAR memory should be mapped into the CPU as WC at best on all arches..

Could be that x86 has the smarts to do the right thing, still trying to
untangle the code :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 22:06               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 22:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 13:53 -0700, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> > Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> > on ppc64 (maybe via an arch hook as it might depend on the processor
> > family). Server powerpc cannot do cachable accesses on IO memory
> > (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
> 
> I think this series doesn't have a problem here only because it never
> touches the BAR pages with the CPU.
> 
> BAR memory should be mapped into the CPU as WC at best on all arches..

Could be that x86 has the smarts to do the right thing, still trying to
untangle the code :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 22:06               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 22:06 UTC (permalink / raw)


On Thu, 2018-03-01@13:53 -0700, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018@07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> > Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> > on ppc64 (maybe via an arch hook as it might depend on the processor
> > family). Server powerpc cannot do cachable accesses on IO memory
> > (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
> 
> I think this series doesn't have a problem here only because it never
> touches the BAR pages with the CPU.
> 
> BAR memory should be mapped into the CPU as WC at best on all arches..

Could be that x86 has the smarts to do the right thing, still trying to
untangle the code :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 22:06               ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01 22:31                 ` Linus Torvalds
  -1 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-01 22:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
>
> Could be that x86 has the smarts to do the right thing, still trying to
> untangle the code :-)

Afaik, x86 will not cache PCI unless the system is misconfigured, and
even then it's more likely to just raise a machine check exception
than cache things.

The last-level cache is going to do fills and spills directly to the
memory controller, not to the PCIe side of things.

(I guess you *can* do things differently, and I wouldn't be surprised
if some people inside Intel did try to do things differently with
trying nvram over PCIe, but in general I think the above is true)

You won't find it in the kernel code either. It's in hardware with
firmware configuration of what addresses are mapped to the memory
controllers (and _how_ they are mapped) and which are not.

You _might_ find it in the BIOS, assuming you understood the tables
and had the BIOS writer's guide to unravel the magic registers.

But you might not even find it there. Some of the memory unit timing
programming is done very early, and by code that Intel doesn't even
release to the BIOS writers except as a magic encrypted blob, afaik.
Some of the magic might even be in microcode.

The page table settings for cacheability are more like a hint, and
only _part_ of the whole picture. The memory type range registers are
another part. And magic low-level uarch, northbridge and memory unit
specific magic is yet another part.

So you can disable caching for memory, but I'm pretty sure you can't
enable caching for PCIe at least in the common case. At best you can
affect how the store buffer works for PCIe.

              Linus
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 22:31                 ` Linus Torvalds
  0 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-01 22:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Max Gurtovoy, Jérôme Glisse, Alex Williamson,
	Oliver OHalloran

On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
>
> Could be that x86 has the smarts to do the right thing, still trying to
> untangle the code :-)

Afaik, x86 will not cache PCI unless the system is misconfigured, and
even then it's more likely to just raise a machine check exception
than cache things.

The last-level cache is going to do fills and spills directly to the
memory controller, not to the PCIe side of things.

(I guess you *can* do things differently, and I wouldn't be surprised
if some people inside Intel did try to do things differently with
trying nvram over PCIe, but in general I think the above is true)

You won't find it in the kernel code either. It's in hardware with
firmware configuration of what addresses are mapped to the memory
controllers (and _how_ they are mapped) and which are not.

You _might_ find it in the BIOS, assuming you understood the tables
and had the BIOS writer's guide to unravel the magic registers.

But you might not even find it there. Some of the memory unit timing
programming is done very early, and by code that Intel doesn't even
release to the BIOS writers except as a magic encrypted blob, afaik.
Some of the magic might even be in microcode.

The page table settings for cacheability are more like a hint, and
only _part_ of the whole picture. The memory type range registers are
another part. And magic low-level uarch, northbridge and memory unit
specific magic is yet another part.

So you can disable caching for memory, but I'm pretty sure you can't
enable caching for PCIe at least in the common case. At best you can
affect how the store buffer works for PCIe.

              Linus

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 22:31                 ` Linus Torvalds
  0 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-01 22:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-nvme, linux-block,
	Jason Gunthorpe, Jérôme Glisse, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org> wrote:
>
> Could be that x86 has the smarts to do the right thing, still trying to
> untangle the code :-)

Afaik, x86 will not cache PCI unless the system is misconfigured, and
even then it's more likely to just raise a machine check exception
than cache things.

The last-level cache is going to do fills and spills directly to the
memory controller, not to the PCIe side of things.

(I guess you *can* do things differently, and I wouldn't be surprised
if some people inside Intel did try to do things differently with
trying nvram over PCIe, but in general I think the above is true)

You won't find it in the kernel code either. It's in hardware with
firmware configuration of what addresses are mapped to the memory
controllers (and _how_ they are mapped) and which are not.

You _might_ find it in the BIOS, assuming you understood the tables
and had the BIOS writer's guide to unravel the magic registers.

But you might not even find it there. Some of the memory unit timing
programming is done very early, and by code that Intel doesn't even
release to the BIOS writers except as a magic encrypted blob, afaik.
Some of the magic might even be in microcode.

The page table settings for cacheability are more like a hint, and
only _part_ of the whole picture. The memory type range registers are
another part. And magic low-level uarch, northbridge and memory unit
specific magic is yet another part.

So you can disable caching for memory, but I'm pretty sure you can't
enable caching for PCIe at least in the common case. At best you can
affect how the store buffer works for PCIe.

              Linus

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 22:31                 ` Linus Torvalds
  0 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-01 22:31 UTC (permalink / raw)


On Thu, Mar 1, 2018@2:06 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
>
> Could be that x86 has the smarts to do the right thing, still trying to
> untangle the code :-)

Afaik, x86 will not cache PCI unless the system is misconfigured, and
even then it's more likely to just raise a machine check exception
than cache things.

The last-level cache is going to do fills and spills directly to the
memory controller, not to the PCIe side of things.

(I guess you *can* do things differently, and I wouldn't be surprised
if some people inside Intel did try to do things differently with
trying nvram over PCIe, but in general I think the above is true)

You won't find it in the kernel code either. It's in hardware with
firmware configuration of what addresses are mapped to the memory
controllers (and _how_ they are mapped) and which are not.

You _might_ find it in the BIOS, assuming you understood the tables
and had the BIOS writer's guide to unravel the magic registers.

But you might not even find it there. Some of the memory unit timing
programming is done very early, and by code that Intel doesn't even
release to the BIOS writers except as a magic encrypted blob, afaik.
Some of the magic might even be in microcode.

The page table settings for cacheability are more like a hint, and
only _part_ of the whole picture. The memory type range registers are
another part. And magic low-level uarch, northbridge and memory unit
specific magic is yet another part.

So you can disable caching for memory, but I'm pretty sure you can't
enable caching for PCIe at least in the common case. At best you can
affect how the store buffer works for PCIe.

              Linus

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 22:31                 ` Linus Torvalds
  (?)
  (?)
@ 2018-03-01 22:34                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> > 
> > Could be that x86 has the smarts to do the right thing, still trying to
> > untangle the code :-)
> 
> Afaik, x86 will not cache PCI unless the system is misconfigured, and
> even then it's more likely to just raise a machine check exception
> than cache things.
> 
> The last-level cache is going to do fills and spills directly to the
> memory controller, not to the PCIe side of things.
> 
> (I guess you *can* do things differently, and I wouldn't be surprised
> if some people inside Intel did try to do things differently with
> trying nvram over PCIe, but in general I think the above is true)
> 
> You won't find it in the kernel code either. It's in hardware with
> firmware configuration of what addresses are mapped to the memory
> controllers (and _how_ they are mapped) and which are not.

Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
mapping code by checking the address vs. memblocks to chose the right
page table attributes.

So the main problem on our side is to figure out the problem of too big
PFNs. I need to look at this with Aneesh, we might be able to make
things fit with a bit of wrangling.

> You _might_ find it in the BIOS, assuming you understood the tables
> and had the BIOS writer's guide to unravel the magic registers.
> 
> But you might not even find it there. Some of the memory unit timing
> programming is done very early, and by code that Intel doesn't even
> release to the BIOS writers except as a magic encrypted blob, afaik.
> Some of the magic might even be in microcode.
> 
> The page table settings for cacheability are more like a hint, and
> only _part_ of the whole picture. The memory type range registers are
> another part. And magic low-level uarch, northbridge and memory unit
> specific magic is yet another part.
> 
> So you can disable caching for memory, but I'm pretty sure you can't
> enable caching for PCIe at least in the common case. At best you can
> affect how the store buffer works for PCIe.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 22:34                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason Gunthorpe, Dan Williams, Logan Gunthorpe,
	Linux Kernel Mailing List, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Max Gurtovoy, Jérôme Glisse, Alex Williamson,
	Oliver OHalloran

On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> > 
> > Could be that x86 has the smarts to do the right thing, still trying to
> > untangle the code :-)
> 
> Afaik, x86 will not cache PCI unless the system is misconfigured, and
> even then it's more likely to just raise a machine check exception
> than cache things.
> 
> The last-level cache is going to do fills and spills directly to the
> memory controller, not to the PCIe side of things.
> 
> (I guess you *can* do things differently, and I wouldn't be surprised
> if some people inside Intel did try to do things differently with
> trying nvram over PCIe, but in general I think the above is true)
> 
> You won't find it in the kernel code either. It's in hardware with
> firmware configuration of what addresses are mapped to the memory
> controllers (and _how_ they are mapped) and which are not.

Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
mapping code by checking the address vs. memblocks to chose the right
page table attributes.

So the main problem on our side is to figure out the problem of too big
PFNs. I need to look at this with Aneesh, we might be able to make
things fit with a bit of wrangling.

> You _might_ find it in the BIOS, assuming you understood the tables
> and had the BIOS writer's guide to unravel the magic registers.
> 
> But you might not even find it there. Some of the memory unit timing
> programming is done very early, and by code that Intel doesn't even
> release to the BIOS writers except as a magic encrypted blob, afaik.
> Some of the magic might even be in microcode.
> 
> The page table settings for cacheability are more like a hint, and
> only _part_ of the whole picture. The memory type range registers are
> another part. And magic low-level uarch, northbridge and memory unit
> specific magic is yet another part.
> 
> So you can disable caching for memory, but I'm pretty sure you can't
> enable caching for PCIe at least in the common case. At best you can
> affect how the store buffer works for PCIe.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 22:34                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-nvme, linux-block,
	Jason Gunthorpe, Jérôme Glisse, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org> wrote:
> > 
> > Could be that x86 has the smarts to do the right thing, still trying to
> > untangle the code :-)
> 
> Afaik, x86 will not cache PCI unless the system is misconfigured, and
> even then it's more likely to just raise a machine check exception
> than cache things.
> 
> The last-level cache is going to do fills and spills directly to the
> memory controller, not to the PCIe side of things.
> 
> (I guess you *can* do things differently, and I wouldn't be surprised
> if some people inside Intel did try to do things differently with
> trying nvram over PCIe, but in general I think the above is true)
> 
> You won't find it in the kernel code either. It's in hardware with
> firmware configuration of what addresses are mapped to the memory
> controllers (and _how_ they are mapped) and which are not.

Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
mapping code by checking the address vs. memblocks to chose the right
page table attributes.

So the main problem on our side is to figure out the problem of too big
PFNs. I need to look at this with Aneesh, we might be able to make
things fit with a bit of wrangling.

> You _might_ find it in the BIOS, assuming you understood the tables
> and had the BIOS writer's guide to unravel the magic registers.
> 
> But you might not even find it there. Some of the memory unit timing
> programming is done very early, and by code that Intel doesn't even
> release to the BIOS writers except as a magic encrypted blob, afaik.
> Some of the magic might even be in microcode.
> 
> The page table settings for cacheability are more like a hint, and
> only _part_ of the whole picture. The memory type range registers are
> another part. And magic low-level uarch, northbridge and memory unit
> specific magic is yet another part.
> 
> So you can disable caching for memory, but I'm pretty sure you can't
> enable caching for PCIe at least in the common case. At best you can
> affect how the store buffer works for PCIe.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 22:34                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 22:34 UTC (permalink / raw)


On Thu, 2018-03-01@14:31 -0800, Linus Torvalds wrote:
> On Thu, Mar 1, 2018@2:06 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> > 
> > Could be that x86 has the smarts to do the right thing, still trying to
> > untangle the code :-)
> 
> Afaik, x86 will not cache PCI unless the system is misconfigured, and
> even then it's more likely to just raise a machine check exception
> than cache things.
> 
> The last-level cache is going to do fills and spills directly to the
> memory controller, not to the PCIe side of things.
> 
> (I guess you *can* do things differently, and I wouldn't be surprised
> if some people inside Intel did try to do things differently with
> trying nvram over PCIe, but in general I think the above is true)
> 
> You won't find it in the kernel code either. It's in hardware with
> firmware configuration of what addresses are mapped to the memory
> controllers (and _how_ they are mapped) and which are not.

Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
mapping code by checking the address vs. memblocks to chose the right
page table attributes.

So the main problem on our side is to figure out the problem of too big
PFNs. I need to look at this with Aneesh, we might be able to make
things fit with a bit of wrangling.

> You _might_ find it in the BIOS, assuming you understood the tables
> and had the BIOS writer's guide to unravel the magic registers.
> 
> But you might not even find it there. Some of the memory unit timing
> programming is done very early, and by code that Intel doesn't even
> release to the BIOS writers except as a magic encrypted blob, afaik.
> Some of the magic might even be in microcode.
> 
> The page table settings for cacheability are more like a hint, and
> only _part_ of the whole picture. The memory type range registers are
> another part. And magic low-level uarch, northbridge and memory unit
> specific magic is yet another part.
> 
> So you can disable caching for memory, but I'm pretty sure you can't
> enable caching for PCIe at least in the common case. At best you can
> affect how the store buffer works for PCIe.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 19:27             ` Logan Gunthorpe
@ 2018-03-01 22:45               ` Jason Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-01 22:45 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise

On Thu, Mar 01, 2018 at 12:27:03PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> >On Thu, Mar 01, 2018 at 08:35:55PM +0200, Sagi Grimberg wrote:
> >This is also why I don't entirely understand why this series has a
> >generic allocator for p2p mem, it makes little sense to me.
> 
> >Why wouldn't the nmve driver just claim the entire CMB of its local
> >device for its own use? Why involve p2p core code in this?
> 
> We'd prefer to have a generic way to get p2pmem instead of restricting
> ourselves to only using CMBs. We did work in the past where the P2P memory
> was part of an IB adapter and not the NVMe card. So this won't work if it's
> an NVMe only interface.

It just seems like it it making it too complicated.

In 99% of cases locality is important and you don't want to accidently
use a buffer in a 3rd device that has no relation to the two doing the
transfer just because it happened to be returned by the generic
allocator.

I can appreciate you might have some special use case for that, but it
absolutely should require special configuration and not just magically
happen.

You bring up IB adaptor memory - if we put that into the allocator
then what is to stop the NMVe driver just using it instead of the CMB
buffer? That would be totally wrong in almost all cases.

Seems like a very subtle and hard to debug performance trap to leave
for the users, and pretty much the only reason to use P2P is
performance... So why have such a dangerous interface?

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 22:45               ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-01 22:45 UTC (permalink / raw)


On Thu, Mar 01, 2018@12:27:03PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 11:42 AM, Jason Gunthorpe wrote:
> >On Thu, Mar 01, 2018@08:35:55PM +0200, Sagi Grimberg wrote:
> >This is also why I don't entirely understand why this series has a
> >generic allocator for p2p mem, it makes little sense to me.
> 
> >Why wouldn't the nmve driver just claim the entire CMB of its local
> >device for its own use? Why involve p2p core code in this?
> 
> We'd prefer to have a generic way to get p2pmem instead of restricting
> ourselves to only using CMBs. We did work in the past where the P2P memory
> was part of an IB adapter and not the NVMe card. So this won't work if it's
> an NVMe only interface.

It just seems like it it making it too complicated.

In 99% of cases locality is important and you don't want to accidently
use a buffer in a 3rd device that has no relation to the two doing the
transfer just because it happened to be returned by the generic
allocator.

I can appreciate you might have some special use case for that, but it
absolutely should require special configuration and not just magically
happen.

You bring up IB adaptor memory - if we put that into the allocator
then what is to stop the NMVe driver just using it instead of the CMB
buffer? That would be totally wrong in almost all cases.

Seems like a very subtle and hard to debug performance trap to leave
for the users, and pretty much the only reason to use P2P is
performance... So why have such a dangerous interface?

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 22:45               ` Jason Gunthorpe
  (?)
  (?)
@ 2018-03-01 22:56                 ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 22:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	linux-block, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 03:45 PM, Jason Gunthorpe wrote:
> I can appreciate you might have some special use case for that, but it
> absolutely should require special configuration and not just magically
> happen.

Well if driver doesn't want someone doing p2p transfers with the memory 
it shouldn't publish it to be used for exactly that purpose.

> You bring up IB adaptor memory - if we put that into the allocator
> then what is to stop the NMVe driver just using it instead of the CMB
> buffer? That would be totally wrong in almost all cases.

If you mean for SQEs in the NVMe driver, look at the code, it 
specifically allocates it from it's own device. If you mean the 
nvmet-rdma then it's using that memory exactly as it was meant to. 
Again, if the IB driver doesn't want someone to use that memory for P2P 
transfers it shouldn't publish it as such.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

It's not at all dangerous, the code specifically only uses P2P memory 
that's local enough. And the majority of the code is there to make sure 
it will all work in all cases.

Honestly, though, I'd love to go back to the case where the user selects 
which p2pmem device to use, but that was very unpopular last year. It 
would simplify a bunch of things though.

Also, no, the reason to use P2P is not performance. Only if you have 
very specific hardware can you get a performance bump and it isn't all 
that significant. The reason to use P2P is so you can design performant 
systems with small CPUs, less or slower DRAM, and low lane counts to the 
CPU, etc.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 22:56                 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 22:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Stephen Bates, Christoph Hellwig,
	Jens Axboe, Keith Busch, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise



On 01/03/18 03:45 PM, Jason Gunthorpe wrote:
> I can appreciate you might have some special use case for that, but it
> absolutely should require special configuration and not just magically
> happen.

Well if driver doesn't want someone doing p2p transfers with the memory 
it shouldn't publish it to be used for exactly that purpose.

> You bring up IB adaptor memory - if we put that into the allocator
> then what is to stop the NMVe driver just using it instead of the CMB
> buffer? That would be totally wrong in almost all cases.

If you mean for SQEs in the NVMe driver, look at the code, it 
specifically allocates it from it's own device. If you mean the 
nvmet-rdma then it's using that memory exactly as it was meant to. 
Again, if the IB driver doesn't want someone to use that memory for P2P 
transfers it shouldn't publish it as such.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

It's not at all dangerous, the code specifically only uses P2P memory 
that's local enough. And the majority of the code is there to make sure 
it will all work in all cases.

Honestly, though, I'd love to go back to the case where the user selects 
which p2pmem device to use, but that was very unpopular last year. It 
would simplify a bunch of things though.

Also, no, the reason to use P2P is not performance. Only if you have 
very specific hardware can you get a performance bump and it isn't all 
that significant. The reason to use P2P is so you can design performant 
systems with small CPUs, less or slower DRAM, and low lane counts to the 
CPU, etc.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 22:56                 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 22:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 01/03/18 03:45 PM, Jason Gunthorpe wrote:
> I can appreciate you might have some special use case for that, but it
> absolutely should require special configuration and not just magically
> happen.

Well if driver doesn't want someone doing p2p transfers with the memory 
it shouldn't publish it to be used for exactly that purpose.

> You bring up IB adaptor memory - if we put that into the allocator
> then what is to stop the NMVe driver just using it instead of the CMB
> buffer? That would be totally wrong in almost all cases.

If you mean for SQEs in the NVMe driver, look at the code, it 
specifically allocates it from it's own device. If you mean the 
nvmet-rdma then it's using that memory exactly as it was meant to. 
Again, if the IB driver doesn't want someone to use that memory for P2P 
transfers it shouldn't publish it as such.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

It's not at all dangerous, the code specifically only uses P2P memory 
that's local enough. And the majority of the code is there to make sure 
it will all work in all cases.

Honestly, though, I'd love to go back to the case where the user selects 
which p2pmem device to use, but that was very unpopular last year. It 
would simplify a bunch of things though.

Also, no, the reason to use P2P is not performance. Only if you have 
very specific hardware can you get a performance bump and it isn't all 
that significant. The reason to use P2P is so you can design performant 
systems with small CPUs, less or slower DRAM, and low lane counts to the 
CPU, etc.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 22:56                 ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 22:56 UTC (permalink / raw)




On 01/03/18 03:45 PM, Jason Gunthorpe wrote:
> I can appreciate you might have some special use case for that, but it
> absolutely should require special configuration and not just magically
> happen.

Well if driver doesn't want someone doing p2p transfers with the memory 
it shouldn't publish it to be used for exactly that purpose.

> You bring up IB adaptor memory - if we put that into the allocator
> then what is to stop the NMVe driver just using it instead of the CMB
> buffer? That would be totally wrong in almost all cases.

If you mean for SQEs in the NVMe driver, look at the code, it 
specifically allocates it from it's own device. If you mean the 
nvmet-rdma then it's using that memory exactly as it was meant to. 
Again, if the IB driver doesn't want someone to use that memory for P2P 
transfers it shouldn't publish it as such.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

It's not at all dangerous, the code specifically only uses P2P memory 
that's local enough. And the majority of the code is there to make sure 
it will all work in all cases.

Honestly, though, I'd love to go back to the case where the user selects 
which p2pmem device to use, but that was very unpopular last year. It 
would simplify a bunch of things though.

Also, no, the reason to use P2P is not performance. Only if you have 
very specific hardware can you get a performance bump and it isn't all 
that significant. The reason to use P2P is so you can design performant 
systems with small CPUs, less or slower DRAM, and low lane counts to the 
CPU, etc.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:57               ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 23:00                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, linux-block, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Thu, 2018-03-01 at 14:57 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> > It handles it fine for many situations. But when you try to map 
> > something that is at the end of the physical address space then the 
> > spares-vmemmap needs virtual address space that's the size of the 
> > physical address space divided by PAGE_SIZE which may be a little bit 
> > too large...
> 
> Though, considering this more, maybe this shouldn't be a problem...
> 
> Lets say you have 56bits of address space.

We use only 52 in practice but yes.

>  That's 64PB. If you use need 
> a sparse vmemmap for the entire space it will take 16TB which leaves you 
> with 63.98PB of address space left. (Similar calculations for other 
> numbers of address bits.)

We only have 52 bits of virtual space for the kernel with the radix
MMU.

> So I'm not sure what the problem with this is.
> 
> We still have to ensure all the arches map the memory with the right 
> cache bits but that should be relatively easy to solve.
> 
> Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:00                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 14:57 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> > It handles it fine for many situations. But when you try to map 
> > something that is at the end of the physical address space then the 
> > spares-vmemmap needs virtual address space that's the size of the 
> > physical address space divided by PAGE_SIZE which may be a little bit 
> > too large...
> 
> Though, considering this more, maybe this shouldn't be a problem...
> 
> Lets say you have 56bits of address space.

We use only 52 in practice but yes.

>  That's 64PB. If you use need 
> a sparse vmemmap for the entire space it will take 16TB which leaves you 
> with 63.98PB of address space left. (Similar calculations for other 
> numbers of address bits.)

We only have 52 bits of virtual space for the kernel with the radix
MMU.

> So I'm not sure what the problem with this is.
> 
> We still have to ensure all the arches map the memory with the right 
> cache bits but that should be relatively easy to solve.
> 
> Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:00                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA, Oliver OHalloran,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 14:57 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> > It handles it fine for many situations. But when you try to map 
> > something that is at the end of the physical address space then the 
> > spares-vmemmap needs virtual address space that's the size of the 
> > physical address space divided by PAGE_SIZE which may be a little bit 
> > too large...
> 
> Though, considering this more, maybe this shouldn't be a problem...
> 
> Lets say you have 56bits of address space.

We use only 52 in practice but yes.

>  That's 64PB. If you use need 
> a sparse vmemmap for the entire space it will take 16TB which leaves you 
> with 63.98PB of address space left. (Similar calculations for other 
> numbers of address bits.)

We only have 52 bits of virtual space for the kernel with the radix
MMU.

> So I'm not sure what the problem with this is.
> 
> We still have to ensure all the arches map the memory with the right 
> cache bits but that should be relatively easy to solve.
> 
> Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:00                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:00 UTC (permalink / raw)


On Thu, 2018-03-01@14:57 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> > It handles it fine for many situations. But when you try to map 
> > something that is at the end of the physical address space then the 
> > spares-vmemmap needs virtual address space that's the size of the 
> > physical address space divided by PAGE_SIZE which may be a little bit 
> > too large...
> 
> Though, considering this more, maybe this shouldn't be a problem...
> 
> Lets say you have 56bits of address space.

We use only 52 in practice but yes.

>  That's 64PB. If you use need 
> a sparse vmemmap for the entire space it will take 16TB which leaves you 
> with 63.98PB of address space left. (Similar calculations for other 
> numbers of address bits.)

We only have 52 bits of virtual space for the kernel with the radix
MMU.

> So I'm not sure what the problem with this is.
> 
> We still have to ensure all the arches map the memory with the right 
> cache bits but that should be relatively easy to solve.
> 
> Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
  2018-03-01 18:55         ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 23:00           ` Bjorn Helgaas
  -1 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 11:55:51AM -0700, Logan Gunthorpe wrote:
> Hi Bjorn,
> 
> Thanks for the review. I'll correct all the nits for the next version.
> 
> On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> > On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> > > Some PCI devices may have memory mapped in a BAR space that's
> > > intended for use in Peer-to-Peer transactions. In order to enable
> > > such transactions the memory must be registered with ZONE_DEVICE pages
> > > so it can be used by DMA interfaces in existing drivers.
> 
> > Is there anything about this memory that makes it specifically
> > intended for peer-to-peer transactions?  I assume the device can't
> > really tell whether a transaction is from a CPU or a peer.
> 
> No there's nothing special about the memory and it can still be accessed by
> the CPU. This is just the intended purpose. You could use this PCI memory as
> regular DMA buffers or regular memory but I'm not sure why you would. It
> would probably be pretty bad performance-wise.
> 
> 
> > BTW, maybe there could be some kind of guide for device driver writers
> > in Documentation/PCI/?
> Makes sense we can look at writing something for the next iteration.
> 
> > I think it would be clearer and sufficient to simply say that we have
> > no way to know whether peer-to-peer routing between PCIe Root Ports is
> > supported (PCIe r4.0, sec 1.3.1).
> 
> Fair enough.
> 
> > The fact that you use the PCIe term "switch" suggests that a PCIe
> > Switch is required, but isn't it sufficient for the peers to be below
> > the same "PCI bridge", which would include PCIe Root Ports, PCIe
> > Switch Downstream Ports, and conventional PCI bridges?
> > The comments at get_upstream_bridge_port() suggest that this isn't
> > enough, and the peers actually do have to be below the same PCIe
> > Switch, but I don't know why.
> 
> I do mean Switch as we do need to keep the traffic off the root complex.
> Seeing, as stated above, we don't know if it actually support it. (While we
> can be certain any PCI switch does). So we specifically want to exclude PCIe
> Root ports and I'm not sure about the support of PCI bridges but I can't
> imagine anyone wanting to do P2P around them so I'd rather be safe than
> sorry and exclude them.

I don't think this is correct.  A Root Port defines a hierarchy domain
(I'm looking at PCIe r4.0, sec 1.3.1).  The capability to route
peer-to-peer transactions *between* hierarchy domains is optional.  I
think this means a Root Complex is not required to route transactions
from one Root Port to another Root Port.

This doesn't say anything about routing between two different devices
below a Root Port.  Those would be in the same hierarchy domain and
should follow the conventional PCI routing rules.  Of course, since a
Root Port has one link that leads to one device, they would probably
be different functions of a single multi-function device, so I don't
know how practical it would be to test this.

> > This whole thing is confusing to me.  Why do you want to reject peers
> > directly connected to the same root port?  Why do you require the same
> > Switch Upstream Port?  You don't exclude conventional PCI, but it
> > looks like you would require peers to share *two* upstream PCI-to-PCI
> > bridges?  I would think a single shared upstream bridge (conventional,
> > PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?
> 
> Hmm, yes, this may just be laziness on my part. Finding the shared upstream
> bridge is a bit more tricky than just showing that they are on the same
> switch. So as coded, a fabric of switches with peers on different legs of
> the fabric are not supported. But yes, maybe they just need to be two
> devices with a single shared upstream bridge that is not the root port.
> Again, we need to reject the root port because we can't know if the root
> complex can support P2P traffic.

This sounds like the same issue as above, so we just need to resolve
that.

> > Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> > sort of weird that callers supply a non-PCI device and then we look up
> > a PCI device here.  I assume you have some reason for this; if you
> > added a writeup in Documentation/PCI, that would be a good place to
> > elaborate on that, maybe with a one-line clue here.
> 
> Well yes, but this is much more convenient for callers which don't need to
> care if the device they are attempting to add (which in the NVMe target
> case, could be a random block device) is a pci device or not. Especially
> seeing find_parent_pci_dev() is non-trivial.

OK.  I accept that it might be convenient, but I still think it leads
to a weird API.  Maybe that's OK; I don't know enough about the
scenario in the caller to do anything more than say "hmm, strange".

> > > +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> > > +{
> > > +	void *ret;
> > > +
> > > +	if (unlikely(!pdev->p2pdma))
> > 
> > Is this a hot path?  I'm not sure it's worth cluttering
> > non-performance paths with likely/unlikely.
> 
> I'd say it can be pretty hot given that we need to allocate and free buffers
> at multiple GB/s for the NVMe target case. I don't exactly have benchmarks
> or anything to show this though...

OK, I'll take your word for that.  I'm fine with this as long as it is
performance-sensitive.

Bjorn
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:00           ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson

On Thu, Mar 01, 2018 at 11:55:51AM -0700, Logan Gunthorpe wrote:
> Hi Bjorn,
> 
> Thanks for the review. I'll correct all the nits for the next version.
> 
> On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> > On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> > > Some PCI devices may have memory mapped in a BAR space that's
> > > intended for use in Peer-to-Peer transactions. In order to enable
> > > such transactions the memory must be registered with ZONE_DEVICE pages
> > > so it can be used by DMA interfaces in existing drivers.
> 
> > Is there anything about this memory that makes it specifically
> > intended for peer-to-peer transactions?  I assume the device can't
> > really tell whether a transaction is from a CPU or a peer.
> 
> No there's nothing special about the memory and it can still be accessed by
> the CPU. This is just the intended purpose. You could use this PCI memory as
> regular DMA buffers or regular memory but I'm not sure why you would. It
> would probably be pretty bad performance-wise.
> 
> 
> > BTW, maybe there could be some kind of guide for device driver writers
> > in Documentation/PCI/?
> Makes sense we can look at writing something for the next iteration.
> 
> > I think it would be clearer and sufficient to simply say that we have
> > no way to know whether peer-to-peer routing between PCIe Root Ports is
> > supported (PCIe r4.0, sec 1.3.1).
> 
> Fair enough.
> 
> > The fact that you use the PCIe term "switch" suggests that a PCIe
> > Switch is required, but isn't it sufficient for the peers to be below
> > the same "PCI bridge", which would include PCIe Root Ports, PCIe
> > Switch Downstream Ports, and conventional PCI bridges?
> > The comments at get_upstream_bridge_port() suggest that this isn't
> > enough, and the peers actually do have to be below the same PCIe
> > Switch, but I don't know why.
> 
> I do mean Switch as we do need to keep the traffic off the root complex.
> Seeing, as stated above, we don't know if it actually support it. (While we
> can be certain any PCI switch does). So we specifically want to exclude PCIe
> Root ports and I'm not sure about the support of PCI bridges but I can't
> imagine anyone wanting to do P2P around them so I'd rather be safe than
> sorry and exclude them.

I don't think this is correct.  A Root Port defines a hierarchy domain
(I'm looking at PCIe r4.0, sec 1.3.1).  The capability to route
peer-to-peer transactions *between* hierarchy domains is optional.  I
think this means a Root Complex is not required to route transactions
from one Root Port to another Root Port.

This doesn't say anything about routing between two different devices
below a Root Port.  Those would be in the same hierarchy domain and
should follow the conventional PCI routing rules.  Of course, since a
Root Port has one link that leads to one device, they would probably
be different functions of a single multi-function device, so I don't
know how practical it would be to test this.

> > This whole thing is confusing to me.  Why do you want to reject peers
> > directly connected to the same root port?  Why do you require the same
> > Switch Upstream Port?  You don't exclude conventional PCI, but it
> > looks like you would require peers to share *two* upstream PCI-to-PCI
> > bridges?  I would think a single shared upstream bridge (conventional,
> > PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?
> 
> Hmm, yes, this may just be laziness on my part. Finding the shared upstream
> bridge is a bit more tricky than just showing that they are on the same
> switch. So as coded, a fabric of switches with peers on different legs of
> the fabric are not supported. But yes, maybe they just need to be two
> devices with a single shared upstream bridge that is not the root port.
> Again, we need to reject the root port because we can't know if the root
> complex can support P2P traffic.

This sounds like the same issue as above, so we just need to resolve
that.

> > Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> > sort of weird that callers supply a non-PCI device and then we look up
> > a PCI device here.  I assume you have some reason for this; if you
> > added a writeup in Documentation/PCI, that would be a good place to
> > elaborate on that, maybe with a one-line clue here.
> 
> Well yes, but this is much more convenient for callers which don't need to
> care if the device they are attempting to add (which in the NVMe target
> case, could be a random block device) is a pci device or not. Especially
> seeing find_parent_pci_dev() is non-trivial.

OK.  I accept that it might be convenient, but I still think it leads
to a weird API.  Maybe that's OK; I don't know enough about the
scenario in the caller to do anything more than say "hmm, strange".

> > > +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> > > +{
> > > +	void *ret;
> > > +
> > > +	if (unlikely(!pdev->p2pdma))
> > 
> > Is this a hot path?  I'm not sure it's worth cluttering
> > non-performance paths with likely/unlikely.
> 
> I'd say it can be pretty hot given that we need to allocate and free buffers
> at multiple GB/s for the NVMe target case. I don't exactly have benchmarks
> or anything to show this though...

OK, I'll take your word for that.  I'm fine with this as long as it is
performance-sensitive.

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:00           ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 11:55:51AM -0700, Logan Gunthorpe wrote:
> Hi Bjorn,
> 
> Thanks for the review. I'll correct all the nits for the next version.
> 
> On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> > On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote:
> > > Some PCI devices may have memory mapped in a BAR space that's
> > > intended for use in Peer-to-Peer transactions. In order to enable
> > > such transactions the memory must be registered with ZONE_DEVICE pages
> > > so it can be used by DMA interfaces in existing drivers.
> 
> > Is there anything about this memory that makes it specifically
> > intended for peer-to-peer transactions?  I assume the device can't
> > really tell whether a transaction is from a CPU or a peer.
> 
> No there's nothing special about the memory and it can still be accessed by
> the CPU. This is just the intended purpose. You could use this PCI memory as
> regular DMA buffers or regular memory but I'm not sure why you would. It
> would probably be pretty bad performance-wise.
> 
> 
> > BTW, maybe there could be some kind of guide for device driver writers
> > in Documentation/PCI/?
> Makes sense we can look at writing something for the next iteration.
> 
> > I think it would be clearer and sufficient to simply say that we have
> > no way to know whether peer-to-peer routing between PCIe Root Ports is
> > supported (PCIe r4.0, sec 1.3.1).
> 
> Fair enough.
> 
> > The fact that you use the PCIe term "switch" suggests that a PCIe
> > Switch is required, but isn't it sufficient for the peers to be below
> > the same "PCI bridge", which would include PCIe Root Ports, PCIe
> > Switch Downstream Ports, and conventional PCI bridges?
> > The comments at get_upstream_bridge_port() suggest that this isn't
> > enough, and the peers actually do have to be below the same PCIe
> > Switch, but I don't know why.
> 
> I do mean Switch as we do need to keep the traffic off the root complex.
> Seeing, as stated above, we don't know if it actually support it. (While we
> can be certain any PCI switch does). So we specifically want to exclude PCIe
> Root ports and I'm not sure about the support of PCI bridges but I can't
> imagine anyone wanting to do P2P around them so I'd rather be safe than
> sorry and exclude them.

I don't think this is correct.  A Root Port defines a hierarchy domain
(I'm looking at PCIe r4.0, sec 1.3.1).  The capability to route
peer-to-peer transactions *between* hierarchy domains is optional.  I
think this means a Root Complex is not required to route transactions
from one Root Port to another Root Port.

This doesn't say anything about routing between two different devices
below a Root Port.  Those would be in the same hierarchy domain and
should follow the conventional PCI routing rules.  Of course, since a
Root Port has one link that leads to one device, they would probably
be different functions of a single multi-function device, so I don't
know how practical it would be to test this.

> > This whole thing is confusing to me.  Why do you want to reject peers
> > directly connected to the same root port?  Why do you require the same
> > Switch Upstream Port?  You don't exclude conventional PCI, but it
> > looks like you would require peers to share *two* upstream PCI-to-PCI
> > bridges?  I would think a single shared upstream bridge (conventional,
> > PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?
> 
> Hmm, yes, this may just be laziness on my part. Finding the shared upstream
> bridge is a bit more tricky than just showing that they are on the same
> switch. So as coded, a fabric of switches with peers on different legs of
> the fabric are not supported. But yes, maybe they just need to be two
> devices with a single shared upstream bridge that is not the root port.
> Again, we need to reject the root port because we can't know if the root
> complex can support P2P traffic.

This sounds like the same issue as above, so we just need to resolve
that.

> > Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> > sort of weird that callers supply a non-PCI device and then we look up
> > a PCI device here.  I assume you have some reason for this; if you
> > added a writeup in Documentation/PCI, that would be a good place to
> > elaborate on that, maybe with a one-line clue here.
> 
> Well yes, but this is much more convenient for callers which don't need to
> care if the device they are attempting to add (which in the NVMe target
> case, could be a random block device) is a pci device or not. Especially
> seeing find_parent_pci_dev() is non-trivial.

OK.  I accept that it might be convenient, but I still think it leads
to a weird API.  Maybe that's OK; I don't know enough about the
scenario in the caller to do anything more than say "hmm, strange".

> > > +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> > > +{
> > > +	void *ret;
> > > +
> > > +	if (unlikely(!pdev->p2pdma))
> > 
> > Is this a hot path?  I'm not sure it's worth cluttering
> > non-performance paths with likely/unlikely.
> 
> I'd say it can be pretty hot given that we need to allocate and free buffers
> at multiple GB/s for the NVMe target case. I don't exactly have benchmarks
> or anything to show this though...

OK, I'll take your word for that.  I'm fine with this as long as it is
performance-sensitive.

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:00           ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:00 UTC (permalink / raw)


On Thu, Mar 01, 2018@11:55:51AM -0700, Logan Gunthorpe wrote:
> Hi Bjorn,
> 
> Thanks for the review. I'll correct all the nits for the next version.
> 
> On 01/03/18 10:37 AM, Bjorn Helgaas wrote:
> > On Wed, Feb 28, 2018@04:39:57PM -0700, Logan Gunthorpe wrote:
> > > Some PCI devices may have memory mapped in a BAR space that's
> > > intended for use in Peer-to-Peer transactions. In order to enable
> > > such transactions the memory must be registered with ZONE_DEVICE pages
> > > so it can be used by DMA interfaces in existing drivers.
> 
> > Is there anything about this memory that makes it specifically
> > intended for peer-to-peer transactions?  I assume the device can't
> > really tell whether a transaction is from a CPU or a peer.
> 
> No there's nothing special about the memory and it can still be accessed by
> the CPU. This is just the intended purpose. You could use this PCI memory as
> regular DMA buffers or regular memory but I'm not sure why you would. It
> would probably be pretty bad performance-wise.
> 
> 
> > BTW, maybe there could be some kind of guide for device driver writers
> > in Documentation/PCI/?
> Makes sense we can look at writing something for the next iteration.
> 
> > I think it would be clearer and sufficient to simply say that we have
> > no way to know whether peer-to-peer routing between PCIe Root Ports is
> > supported (PCIe r4.0, sec 1.3.1).
> 
> Fair enough.
> 
> > The fact that you use the PCIe term "switch" suggests that a PCIe
> > Switch is required, but isn't it sufficient for the peers to be below
> > the same "PCI bridge", which would include PCIe Root Ports, PCIe
> > Switch Downstream Ports, and conventional PCI bridges?
> > The comments at get_upstream_bridge_port() suggest that this isn't
> > enough, and the peers actually do have to be below the same PCIe
> > Switch, but I don't know why.
> 
> I do mean Switch as we do need to keep the traffic off the root complex.
> Seeing, as stated above, we don't know if it actually support it. (While we
> can be certain any PCI switch does). So we specifically want to exclude PCIe
> Root ports and I'm not sure about the support of PCI bridges but I can't
> imagine anyone wanting to do P2P around them so I'd rather be safe than
> sorry and exclude them.

I don't think this is correct.  A Root Port defines a hierarchy domain
(I'm looking at PCIe r4.0, sec 1.3.1).  The capability to route
peer-to-peer transactions *between* hierarchy domains is optional.  I
think this means a Root Complex is not required to route transactions
from one Root Port to another Root Port.

This doesn't say anything about routing between two different devices
below a Root Port.  Those would be in the same hierarchy domain and
should follow the conventional PCI routing rules.  Of course, since a
Root Port has one link that leads to one device, they would probably
be different functions of a single multi-function device, so I don't
know how practical it would be to test this.

> > This whole thing is confusing to me.  Why do you want to reject peers
> > directly connected to the same root port?  Why do you require the same
> > Switch Upstream Port?  You don't exclude conventional PCI, but it
> > looks like you would require peers to share *two* upstream PCI-to-PCI
> > bridges?  I would think a single shared upstream bridge (conventional,
> > PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient?
> 
> Hmm, yes, this may just be laziness on my part. Finding the shared upstream
> bridge is a bit more tricky than just showing that they are on the same
> switch. So as coded, a fabric of switches with peers on different legs of
> the fabric are not supported. But yes, maybe they just need to be two
> devices with a single shared upstream bridge that is not the root port.
> Again, we need to reject the root port because we can't know if the root
> complex can support P2P traffic.

This sounds like the same issue as above, so we just need to resolve
that.

> > Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems
> > sort of weird that callers supply a non-PCI device and then we look up
> > a PCI device here.  I assume you have some reason for this; if you
> > added a writeup in Documentation/PCI, that would be a good place to
> > elaborate on that, maybe with a one-line clue here.
> 
> Well yes, but this is much more convenient for callers which don't need to
> care if the device they are attempting to add (which in the NVMe target
> case, could be a random block device) is a pci device or not. Especially
> seeing find_parent_pci_dev() is non-trivial.

OK.  I accept that it might be convenient, but I still think it leads
to a weird API.  Maybe that's OK; I don't know enough about the
scenario in the caller to do anything more than say "hmm, strange".

> > > +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> > > +{
> > > +	void *ret;
> > > +
> > > +	if (unlikely(!pdev->p2pdma))
> > 
> > Is this a hot path?  I'm not sure it's worth cluttering
> > non-performance paths with likely/unlikely.
> 
> I'd say it can be pretty hot given that we need to allocate and free buffers
> at multiple GB/s for the NVMe target case. I don't exactly have benchmarks
> or anything to show this though...

OK, I'll take your word for that.  I'm fine with this as long as it is
performance-sensitive.

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 22:45               ` Jason Gunthorpe
                                   ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 23:00                 ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	linux-block, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
 >> was part of an IB adapter and not the NVMe card. So this won't work if it's
  >> an NVMe only interface.
    
 > It just seems like it it making it too complicated.

I disagree. Having a common allocator (instead of some separate allocator per driver) makes things simpler.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

P2P is about offloading the memory and PCI subsystem of the host CPU and this is achieved no matter which p2p_dev is used.

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:00                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise

Pj4gV2UnZCBwcmVmZXIgdG8gaGF2ZSBhIGdlbmVyaWMgd2F5IHRvIGdldCBwMnBtZW0gaW5zdGVh
ZCBvZiByZXN0cmljdGluZw0KPj4gb3Vyc2VsdmVzIHRvIG9ubHkgdXNpbmcgQ01Ccy4gV2UgZGlk
IHdvcmsgaW4gdGhlIHBhc3Qgd2hlcmUgdGhlIFAyUCBtZW1vcnkNCiA+PiB3YXMgcGFydCBvZiBh
biBJQiBhZGFwdGVyIGFuZCBub3QgdGhlIE5WTWUgY2FyZC4gU28gdGhpcyB3b24ndCB3b3JrIGlm
IGl0J3MNCiAgPj4gYW4gTlZNZSBvbmx5IGludGVyZmFjZS4NCiAgICANCiA+IEl0IGp1c3Qgc2Vl
bXMgbGlrZSBpdCBpdCBtYWtpbmcgaXQgdG9vIGNvbXBsaWNhdGVkLg0KDQpJIGRpc2FncmVlLiBI
YXZpbmcgYSBjb21tb24gYWxsb2NhdG9yIChpbnN0ZWFkIG9mIHNvbWUgc2VwYXJhdGUgYWxsb2Nh
dG9yIHBlciBkcml2ZXIpIG1ha2VzIHRoaW5ncyBzaW1wbGVyLg0KDQo+IFNlZW1zIGxpa2UgYSB2
ZXJ5IHN1YnRsZSBhbmQgaGFyZCB0byBkZWJ1ZyBwZXJmb3JtYW5jZSB0cmFwIHRvIGxlYXZlDQo+
IGZvciB0aGUgdXNlcnMsIGFuZCBwcmV0dHkgbXVjaCB0aGUgb25seSByZWFzb24gdG8gdXNlIFAy
UCBpcw0KPiBwZXJmb3JtYW5jZS4uLiBTbyB3aHkgaGF2ZSBzdWNoIGEgZGFuZ2Vyb3VzIGludGVy
ZmFjZT8NCg0KUDJQIGlzIGFib3V0IG9mZmxvYWRpbmcgdGhlIG1lbW9yeSBhbmQgUENJIHN1YnN5
c3RlbSBvZiB0aGUgaG9zdCBDUFUgYW5kIHRoaXMgaXMgYWNoaWV2ZWQgbm8gbWF0dGVyIHdoaWNo
IHAycF9kZXYgaXMgdXNlZC4NCg0KU3RlcGhlbg0KICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:00                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
 >> was part of an IB adapter and not the NVMe card. So this won't work if it's
  >> an NVMe only interface.
    
 > It just seems like it it making it too complicated.

I disagree. Having a common allocator (instead of some separate allocator per driver) makes things simpler.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

P2P is about offloading the memory and PCI subsystem of the host CPU and this is achieved no matter which p2p_dev is used.

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:00                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise

>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
 >> was part of an IB adapter and not the NVMe card. So this won't work if it's
  >> an NVMe only interface.
    
 > It just seems like it it making it too complicated.

I disagree. Having a common allocator (instead of some separate allocator per driver) makes things simpler.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

P2P is about offloading the memory and PCI subsystem of the host CPU and this is achieved no matter which p2p_dev is used.

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:00                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:00 UTC (permalink / raw)


>> We'd prefer to have a generic way to get p2pmem instead of restricting
>> ourselves to only using CMBs. We did work in the past where the P2P memory
 >> was part of an IB adapter and not the NVMe card. So this won't work if it's
  >> an NVMe only interface.
    
 > It just seems like it it making it too complicated.

I disagree. Having a common allocator (instead of some separate allocator per driver) makes things simpler.

> Seems like a very subtle and hard to debug performance trap to leave
> for the users, and pretty much the only reason to use P2P is
> performance... So why have such a dangerous interface?

P2P is about offloading the memory and PCI subsystem of the host CPU and this is achieved no matter which p2p_dev is used.

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
  2018-03-01 23:00           ` Bjorn Helgaas
  (?)
  (?)
@ 2018-03-01 23:06             ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:06 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



> I don't think this is correct.  A Root Port defines a hierarchy domain
> (I'm looking at PCIe r4.0, sec 1.3.1).  The capability to route
> peer-to-peer transactions *between* hierarchy domains is optional.  I
> think this means a Root Complex is not required to route transactions
> from one Root Port to another Root Port.
> 
> This doesn't say anything about routing between two different devices
> below a Root Port.  Those would be in the same hierarchy domain and
> should follow the conventional PCI routing rules.  Of course, since a
> Root Port has one link that leads to one device, they would probably
> be different functions of a single multi-function device, so I don't
> know how practical it would be to test this.

Yes, given that there's only one device below a root port it will either 
be a switch or a multi-function device. In the multi-function device 
case, I'm pretty sure the spec disallows routing-to-self so doing a P2P 
transaction in that sense isn't going to work unless the device 
specifically supports it and intercepts the traffic before it gets to 
the port.

But, if we're talking about multi-function devices it should be able to 
do everything within it's own driver so it's not exactly Peer-to-Peer. 
Still, if someone has such hardware I think it's up to them to add 
support for this odd situation.

Logan


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:06             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:06 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson



> I don't think this is correct.  A Root Port defines a hierarchy domain
> (I'm looking at PCIe r4.0, sec 1.3.1).  The capability to route
> peer-to-peer transactions *between* hierarchy domains is optional.  I
> think this means a Root Complex is not required to route transactions
> from one Root Port to another Root Port.
> 
> This doesn't say anything about routing between two different devices
> below a Root Port.  Those would be in the same hierarchy domain and
> should follow the conventional PCI routing rules.  Of course, since a
> Root Port has one link that leads to one device, they would probably
> be different functions of a single multi-function device, so I don't
> know how practical it would be to test this.

Yes, given that there's only one device below a root port it will either 
be a switch or a multi-function device. In the multi-function device 
case, I'm pretty sure the spec disallows routing-to-self so doing a P2P 
transaction in that sense isn't going to work unless the device 
specifically supports it and intercepts the traffic before it gets to 
the port.

But, if we're talking about multi-function devices it should be able to 
do everything within it's own driver so it's not exactly Peer-to-Peer. 
Still, if someone has such hardware I think it's up to them to add 
support for this odd situation.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:06             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:06 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



> I don't think this is correct.  A Root Port defines a hierarchy domain
> (I'm looking at PCIe r4.0, sec 1.3.1).  The capability to route
> peer-to-peer transactions *between* hierarchy domains is optional.  I
> think this means a Root Complex is not required to route transactions
> from one Root Port to another Root Port.
> 
> This doesn't say anything about routing between two different devices
> below a Root Port.  Those would be in the same hierarchy domain and
> should follow the conventional PCI routing rules.  Of course, since a
> Root Port has one link that leads to one device, they would probably
> be different functions of a single multi-function device, so I don't
> know how practical it would be to test this.

Yes, given that there's only one device below a root port it will either 
be a switch or a multi-function device. In the multi-function device 
case, I'm pretty sure the spec disallows routing-to-self so doing a P2P 
transaction in that sense isn't going to work unless the device 
specifically supports it and intercepts the traffic before it gets to 
the port.

But, if we're talking about multi-function devices it should be able to 
do everything within it's own driver so it's not exactly Peer-to-Peer. 
Still, if someone has such hardware I think it's up to them to add 
support for this odd situation.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:06             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:06 UTC (permalink / raw)




> I don't think this is correct.  A Root Port defines a hierarchy domain
> (I'm looking at PCIe r4.0, sec 1.3.1).  The capability to route
> peer-to-peer transactions *between* hierarchy domains is optional.  I
> think this means a Root Complex is not required to route transactions
> from one Root Port to another Root Port.
> 
> This doesn't say anything about routing between two different devices
> below a Root Port.  Those would be in the same hierarchy domain and
> should follow the conventional PCI routing rules.  Of course, since a
> Root Port has one link that leads to one device, they would probably
> be different functions of a single multi-function device, so I don't
> know how practical it would be to test this.

Yes, given that there's only one device below a root port it will either 
be a switch or a multi-function device. In the multi-function device 
case, I'm pretty sure the spec disallows routing-to-self so doing a P2P 
transaction in that sense isn't going to work unless the device 
specifically supports it and intercepts the traffic before it gets to 
the port.

But, if we're talking about multi-function devices it should be able to 
do everything within it's own driver so it's not exactly Peer-to-Peer. 
Still, if someone has such hardware I think it's up to them to add 
support for this odd situation.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
  2018-03-01 23:06             ` Logan Gunthorpe
                                 ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 23:14               ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:14 UTC (permalink / raw)
  To: Logan Gunthorpe, Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> transaction in that sense isn't going to work unless the device 
> specifically supports it and intercepts the traffic before it gets to 
> the port.

This is correct. Unless the device intercepts the TLP before it hits the root-port then this would be considered a "route to self" violation and an error event would occur. The same holds for the downstream port on a PCI switch (unless route-to-self violations are disabled which violates the spec but which I have seen done in certain applications).
    
Stephen
    
    
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:14               ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:14 UTC (permalink / raw)
  To: Logan Gunthorpe, Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson

PiBJJ20gcHJldHR5IHN1cmUgdGhlIHNwZWMgZGlzYWxsb3dzIHJvdXRpbmctdG8tc2VsZiBzbyBk
b2luZyBhIFAyUCANCj4gdHJhbnNhY3Rpb24gaW4gdGhhdCBzZW5zZSBpc24ndCBnb2luZyB0byB3
b3JrIHVubGVzcyB0aGUgZGV2aWNlIA0KPiBzcGVjaWZpY2FsbHkgc3VwcG9ydHMgaXQgYW5kIGlu
dGVyY2VwdHMgdGhlIHRyYWZmaWMgYmVmb3JlIGl0IGdldHMgdG8gDQo+IHRoZSBwb3J0Lg0KDQpU
aGlzIGlzIGNvcnJlY3QuIFVubGVzcyB0aGUgZGV2aWNlIGludGVyY2VwdHMgdGhlIFRMUCBiZWZv
cmUgaXQgaGl0cyB0aGUgcm9vdC1wb3J0IHRoZW4gdGhpcyB3b3VsZCBiZSBjb25zaWRlcmVkIGEg
InJvdXRlIHRvIHNlbGYiIHZpb2xhdGlvbiBhbmQgYW4gZXJyb3IgZXZlbnQgd291bGQgb2NjdXIu
IFRoZSBzYW1lIGhvbGRzIGZvciB0aGUgZG93bnN0cmVhbSBwb3J0IG9uIGEgUENJIHN3aXRjaCAo
dW5sZXNzIHJvdXRlLXRvLXNlbGYgdmlvbGF0aW9ucyBhcmUgZGlzYWJsZWQgd2hpY2ggdmlvbGF0
ZXMgdGhlIHNwZWMgYnV0IHdoaWNoIEkgaGF2ZSBzZWVuIGRvbmUgaW4gY2VydGFpbiBhcHBsaWNh
dGlvbnMpLg0KICAgIA0KU3RlcGhlbg0KICAgIA0KICAgIA0KICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:14               ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:14 UTC (permalink / raw)
  To: Logan Gunthorpe, Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

> I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> transaction in that sense isn't going to work unless the device 
> specifically supports it and intercepts the traffic before it gets to 
> the port.

This is correct. Unless the device intercepts the TLP before it hits the root-port then this would be considered a "route to self" violation and an error event would occur. The same holds for the downstream port on a PCI switch (unless route-to-self violations are disabled which violates the spec but which I have seen done in certain applications).
    
Stephen
    
    
    


^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:14               ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:14 UTC (permalink / raw)
  To: Logan Gunthorpe, Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson

> I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> transaction in that sense isn't going to work unless the device 
> specifically supports it and intercepts the traffic before it gets to 
> the port.

This is correct. Unless the device intercepts the TLP before it hits the root-port then this would be considered a "route to self" violation and an error event would occur. The same holds for the downstream port on a PCI switch (unless route-to-self violations are disabled which violates the spec but which I have seen done in certain applications).
    
Stephen
    
    
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:14               ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:14 UTC (permalink / raw)


> I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> transaction in that sense isn't going to work unless the device 
> specifically supports it and intercepts the traffic before it gets to 
> the port.

This is correct. Unless the device intercepts the TLP before it hits the root-port then this would be considered a "route to self" violation and an error event would occur. The same holds for the downstream port on a PCI switch (unless route-to-self violations are disabled which violates the spec but which I have seen done in certain applications).
    
Stephen
    
    
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 18:54       ` Stephen  Bates
  (?)
  (?)
@ 2018-03-01 23:15         ` Bjorn Helgaas
  -1 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:15 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 06:54:01PM +0000, Stephen  Bates wrote:
> Thanks for the detailed review Bjorn!
> 
> >> +	  Enabling this option will also disable ACS on all ports behind
> >> +	  any PCIe switch. This effictively puts all devices behind any
> >> +	  switch into the same IOMMU group.
> 
> >  Does this really mean "all devices behind the same Root Port"?
> 
> Not necessarily. You might have a cascade of switches (i.e switches
> below a switch) to achieve a very large fan-out (in an NVMe SSD
> array for example) and we will only disable ACS on the ports below
> the relevant switch.

The question is what the relevant switch is.  We call pci_enable_acs()
on every PCI device, including Root Ports.  It looks like this relies
on get_upstream_bridge_port() to filter out some things.  I don't
think get_upstream_bridge_port() is doing the right thing, so we need
to sort that out first, I guess.

> > What does this mean in terms of device security?  I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.
> 
> This was discussed during v1 [1]. Disabling ACS on all downstream
> ports of the switch means that all the EPs below it have to part of
> the same IOMMU grouping. However it was also agreed that as long as
> the ACS disable occurred at boot time (which is does in v2) then the
> virtualization layer will be aware of it and will perform the IOMMU
> group formation correctly.
>     
> > I don't mind admitting that this patch makes me pretty nervous, and I
> > don't have a clear idea of what the implications of this are, or how
> > to communicate those to end users.  "The same IOMMU group" is a pretty
> > abstract idea.
>     
> Alex gave a good overview of the implications in [1].
> 
> [1] https://marc.info/?l=linux-pci&m=151512320031739&w=2

This might be a good start, but whatever the implications are, they
need to be distilled and made clear directly in the changelog and the
Kconfig help text.

Bjorn
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 23:15         ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:15 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson

On Thu, Mar 01, 2018 at 06:54:01PM +0000, Stephen  Bates wrote:
> Thanks for the detailed review Bjorn!
> 
> >> +	  Enabling this option will also disable ACS on all ports behind
> >> +	  any PCIe switch. This effictively puts all devices behind any
> >> +	  switch into the same IOMMU group.
> 
> >  Does this really mean "all devices behind the same Root Port"?
> 
> Not necessarily. You might have a cascade of switches (i.e switches
> below a switch) to achieve a very large fan-out (in an NVMe SSD
> array for example) and we will only disable ACS on the ports below
> the relevant switch.

The question is what the relevant switch is.  We call pci_enable_acs()
on every PCI device, including Root Ports.  It looks like this relies
on get_upstream_bridge_port() to filter out some things.  I don't
think get_upstream_bridge_port() is doing the right thing, so we need
to sort that out first, I guess.

> > What does this mean in terms of device security?  I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.
> 
> This was discussed during v1 [1]. Disabling ACS on all downstream
> ports of the switch means that all the EPs below it have to part of
> the same IOMMU grouping. However it was also agreed that as long as
> the ACS disable occurred at boot time (which is does in v2) then the
> virtualization layer will be aware of it and will perform the IOMMU
> group formation correctly.
>     
> > I don't mind admitting that this patch makes me pretty nervous, and I
> > don't have a clear idea of what the implications of this are, or how
> > to communicate those to end users.  "The same IOMMU group" is a pretty
> > abstract idea.
>     
> Alex gave a good overview of the implications in [1].
> 
> [1] https://marc.info/?l=linux-pci&m=151512320031739&w=2

This might be a good start, but whatever the implications are, they
need to be distilled and made clear directly in the changelog and the
Kconfig help text.

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 23:15         ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:15 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 06:54:01PM +0000, Stephen  Bates wrote:
> Thanks for the detailed review Bjorn!
> 
> >> +	  Enabling this option will also disable ACS on all ports behind
> >> +	  any PCIe switch. This effictively puts all devices behind any
> >> +	  switch into the same IOMMU group.
> 
> >  Does this really mean "all devices behind the same Root Port"?
> 
> Not necessarily. You might have a cascade of switches (i.e switches
> below a switch) to achieve a very large fan-out (in an NVMe SSD
> array for example) and we will only disable ACS on the ports below
> the relevant switch.

The question is what the relevant switch is.  We call pci_enable_acs()
on every PCI device, including Root Ports.  It looks like this relies
on get_upstream_bridge_port() to filter out some things.  I don't
think get_upstream_bridge_port() is doing the right thing, so we need
to sort that out first, I guess.

> > What does this mean in terms of device security?  I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.
> 
> This was discussed during v1 [1]. Disabling ACS on all downstream
> ports of the switch means that all the EPs below it have to part of
> the same IOMMU grouping. However it was also agreed that as long as
> the ACS disable occurred at boot time (which is does in v2) then the
> virtualization layer will be aware of it and will perform the IOMMU
> group formation correctly.
>     
> > I don't mind admitting that this patch makes me pretty nervous, and I
> > don't have a clear idea of what the implications of this are, or how
> > to communicate those to end users.  "The same IOMMU group" is a pretty
> > abstract idea.
>     
> Alex gave a good overview of the implications in [1].
> 
> [1] https://marc.info/?l=linux-pci&m=151512320031739&w=2

This might be a good start, but whatever the implications are, they
need to be distilled and made clear directly in the changelog and the
Kconfig help text.

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 23:15         ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:15 UTC (permalink / raw)


On Thu, Mar 01, 2018@06:54:01PM +0000, Stephen  Bates wrote:
> Thanks for the detailed review Bjorn!
> 
> >> +	  Enabling this option will also disable ACS on all ports behind
> >> +	  any PCIe switch. This effictively puts all devices behind any
> >> +	  switch into the same IOMMU group.
> 
> >  Does this really mean "all devices behind the same Root Port"?
> 
> Not necessarily. You might have a cascade of switches (i.e switches
> below a switch) to achieve a very large fan-out (in an NVMe SSD
> array for example) and we will only disable ACS on the ports below
> the relevant switch.

The question is what the relevant switch is.  We call pci_enable_acs()
on every PCI device, including Root Ports.  It looks like this relies
on get_upstream_bridge_port() to filter out some things.  I don't
think get_upstream_bridge_port() is doing the right thing, so we need
to sort that out first, I guess.

> > What does this mean in terms of device security?  I assume it means,
> > at least, that individual devices can't be assigned to separate VMs.
> 
> This was discussed during v1 [1]. Disabling ACS on all downstream
> ports of the switch means that all the EPs below it have to part of
> the same IOMMU grouping. However it was also agreed that as long as
> the ACS disable occurred at boot time (which is does in v2) then the
> virtualization layer will be aware of it and will perform the IOMMU
> group formation correctly.
>     
> > I don't mind admitting that this patch makes me pretty nervous, and I
> > don't have a clear idea of what the implications of this are, or how
> > to communicate those to end users.  "The same IOMMU group" is a pretty
> > abstract idea.
>     
> Alex gave a good overview of the implications in [1].
> 
> [1] https://marc.info/?l=linux-pci&m=151512320031739&w=2

This might be a good start, but whatever the implications are, they
need to be distilled and made clear directly in the changelog and the
Kconfig help text.

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 23:00                 ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01 23:19                   ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:19 UTC (permalink / raw)
  To: benh, Dan Williams
  Cc: Jens Axboe, linux-block, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> We use only 52 in practice but yes.
> 
>>   That's 64PB. If you use need
>> a sparse vmemmap for the entire space it will take 16TB which leaves you
>> with 63.98PB of address space left. (Similar calculations for other
>> numbers of address bits.)
> 
> We only have 52 bits of virtual space for the kernel with the radix
> MMU.

Ok, assuming you only have 52 bits of physical address space: the sparse 
vmemmap takes 1TB and you're left with 3.9PB of address space for other 
things. So, again, why doesn't that work? Is my math wrong?

Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:19                   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:19 UTC (permalink / raw)
  To: benh, Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> We use only 52 in practice but yes.
> 
>>   That's 64PB. If you use need
>> a sparse vmemmap for the entire space it will take 16TB which leaves you
>> with 63.98PB of address space left. (Similar calculations for other
>> numbers of address bits.)
> 
> We only have 52 bits of virtual space for the kernel with the radix
> MMU.

Ok, assuming you only have 52 bits of physical address space: the sparse 
vmemmap takes 1TB and you're left with 3.9PB of address space for other 
things. So, again, why doesn't that work? Is my math wrong?

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:19                   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:19 UTC (permalink / raw)
  To: benh-8fk3Idey6ehBDgjK7y7TUQ, Dan Williams
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA, Oliver OHalloran,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> We use only 52 in practice but yes.
> 
>>   That's 64PB. If you use need
>> a sparse vmemmap for the entire space it will take 16TB which leaves you
>> with 63.98PB of address space left. (Similar calculations for other
>> numbers of address bits.)
> 
> We only have 52 bits of virtual space for the kernel with the radix
> MMU.

Ok, assuming you only have 52 bits of physical address space: the sparse 
vmemmap takes 1TB and you're left with 3.9PB of address space for other 
things. So, again, why doesn't that work? Is my math wrong?

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:19                   ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:19 UTC (permalink / raw)




On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> We use only 52 in practice but yes.
> 
>>   That's 64PB. If you use need
>> a sparse vmemmap for the entire space it will take 16TB which leaves you
>> with 63.98PB of address space left. (Similar calculations for other
>> numbers of address bits.)
> 
> We only have 52 bits of virtual space for the kernel with the radix
> MMU.

Ok, assuming you only have 52 bits of physical address space: the sparse 
vmemmap takes 1TB and you're left with 3.9PB of address space for other 
things. So, again, why doesn't that work? Is my math wrong?

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:00                 ` Stephen  Bates
  (?)
@ 2018-03-01 23:20                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-01 23:20 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Logan Gunthorpe, Sagi Grimberg, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Bjorn Helgaas,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Steve Wise

On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:

> > Seems like a very subtle and hard to debug performance trap to leave
> > for the users, and pretty much the only reason to use P2P is
> > performance... So why have such a dangerous interface?
> 
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.

No, locality matters. If you have a bunch of NICs and bunch of drives
and the allocator chooses to put all P2P memory on a single drive your
performance will suck horribly even if all the traffic is offloaded.

Performance will suck if you have speed differences between the PCI-E
devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
will not reach full performance unless the p2p buffers are properly
balanced between all cards.

This is why I think putting things in one big pool and pretending like
any P2P mem is interchangable with any other makes zero sense to me,
it is not how the system actually works. Proper locality matters a
lot.

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:20                   ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-01 23:20 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Logan Gunthorpe, Sagi Grimberg, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Bjorn Helgaas,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt

On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:

> > Seems like a very subtle and hard to debug performance trap to leave
> > for the users, and pretty much the only reason to use P2P is
> > performance... So why have such a dangerous interface?
> 
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.

No, locality matters. If you have a bunch of NICs and bunch of drives
and the allocator chooses to put all P2P memory on a single drive your
performance will suck horribly even if all the traffic is offloaded.

Performance will suck if you have speed differences between the PCI-E
devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
will not reach full performance unless the p2p buffers are properly
balanced between all cards.

This is why I think putting things in one big pool and pretending like
any P2P mem is interchangable with any other makes zero sense to me,
it is not how the system actually works. Proper locality matters a
lot.

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:20                   ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-01 23:20 UTC (permalink / raw)


On Thu, Mar 01, 2018@11:00:51PM +0000, Stephen  Bates wrote:

> > Seems like a very subtle and hard to debug performance trap to leave
> > for the users, and pretty much the only reason to use P2P is
> > performance... So why have such a dangerous interface?
> 
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.

No, locality matters. If you have a bunch of NICs and bunch of drives
and the allocator chooses to put all P2P memory on a single drive your
performance will suck horribly even if all the traffic is offloaded.

Performance will suck if you have speed differences between the PCI-E
devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
will not reach full performance unless the p2p buffers are properly
balanced between all cards.

This is why I think putting things in one big pool and pretending like
any P2P mem is interchangable with any other makes zero sense to me,
it is not how the system actually works. Proper locality matters a
lot.

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 23:19                   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 23:25                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:25 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, linux-block, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:25                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:25 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:25                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:25 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA, Oliver OHalloran,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:25                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:25 UTC (permalink / raw)


On Thu, 2018-03-01@16:19 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 23:19                   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-01 23:26                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:26 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, linux-block, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:

(Switching back to my non-IBM address ...)

> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:26                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:26 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:

(Switching back to my non-IBM address ...)

> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:26                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:26 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA, Oliver OHalloran,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:

(Switching back to my non-IBM address ...)

> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:26                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-01 23:26 UTC (permalink / raw)


On Thu, 2018-03-01@16:19 -0700, Logan Gunthorpe wrote:

(Switching back to my non-IBM address ...)

> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:20                   ` Jason Gunthorpe
  (?)
  (?)
@ 2018-03-01 23:29                     ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Stephen Bates
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	linux-block, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:20 PM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:

> No, locality matters. If you have a bunch of NICs and bunch of drives
> and the allocator chooses to put all P2P memory on a single drive your
> performance will suck horribly even if all the traffic is offloaded.
> 
> Performance will suck if you have speed differences between the PCI-E
> devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
> will not reach full performance unless the p2p buffers are properly
> balanced between all cards.

This would be solved better by choosing the closest devices in the 
hierarchy in the p2pmem_find function (ie. preferring devices involved 
in the transaction). Also, based on the current code, it's a bit of a 
moot point seeing it requires all devices to be on the same switch. So 
unless you have a giant switch hidden somewhere you're not going to get 
too many NIC/NVME pairs.

In any case, we are looking at changing both those things so distance in 
the topology is preferred and the ability to span multiple switches is 
allowed. We're also discussing changing it to "user picks" which 
simplifies all of this.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:29                     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Stephen Bates
  Cc: Sagi Grimberg, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise



On 01/03/18 04:20 PM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:

> No, locality matters. If you have a bunch of NICs and bunch of drives
> and the allocator chooses to put all P2P memory on a single drive your
> performance will suck horribly even if all the traffic is offloaded.
> 
> Performance will suck if you have speed differences between the PCI-E
> devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
> will not reach full performance unless the p2p buffers are properly
> balanced between all cards.

This would be solved better by choosing the closest devices in the 
hierarchy in the p2pmem_find function (ie. preferring devices involved 
in the transaction). Also, based on the current code, it's a bit of a 
moot point seeing it requires all devices to be on the same switch. So 
unless you have a giant switch hidden somewhere you're not going to get 
too many NIC/NVME pairs.

In any case, we are looking at changing both those things so distance in 
the topology is preferred and the ability to span multiple switches is 
allowed. We're also discussing changing it to "user picks" which 
simplifies all of this.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:29                     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Stephen Bates
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 01/03/18 04:20 PM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:

> No, locality matters. If you have a bunch of NICs and bunch of drives
> and the allocator chooses to put all P2P memory on a single drive your
> performance will suck horribly even if all the traffic is offloaded.
> 
> Performance will suck if you have speed differences between the PCI-E
> devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
> will not reach full performance unless the p2p buffers are properly
> balanced between all cards.

This would be solved better by choosing the closest devices in the 
hierarchy in the p2pmem_find function (ie. preferring devices involved 
in the transaction). Also, based on the current code, it's a bit of a 
moot point seeing it requires all devices to be on the same switch. So 
unless you have a giant switch hidden somewhere you're not going to get 
too many NIC/NVME pairs.

In any case, we are looking at changing both those things so distance in 
the topology is preferred and the ability to span multiple switches is 
allowed. We're also discussing changing it to "user picks" which 
simplifies all of this.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:29                     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:29 UTC (permalink / raw)




On 01/03/18 04:20 PM, Jason Gunthorpe wrote:
> On Thu, Mar 01, 2018@11:00:51PM +0000, Stephen  Bates wrote:

> No, locality matters. If you have a bunch of NICs and bunch of drives
> and the allocator chooses to put all P2P memory on a single drive your
> performance will suck horribly even if all the traffic is offloaded.
> 
> Performance will suck if you have speed differences between the PCI-E
> devices. Eg a bunch of Gen 3 x8 NVMe cards paired with a Gen 4 x16 NIC
> will not reach full performance unless the p2p buffers are properly
> balanced between all cards.

This would be solved better by choosing the closest devices in the 
hierarchy in the p2pmem_find function (ie. preferring devices involved 
in the transaction). Also, based on the current code, it's a bit of a 
moot point seeing it requires all devices to be on the same switch. So 
unless you have a giant switch hidden somewhere you're not going to get 
too many NIC/NVME pairs.

In any case, we are looking at changing both those things so distance in 
the topology is preferred and the ability to span multiple switches is 
allowed. We're also discussing changing it to "user picks" which 
simplifies all of this.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:20                   ` Jason Gunthorpe
                                       ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 23:32                     ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	linux-block, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

>   No, locality matters. If you have a bunch of NICs and bunch of drives
>   and the allocator chooses to put all P2P memory on a single drive your
>   performance will suck horribly even if all the traffic is offloaded.
    
Sagi brought this up earlier in his comments about the _find_ function. We are planning to do something about this in the next version. This might be a randomization or a "user-pick" and include a rule around using the p2p_dev on the EP if that EP is part of the transaction.

Stephen


    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:32                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Sagi Grimberg, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Bjorn Helgaas,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Steve Wise

PiAgIE5vLCBsb2NhbGl0eSBtYXR0ZXJzLiBJZiB5b3UgaGF2ZSBhIGJ1bmNoIG9mIE5JQ3MgYW5k
IGJ1bmNoIG9mIGRyaXZlcw0KPiAgIGFuZCB0aGUgYWxsb2NhdG9yIGNob29zZXMgdG8gcHV0IGFs
bCBQMlAgbWVtb3J5IG9uIGEgc2luZ2xlIGRyaXZlIHlvdXINCj4gICBwZXJmb3JtYW5jZSB3aWxs
IHN1Y2sgaG9ycmlibHkgZXZlbiBpZiBhbGwgdGhlIHRyYWZmaWMgaXMgb2ZmbG9hZGVkLg0KICAg
IA0KU2FnaSBicm91Z2h0IHRoaXMgdXAgZWFybGllciBpbiBoaXMgY29tbWVudHMgYWJvdXQgdGhl
IF9maW5kXyBmdW5jdGlvbi4gV2UgYXJlIHBsYW5uaW5nIHRvIGRvIHNvbWV0aGluZyBhYm91dCB0
aGlzIGluIHRoZSBuZXh0IHZlcnNpb24uIFRoaXMgbWlnaHQgYmUgYSByYW5kb21pemF0aW9uIG9y
IGEgInVzZXItcGljayIgYW5kIGluY2x1ZGUgYSBydWxlIGFyb3VuZCB1c2luZyB0aGUgcDJwX2Rl
diBvbiB0aGUgRVAgaWYgdGhhdCBFUCBpcyBwYXJ0IG9mIHRoZSB0cmFuc2FjdGlvbi4NCg0KU3Rl
cGhlbg0KDQoNCiAgICANCg0K

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:32                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

>   No, locality matters. If you have a bunch of NICs and bunch of drives
>   and the allocator chooses to put all P2P memory on a single drive your
>   performance will suck horribly even if all the traffic is offloaded.
    
Sagi brought this up earlier in his comments about the _find_ function. We are planning to do something about this in the next version. This might be a randomization or a "user-pick" and include a rule around using the p2p_dev on the EP if that EP is part of the transaction.

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:32                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Sagi Grimberg, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Bjorn Helgaas,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson, Steve Wise

>   No, locality matters. If you have a bunch of NICs and bunch of drives
>   and the allocator chooses to put all P2P memory on a single drive your
>   performance will suck horribly even if all the traffic is offloaded.
    
Sagi brought this up earlier in his comments about the _find_ function. We are planning to do something about this in the next version. This might be a randomization or a "user-pick" and include a rule around using the p2p_dev on the EP if that EP is part of the transaction.

Stephen


    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:32                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:32 UTC (permalink / raw)


>   No, locality matters. If you have a bunch of NICs and bunch of drives
>   and the allocator chooses to put all P2P memory on a single drive your
>   performance will suck horribly even if all the traffic is offloaded.
    
Sagi brought this up earlier in his comments about the _find_ function. We are planning to do something about this in the next version. This might be a randomization or a "user-pick" and include a rule around using the p2p_dev on the EP if that EP is part of the transaction.

Stephen


    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
  2018-03-01 23:14               ` Stephen  Bates
  (?)
  (?)
@ 2018-03-01 23:45                 ` Bjorn Helgaas
  -1 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:45 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 11:14:46PM +0000, Stephen  Bates wrote:
> > I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> > transaction in that sense isn't going to work unless the device 
> > specifically supports it and intercepts the traffic before it gets to 
> > the port.
> 
> This is correct. Unless the device intercepts the TLP before it hits
> the root-port then this would be considered a "route to self"
> violation and an error event would occur. The same holds for the
> downstream port on a PCI switch (unless route-to-self violations are
> disabled which violates the spec but which I have seen done in
> certain applications).

I agree that a function doing DMA to a sibling within the same
multi-function device would probably not generate a TLP for it (I
would be curious to read about this in the spec if you have a
pointer).

More fundamentally, is there some multi-function-specific restriction
on peer-to-peer DMA?  In conventional PCI, single-function devices on
the same bus can DMA to each other.  The transactions will appear on
the bus, but the upstream bridge will ignore them because the address
is inside the bridge's memory window.  As far as I know, the same
should happen on PCIe.

I don't know what happens with functions of a multi-function device,
either in conventional PCI or PCIe.  I don't remember a restriction on
whether they can DMA to each other, but maybe there is.

Bjorn
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:45                 ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:45 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson

On Thu, Mar 01, 2018 at 11:14:46PM +0000, Stephen  Bates wrote:
> > I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> > transaction in that sense isn't going to work unless the device 
> > specifically supports it and intercepts the traffic before it gets to 
> > the port.
> 
> This is correct. Unless the device intercepts the TLP before it hits
> the root-port then this would be considered a "route to self"
> violation and an error event would occur. The same holds for the
> downstream port on a PCI switch (unless route-to-self violations are
> disabled which violates the spec but which I have seen done in
> certain applications).

I agree that a function doing DMA to a sibling within the same
multi-function device would probably not generate a TLP for it (I
would be curious to read about this in the spec if you have a
pointer).

More fundamentally, is there some multi-function-specific restriction
on peer-to-peer DMA?  In conventional PCI, single-function devices on
the same bus can DMA to each other.  The transactions will appear on
the bus, but the upstream bridge will ignore them because the address
is inside the bridge's memory window.  As far as I know, the same
should happen on PCIe.

I don't know what happens with functions of a multi-function device,
either in conventional PCI or PCIe.  I don't remember a restriction on
whether they can DMA to each other, but maybe there is.

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:45                 ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:45 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 11:14:46PM +0000, Stephen  Bates wrote:
> > I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> > transaction in that sense isn't going to work unless the device 
> > specifically supports it and intercepts the traffic before it gets to 
> > the port.
> 
> This is correct. Unless the device intercepts the TLP before it hits
> the root-port then this would be considered a "route to self"
> violation and an error event would occur. The same holds for the
> downstream port on a PCI switch (unless route-to-self violations are
> disabled which violates the spec but which I have seen done in
> certain applications).

I agree that a function doing DMA to a sibling within the same
multi-function device would probably not generate a TLP for it (I
would be curious to read about this in the spec if you have a
pointer).

More fundamentally, is there some multi-function-specific restriction
on peer-to-peer DMA?  In conventional PCI, single-function devices on
the same bus can DMA to each other.  The transactions will appear on
the bus, but the upstream bridge will ignore them because the address
is inside the bridge's memory window.  As far as I know, the same
should happen on PCIe.

I don't know what happens with functions of a multi-function device,
either in conventional PCI or PCIe.  I don't remember a restriction on
whether they can DMA to each other, but maybe there is.

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory
@ 2018-03-01 23:45                 ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-01 23:45 UTC (permalink / raw)


On Thu, Mar 01, 2018@11:14:46PM +0000, Stephen  Bates wrote:
> > I'm pretty sure the spec disallows routing-to-self so doing a P2P 
> > transaction in that sense isn't going to work unless the device 
> > specifically supports it and intercepts the traffic before it gets to 
> > the port.
> 
> This is correct. Unless the device intercepts the TLP before it hits
> the root-port then this would be considered a "route to self"
> violation and an error event would occur. The same holds for the
> downstream port on a PCI switch (unless route-to-self violations are
> disabled which violates the spec but which I have seen done in
> certain applications).

I agree that a function doing DMA to a sibling within the same
multi-function device would probably not generate a TLP for it (I
would be curious to read about this in the spec if you have a
pointer).

More fundamentally, is there some multi-function-specific restriction
on peer-to-peer DMA?  In conventional PCI, single-function devices on
the same bus can DMA to each other.  The transactions will appear on
the bus, but the upstream bridge will ignore them because the address
is inside the bridge's memory window.  As far as I know, the same
should happen on PCIe.

I don't know what happens with functions of a multi-function device,
either in conventional PCI or PCIe.  I don't remember a restriction on
whether they can DMA to each other, but maybe there is.

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:00                 ` Stephen  Bates
  (?)
  (?)
@ 2018-03-01 23:49                   ` Keith Busch
  -1 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-01 23:49 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, linux-block,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:
> 
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.

Even within a device, memory attributes for its various regions may not be
the same. There's a meaningful difference between writing to an NVMe CMB
vs PMR, and a single device could have both. We don't want to lump these
all together without knowing which region you're allocating from, right?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:49                   ` Keith Busch
  0 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-01 23:49 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jason Gunthorpe, Logan Gunthorpe, Sagi Grimberg, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise

On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:
> 
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.

Even within a device, memory attributes for its various regions may not be
the same. There's a meaningful difference between writing to an NVMe CMB
vs PMR, and a single device could have both. We don't want to lump these
all together without knowing which region you're allocating from, right?

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:49                   ` Keith Busch
  0 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-01 23:49 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:
> 
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.

Even within a device, memory attributes for its various regions may not be
the same. There's a meaningful difference between writing to an NVMe CMB
vs PMR, and a single device could have both. We don't want to lump these
all together without knowing which region you're allocating from, right?

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:49                   ` Keith Busch
  0 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-01 23:49 UTC (permalink / raw)


On Thu, Mar 01, 2018@11:00:51PM +0000, Stephen  Bates wrote:
> 
> P2P is about offloading the memory and PCI subsystem of the host CPU
> and this is achieved no matter which p2p_dev is used.

Even within a device, memory attributes for its various regions may not be
the same. There's a meaningful difference between writing to an NVMe CMB
vs PMR, and a single device could have both. We don't want to lump these
all together without knowing which region you're allocating from, right?

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:49                   ` Keith Busch
  (?)
  (?)
@ 2018-03-01 23:52                     ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:52 UTC (permalink / raw)
  To: Keith Busch, Stephen Bates
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, linux-block,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:49 PM, Keith Busch wrote:
> On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:
>>
>> P2P is about offloading the memory and PCI subsystem of the host CPU
>> and this is achieved no matter which p2p_dev is used.
> 
> Even within a device, memory attributes for its various regions may not be
> the same. There's a meaningful difference between writing to an NVMe CMB
> vs PMR, and a single device could have both. We don't want to lump these
> all together without knowing which region you're allocating from, right?

Yes, PMRs are a whole other beast. I have no idea what anyone is going 
to actually do with those yet. But no one is proposing to publish them 
as p2pmem with this code.

Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:52                     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:52 UTC (permalink / raw)
  To: Keith Busch, Stephen Bates
  Cc: Jason Gunthorpe, Sagi Grimberg, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise



On 01/03/18 04:49 PM, Keith Busch wrote:
> On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:
>>
>> P2P is about offloading the memory and PCI subsystem of the host CPU
>> and this is achieved no matter which p2p_dev is used.
> 
> Even within a device, memory attributes for its various regions may not be
> the same. There's a meaningful difference between writing to an NVMe CMB
> vs PMR, and a single device could have both. We don't want to lump these
> all together without knowing which region you're allocating from, right?

Yes, PMRs are a whole other beast. I have no idea what anyone is going 
to actually do with those yet. But no one is proposing to publish them 
as p2pmem with this code.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:52                     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:52 UTC (permalink / raw)
  To: Keith Busch, Stephen Bates
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:49 PM, Keith Busch wrote:
> On Thu, Mar 01, 2018 at 11:00:51PM +0000, Stephen  Bates wrote:
>>
>> P2P is about offloading the memory and PCI subsystem of the host CPU
>> and this is achieved no matter which p2p_dev is used.
> 
> Even within a device, memory attributes for its various regions may not be
> the same. There's a meaningful difference between writing to an NVMe CMB
> vs PMR, and a single device could have both. We don't want to lump these
> all together without knowing which region you're allocating from, right?

Yes, PMRs are a whole other beast. I have no idea what anyone is going 
to actually do with those yet. But no one is proposing to publish them 
as p2pmem with this code.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:52                     ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:52 UTC (permalink / raw)




On 01/03/18 04:49 PM, Keith Busch wrote:
> On Thu, Mar 01, 2018@11:00:51PM +0000, Stephen  Bates wrote:
>>
>> P2P is about offloading the memory and PCI subsystem of the host CPU
>> and this is achieved no matter which p2p_dev is used.
> 
> Even within a device, memory attributes for its various regions may not be
> the same. There's a meaningful difference between writing to an NVMe CMB
> vs PMR, and a single device could have both. We don't want to lump these
> all together without knowing which region you're allocating from, right?

Yes, PMRs are a whole other beast. I have no idea what anyone is going 
to actually do with those yet. But no one is proposing to publish them 
as p2pmem with this code.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:49                   ` Keith Busch
                                       ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 23:53                     ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, linux-block,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> There's a meaningful difference between writing to an NVMe CMB vs PMR

When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).

Stephen

    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:53                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jason Gunthorpe, Logan Gunthorpe, Sagi Grimberg, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise

PiBUaGVyZSdzIGEgbWVhbmluZ2Z1bCBkaWZmZXJlbmNlIGJldHdlZW4gd3JpdGluZyB0byBhbiBO
Vk1lIENNQiB2cyBQTVINCg0KV2hlbiB0aGUgUE1SIHNwZWMgYmVjb21lcyBwdWJsaWMgd2UgY2Fu
IGRpc2N1c3MgaG93IGJlc3QgdG8gaW50ZWdyYXRlIGl0IGludG8gdGhlIFAyUCBmcmFtZXdvcmsg
KGlmIGF0IGFsbCkgOy0pLg0KDQpTdGVwaGVuDQoNCiAgICANCg0K

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:53                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

> There's a meaningful difference between writing to an NVMe CMB vs PMR

When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:53                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jason Gunthorpe, Logan Gunthorpe, Sagi Grimberg, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise

> There's a meaningful difference between writing to an NVMe CMB vs PMR

When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).

Stephen

    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:53                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:53 UTC (permalink / raw)


> There's a meaningful difference between writing to an NVMe CMB vs PMR

When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).

Stephen

    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 23:26                     ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-01 23:54                       ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Jens Axboe, linux-block, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 01/03/18 04:26 PM, Benjamin Herrenschmidt wrote:
> The big problem is not the vmemmap, it's the linear mapping.

Ah, yes, ok.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:54                       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:26 PM, Benjamin Herrenschmidt wrote:
> The big problem is not the vmemmap, it's the linear mapping.

Ah, yes, ok.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:54                       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA, Oliver OHalloran,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:26 PM, Benjamin Herrenschmidt wrote:
> The big problem is not the vmemmap, it's the linear mapping.

Ah, yes, ok.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-01 23:54                       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:54 UTC (permalink / raw)




On 01/03/18 04:26 PM, Benjamin Herrenschmidt wrote:
> The big problem is not the vmemmap, it's the linear mapping.

Ah, yes, ok.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:49                   ` Keith Busch
                                       ` (2 preceding siblings ...)
  (?)
@ 2018-03-01 23:57                     ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, linux-block,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> We don't want to lump these all together without knowing which region you're allocating from, right?

In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:57                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jason Gunthorpe, Logan Gunthorpe, Sagi Grimberg, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise

PiBXZSBkb24ndCB3YW50IHRvIGx1bXAgdGhlc2UgYWxsIHRvZ2V0aGVyIHdpdGhvdXQga25vd2lu
ZyB3aGljaCByZWdpb24geW91J3JlIGFsbG9jYXRpbmcgZnJvbSwgcmlnaHQ/DQoNCkluIGFsbCBz
ZXJpb3VzbmVzcyBJIGRvIGFncmVlIHdpdGggeW91IG9uIHRoZXNlIEtlaXRoIGluIHRoZSBsb25n
IHRlcm0uIFdlIHdvdWxkIGNvbnNpZGVyIGFkZGluZyBwcm9wZXJ0eSBmbGFncyBmb3IgdGhlIG1l
bW9yeSBhcyBpdCBpcyBhZGRlZCB0byB0aGUgcDJwIGNvcmUgYW5kIHRoZW4gdGhlIGFsbG9jYXRv
ciBjb3VsZCBldm9sdmUgdG8gaW50ZWxsaWdlbnRseSBkaXNoIGl0IG91dC4gQXR0cmlidXRlcyBs
aWtlIGVuZHVyYW5jZSwgbGF0ZW5jeSBhbmQgc3BlY2lhbCB3cml0ZSBjb21taXQgcmVxdWlyZW1l
bnRzIGNvdWxkIGFsbCBiZWNvbWUgYXR0cmlidXRlcyBpbiB0aW1lLiBQZXJoYXBzIG9uZSBtb3Jl
IHJlYXNvbiBmb3IgYSBjZW50cmFsIGVudGl0eSBmb3IgcDJwIG1lbW9yeSBhbGxvY2F0aW9uIHNv
IHRoaXMgY29kZSBkb2VzIG5vdCBlbmQgdXAgaGF2aW5nIHRvIGdvIGludG8gbWFueSBkaWZmZXJl
bnQgZHJpdmVycz8NCg0KU3RlcGhlbg0KICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:57                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

> We don't want to lump these all together without knowing which region you're allocating from, right?

In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:57                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jason Gunthorpe, Logan Gunthorpe, Sagi Grimberg, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise

> We don't want to lump these all together without knowing which region you're allocating from, right?

In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-01 23:57                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-01 23:57 UTC (permalink / raw)


> We don't want to lump these all together without knowing which region you're allocating from, right?

In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 23:15         ` Bjorn Helgaas
  (?)
  (?)
@ 2018-03-01 23:59           ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:59 UTC (permalink / raw)
  To: Bjorn Helgaas, Stephen Bates
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:15 PM, Bjorn Helgaas wrote:
> The question is what the relevant switch is.  We call pci_enable_acs()
> on every PCI device, including Root Ports.  It looks like this relies
> on get_upstream_bridge_port() to filter out some things.  I don't
> think get_upstream_bridge_port() is doing the right thing, so we need
> to sort that out first, I guess.

Yes, well if a port has an upstream bridge port and that port has an 
upstream bridge port we are almost certainly looking at a port in a 
switch. If you have a better suggestion to only disable ACS bits on 
downstream switch ports I'd be happy to consider it.


> This might be a good start, but whatever the implications are, they
> need to be distilled and made clear directly in the changelog and the
> Kconfig help text.

Well, I tried. I'll expand on it for v3.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 23:59           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:59 UTC (permalink / raw)
  To: Bjorn Helgaas, Stephen Bates
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson



On 01/03/18 04:15 PM, Bjorn Helgaas wrote:
> The question is what the relevant switch is.  We call pci_enable_acs()
> on every PCI device, including Root Ports.  It looks like this relies
> on get_upstream_bridge_port() to filter out some things.  I don't
> think get_upstream_bridge_port() is doing the right thing, so we need
> to sort that out first, I guess.

Yes, well if a port has an upstream bridge port and that port has an 
upstream bridge port we are almost certainly looking at a port in a 
switch. If you have a better suggestion to only disable ACS bits on 
downstream switch ports I'd be happy to consider it.


> This might be a good start, but whatever the implications are, they
> need to be distilled and made clear directly in the changelog and the
> Kconfig help text.

Well, I tried. I'll expand on it for v3.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 23:59           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:59 UTC (permalink / raw)
  To: Bjorn Helgaas, Stephen Bates
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:15 PM, Bjorn Helgaas wrote:
> The question is what the relevant switch is.  We call pci_enable_acs()
> on every PCI device, including Root Ports.  It looks like this relies
> on get_upstream_bridge_port() to filter out some things.  I don't
> think get_upstream_bridge_port() is doing the right thing, so we need
> to sort that out first, I guess.

Yes, well if a port has an upstream bridge port and that port has an 
upstream bridge port we are almost certainly looking at a port in a 
switch. If you have a better suggestion to only disable ACS bits on 
downstream switch ports I'd be happy to consider it.


> This might be a good start, but whatever the implications are, they
> need to be distilled and made clear directly in the changelog and the
> Kconfig help text.

Well, I tried. I'll expand on it for v3.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-01 23:59           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-01 23:59 UTC (permalink / raw)




On 01/03/18 04:15 PM, Bjorn Helgaas wrote:
> The question is what the relevant switch is.  We call pci_enable_acs()
> on every PCI device, including Root Ports.  It looks like this relies
> on get_upstream_bridge_port() to filter out some things.  I don't
> think get_upstream_bridge_port() is doing the right thing, so we need
> to sort that out first, I guess.

Yes, well if a port has an upstream bridge port and that port has an 
upstream bridge port we are almost certainly looking at a port in a 
switch. If you have a better suggestion to only disable ACS bits on 
downstream switch ports I'd be happy to consider it.


> This might be a good start, but whatever the implications are, they
> need to be distilled and made clear directly in the changelog and the
> Kconfig help text.

Well, I tried. I'll expand on it for v3.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:57                     ` Stephen  Bates
  (?)
  (?)
@ 2018-03-02  0:03                       ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:03 UTC (permalink / raw)
  To: Stephen Bates, Keith Busch
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, linux-block,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:57 PM, Stephen  Bates wrote:
>> We don't want to lump these all together without knowing which region you're allocating from, right?
> 
> In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?

Ugh, I don't know... An allocator for PMR is going to be quite a 
different animal. It would probably need to look more like a filesystem 
given you'd need to access the same regions across reboots. I don't 
think it fits the p2pmem paradigm at all. I think it's something 
entirely new all together and who knows exactly what it will look like.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02  0:03                       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:03 UTC (permalink / raw)
  To: Stephen Bates, Keith Busch
  Cc: Jason Gunthorpe, Sagi Grimberg, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise



On 01/03/18 04:57 PM, Stephen  Bates wrote:
>> We don't want to lump these all together without knowing which region you're allocating from, right?
> 
> In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?

Ugh, I don't know... An allocator for PMR is going to be quite a 
different animal. It would probably need to look more like a filesystem 
given you'd need to access the same regions across reboots. I don't 
think it fits the p2pmem paradigm at all. I think it's something 
entirely new all together and who knows exactly what it will look like.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02  0:03                       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:03 UTC (permalink / raw)
  To: Stephen Bates, Keith Busch
  Cc: Jens Axboe, Alex Williamson, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 04:57 PM, Stephen  Bates wrote:
>> We don't want to lump these all together without knowing which region you're allocating from, right?
> 
> In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?

Ugh, I don't know... An allocator for PMR is going to be quite a 
different animal. It would probably need to look more like a filesystem 
given you'd need to access the same regions across reboots. I don't 
think it fits the p2pmem paradigm at all. I think it's something 
entirely new all together and who knows exactly what it will look like.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02  0:03                       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:03 UTC (permalink / raw)




On 01/03/18 04:57 PM, Stephen  Bates wrote:
>> We don't want to lump these all together without knowing which region you're allocating from, right?
> 
> In all seriousness I do agree with you on these Keith in the long term. We would consider adding property flags for the memory as it is added to the p2p core and then the allocator could evolve to intelligently dish it out. Attributes like endurance, latency and special write commit requirements could all become attributes in time. Perhaps one more reason for a central entity for p2p memory allocation so this code does not end up having to go into many different drivers?

Ugh, I don't know... An allocator for PMR is going to be quite a 
different animal. It would probably need to look more like a filesystem 
given you'd need to access the same regions across reboots. I don't 
think it fits the p2pmem paradigm at all. I think it's something 
entirely new all together and who knows exactly what it will look like.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
  2018-03-01 17:44     ` Bjorn Helgaas
  (?)
  (?)
@ 2018-03-02  0:15       ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
> I think these two statements are out of order, since the attributes
> dereference pdev->p2pdma.  And it looks like you set "error"
> unnecessarily, since you return immediately looking at it.

Per the previous series, sysfs_create_group is must_check for some 
reason. I had a printk there but you didn't think it was necessary. So 
assigning it to error is the only way to squash the warning.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-02  0:15       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson



On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
> I think these two statements are out of order, since the attributes
> dereference pdev->p2pdma.  And it looks like you set "error"
> unnecessarily, since you return immediately looking at it.

Per the previous series, sysfs_create_group is must_check for some 
reason. I had a printk there but you didn't think it was necessary. So 
assigning it to error is the only way to squash the warning.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-02  0:15       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
> I think these two statements are out of order, since the attributes
> dereference pdev->p2pdma.  And it looks like you set "error"
> unnecessarily, since you return immediately looking at it.

Per the previous series, sysfs_create_group is must_check for some 
reason. I had a printk there but you didn't think it was necessary. So 
assigning it to error is the only way to squash the warning.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-02  0:15       ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:15 UTC (permalink / raw)




On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
> I think these two statements are out of order, since the attributes
> dereference pdev->p2pdma.  And it looks like you set "error"
> unnecessarily, since you return immediately looking at it.

Per the previous series, sysfs_create_group is must_check for some 
reason. I had a printk there but you didn't think it was necessary. So 
assigning it to error is the only way to squash the warning.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
  2018-03-02  0:15       ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-02  0:36         ` Dan Williams
  -1 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-02  0:36 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	linux-block, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, Mar 1, 2018 at 4:15 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>
>> I think these two statements are out of order, since the attributes
>> dereference pdev->p2pdma.  And it looks like you set "error"
>> unnecessarily, since you return immediately looking at it.
>
>
> Per the previous series, sysfs_create_group is must_check for some reason. I
> had a printk there but you didn't think it was necessary. So assigning it to
> error is the only way to squash the warning.

Why not fail the setup if the sysfs_create_group() fails? Sure, it may
not be strictly required for userspace to have access to these
attributes, but it seems hostile that userspace can't make assumptions
about the presence of the "p2pmem" directory relative to the
capability being setup.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-02  0:36         ` Dan Williams
  0 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-02  0:36 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Bjorn Helgaas, Linux Kernel Mailing List, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson

On Thu, Mar 1, 2018 at 4:15 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>
>> I think these two statements are out of order, since the attributes
>> dereference pdev->p2pdma.  And it looks like you set "error"
>> unnecessarily, since you return immediately looking at it.
>
>
> Per the previous series, sysfs_create_group is must_check for some reason. I
> had a printk there but you didn't think it was necessary. So assigning it to
> error is the only way to squash the warning.

Why not fail the setup if the sysfs_create_group() fails? Sure, it may
not be strictly required for userspace to have access to these
attributes, but it seems hostile that userspace can't make assumptions
about the presence of the "p2pmem" directory relative to the
capability being setup.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-02  0:36         ` Dan Williams
  0 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-02  0:36 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 1, 2018 at 4:15 PM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>
>
> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>
>> I think these two statements are out of order, since the attributes
>> dereference pdev->p2pdma.  And it looks like you set "error"
>> unnecessarily, since you return immediately looking at it.
>
>
> Per the previous series, sysfs_create_group is must_check for some reason. I
> had a printk there but you didn't think it was necessary. So assigning it to
> error is the only way to squash the warning.

Why not fail the setup if the sysfs_create_group() fails? Sure, it may
not be strictly required for userspace to have access to these
attributes, but it seems hostile that userspace can't make assumptions
about the presence of the "p2pmem" directory relative to the
capability being setup.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-02  0:36         ` Dan Williams
  0 siblings, 0 replies; 504+ messages in thread
From: Dan Williams @ 2018-03-02  0:36 UTC (permalink / raw)


On Thu, Mar 1, 2018@4:15 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>
>> I think these two statements are out of order, since the attributes
>> dereference pdev->p2pdma.  And it looks like you set "error"
>> unnecessarily, since you return immediately looking at it.
>
>
> Per the previous series, sysfs_create_group is must_check for some reason. I
> had a printk there but you didn't think it was necessary. So assigning it to
> error is the only way to squash the warning.

Why not fail the setup if the sysfs_create_group() fails? Sure, it may
not be strictly required for userspace to have access to these
attributes, but it seems hostile that userspace can't make assumptions
about the presence of the "p2pmem" directory relative to the
capability being setup.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
  2018-03-02  0:36         ` Dan Williams
  (?)
  (?)
@ 2018-03-02  0:37           ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	linux-block, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 01/03/18 05:36 PM, Dan Williams wrote:
> On Thu, Mar 1, 2018 at 4:15 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>>
>>> I think these two statements are out of order, since the attributes
>>> dereference pdev->p2pdma.  And it looks like you set "error"
>>> unnecessarily, since you return immediately looking at it.
>>
>>
>> Per the previous series, sysfs_create_group is must_check for some reason. I
>> had a printk there but you didn't think it was necessary. So assigning it to
>> error is the only way to squash the warning.
> 
> Why not fail the setup if the sysfs_create_group() fails? Sure, it may
> not be strictly required for userspace to have access to these
> attributes, but it seems hostile that userspace can't make assumptions
> about the presence of the "p2pmem" directory relative to the
> capability being setup.

Works for me. I'll do that.

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-02  0:37           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bjorn Helgaas, Linux Kernel Mailing List, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Stephen Bates,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson



On 01/03/18 05:36 PM, Dan Williams wrote:
> On Thu, Mar 1, 2018 at 4:15 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>>
>>> I think these two statements are out of order, since the attributes
>>> dereference pdev->p2pdma.  And it looks like you set "error"
>>> unnecessarily, since you return immediately looking at it.
>>
>>
>> Per the previous series, sysfs_create_group is must_check for some reason. I
>> had a printk there but you didn't think it was necessary. So assigning it to
>> error is the only way to squash the warning.
> 
> Why not fail the setup if the sysfs_create_group() fails? Sure, it may
> not be strictly required for userspace to have access to these
> attributes, but it seems hostile that userspace can't make assumptions
> about the presence of the "p2pmem" directory relative to the
> capability being setup.

Works for me. I'll do that.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-02  0:37           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/03/18 05:36 PM, Dan Williams wrote:
> On Thu, Mar 1, 2018 at 4:15 PM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>>
>>
>> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>>
>>> I think these two statements are out of order, since the attributes
>>> dereference pdev->p2pdma.  And it looks like you set "error"
>>> unnecessarily, since you return immediately looking at it.
>>
>>
>> Per the previous series, sysfs_create_group is must_check for some reason. I
>> had a printk there but you didn't think it was necessary. So assigning it to
>> error is the only way to squash the warning.
> 
> Why not fail the setup if the sysfs_create_group() fails? Sure, it may
> not be strictly required for userspace to have access to these
> attributes, but it seems hostile that userspace can't make assumptions
> about the presence of the "p2pmem" directory relative to the
> capability being setup.

Works for me. I'll do that.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats
@ 2018-03-02  0:37           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02  0:37 UTC (permalink / raw)




On 01/03/18 05:36 PM, Dan Williams wrote:
> On Thu, Mar 1, 2018@4:15 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>> On 01/03/18 10:44 AM, Bjorn Helgaas wrote:
>>>
>>> I think these two statements are out of order, since the attributes
>>> dereference pdev->p2pdma.  And it looks like you set "error"
>>> unnecessarily, since you return immediately looking at it.
>>
>>
>> Per the previous series, sysfs_create_group is must_check for some reason. I
>> had a printk there but you didn't think it was necessary. So assigning it to
>> error is the only way to squash the warning.
> 
> Why not fail the setup if the sysfs_create_group() fails? Sure, it may
> not be strictly required for userspace to have access to these
> attributes, but it seems hostile that userspace can't make assumptions
> about the presence of the "p2pmem" directory relative to the
> capability being setup.

Works for me. I'll do that.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:53                     ` Stephen  Bates
  (?)
  (?)
@ 2018-03-02 15:53                       ` Christoph Hellwig
  -1 siblings, 0 replies; 504+ messages in thread
From: Christoph Hellwig @ 2018-03-02 15:53 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, linux-block, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch,
	Jason Gunthorpe, Alex Williamson, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Thu, Mar 01, 2018 at 11:53:16PM +0000, Stephen  Bates wrote:
> > There's a meaningful difference between writing to an NVMe CMB vs PMR
> 
> When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).

http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 15:53                       ` Christoph Hellwig
  0 siblings, 0 replies; 504+ messages in thread
From: Christoph Hellwig @ 2018-03-02 15:53 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Keith Busch, Jens Axboe, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	linux-block, Jason Gunthorpe, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Thu, Mar 01, 2018 at 11:53:16PM +0000, Stephen  Bates wrote:
> > There's a meaningful difference between writing to an NVMe CMB vs PMR
> 
> When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).

http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 15:53                       ` Christoph Hellwig
  0 siblings, 0 replies; 504+ messages in thread
From: Christoph Hellwig @ 2018-03-02 15:53 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Jason Gunthorpe, Alex Williamson, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Thu, Mar 01, 2018 at 11:53:16PM +0000, Stephen  Bates wrote:
> > There's a meaningful difference between writing to an NVMe CMB vs PMR
> 
> When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).

http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 15:53                       ` Christoph Hellwig
  0 siblings, 0 replies; 504+ messages in thread
From: Christoph Hellwig @ 2018-03-02 15:53 UTC (permalink / raw)


On Thu, Mar 01, 2018@11:53:16PM +0000, Stephen  Bates wrote:
> > There's a meaningful difference between writing to an NVMe CMB vs PMR
> 
> When the PMR spec becomes public we can discuss how best to integrate it into the P2P framework (if at all) ;-).

http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-01 23:57                     ` Stephen  Bates
  (?)
@ 2018-03-02 16:18                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-02 16:18 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Keith Busch, Logan Gunthorpe, Sagi Grimberg, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt,
	Alex Williamson, Steve Wise

On Thu, Mar 01, 2018 at 11:57:43PM +0000, Stephen  Bates wrote:
> > We don't want to lump these all together without knowing which region you're allocating from, right?
> 
> In all seriousness I do agree with you on these Keith in the long
> term. We would consider adding property flags for the memory as it
> is added to the p2p core and then the allocator could evolve to
> intelligently dish it out. Attributes like endurance, latency and
> special write commit requirements could all become attributes in
> time. Perhaps one more reason for a central entity for p2p memory
> allocation so this code does not end up having to go into many
> different drivers?

I fear we will find that every kind of P2P memory has special
allocator requirements and dumping it all into one giant pool is not
going to be helpful.

This allocator is already seems not useful for the P2P target memory
on a Mellanox NIC due to the way it has a special allocation flow
(windowing) and special usage requirements..

Nor can it be usefull for the doorbell memory in the NIC.

Both of these are existing use cases for P2P with out of tree patches..

The allocator seems to only be able to solve the CMB problem, and I
think due to that it is better to put this allocator in the NVMe
driver and not the core code.. At least until we find a 2nd user that
needs the same allocation scheme...

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 16:18                       ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-02 16:18 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Keith Busch, Logan Gunthorpe, Sagi Grimberg, linux-kernel,
	linux-pci, linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Bjorn Helgaas, Max Gurtovoy,
	Dan Williams, Jérôme Glisse, Benjamin Herrenschmidt

On Thu, Mar 01, 2018 at 11:57:43PM +0000, Stephen  Bates wrote:
> > We don't want to lump these all together without knowing which region you're allocating from, right?
> 
> In all seriousness I do agree with you on these Keith in the long
> term. We would consider adding property flags for the memory as it
> is added to the p2p core and then the allocator could evolve to
> intelligently dish it out. Attributes like endurance, latency and
> special write commit requirements could all become attributes in
> time. Perhaps one more reason for a central entity for p2p memory
> allocation so this code does not end up having to go into many
> different drivers?

I fear we will find that every kind of P2P memory has special
allocator requirements and dumping it all into one giant pool is not
going to be helpful.

This allocator is already seems not useful for the P2P target memory
on a Mellanox NIC due to the way it has a special allocation flow
(windowing) and special usage requirements..

Nor can it be usefull for the doorbell memory in the NIC.

Both of these are existing use cases for P2P with out of tree patches..

The allocator seems to only be able to solve the CMB problem, and I
think due to that it is better to put this allocator in the NVMe
driver and not the core code.. At least until we find a 2nd user that
needs the same allocation scheme...

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 16:18                       ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-02 16:18 UTC (permalink / raw)


On Thu, Mar 01, 2018@11:57:43PM +0000, Stephen  Bates wrote:
> > We don't want to lump these all together without knowing which region you're allocating from, right?
> 
> In all seriousness I do agree with you on these Keith in the long
> term. We would consider adding property flags for the memory as it
> is added to the p2p core and then the allocator could evolve to
> intelligently dish it out. Attributes like endurance, latency and
> special write commit requirements could all become attributes in
> time. Perhaps one more reason for a central entity for p2p memory
> allocation so this code does not end up having to go into many
> different drivers?

I fear we will find that every kind of P2P memory has special
allocator requirements and dumping it all into one giant pool is not
going to be helpful.

This allocator is already seems not useful for the P2P target memory
on a Mellanox NIC due to the way it has a special allocation flow
(windowing) and special usage requirements..

Nor can it be usefull for the doorbell memory in the NIC.

Both of these are existing use cases for P2P with out of tree patches..

The allocator seems to only be able to solve the CMB problem, and I
think due to that it is better to put this allocator in the NVMe
driver and not the core code.. At least until we find a 2nd user that
needs the same allocation scheme...

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 22:34                   ` Benjamin Herrenschmidt
                                       ` (2 preceding siblings ...)
  (?)
@ 2018-03-02 16:22                     ` Kani, Toshi
  -1 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 16:22 UTC (permalink / raw)
  To: torvalds, benh
  Cc: axboe, keith.busch, oliveroh, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block, jgg,
	alex.williamson, jglisse, bhelgaas, maxg, hch

On Fri, 2018-03-02 at 09:34 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> > On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> > > 
> > > Could be that x86 has the smarts to do the right thing, still trying to
> > > untangle the code :-)
> > 
> > Afaik, x86 will not cache PCI unless the system is misconfigured, and
> > even then it's more likely to just raise a machine check exception
> > than cache things.
> > 
> > The last-level cache is going to do fills and spills directly to the
> > memory controller, not to the PCIe side of things.
> > 
> > (I guess you *can* do things differently, and I wouldn't be surprised
> > if some people inside Intel did try to do things differently with
> > trying nvram over PCIe, but in general I think the above is true)
> > 
> > You won't find it in the kernel code either. It's in hardware with
> > firmware configuration of what addresses are mapped to the memory
> > controllers (and _how_ they are mapped) and which are not.
> 
> Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
> mapping code by checking the address vs. memblocks to chose the right
> page table attributes.

FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
These registers effectively overwrite page table setups.  Intel SDM
defines the effect as follows.  'PAT Entry Value' is the page table
setup.

MTRR Memory Type  PAT Entry Value  Effective Memory Type
--------------------------------------------------------
UC                UC               UC
UC                WC               WC
UC                WT               UC
UC                WB               UC
UC                WP               UC 

On my system, BIOS sets MTRRs to cover the entire MMIO ranges with UC.
Other BIOSes may simply set the MTRR default type to UC, i.e. uncovered
ranges become UC.

# cat /proc/mtrr
 :
reg01: base=0xc0000000000 (12582912MB), size=2097152MB, count=1:
uncachable
 :

# cat /proc/iomem | grep 'PCI Bus'
 :
c0000000000-c3fffffffff : PCI Bus 0000:00
c4000000000-c7fffffffff : PCI Bus 0000:11
c8000000000-cbfffffffff : PCI Bus 0000:36
cc000000000-cffffffffff : PCI Bus 0000:5b
d0000000000-d3fffffffff : PCI Bus 0000:80
d4000000000-d7fffffffff : PCI Bus 0000:85
d8000000000-dbfffffffff : PCI Bus 0000:ae
dc000000000-dffffffffff : PCI Bus 0000:d7

-Toshi


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 16:22                     ` Kani, Toshi
  0 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 16:22 UTC (permalink / raw)
  To: torvalds, benh
  Cc: linux-kernel, alex.williamson, linux-block, linux-rdma, hch,
	axboe, linux-nvdimm, jglisse, linux-nvme, maxg, linux-pci,
	keith.busch, oliveroh, jgg, bhelgaas

T24gRnJpLCAyMDE4LTAzLTAyIGF0IDA5OjM0ICsxMTAwLCBCZW5qYW1pbiBIZXJyZW5zY2htaWR0
IHdyb3RlOg0KPiBPbiBUaHUsIDIwMTgtMDMtMDEgYXQgMTQ6MzEgLTA4MDAsIExpbnVzIFRvcnZh
bGRzIHdyb3RlOg0KPiA+IE9uIFRodSwgTWFyIDEsIDIwMTggYXQgMjowNiBQTSwgQmVuamFtaW4g
SGVycmVuc2NobWlkdCA8YmVuaEBhdTEuaWJtLmNvbT4gd3JvdGU6DQo+ID4gPiANCj4gPiA+IENv
dWxkIGJlIHRoYXQgeDg2IGhhcyB0aGUgc21hcnRzIHRvIGRvIHRoZSByaWdodCB0aGluZywgc3Rp
bGwgdHJ5aW5nIHRvDQo+ID4gPiB1bnRhbmdsZSB0aGUgY29kZSA6LSkNCj4gPiANCj4gPiBBZmFp
aywgeDg2IHdpbGwgbm90IGNhY2hlIFBDSSB1bmxlc3MgdGhlIHN5c3RlbSBpcyBtaXNjb25maWd1
cmVkLCBhbmQNCj4gPiBldmVuIHRoZW4gaXQncyBtb3JlIGxpa2VseSB0byBqdXN0IHJhaXNlIGEg
bWFjaGluZSBjaGVjayBleGNlcHRpb24NCj4gPiB0aGFuIGNhY2hlIHRoaW5ncy4NCj4gPiANCj4g
PiBUaGUgbGFzdC1sZXZlbCBjYWNoZSBpcyBnb2luZyB0byBkbyBmaWxscyBhbmQgc3BpbGxzIGRp
cmVjdGx5IHRvIHRoZQ0KPiA+IG1lbW9yeSBjb250cm9sbGVyLCBub3QgdG8gdGhlIFBDSWUgc2lk
ZSBvZiB0aGluZ3MuDQo+ID4gDQo+ID4gKEkgZ3Vlc3MgeW91ICpjYW4qIGRvIHRoaW5ncyBkaWZm
ZXJlbnRseSwgYW5kIEkgd291bGRuJ3QgYmUgc3VycHJpc2VkDQo+ID4gaWYgc29tZSBwZW9wbGUg
aW5zaWRlIEludGVsIGRpZCB0cnkgdG8gZG8gdGhpbmdzIGRpZmZlcmVudGx5IHdpdGgNCj4gPiB0
cnlpbmcgbnZyYW0gb3ZlciBQQ0llLCBidXQgaW4gZ2VuZXJhbCBJIHRoaW5rIHRoZSBhYm92ZSBp
cyB0cnVlKQ0KPiA+IA0KPiA+IFlvdSB3b24ndCBmaW5kIGl0IGluIHRoZSBrZXJuZWwgY29kZSBl
aXRoZXIuIEl0J3MgaW4gaGFyZHdhcmUgd2l0aA0KPiA+IGZpcm13YXJlIGNvbmZpZ3VyYXRpb24g
b2Ygd2hhdCBhZGRyZXNzZXMgYXJlIG1hcHBlZCB0byB0aGUgbWVtb3J5DQo+ID4gY29udHJvbGxl
cnMgKGFuZCBfaG93XyB0aGV5IGFyZSBtYXBwZWQpIGFuZCB3aGljaCBhcmUgbm90Lg0KPiANCj4g
QWggdGhhbmtzICEgVGhhbmtzIGV4cGxhaW5zLiBXZSBjYW4gZml4IHRoYXQgb24gcHBjNjQgaW4g
b3VyIGxpbmVhcg0KPiBtYXBwaW5nIGNvZGUgYnkgY2hlY2tpbmcgdGhlIGFkZHJlc3MgdnMuIG1l
bWJsb2NrcyB0byBjaG9zZSB0aGUgcmlnaHQNCj4gcGFnZSB0YWJsZSBhdHRyaWJ1dGVzLg0KDQpG
V0lXLCB0aGlzIHRoaW5nIGlzIGNhbGxlZCBNVFJScyBvbiB4ODYsIHdoaWNoIGFyZSBpbml0aWFs
aXplZCBieSBCSU9TLg0KVGhlc2UgcmVnaXN0ZXJzIGVmZmVjdGl2ZWx5IG92ZXJ3cml0ZSBwYWdl
IHRhYmxlIHNldHVwcy4gIEludGVsIFNETQ0KZGVmaW5lcyB0aGUgZWZmZWN0IGFzIGZvbGxvd3Mu
ICAnUEFUIEVudHJ5IFZhbHVlJyBpcyB0aGUgcGFnZSB0YWJsZQ0Kc2V0dXAuDQoNCk1UUlIgTWVt
b3J5IFR5cGUgIFBBVCBFbnRyeSBWYWx1ZSAgRWZmZWN0aXZlIE1lbW9yeSBUeXBlDQotLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KVUMgICAg
ICAgICAgICAgICAgVUMgICAgICAgICAgICAgICBVQw0KVUMgICAgICAgICAgICAgICAgV0MgICAg
ICAgICAgICAgICBXQw0KVUMgICAgICAgICAgICAgICAgV1QgICAgICAgICAgICAgICBVQw0KVUMg
ICAgICAgICAgICAgICAgV0IgICAgICAgICAgICAgICBVQw0KVUMgICAgICAgICAgICAgICAgV1Ag
ICAgICAgICAgICAgICBVQyANCg0KT24gbXkgc3lzdGVtLCBCSU9TIHNldHMgTVRSUnMgdG8gY292
ZXIgdGhlIGVudGlyZSBNTUlPIHJhbmdlcyB3aXRoIFVDLg0KT3RoZXIgQklPU2VzIG1heSBzaW1w
bHkgc2V0IHRoZSBNVFJSIGRlZmF1bHQgdHlwZSB0byBVQywgaS5lLiB1bmNvdmVyZWQNCnJhbmdl
cyBiZWNvbWUgVUMuDQoNCiMgY2F0IC9wcm9jL210cnINCiA6DQpyZWcwMTogYmFzZT0weGMwMDAw
MDAwMDAwICgxMjU4MjkxMk1CKSwgc2l6ZT0yMDk3MTUyTUIsIGNvdW50PTE6DQp1bmNhY2hhYmxl
DQogOg0KDQojIGNhdCAvcHJvYy9pb21lbSB8IGdyZXAgJ1BDSSBCdXMnDQogOg0KYzAwMDAwMDAw
MDAtYzNmZmZmZmZmZmYgOiBQQ0kgQnVzIDAwMDA6MDANCmM0MDAwMDAwMDAwLWM3ZmZmZmZmZmZm
IDogUENJIEJ1cyAwMDAwOjExDQpjODAwMDAwMDAwMC1jYmZmZmZmZmZmZiA6IFBDSSBCdXMgMDAw
MDozNg0KY2MwMDAwMDAwMDAtY2ZmZmZmZmZmZmYgOiBQQ0kgQnVzIDAwMDA6NWINCmQwMDAwMDAw
MDAwLWQzZmZmZmZmZmZmIDogUENJIEJ1cyAwMDAwOjgwDQpkNDAwMDAwMDAwMC1kN2ZmZmZmZmZm
ZiA6IFBDSSBCdXMgMDAwMDo4NQ0KZDgwMDAwMDAwMDAtZGJmZmZmZmZmZmYgOiBQQ0kgQnVzIDAw
MDA6YWUNCmRjMDAwMDAwMDAwLWRmZmZmZmZmZmZmIDogUENJIEJ1cyAwMDAwOmQ3DQoNCi1Ub3No
aQ0KDQoNCg==

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 16:22                     ` Kani, Toshi
  0 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 16:22 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	oliveroh-8fk3Idey6ehBDgjK7y7TUQ,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, jgg-uk2M96/98Pc,
	alex.williamson-H+wXaHxf7aLQT0dZR+AlfA,
	jglisse-H+wXaHxf7aLQT0dZR+AlfA, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	maxg-VPRAkNaXOzVWk0Htik3J/w, hch-jcswGhMUV9g

On Fri, 2018-03-02 at 09:34 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> > On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org> wrote:
> > > 
> > > Could be that x86 has the smarts to do the right thing, still trying to
> > > untangle the code :-)
> > 
> > Afaik, x86 will not cache PCI unless the system is misconfigured, and
> > even then it's more likely to just raise a machine check exception
> > than cache things.
> > 
> > The last-level cache is going to do fills and spills directly to the
> > memory controller, not to the PCIe side of things.
> > 
> > (I guess you *can* do things differently, and I wouldn't be surprised
> > if some people inside Intel did try to do things differently with
> > trying nvram over PCIe, but in general I think the above is true)
> > 
> > You won't find it in the kernel code either. It's in hardware with
> > firmware configuration of what addresses are mapped to the memory
> > controllers (and _how_ they are mapped) and which are not.
> 
> Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
> mapping code by checking the address vs. memblocks to chose the right
> page table attributes.

FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
These registers effectively overwrite page table setups.  Intel SDM
defines the effect as follows.  'PAT Entry Value' is the page table
setup.

MTRR Memory Type  PAT Entry Value  Effective Memory Type
--------------------------------------------------------
UC                UC               UC
UC                WC               WC
UC                WT               UC
UC                WB               UC
UC                WP               UC 

On my system, BIOS sets MTRRs to cover the entire MMIO ranges with UC.
Other BIOSes may simply set the MTRR default type to UC, i.e. uncovered
ranges become UC.

# cat /proc/mtrr
 :
reg01: base=0xc0000000000 (12582912MB), size=2097152MB, count=1:
uncachable
 :

# cat /proc/iomem | grep 'PCI Bus'
 :
c0000000000-c3fffffffff : PCI Bus 0000:00
c4000000000-c7fffffffff : PCI Bus 0000:11
c8000000000-cbfffffffff : PCI Bus 0000:36
cc000000000-cffffffffff : PCI Bus 0000:5b
d0000000000-d3fffffffff : PCI Bus 0000:80
d4000000000-d7fffffffff : PCI Bus 0000:85
d8000000000-dbfffffffff : PCI Bus 0000:ae
dc000000000-dffffffffff : PCI Bus 0000:d7

-Toshi

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 16:22                     ` Kani, Toshi
  0 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 16:22 UTC (permalink / raw)
  To: torvalds, benh
  Cc: linux-kernel, alex.williamson, linux-block, linux-rdma, hch,
	axboe, linux-nvdimm, jglisse, linux-nvme, maxg, linux-pci,
	keith.busch, oliveroh, jgg, bhelgaas

On Fri, 2018-03-02 at 09:34 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> > On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> > > 
> > > Could be that x86 has the smarts to do the right thing, still trying to
> > > untangle the code :-)
> > 
> > Afaik, x86 will not cache PCI unless the system is misconfigured, and
> > even then it's more likely to just raise a machine check exception
> > than cache things.
> > 
> > The last-level cache is going to do fills and spills directly to the
> > memory controller, not to the PCIe side of things.
> > 
> > (I guess you *can* do things differently, and I wouldn't be surprised
> > if some people inside Intel did try to do things differently with
> > trying nvram over PCIe, but in general I think the above is true)
> > 
> > You won't find it in the kernel code either. It's in hardware with
> > firmware configuration of what addresses are mapped to the memory
> > controllers (and _how_ they are mapped) and which are not.
> 
> Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
> mapping code by checking the address vs. memblocks to chose the right
> page table attributes.

FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
These registers effectively overwrite page table setups.  Intel SDM
defines the effect as follows.  'PAT Entry Value' is the page table
setup.

MTRR Memory Type  PAT Entry Value  Effective Memory Type
--------------------------------------------------------
UC                UC               UC
UC                WC               WC
UC                WT               UC
UC                WB               UC
UC                WP               UC 

On my system, BIOS sets MTRRs to cover the entire MMIO ranges with UC.
Other BIOSes may simply set the MTRR default type to UC, i.e. uncovered
ranges become UC.

# cat /proc/mtrr
 :
reg01: base=0xc0000000000 (12582912MB), size=2097152MB, count=1:
uncachable
 :

# cat /proc/iomem | grep 'PCI Bus'
 :
c0000000000-c3fffffffff : PCI Bus 0000:00
c4000000000-c7fffffffff : PCI Bus 0000:11
c8000000000-cbfffffffff : PCI Bus 0000:36
cc000000000-cffffffffff : PCI Bus 0000:5b
d0000000000-d3fffffffff : PCI Bus 0000:80
d4000000000-d7fffffffff : PCI Bus 0000:85
d8000000000-dbfffffffff : PCI Bus 0000:ae
dc000000000-dffffffffff : PCI Bus 0000:d7

-Toshi

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 16:22                     ` Kani, Toshi
  0 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 16:22 UTC (permalink / raw)


On Fri, 2018-03-02@09:34 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01@14:31 -0800, Linus Torvalds wrote:
> > On Thu, Mar 1, 2018@2:06 PM, Benjamin Herrenschmidt <benh@au1.ibm.com> wrote:
> > > 
> > > Could be that x86 has the smarts to do the right thing, still trying to
> > > untangle the code :-)
> > 
> > Afaik, x86 will not cache PCI unless the system is misconfigured, and
> > even then it's more likely to just raise a machine check exception
> > than cache things.
> > 
> > The last-level cache is going to do fills and spills directly to the
> > memory controller, not to the PCIe side of things.
> > 
> > (I guess you *can* do things differently, and I wouldn't be surprised
> > if some people inside Intel did try to do things differently with
> > trying nvram over PCIe, but in general I think the above is true)
> > 
> > You won't find it in the kernel code either. It's in hardware with
> > firmware configuration of what addresses are mapped to the memory
> > controllers (and _how_ they are mapped) and which are not.
> 
> Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
> mapping code by checking the address vs. memblocks to chose the right
> page table attributes.

FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
These registers effectively overwrite page table setups.  Intel SDM
defines the effect as follows.  'PAT Entry Value' is the page table
setup.

MTRR Memory Type  PAT Entry Value  Effective Memory Type
--------------------------------------------------------
UC                UC               UC
UC                WC               WC
UC                WT               UC
UC                WB               UC
UC                WP               UC 

On my system, BIOS sets MTRRs to cover the entire MMIO ranges with UC.
Other BIOSes may simply set the MTRR default type to UC, i.e. uncovered
ranges become UC.

# cat /proc/mtrr
 :
reg01: base=0xc0000000000 (12582912MB), size=2097152MB, count=1:
uncachable
 :

# cat /proc/iomem | grep 'PCI Bus'
 :
c0000000000-c3fffffffff : PCI Bus 0000:00
c4000000000-c7fffffffff : PCI Bus 0000:11
c8000000000-cbfffffffff : PCI Bus 0000:36
cc000000000-cffffffffff : PCI Bus 0000:5b
d0000000000-d3fffffffff : PCI Bus 0000:80
d4000000000-d7fffffffff : PCI Bus 0000:85
d8000000000-dbfffffffff : PCI Bus 0000:ae
dc000000000-dffffffffff : PCI Bus 0000:d7

-Toshi

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-02 16:22                     ` Kani, Toshi
  (?)
  (?)
@ 2018-03-02 16:57                       ` Linus Torvalds
  -1 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-02 16:57 UTC (permalink / raw)
  To: Kani, Toshi
  Cc: axboe, keith.busch, oliveroh, linux-nvdimm, linux-rdma, benh,
	linux-kernel, linux-nvme, linux-block, jgg, alex.williamson,
	jglisse, linux-pci, bhelgaas, maxg, hch

On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi <toshi.kani@hpe.com> wrote:
>
> FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.

No.

Or rather, that's simply just another (small) part of it all - and an
architected and documented one at that.

Like the page table caching entries, the memory type range registers
are really just "secondary information". They don't actually select
between PCIe and RAM, they just affect the behavior on top of that.

The really nitty-gritty stuff is not architected, and generally not
documented outside (possibly) the BIOS writer's guide that is not made
public.

Those magical registers contain details like how the DRAM is
interleaved (if it is), what the timings are, where which memory
controller handles which memory range, and what are goes to PCIe etc.

Basically all the actual *steering* information is very much hidden
away from the kernel (and often from the BIOS too). The parts we see
at a higher level are just tuning and tweaks.

Note: the details differ _enormously_ between different chips. The
setup can be very different, with things like Knights Landing having
the external cache that can also act as local memory that isn't a
cache but maps at a different physical address instead etc. That's the
kind of steering I'm talking about - at a low level how physical
addresses get mapped to different cache partitions, memory
controllers, or to the IO system etc.

                  Linus
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 16:57                       ` Linus Torvalds
  0 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-02 16:57 UTC (permalink / raw)
  To: Kani, Toshi
  Cc: benh, linux-kernel, alex.williamson, linux-block, linux-rdma,
	hch, axboe, linux-nvdimm, jglisse, linux-nvme, maxg, linux-pci,
	keith.busch, oliveroh, jgg, bhelgaas

On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi <toshi.kani@hpe.com> wrote:
>
> FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.

No.

Or rather, that's simply just another (small) part of it all - and an
architected and documented one at that.

Like the page table caching entries, the memory type range registers
are really just "secondary information". They don't actually select
between PCIe and RAM, they just affect the behavior on top of that.

The really nitty-gritty stuff is not architected, and generally not
documented outside (possibly) the BIOS writer's guide that is not made
public.

Those magical registers contain details like how the DRAM is
interleaved (if it is), what the timings are, where which memory
controller handles which memory range, and what are goes to PCIe etc.

Basically all the actual *steering* information is very much hidden
away from the kernel (and often from the BIOS too). The parts we see
at a higher level are just tuning and tweaks.

Note: the details differ _enormously_ between different chips. The
setup can be very different, with things like Knights Landing having
the external cache that can also act as local memory that isn't a
cache but maps at a different physical address instead etc. That's the
kind of steering I'm talking about - at a low level how physical
addresses get mapped to different cache partitions, memory
controllers, or to the IO system etc.

                  Linus

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 16:57                       ` Linus Torvalds
  0 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-02 16:57 UTC (permalink / raw)
  To: Kani, Toshi
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	oliveroh-8fk3Idey6ehBDgjK7y7TUQ,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, jgg-uk2M96/98Pc,
	alex.williamson-H+wXaHxf7aLQT0dZR+AlfA,
	jglisse-H+wXaHxf7aLQT0dZR+AlfA, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA, maxg-VPRAkNaXOzVWk0Htik3J/w,
	hch-jcswGhMUV9g

On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi <toshi.kani-ZPxbGqLxI0U@public.gmane.org> wrote:
>
> FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.

No.

Or rather, that's simply just another (small) part of it all - and an
architected and documented one at that.

Like the page table caching entries, the memory type range registers
are really just "secondary information". They don't actually select
between PCIe and RAM, they just affect the behavior on top of that.

The really nitty-gritty stuff is not architected, and generally not
documented outside (possibly) the BIOS writer's guide that is not made
public.

Those magical registers contain details like how the DRAM is
interleaved (if it is), what the timings are, where which memory
controller handles which memory range, and what are goes to PCIe etc.

Basically all the actual *steering* information is very much hidden
away from the kernel (and often from the BIOS too). The parts we see
at a higher level are just tuning and tweaks.

Note: the details differ _enormously_ between different chips. The
setup can be very different, with things like Knights Landing having
the external cache that can also act as local memory that isn't a
cache but maps at a different physical address instead etc. That's the
kind of steering I'm talking about - at a low level how physical
addresses get mapped to different cache partitions, memory
controllers, or to the IO system etc.

                  Linus

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 16:57                       ` Linus Torvalds
  0 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-02 16:57 UTC (permalink / raw)


On Fri, Mar 2, 2018@8:22 AM, Kani, Toshi <toshi.kani@hpe.com> wrote:
>
> FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.

No.

Or rather, that's simply just another (small) part of it all - and an
architected and documented one at that.

Like the page table caching entries, the memory type range registers
are really just "secondary information". They don't actually select
between PCIe and RAM, they just affect the behavior on top of that.

The really nitty-gritty stuff is not architected, and generally not
documented outside (possibly) the BIOS writer's guide that is not made
public.

Those magical registers contain details like how the DRAM is
interleaved (if it is), what the timings are, where which memory
controller handles which memory range, and what are goes to PCIe etc.

Basically all the actual *steering* information is very much hidden
away from the kernel (and often from the BIOS too). The parts we see
at a higher level are just tuning and tweaks.

Note: the details differ _enormously_ between different chips. The
setup can be very different, with things like Knights Landing having
the external cache that can also act as local memory that isn't a
cache but maps at a different physical address instead etc. That's the
kind of steering I'm talking about - at a low level how physical
addresses get mapped to different cache partitions, memory
controllers, or to the IO system etc.

                  Linus

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-02 16:18                       ` Jason Gunthorpe
  (?)
  (?)
@ 2018-03-02 17:10                         ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02 17:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Stephen Bates
  Cc: Jens Axboe, linux-block, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	Keith Busch, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 02/03/18 09:18 AM, Jason Gunthorpe wrote:

> This allocator is already seems not useful for the P2P target memory
> on a Mellanox NIC due to the way it has a special allocation flow
> (windowing) and special usage requirements..
> 
> Nor can it be usefull for the doorbell memory in the NIC.

No one says every P2P use has to use P2P memory. Doorbells are obviously 
not P2P memory. But it's the p2mem interface that's important and the 
interface absolutely does not belong in the NVMe driver. Once you have 
the P2P memory interface you need an allocator behind it and the obvious 
place is in the P2P code to handle the common case where you're just 
mapping a BAR. We don't need to implement a genalloc in every driver 
that has P2P memory attached with it. If some hardware wants to expose 
memory that requires complicated allocation it's up to them to solve 
that problem but that isn't enough justification, to me, to push common 
code into every driver.

> Both of these are existing use cases for P2P with out of tree patches..

And we have out of tree code that uses the generic allocator part of 
p2pmem.

> The allocator seems to only be able to solve the CMB problem, and I
> think due to that it is better to put this allocator in the NVMe
> driver and not the core code.. At least until we find a 2nd user that
> needs the same allocation scheme...

See the first P2PMEM RFC. We used Chelsio's NIC instead of the CMB with 
a very similar allocation scheme. We'd still be enabling that NIC in the 
same way if we didn't run into hardware issues with it. A simple BAR 
with memory behind it is always going to be the most common case.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 17:10                         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02 17:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Stephen Bates
  Cc: Keith Busch, Sagi Grimberg, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Christoph Hellwig,
	Jens Axboe, Bjorn Helgaas, Max Gurtovoy, Dan Williams,
	Jérôme Glisse, Benjamin Herrenschmidt, Alex Williamson,
	Steve Wise



On 02/03/18 09:18 AM, Jason Gunthorpe wrote:

> This allocator is already seems not useful for the P2P target memory
> on a Mellanox NIC due to the way it has a special allocation flow
> (windowing) and special usage requirements..
> 
> Nor can it be usefull for the doorbell memory in the NIC.

No one says every P2P use has to use P2P memory. Doorbells are obviously 
not P2P memory. But it's the p2mem interface that's important and the 
interface absolutely does not belong in the NVMe driver. Once you have 
the P2P memory interface you need an allocator behind it and the obvious 
place is in the P2P code to handle the common case where you're just 
mapping a BAR. We don't need to implement a genalloc in every driver 
that has P2P memory attached with it. If some hardware wants to expose 
memory that requires complicated allocation it's up to them to solve 
that problem but that isn't enough justification, to me, to push common 
code into every driver.

> Both of these are existing use cases for P2P with out of tree patches..

And we have out of tree code that uses the generic allocator part of 
p2pmem.

> The allocator seems to only be able to solve the CMB problem, and I
> think due to that it is better to put this allocator in the NVMe
> driver and not the core code.. At least until we find a 2nd user that
> needs the same allocation scheme...

See the first P2PMEM RFC. We used Chelsio's NIC instead of the CMB with 
a very similar allocation scheme. We'd still be enabling that NIC in the 
same way if we didn't run into hardware issues with it. A simple BAR 
with memory behind it is always going to be the most common case.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 17:10                         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02 17:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Stephen Bates
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 02/03/18 09:18 AM, Jason Gunthorpe wrote:

> This allocator is already seems not useful for the P2P target memory
> on a Mellanox NIC due to the way it has a special allocation flow
> (windowing) and special usage requirements..
> 
> Nor can it be usefull for the doorbell memory in the NIC.

No one says every P2P use has to use P2P memory. Doorbells are obviously 
not P2P memory. But it's the p2mem interface that's important and the 
interface absolutely does not belong in the NVMe driver. Once you have 
the P2P memory interface you need an allocator behind it and the obvious 
place is in the P2P code to handle the common case where you're just 
mapping a BAR. We don't need to implement a genalloc in every driver 
that has P2P memory attached with it. If some hardware wants to expose 
memory that requires complicated allocation it's up to them to solve 
that problem but that isn't enough justification, to me, to push common 
code into every driver.

> Both of these are existing use cases for P2P with out of tree patches..

And we have out of tree code that uses the generic allocator part of 
p2pmem.

> The allocator seems to only be able to solve the CMB problem, and I
> think due to that it is better to put this allocator in the NVMe
> driver and not the core code.. At least until we find a 2nd user that
> needs the same allocation scheme...

See the first P2PMEM RFC. We used Chelsio's NIC instead of the CMB with 
a very similar allocation scheme. We'd still be enabling that NIC in the 
same way if we didn't run into hardware issues with it. A simple BAR 
with memory behind it is always going to be the most common case.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 17:10                         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02 17:10 UTC (permalink / raw)




On 02/03/18 09:18 AM, Jason Gunthorpe wrote:

> This allocator is already seems not useful for the P2P target memory
> on a Mellanox NIC due to the way it has a special allocation flow
> (windowing) and special usage requirements..
> 
> Nor can it be usefull for the doorbell memory in the NIC.

No one says every P2P use has to use P2P memory. Doorbells are obviously 
not P2P memory. But it's the p2mem interface that's important and the 
interface absolutely does not belong in the NVMe driver. Once you have 
the P2P memory interface you need an allocator behind it and the obvious 
place is in the P2P code to handle the common case where you're just 
mapping a BAR. We don't need to implement a genalloc in every driver 
that has P2P memory attached with it. If some hardware wants to expose 
memory that requires complicated allocation it's up to them to solve 
that problem but that isn't enough justification, to me, to push common 
code into every driver.

> Both of these are existing use cases for P2P with out of tree patches..

And we have out of tree code that uses the generic allocator part of 
p2pmem.

> The allocator seems to only be able to solve the CMB problem, and I
> think due to that it is better to put this allocator in the NVMe
> driver and not the core code.. At least until we find a 2nd user that
> needs the same allocation scheme...

See the first P2PMEM RFC. We used Chelsio's NIC instead of the CMB with 
a very similar allocation scheme. We'd still be enabling that NIC in the 
same way if we didn't run into hardware issues with it. A simple BAR 
with memory behind it is always going to be the most common case.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-02 16:57                       ` Linus Torvalds
  (?)
  (?)
@ 2018-03-02 17:34                         ` Linus Torvalds
  -1 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-02 17:34 UTC (permalink / raw)
  To: Kani, Toshi
  Cc: axboe, keith.busch, oliveroh, linux-nvdimm, linux-rdma, benh,
	linux-kernel, linux-nvme, linux-block, jgg, alex.williamson,
	jglisse, linux-pci, bhelgaas, maxg, hch

On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.

Side note: historically the two may have been almost the same, since
the CPU only had one single unified bus for "memory" (whether that was
memory-mapped PCI or actual RAM). The steering was external.

But even back then you had extended bits to specify things like how
the 640k-1M region got remapped - which could depend on not just the
address, but on whether you read or wrote to it.  The "lost" 384kB of
RAM could either be remapped at a different address, or could be used
for shadowing the (slow) ROM contents, or whatever.

              Linus
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 17:34                         ` Linus Torvalds
  0 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-02 17:34 UTC (permalink / raw)
  To: Kani, Toshi
  Cc: benh, linux-kernel, alex.williamson, linux-block, linux-rdma,
	hch, axboe, linux-nvdimm, jglisse, linux-nvme, maxg, linux-pci,
	keith.busch, oliveroh, jgg, bhelgaas

On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.

Side note: historically the two may have been almost the same, since
the CPU only had one single unified bus for "memory" (whether that was
memory-mapped PCI or actual RAM). The steering was external.

But even back then you had extended bits to specify things like how
the 640k-1M region got remapped - which could depend on not just the
address, but on whether you read or wrote to it.  The "lost" 384kB of
RAM could either be remapped at a different address, or could be used
for shadowing the (slow) ROM contents, or whatever.

              Linus

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 17:34                         ` Linus Torvalds
  0 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-02 17:34 UTC (permalink / raw)
  To: Kani, Toshi
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	oliveroh-8fk3Idey6ehBDgjK7y7TUQ,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, jgg-uk2M96/98Pc,
	alex.williamson-H+wXaHxf7aLQT0dZR+AlfA,
	jglisse-H+wXaHxf7aLQT0dZR+AlfA, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA, maxg-VPRAkNaXOzVWk0Htik3J/w,
	hch-jcswGhMUV9g

On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds
<torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.

Side note: historically the two may have been almost the same, since
the CPU only had one single unified bus for "memory" (whether that was
memory-mapped PCI or actual RAM). The steering was external.

But even back then you had extended bits to specify things like how
the 640k-1M region got remapped - which could depend on not just the
address, but on whether you read or wrote to it.  The "lost" 384kB of
RAM could either be remapped at a different address, or could be used
for shadowing the (slow) ROM contents, or whatever.

              Linus

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 17:34                         ` Linus Torvalds
  0 siblings, 0 replies; 504+ messages in thread
From: Linus Torvalds @ 2018-03-02 17:34 UTC (permalink / raw)


On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.

Side note: historically the two may have been almost the same, since
the CPU only had one single unified bus for "memory" (whether that was
memory-mapped PCI or actual RAM). The steering was external.

But even back then you had extended bits to specify things like how
the 640k-1M region got remapped - which could depend on not just the
address, but on whether you read or wrote to it.  The "lost" 384kB of
RAM could either be remapped at a different address, or could be used
for shadowing the (slow) ROM contents, or whatever.

              Linus

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-02 16:57                       ` Linus Torvalds
                                           ` (2 preceding siblings ...)
  (?)
@ 2018-03-02 17:38                         ` Kani, Toshi
  -1 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 17:38 UTC (permalink / raw)
  To: torvalds
  Cc: axboe, keith.busch, oliveroh, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block, jgg,
	alex.williamson, jglisse, benh, bhelgaas, maxg, hch

On Fri, 2018-03-02 at 08:57 -0800, Linus Torvalds wrote:
> On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi <toshi.kani@hpe.com> wrote:
> > 
> > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
> 
> No.
> 
> Or rather, that's simply just another (small) part of it all - and an
> architected and documented one at that.
> 
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.
> 
> The really nitty-gritty stuff is not architected, and generally not
> documented outside (possibly) the BIOS writer's guide that is not made
> public.
> 
> Those magical registers contain details like how the DRAM is
> interleaved (if it is), what the timings are, where which memory
> controller handles which memory range, and what are goes to PCIe etc.
> 
> Basically all the actual *steering* information is very much hidden
> away from the kernel (and often from the BIOS too). The parts we see
> at a higher level are just tuning and tweaks.
> 
> Note: the details differ _enormously_ between different chips. The
> setup can be very different, with things like Knights Landing having
> the external cache that can also act as local memory that isn't a
> cache but maps at a different physical address instead etc. That's the
> kind of steering I'm talking about - at a low level how physical
> addresses get mapped to different cache partitions, memory
> controllers, or to the IO system etc.

Right, MRC code is not documented publicly, and it is very much CPU
dependent.  It programs address decoders and maps DRAMs to physical
address as you described.  MTRRs have nothing to do with this memory
controller setting.  That said, MTRRs specify CPU's memory access type,
such as UC and WB.

Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 17:38                         ` Kani, Toshi
  0 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 17:38 UTC (permalink / raw)
  To: torvalds
  Cc: linux-kernel, alex.williamson, linux-block, linux-rdma, bhelgaas,
	hch, axboe, linux-nvdimm, jglisse, linux-nvme, maxg, linux-pci,
	keith.busch, benh, jgg, oliveroh

T24gRnJpLCAyMDE4LTAzLTAyIGF0IDA4OjU3IC0wODAwLCBMaW51cyBUb3J2YWxkcyB3cm90ZToN
Cj4gT24gRnJpLCBNYXIgMiwgMjAxOCBhdCA4OjIyIEFNLCBLYW5pLCBUb3NoaSA8dG9zaGkua2Fu
aUBocGUuY29tPiB3cm90ZToNCj4gPiANCj4gPiBGV0lXLCB0aGlzIHRoaW5nIGlzIGNhbGxlZCBN
VFJScyBvbiB4ODYsIHdoaWNoIGFyZSBpbml0aWFsaXplZCBieSBCSU9TLg0KPiANCj4gTm8uDQo+
IA0KPiBPciByYXRoZXIsIHRoYXQncyBzaW1wbHkganVzdCBhbm90aGVyIChzbWFsbCkgcGFydCBv
ZiBpdCBhbGwgLSBhbmQgYW4NCj4gYXJjaGl0ZWN0ZWQgYW5kIGRvY3VtZW50ZWQgb25lIGF0IHRo
YXQuDQo+IA0KPiBMaWtlIHRoZSBwYWdlIHRhYmxlIGNhY2hpbmcgZW50cmllcywgdGhlIG1lbW9y
eSB0eXBlIHJhbmdlIHJlZ2lzdGVycw0KPiBhcmUgcmVhbGx5IGp1c3QgInNlY29uZGFyeSBpbmZv
cm1hdGlvbiIuIFRoZXkgZG9uJ3QgYWN0dWFsbHkgc2VsZWN0DQo+IGJldHdlZW4gUENJZSBhbmQg
UkFNLCB0aGV5IGp1c3QgYWZmZWN0IHRoZSBiZWhhdmlvciBvbiB0b3Agb2YgdGhhdC4NCj4gDQo+
IFRoZSByZWFsbHkgbml0dHktZ3JpdHR5IHN0dWZmIGlzIG5vdCBhcmNoaXRlY3RlZCwgYW5kIGdl
bmVyYWxseSBub3QNCj4gZG9jdW1lbnRlZCBvdXRzaWRlIChwb3NzaWJseSkgdGhlIEJJT1Mgd3Jp
dGVyJ3MgZ3VpZGUgdGhhdCBpcyBub3QgbWFkZQ0KPiBwdWJsaWMuDQo+IA0KPiBUaG9zZSBtYWdp
Y2FsIHJlZ2lzdGVycyBjb250YWluIGRldGFpbHMgbGlrZSBob3cgdGhlIERSQU0gaXMNCj4gaW50
ZXJsZWF2ZWQgKGlmIGl0IGlzKSwgd2hhdCB0aGUgdGltaW5ncyBhcmUsIHdoZXJlIHdoaWNoIG1l
bW9yeQ0KPiBjb250cm9sbGVyIGhhbmRsZXMgd2hpY2ggbWVtb3J5IHJhbmdlLCBhbmQgd2hhdCBh
cmUgZ29lcyB0byBQQ0llIGV0Yy4NCj4gDQo+IEJhc2ljYWxseSBhbGwgdGhlIGFjdHVhbCAqc3Rl
ZXJpbmcqIGluZm9ybWF0aW9uIGlzIHZlcnkgbXVjaCBoaWRkZW4NCj4gYXdheSBmcm9tIHRoZSBr
ZXJuZWwgKGFuZCBvZnRlbiBmcm9tIHRoZSBCSU9TIHRvbykuIFRoZSBwYXJ0cyB3ZSBzZWUNCj4g
YXQgYSBoaWdoZXIgbGV2ZWwgYXJlIGp1c3QgdHVuaW5nIGFuZCB0d2Vha3MuDQo+IA0KPiBOb3Rl
OiB0aGUgZGV0YWlscyBkaWZmZXIgX2Vub3Jtb3VzbHlfIGJldHdlZW4gZGlmZmVyZW50IGNoaXBz
LiBUaGUNCj4gc2V0dXAgY2FuIGJlIHZlcnkgZGlmZmVyZW50LCB3aXRoIHRoaW5ncyBsaWtlIEtu
aWdodHMgTGFuZGluZyBoYXZpbmcNCj4gdGhlIGV4dGVybmFsIGNhY2hlIHRoYXQgY2FuIGFsc28g
YWN0IGFzIGxvY2FsIG1lbW9yeSB0aGF0IGlzbid0IGENCj4gY2FjaGUgYnV0IG1hcHMgYXQgYSBk
aWZmZXJlbnQgcGh5c2ljYWwgYWRkcmVzcyBpbnN0ZWFkIGV0Yy4gVGhhdCdzIHRoZQ0KPiBraW5k
IG9mIHN0ZWVyaW5nIEknbSB0YWxraW5nIGFib3V0IC0gYXQgYSBsb3cgbGV2ZWwgaG93IHBoeXNp
Y2FsDQo+IGFkZHJlc3NlcyBnZXQgbWFwcGVkIHRvIGRpZmZlcmVudCBjYWNoZSBwYXJ0aXRpb25z
LCBtZW1vcnkNCj4gY29udHJvbGxlcnMsIG9yIHRvIHRoZSBJTyBzeXN0ZW0gZXRjLg0KDQpSaWdo
dCwgTVJDIGNvZGUgaXMgbm90IGRvY3VtZW50ZWQgcHVibGljbHksIGFuZCBpdCBpcyB2ZXJ5IG11
Y2ggQ1BVDQpkZXBlbmRlbnQuICBJdCBwcm9ncmFtcyBhZGRyZXNzIGRlY29kZXJzIGFuZCBtYXBz
IERSQU1zIHRvIHBoeXNpY2FsDQphZGRyZXNzIGFzIHlvdSBkZXNjcmliZWQuICBNVFJScyBoYXZl
IG5vdGhpbmcgdG8gZG8gd2l0aCB0aGlzIG1lbW9yeQ0KY29udHJvbGxlciBzZXR0aW5nLiAgVGhh
dCBzYWlkLCBNVFJScyBzcGVjaWZ5IENQVSdzIG1lbW9yeSBhY2Nlc3MgdHlwZSwNCnN1Y2ggYXMg
VUMgYW5kIFdCLg0KDQpUaGFua3MsDQotVG9zaGkNCg==

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 17:38                         ` Kani, Toshi
  0 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 17:38 UTC (permalink / raw)
  To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	oliveroh-8fk3Idey6ehBDgjK7y7TUQ,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, jgg-uk2M96/98Pc,
	alex.williamson-H+wXaHxf7aLQT0dZR+AlfA,
	jglisse-H+wXaHxf7aLQT0dZR+AlfA,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	bhelgaas-hpIqsD4AKlfQT0dZR+AlfA, maxg-VPRAkNaXOzVWk0Htik3J/w,
	hch-jcswGhMUV9g

On Fri, 2018-03-02 at 08:57 -0800, Linus Torvalds wrote:
> On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi <toshi.kani-ZPxbGqLxI0U@public.gmane.org> wrote:
> > 
> > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
> 
> No.
> 
> Or rather, that's simply just another (small) part of it all - and an
> architected and documented one at that.
> 
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.
> 
> The really nitty-gritty stuff is not architected, and generally not
> documented outside (possibly) the BIOS writer's guide that is not made
> public.
> 
> Those magical registers contain details like how the DRAM is
> interleaved (if it is), what the timings are, where which memory
> controller handles which memory range, and what are goes to PCIe etc.
> 
> Basically all the actual *steering* information is very much hidden
> away from the kernel (and often from the BIOS too). The parts we see
> at a higher level are just tuning and tweaks.
> 
> Note: the details differ _enormously_ between different chips. The
> setup can be very different, with things like Knights Landing having
> the external cache that can also act as local memory that isn't a
> cache but maps at a different physical address instead etc. That's the
> kind of steering I'm talking about - at a low level how physical
> addresses get mapped to different cache partitions, memory
> controllers, or to the IO system etc.

Right, MRC code is not documented publicly, and it is very much CPU
dependent.  It programs address decoders and maps DRAMs to physical
address as you described.  MTRRs have nothing to do with this memory
controller setting.  That said, MTRRs specify CPU's memory access type,
such as UC and WB.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 17:38                         ` Kani, Toshi
  0 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 17:38 UTC (permalink / raw)
  To: torvalds
  Cc: linux-kernel, alex.williamson, linux-block, linux-rdma, bhelgaas,
	hch, axboe, linux-nvdimm, jglisse, linux-nvme, maxg, linux-pci,
	keith.busch, benh, jgg, oliveroh

On Fri, 2018-03-02 at 08:57 -0800, Linus Torvalds wrote:
> On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi <toshi.kani@hpe.com> wrote:
> > 
> > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
> 
> No.
> 
> Or rather, that's simply just another (small) part of it all - and an
> architected and documented one at that.
> 
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.
> 
> The really nitty-gritty stuff is not architected, and generally not
> documented outside (possibly) the BIOS writer's guide that is not made
> public.
> 
> Those magical registers contain details like how the DRAM is
> interleaved (if it is), what the timings are, where which memory
> controller handles which memory range, and what are goes to PCIe etc.
> 
> Basically all the actual *steering* information is very much hidden
> away from the kernel (and often from the BIOS too). The parts we see
> at a higher level are just tuning and tweaks.
> 
> Note: the details differ _enormously_ between different chips. The
> setup can be very different, with things like Knights Landing having
> the external cache that can also act as local memory that isn't a
> cache but maps at a different physical address instead etc. That's the
> kind of steering I'm talking about - at a low level how physical
> addresses get mapped to different cache partitions, memory
> controllers, or to the IO system etc.

Right, MRC code is not documented publicly, and it is very much CPU
dependent.  It programs address decoders and maps DRAMs to physical
address as you described.  MTRRs have nothing to do with this memory
controller setting.  That said, MTRRs specify CPU's memory access type,
such as UC and WB.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 17:38                         ` Kani, Toshi
  0 siblings, 0 replies; 504+ messages in thread
From: Kani, Toshi @ 2018-03-02 17:38 UTC (permalink / raw)


On Fri, 2018-03-02@08:57 -0800, Linus Torvalds wrote:
> On Fri, Mar 2, 2018@8:22 AM, Kani, Toshi <toshi.kani@hpe.com> wrote:
> > 
> > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
> 
> No.
> 
> Or rather, that's simply just another (small) part of it all - and an
> architected and documented one at that.
> 
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.
> 
> The really nitty-gritty stuff is not architected, and generally not
> documented outside (possibly) the BIOS writer's guide that is not made
> public.
> 
> Those magical registers contain details like how the DRAM is
> interleaved (if it is), what the timings are, where which memory
> controller handles which memory range, and what are goes to PCIe etc.
> 
> Basically all the actual *steering* information is very much hidden
> away from the kernel (and often from the BIOS too). The parts we see
> at a higher level are just tuning and tweaks.
> 
> Note: the details differ _enormously_ between different chips. The
> setup can be very different, with things like Knights Landing having
> the external cache that can also act as local memory that isn't a
> cache but maps at a different physical address instead etc. That's the
> kind of steering I'm talking about - at a low level how physical
> addresses get mapped to different cache partitions, memory
> controllers, or to the IO system etc.

Right, MRC code is not documented publicly, and it is very much CPU
dependent.  It programs address decoders and maps DRAMs to physical
address as you described.  MTRRs have nothing to do with this memory
controller setting.  That said, MTRRs specify CPU's memory access type,
such as UC and WB.

Thanks,
-Toshi

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
  2018-03-02 15:53                       ` Christoph Hellwig
  (?)
  (?)
@ 2018-03-02 20:51                         ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-02 20:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch,
	Jason Gunthorpe, Alex Williamson, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

>    http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

@Keith - my apologies.

@Christoph - thanks for the link

So my understanding of when the technical content surrounding new NVMe Technical Proposals (TPs) was wrong. I though the TP content could only be discussed once disclosed in the public standard. I have now learnt that once the TPs are ratified they are publically available!

However, as Logan pointed out, PMRs are not relevant to this series so let's defer discussion on how to support them to a later date!

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 20:51                         ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-02 20:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	linux-block, Jason Gunthorpe, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

PiAgICBodHRwOi8vbnZtZXhwcmVzcy5vcmcvd3AtY29udGVudC91cGxvYWRzL05WTS1FeHByZXNz
LTEuMy1SYXRpZmllZC1UUHMuemlwDQoNCkBLZWl0aCAtIG15IGFwb2xvZ2llcy4NCg0KQENocmlz
dG9waCAtIHRoYW5rcyBmb3IgdGhlIGxpbmsNCg0KU28gbXkgdW5kZXJzdGFuZGluZyBvZiB3aGVu
IHRoZSB0ZWNobmljYWwgY29udGVudCBzdXJyb3VuZGluZyBuZXcgTlZNZSBUZWNobmljYWwgUHJv
cG9zYWxzIChUUHMpIHdhcyB3cm9uZy4gSSB0aG91Z2ggdGhlIFRQIGNvbnRlbnQgY291bGQgb25s
eSBiZSBkaXNjdXNzZWQgb25jZSBkaXNjbG9zZWQgaW4gdGhlIHB1YmxpYyBzdGFuZGFyZC4gSSBo
YXZlIG5vdyBsZWFybnQgdGhhdCBvbmNlIHRoZSBUUHMgYXJlIHJhdGlmaWVkIHRoZXkgYXJlIHB1
YmxpY2FsbHkgYXZhaWxhYmxlIQ0KDQpIb3dldmVyLCBhcyBMb2dhbiBwb2ludGVkIG91dCwgUE1S
cyBhcmUgbm90IHJlbGV2YW50IHRvIHRoaXMgc2VyaWVzIHNvIGxldCdzIGRlZmVyIGRpc2N1c3Np
b24gb24gaG93IHRvIHN1cHBvcnQgdGhlbSB0byBhIGxhdGVyIGRhdGUhDQoNClN0ZXBoZW4NCiAg
ICANCg0K

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 20:51                         ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-02 20:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	linux-block, Jason Gunthorpe, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

>    http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

@Keith - my apologies.

@Christoph - thanks for the link

So my understanding of when the technical content surrounding new NVMe Technical Proposals (TPs) was wrong. I though the TP content could only be discussed once disclosed in the public standard. I have now learnt that once the TPs are ratified they are publically available!

However, as Logan pointed out, PMRs are not relevant to this series so let's defer discussion on how to support them to a later date!

Stephen
    


^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
@ 2018-03-02 20:51                         ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-02 20:51 UTC (permalink / raw)


>    http://nvmexpress.org/wp-content/uploads/NVM-Express-1.3-Ratified-TPs.zip

@Keith - my apologies.

@Christoph - thanks for the link

So my understanding of when the technical content surrounding new NVMe Technical Proposals (TPs) was wrong. I though the TP content could only be discussed once disclosed in the public standard. I have now learnt that once the TPs are ratified they are publically available!

However, as Logan pointed out, PMRs are not relevant to this series so let's defer discussion on how to support them to a later date!

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 21:10               ` Jerome Glisse
                                   ` (2 preceding siblings ...)
  (?)
@ 2018-03-02 21:38                 ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-02 21:38 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> It seems people miss-understand HMM :( 

Hi Jerome

Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. Along the way I came up with a couple of things.

While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when one might choose to use one over the other. Would it be possible to update hmm.txt to include some discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the kernel's linear address space for the device memory and DEVICE_PRIVATE does not. However, like I said, I am not sure when you would use either one and the pros and cons of doing so. I actually ended up finding some useful information in memremap.h but I don't think it is fair to expect people to dig *that* deep to find this information ;-).

A quick grep shows no drivers using the HMM API in the upstream code today. Is this correct? Are there any examples of out of tree drivers that use HMM you can point me too? As a driver developer what resources exist to help me write a HMM aware driver?

The (very nice) hmm.txt document is not references in the MAINTAINERS file? You might want to fix that when you have a moment.

Stephen


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 21:38                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-02 21:38 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams, Alex Williamson,
	Oliver OHalloran

PiBJdCBzZWVtcyBwZW9wbGUgbWlzcy11bmRlcnN0YW5kIEhNTSA6KCANCg0KSGkgSmVyb21lDQoN
CllvdXIgdW5oYXBweSBmYWNlIGVtb3RpY29uIG1hZGUgbWUgc2FkIHNvIEkgd2VudCBvZmYgdG8g
KHJlKXJlYWQgdXAgb24gSE1NLiBBbG9uZyB0aGUgd2F5IEkgY2FtZSB1cCB3aXRoIGEgY291cGxl
IG9mIHRoaW5ncy4NCg0KV2hpbGUgaG1tLnR4dCBpcyByZWFsbHkgbmljZSB0byByZWFkIGl0IG1h
a2VzIG5vIG1lbnRpb24gb2YgREVWSUNFX1BSSVZBVEUgYW5kIERFVklDRV9QVUJMSUMuIEl0IGFs
c28gZ2l2ZXMgbm8gaW5kaWNhdGlvbiB3aGVuIG9uZSBtaWdodCBjaG9vc2UgdG8gdXNlIG9uZSBv
dmVyIHRoZSBvdGhlci4gV291bGQgaXQgYmUgcG9zc2libGUgdG8gdXBkYXRlIGhtbS50eHQgdG8g
aW5jbHVkZSBzb21lIGRpc2N1c3Npb24gb24gdGhpcz8gSSB1bmRlcnN0YW5kIHRoYXQgREVWSUNF
X1BVQkxJQyBjcmVhdGVzIGEgbWFwcGluZyBpbiB0aGUga2VybmVsJ3MgbGluZWFyIGFkZHJlc3Mg
c3BhY2UgZm9yIHRoZSBkZXZpY2UgbWVtb3J5IGFuZCBERVZJQ0VfUFJJVkFURSBkb2VzIG5vdC4g
SG93ZXZlciwgbGlrZSBJIHNhaWQsIEkgYW0gbm90IHN1cmUgd2hlbiB5b3Ugd291bGQgdXNlIGVp
dGhlciBvbmUgYW5kIHRoZSBwcm9zIGFuZCBjb25zIG9mIGRvaW5nIHNvLiBJIGFjdHVhbGx5IGVu
ZGVkIHVwIGZpbmRpbmcgc29tZSB1c2VmdWwgaW5mb3JtYXRpb24gaW4gbWVtcmVtYXAuaCBidXQg
SSBkb24ndCB0aGluayBpdCBpcyBmYWlyIHRvIGV4cGVjdCBwZW9wbGUgdG8gZGlnICp0aGF0KiBk
ZWVwIHRvIGZpbmQgdGhpcyBpbmZvcm1hdGlvbiA7LSkuDQoNCkEgcXVpY2sgZ3JlcCBzaG93cyBu
byBkcml2ZXJzIHVzaW5nIHRoZSBITU0gQVBJIGluIHRoZSB1cHN0cmVhbSBjb2RlIHRvZGF5LiBJ
cyB0aGlzIGNvcnJlY3Q/IEFyZSB0aGVyZSBhbnkgZXhhbXBsZXMgb2Ygb3V0IG9mIHRyZWUgZHJp
dmVycyB0aGF0IHVzZSBITU0geW91IGNhbiBwb2ludCBtZSB0b28/IEFzIGEgZHJpdmVyIGRldmVs
b3BlciB3aGF0IHJlc291cmNlcyBleGlzdCB0byBoZWxwIG1lIHdyaXRlIGEgSE1NIGF3YXJlIGRy
aXZlcj8NCg0KVGhlICh2ZXJ5IG5pY2UpIGhtbS50eHQgZG9jdW1lbnQgaXMgbm90IHJlZmVyZW5j
ZXMgaW4gdGhlIE1BSU5UQUlORVJTIGZpbGU/IFlvdSBtaWdodCB3YW50IHRvIGZpeCB0aGF0IHdo
ZW4geW91IGhhdmUgYSBtb21lbnQuDQoNClN0ZXBoZW4NCg0KDQo=

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 21:38                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-02 21:38 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

> It seems people miss-understand HMM :( 

Hi Jerome

Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. Along the way I came up with a couple of things.

While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when one might choose to use one over the other. Would it be possible to update hmm.txt to include some discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the kernel's linear address space for the device memory and DEVICE_PRIVATE does not. However, like I said, I am not sure when you would use either one and the pros and cons of doing so. I actually ended up finding some useful information in memremap.h but I don't think it is fair to expect people to dig *that* deep to find this information ;-).

A quick grep shows no drivers using the HMM API in the upstream code today. Is this correct? Are there any examples of out of tree drivers that use HMM you can point me too? As a driver developer what resources exist to help me write a HMM aware driver?

The (very nice) hmm.txt document is not references in the MAINTAINERS file? You might want to fix that when you have a moment.

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 21:38                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-02 21:38 UTC (permalink / raw)
  To: Jerome Glisse, Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-pci, linux-nvme,
	linux-rdma, linux-nvdimm, linux-block, Christoph Hellwig,
	Jens Axboe, Keith Busch, Sagi Grimberg, Bjorn Helgaas,
	Jason Gunthorpe, Max Gurtovoy, Dan Williams, Alex Williamson,
	Oliver OHalloran

> It seems people miss-understand HMM :( 

Hi Jerome

Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. Along the way I came up with a couple of things.

While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when one might choose to use one over the other. Would it be possible to update hmm.txt to include some discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the kernel's linear address space for the device memory and DEVICE_PRIVATE does not. However, like I said, I am not sure when you would use either one and the pros and cons of doing so. I actually ended up finding some useful information in memremap.h but I don't think it is fair to expect people to dig *that* deep to find this information ;-).

A quick grep shows no drivers using the HMM API in the upstream code today. Is this correct? Are there any examples of out of tree drivers that use HMM you can point me too? As a driver developer what resources exist to help me write a HMM aware driver?

The (very nice) hmm.txt document is not references in the MAINTAINERS file? You might want to fix that when you have a moment.

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 21:38                 ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-02 21:38 UTC (permalink / raw)


> It seems people miss-understand HMM :( 

Hi Jerome

Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. Along the way I came up with a couple of things.

While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when one might choose to use one over the other. Would it be possible to update hmm.txt to include some discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the kernel's linear address space for the device memory and DEVICE_PRIVATE does not. However, like I said, I am not sure when you would use either one and the pros and cons of doing so. I actually ended up finding some useful information in memremap.h but I don't think it is fair to expect people to dig *that* deep to find this information ;-).

A quick grep shows no drivers using the HMM API in the upstream code today. Is this correct? Are there any examples of out of tree drivers that use HMM you can point me too? As a driver developer what resources exist to help me write a HMM aware driver?

The (very nice) hmm.txt document is not references in the MAINTAINERS file? You might want to fix that when you have a moment.

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-01 23:25                     ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-02 21:44                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-02 21:44 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, linux-block, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Fri, 2018-03-02 at 10:25 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> > 
> > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > > We use only 52 in practice but yes.
> > > 
> > > >   That's 64PB. If you use need
> > > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > > with 63.98PB of address space left. (Similar calculations for other
> > > > numbers of address bits.)
> > > 
> > > We only have 52 bits of virtual space for the kernel with the radix
> > > MMU.
> > 
> > Ok, assuming you only have 52 bits of physical address space: the sparse 
> > vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> > things. So, again, why doesn't that work? Is my math wrong
> 
> The big problem is not the vmemmap, it's the linear mapping

Allright, so, I think I have a plan to fix this, but it will take a
little bit of time.

Basically the idea is to have firmware pass to Linux a region that's
known to not have anything in it that it can use for the vmalloc space
rather than have linux arbitrarily cut the address space in half.

I'm pretty sure I can always find large enough "holes" in the physical
address space that are outside of both RAM/OpenCAPI/Nvlink and
PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
to have to know about the intimate HW details so I'll pass it from FW.

It will take some time to adjust Linux and get updated FW around
though.

Once that's done, I'll be able to have the linear mapping go through
the entire 52-bit space (minus that hole). Of course the hole need to
be large enough to hold a vmemmap for a 52-bit space, so that's about
4TB. So I probably need a hole that's at least 8TB.

As for the mapping attributes, it should be easy for my linear mapping
code to ensure anything that isn't actual RAM is mapped NC.

Cheers,
Ben.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 21:44                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-02 21:44 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Fri, 2018-03-02 at 10:25 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> > 
> > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > > We use only 52 in practice but yes.
> > > 
> > > >   That's 64PB. If you use need
> > > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > > with 63.98PB of address space left. (Similar calculations for other
> > > > numbers of address bits.)
> > > 
> > > We only have 52 bits of virtual space for the kernel with the radix
> > > MMU.
> > 
> > Ok, assuming you only have 52 bits of physical address space: the sparse 
> > vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> > things. So, again, why doesn't that work? Is my math wrong
> 
> The big problem is not the vmemmap, it's the linear mapping

Allright, so, I think I have a plan to fix this, but it will take a
little bit of time.

Basically the idea is to have firmware pass to Linux a region that's
known to not have anything in it that it can use for the vmalloc space
rather than have linux arbitrarily cut the address space in half.

I'm pretty sure I can always find large enough "holes" in the physical
address space that are outside of both RAM/OpenCAPI/Nvlink and
PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
to have to know about the intimate HW details so I'll pass it from FW.

It will take some time to adjust Linux and get updated FW around
though.

Once that's done, I'll be able to have the linear mapping go through
the entire 52-bit space (minus that hole). Of course the hole need to
be large enough to hold a vmemmap for a 52-bit space, so that's about
4TB. So I probably need a hole that's at least 8TB.

As for the mapping attributes, it should be easy for my linear mapping
code to ensure anything that isn't actual RAM is mapped NC.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 21:44                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-02 21:44 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA, Oliver OHalloran,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Fri, 2018-03-02 at 10:25 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> > 
> > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > > We use only 52 in practice but yes.
> > > 
> > > >   That's 64PB. If you use need
> > > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > > with 63.98PB of address space left. (Similar calculations for other
> > > > numbers of address bits.)
> > > 
> > > We only have 52 bits of virtual space for the kernel with the radix
> > > MMU.
> > 
> > Ok, assuming you only have 52 bits of physical address space: the sparse 
> > vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> > things. So, again, why doesn't that work? Is my math wrong
> 
> The big problem is not the vmemmap, it's the linear mapping

Allright, so, I think I have a plan to fix this, but it will take a
little bit of time.

Basically the idea is to have firmware pass to Linux a region that's
known to not have anything in it that it can use for the vmalloc space
rather than have linux arbitrarily cut the address space in half.

I'm pretty sure I can always find large enough "holes" in the physical
address space that are outside of both RAM/OpenCAPI/Nvlink and
PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
to have to know about the intimate HW details so I'll pass it from FW.

It will take some time to adjust Linux and get updated FW around
though.

Once that's done, I'll be able to have the linear mapping go through
the entire 52-bit space (minus that hole). Of course the hole need to
be large enough to hold a vmemmap for a 52-bit space, so that's about
4TB. So I probably need a hole that's at least 8TB.

As for the mapping attributes, it should be easy for my linear mapping
code to ensure anything that isn't actual RAM is mapped NC.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 21:44                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 504+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-02 21:44 UTC (permalink / raw)


On Fri, 2018-03-02@10:25 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01@16:19 -0700, Logan Gunthorpe wrote:
> > 
> > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > > We use only 52 in practice but yes.
> > > 
> > > >   That's 64PB. If you use need
> > > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > > with 63.98PB of address space left. (Similar calculations for other
> > > > numbers of address bits.)
> > > 
> > > We only have 52 bits of virtual space for the kernel with the radix
> > > MMU.
> > 
> > Ok, assuming you only have 52 bits of physical address space: the sparse 
> > vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> > things. So, again, why doesn't that work? Is my math wrong
> 
> The big problem is not the vmemmap, it's the linear mapping

Allright, so, I think I have a plan to fix this, but it will take a
little bit of time.

Basically the idea is to have firmware pass to Linux a region that's
known to not have anything in it that it can use for the vmalloc space
rather than have linux arbitrarily cut the address space in half.

I'm pretty sure I can always find large enough "holes" in the physical
address space that are outside of both RAM/OpenCAPI/Nvlink and
PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
to have to know about the intimate HW details so I'll pass it from FW.

It will take some time to adjust Linux and get updated FW around
though.

Once that's done, I'll be able to have the linear mapping go through
the entire 52-bit space (minus that hole). Of course the hole need to
be large enough to hold a vmemmap for a 52-bit space, so that's about
4TB. So I probably need a hole that's at least 8TB.

As for the mapping attributes, it should be easy for my linear mapping
code to ensure anything that isn't actual RAM is mapped NC.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-02 21:38                 ` Stephen  Bates
                                     ` (2 preceding siblings ...)
  (?)
@ 2018-03-02 22:09                   ` Jerome Glisse
  -1 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-02 22:09 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Fri, Mar 02, 2018 at 09:38:43PM +0000, Stephen  Bates wrote:
> > It seems people miss-understand HMM :( 
> 
> Hi Jerome
> 
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
> 
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).

Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.

PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.


> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?

I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).

How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.

Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.


> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.

I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.

Cheers,
Jérôme
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 22:09                   ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-02 22:09 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Alex Williamson, Oliver OHalloran

On Fri, Mar 02, 2018 at 09:38:43PM +0000, Stephen  Bates wrote:
> > It seems people miss-understand HMM :( 
> 
> Hi Jerome
> 
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
> 
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).

Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.

PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.


> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?

I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).

How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.

Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.


> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.

I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 22:09                   ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-02 22:09 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Fri, Mar 02, 2018 at 09:38:43PM +0000, Stephen  Bates wrote:
> > It seems people miss-understand HMM :( 
> 
> Hi Jerome
> 
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
> 
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).

Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.

PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.


> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?

I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).

How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.

Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.


> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.

I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 22:09                   ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-02 22:09 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Alex Williamson, Oliver OHalloran

On Fri, Mar 02, 2018 at 09:38:43PM +0000, Stephen  Bates wrote:
> > It seems people miss-understand HMM :( 
> 
> Hi Jerome
> 
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
> 
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).

Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.

PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.


> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?

I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).

How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.

Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.


> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.

I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 22:09                   ` Jerome Glisse
  0 siblings, 0 replies; 504+ messages in thread
From: Jerome Glisse @ 2018-03-02 22:09 UTC (permalink / raw)


On Fri, Mar 02, 2018@09:38:43PM +0000, Stephen  Bates wrote:
> > It seems people miss-understand HMM :( 
> 
> Hi Jerome
> 
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
> 
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).

Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.

PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.


> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?

I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).

How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.

Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.


> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.

I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.

Cheers,
J?r?me

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-02 21:44                       ` Benjamin Herrenschmidt
  (?)
  (?)
@ 2018-03-02 22:24                         ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02 22:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Jens Axboe, linux-block, Oliver OHalloran, linux-nvdimm,
	linux-rdma, linux-pci, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 02/03/18 02:44 PM, Benjamin Herrenschmidt wrote:
> Allright, so, I think I have a plan to fix this, but it will take a
> little bit of time.
> 
> Basically the idea is to have firmware pass to Linux a region that's
> known to not have anything in it that it can use for the vmalloc space
> rather than have linux arbitrarily cut the address space in half.
> 
> I'm pretty sure I can always find large enough "holes" in the physical
> address space that are outside of both RAM/OpenCAPI/Nvlink and
> PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
> to have to know about the intimate HW details so I'll pass it from FW.
> 
> It will take some time to adjust Linux and get updated FW around
> though.
> 
> Once that's done, I'll be able to have the linear mapping go through
> the entire 52-bit space (minus that hole). Of course the hole need to
> be large enough to hold a vmemmap for a 52-bit space, so that's about
> 4TB. So I probably need a hole that's at least 8TB.
> 
> As for the mapping attributes, it should be easy for my linear mapping
> code to ensure anything that isn't actual RAM is mapped NC.

Very cool. I'm glad to hear you found a way to fix this.

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 22:24                         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02 22:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran, Alex Williamson,
	linux-nvdimm, linux-rdma, linux-pci, Linux Kernel Mailing List,
	linux-nvme, linux-block, Jérôme Glisse,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 02/03/18 02:44 PM, Benjamin Herrenschmidt wrote:
> Allright, so, I think I have a plan to fix this, but it will take a
> little bit of time.
> 
> Basically the idea is to have firmware pass to Linux a region that's
> known to not have anything in it that it can use for the vmalloc space
> rather than have linux arbitrarily cut the address space in half.
> 
> I'm pretty sure I can always find large enough "holes" in the physical
> address space that are outside of both RAM/OpenCAPI/Nvlink and
> PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
> to have to know about the intimate HW details so I'll pass it from FW.
> 
> It will take some time to adjust Linux and get updated FW around
> though.
> 
> Once that's done, I'll be able to have the linear mapping go through
> the entire 52-bit space (minus that hole). Of course the hole need to
> be large enough to hold a vmemmap for a 52-bit space, so that's about
> 4TB. So I probably need a hole that's at least 8TB.
> 
> As for the mapping attributes, it should be easy for my linear mapping
> code to ensure anything that isn't actual RAM is mapped NC.

Very cool. I'm glad to hear you found a way to fix this.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 22:24                         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02 22:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA, Oliver OHalloran,
	linux-nvdimm, linux-rdma, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 02/03/18 02:44 PM, Benjamin Herrenschmidt wrote:
> Allright, so, I think I have a plan to fix this, but it will take a
> little bit of time.
> 
> Basically the idea is to have firmware pass to Linux a region that's
> known to not have anything in it that it can use for the vmalloc space
> rather than have linux arbitrarily cut the address space in half.
> 
> I'm pretty sure I can always find large enough "holes" in the physical
> address space that are outside of both RAM/OpenCAPI/Nvlink and
> PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
> to have to know about the intimate HW details so I'll pass it from FW.
> 
> It will take some time to adjust Linux and get updated FW around
> though.
> 
> Once that's done, I'll be able to have the linear mapping go through
> the entire 52-bit space (minus that hole). Of course the hole need to
> be large enough to hold a vmemmap for a 52-bit space, so that's about
> 4TB. So I probably need a hole that's at least 8TB.
> 
> As for the mapping attributes, it should be easy for my linear mapping
> code to ensure anything that isn't actual RAM is mapped NC.

Very cool. I'm glad to hear you found a way to fix this.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-02 22:24                         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-02 22:24 UTC (permalink / raw)




On 02/03/18 02:44 PM, Benjamin Herrenschmidt wrote:
> Allright, so, I think I have a plan to fix this, but it will take a
> little bit of time.
> 
> Basically the idea is to have firmware pass to Linux a region that's
> known to not have anything in it that it can use for the vmalloc space
> rather than have linux arbitrarily cut the address space in half.
> 
> I'm pretty sure I can always find large enough "holes" in the physical
> address space that are outside of both RAM/OpenCAPI/Nvlink and
> PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
> to have to know about the intimate HW details so I'll pass it from FW.
> 
> It will take some time to adjust Linux and get updated FW around
> though.
> 
> Once that's done, I'll be able to have the linear mapping go through
> the entire 52-bit space (minus that hole). Of course the hole need to
> be large enough to hold a vmemmap for a 52-bit space, so that's about
> 4TB. So I probably need a hole that's at least 8TB.
> 
> As for the mapping attributes, it should be easy for my linear mapping
> code to ensure anything that isn't actual RAM is mapped NC.

Very cool. I'm glad to hear you found a way to fix this.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-02-28 23:40   ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-05  1:33     ` Oliver
  -1 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-05  1:33 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Register the CMB buffer as p2pmem and use the appropriate allocation
> functions to create and destroy the IO SQ.
>
> If the CMB supports WDS and RDS, publish it for use as p2p memory
> by other devices.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
>  1 file changed, 41 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 73036d2fbbd5..56ca79be8476 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -29,6 +29,7 @@
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
> +#include <linux/pci-p2pdma.h>
>
>  #include "nvme.h"
>
> @@ -91,9 +92,8 @@ struct nvme_dev {
>         struct work_struct remove_work;
>         struct mutex shutdown_lock;
>         bool subsystem;
> -       void __iomem *cmb;
> -       pci_bus_addr_t cmb_bus_addr;
>         u64 cmb_size;
> +       bool cmb_use_sqes;
>         u32 cmbsz;
>         u32 cmbloc;
>         struct nvme_ctrl ctrl;
> @@ -148,7 +148,7 @@ struct nvme_queue {
>         struct nvme_dev *dev;
>         spinlock_t q_lock;
>         struct nvme_command *sq_cmds;
> -       struct nvme_command __iomem *sq_cmds_io;
> +       bool sq_cmds_is_io;
>         volatile struct nvme_completion *cqes;
>         struct blk_mq_tags **tags;
>         dma_addr_t sq_dma_addr;
> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>  {
>         u16 tail = nvmeq->sq_tail;

> -       if (nvmeq->sq_cmds_io)
> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> -       else
> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
the _toio() variant enforces alignment, does the copy with 4 byte
stores, and has a full barrier after the copy. In comparison our
regular memcpy() does none of those things and may use unaligned and
vector load/stores. For normal (cacheable) memory that is perfectly
fine, but they can cause alignment faults when targeted at MMIO
(cache-inhibited) memory.

I think in this particular case it might be ok since we know SEQs are
aligned to 64 byte boundaries and the copy is too small to use our
vectorised memcpy(). I'll assume we don't need explicit ordering
between writes of SEQs since the existing code doesn't seem to care
unless the doorbell is being rung, so you're probably fine there too.
That said, I still think this is a little bit sketchy and at the very
least you should add a comment explaining what's going on when the CMB
is being used. If someone more familiar with the NVMe driver could
chime in I would appreciate it.

>         if (++tail == nvmeq->q_depth)
>                 tail = 0;
> @@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
>         dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
> -       if (nvmeq->sq_cmds)
> -               dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> -                                       nvmeq->sq_cmds, nvmeq->sq_dma_addr);
> +
> +       if (nvmeq->sq_cmds) {
> +               if (nvmeq->sq_cmds_is_io)
> +                       pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
> +                                       nvmeq->sq_cmds,
> +                                       SQ_SIZE(nvmeq->q_depth));
> +               else
> +                       dma_free_coherent(nvmeq->q_dmadev,
> +                                         SQ_SIZE(nvmeq->q_depth),
> +                                         nvmeq->sq_cmds,
> +                                         nvmeq->sq_dma_addr);
> +       }
>  }
>
>  static void nvme_free_queues(struct nvme_dev *dev, int lowest)
> @@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
>  static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
>                                 int qid, int depth)
>  {
> -       /* CMB SQEs will be mapped before creation */
> -       if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
> -               return 0;
> +       struct pci_dev *pdev = to_pci_dev(dev->dev);
> +
> +       if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> +               nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
> +               nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
> +                                               nvmeq->sq_cmds);
> +               nvmeq->sq_cmds_is_io = true;
> +       }
> +
> +       if (!nvmeq->sq_cmds) {
> +               nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> +                                       &nvmeq->sq_dma_addr, GFP_KERNEL);
> +               nvmeq->sq_cmds_is_io = false;
> +       }
>
> -       nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> -                                           &nvmeq->sq_dma_addr, GFP_KERNEL);
>         if (!nvmeq->sq_cmds)
>                 return -ENOMEM;
>         return 0;
> @@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
>         struct nvme_dev *dev = nvmeq->dev;
>         int result;
>
> -       if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> -               unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> -                                                     dev->ctrl.page_size);
> -               nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
> -               nvmeq->sq_cmds_io = dev->cmb + offset;
> -       }
> -
>         nvmeq->cq_vector = qid - 1;
>         result = adapter_alloc_cq(dev, qid, nvmeq);
>         if (result < 0)
> @@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>                 return;
>         dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
>
> -       if (!use_cmb_sqes)
> -               return;
> -
>         size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
>         offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
>         bar = NVME_CMB_BIR(dev->cmbloc);
> @@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>         if (size > bar_size - offset)
>                 size = bar_size - offset;
>
> -       dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
> -       if (!dev->cmb)
> +       if (pci_p2pdma_add_resource(pdev, bar, size, offset))
>                 return;
> -       dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
> +
>         dev->cmb_size = size;
> +       dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
> +
> +       if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
> +                       (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
> +               pci_p2pmem_publish(pdev, true);
>
>         if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
>                                     &dev_attr_cmb.attr, NULL))
> @@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>
>  static inline void nvme_release_cmb(struct nvme_dev *dev)
>  {
> -       if (dev->cmb) {
> -               iounmap(dev->cmb);
> -               dev->cmb = NULL;
> +       if (dev->cmb_size) {
>                 sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
>                                              &dev_attr_cmb.attr, NULL);
> -               dev->cmbsz = 0;
> +               dev->cmb_size = 0;
>         }
>  }
>
> @@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>         if (nr_io_queues == 0)
>                 return 0;
>
> -       if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> +       if (dev->cmb_use_sqes) {
>                 result = nvme_cmb_qdepth(dev, nr_io_queues,
>                                 sizeof(struct nvme_command));
>                 if (result > 0)
>                         dev->q_depth = result;
>                 else
> -                       nvme_release_cmb(dev);
> +                       dev->cmb_use_sqes = false;
>         }
>
>         do {
> --
> 2.11.0
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05  1:33     ` Oliver
  0 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-05  1:33 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Jens Axboe, Benjamin Herrenschmidt, Alex Williamson,
	Keith Busch, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Register the CMB buffer as p2pmem and use the appropriate allocation
> functions to create and destroy the IO SQ.
>
> If the CMB supports WDS and RDS, publish it for use as p2p memory
> by other devices.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
>  1 file changed, 41 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 73036d2fbbd5..56ca79be8476 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -29,6 +29,7 @@
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
> +#include <linux/pci-p2pdma.h>
>
>  #include "nvme.h"
>
> @@ -91,9 +92,8 @@ struct nvme_dev {
>         struct work_struct remove_work;
>         struct mutex shutdown_lock;
>         bool subsystem;
> -       void __iomem *cmb;
> -       pci_bus_addr_t cmb_bus_addr;
>         u64 cmb_size;
> +       bool cmb_use_sqes;
>         u32 cmbsz;
>         u32 cmbloc;
>         struct nvme_ctrl ctrl;
> @@ -148,7 +148,7 @@ struct nvme_queue {
>         struct nvme_dev *dev;
>         spinlock_t q_lock;
>         struct nvme_command *sq_cmds;
> -       struct nvme_command __iomem *sq_cmds_io;
> +       bool sq_cmds_is_io;
>         volatile struct nvme_completion *cqes;
>         struct blk_mq_tags **tags;
>         dma_addr_t sq_dma_addr;
> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>  {
>         u16 tail = nvmeq->sq_tail;

> -       if (nvmeq->sq_cmds_io)
> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> -       else
> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
the _toio() variant enforces alignment, does the copy with 4 byte
stores, and has a full barrier after the copy. In comparison our
regular memcpy() does none of those things and may use unaligned and
vector load/stores. For normal (cacheable) memory that is perfectly
fine, but they can cause alignment faults when targeted at MMIO
(cache-inhibited) memory.

I think in this particular case it might be ok since we know SEQs are
aligned to 64 byte boundaries and the copy is too small to use our
vectorised memcpy(). I'll assume we don't need explicit ordering
between writes of SEQs since the existing code doesn't seem to care
unless the doorbell is being rung, so you're probably fine there too.
That said, I still think this is a little bit sketchy and at the very
least you should add a comment explaining what's going on when the CMB
is being used. If someone more familiar with the NVMe driver could
chime in I would appreciate it.

>         if (++tail == nvmeq->q_depth)
>                 tail = 0;
> @@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
>         dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
> -       if (nvmeq->sq_cmds)
> -               dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> -                                       nvmeq->sq_cmds, nvmeq->sq_dma_addr);
> +
> +       if (nvmeq->sq_cmds) {
> +               if (nvmeq->sq_cmds_is_io)
> +                       pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
> +                                       nvmeq->sq_cmds,
> +                                       SQ_SIZE(nvmeq->q_depth));
> +               else
> +                       dma_free_coherent(nvmeq->q_dmadev,
> +                                         SQ_SIZE(nvmeq->q_depth),
> +                                         nvmeq->sq_cmds,
> +                                         nvmeq->sq_dma_addr);
> +       }
>  }
>
>  static void nvme_free_queues(struct nvme_dev *dev, int lowest)
> @@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
>  static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
>                                 int qid, int depth)
>  {
> -       /* CMB SQEs will be mapped before creation */
> -       if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
> -               return 0;
> +       struct pci_dev *pdev = to_pci_dev(dev->dev);
> +
> +       if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> +               nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
> +               nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
> +                                               nvmeq->sq_cmds);
> +               nvmeq->sq_cmds_is_io = true;
> +       }
> +
> +       if (!nvmeq->sq_cmds) {
> +               nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> +                                       &nvmeq->sq_dma_addr, GFP_KERNEL);
> +               nvmeq->sq_cmds_is_io = false;
> +       }
>
> -       nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> -                                           &nvmeq->sq_dma_addr, GFP_KERNEL);
>         if (!nvmeq->sq_cmds)
>                 return -ENOMEM;
>         return 0;
> @@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
>         struct nvme_dev *dev = nvmeq->dev;
>         int result;
>
> -       if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> -               unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> -                                                     dev->ctrl.page_size);
> -               nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
> -               nvmeq->sq_cmds_io = dev->cmb + offset;
> -       }
> -
>         nvmeq->cq_vector = qid - 1;
>         result = adapter_alloc_cq(dev, qid, nvmeq);
>         if (result < 0)
> @@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>                 return;
>         dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
>
> -       if (!use_cmb_sqes)
> -               return;
> -
>         size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
>         offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
>         bar = NVME_CMB_BIR(dev->cmbloc);
> @@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>         if (size > bar_size - offset)
>                 size = bar_size - offset;
>
> -       dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
> -       if (!dev->cmb)
> +       if (pci_p2pdma_add_resource(pdev, bar, size, offset))
>                 return;
> -       dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
> +
>         dev->cmb_size = size;
> +       dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
> +
> +       if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
> +                       (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
> +               pci_p2pmem_publish(pdev, true);
>
>         if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
>                                     &dev_attr_cmb.attr, NULL))
> @@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>
>  static inline void nvme_release_cmb(struct nvme_dev *dev)
>  {
> -       if (dev->cmb) {
> -               iounmap(dev->cmb);
> -               dev->cmb = NULL;
> +       if (dev->cmb_size) {
>                 sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
>                                              &dev_attr_cmb.attr, NULL);
> -               dev->cmbsz = 0;
> +               dev->cmb_size = 0;
>         }
>  }
>
> @@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>         if (nr_io_queues == 0)
>                 return 0;
>
> -       if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> +       if (dev->cmb_use_sqes) {
>                 result = nvme_cmb_qdepth(dev, nr_io_queues,
>                                 sizeof(struct nvme_command));
>                 if (result > 0)
>                         dev->q_depth = result;
>                 else
> -                       nvme_release_cmb(dev);
> +                       dev->cmb_use_sqes = false;
>         }
>
>         do {
> --
> 2.11.0
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05  1:33     ` Oliver
  0 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-05  1:33 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
> Register the CMB buffer as p2pmem and use the appropriate allocation
> functions to create and destroy the IO SQ.
>
> If the CMB supports WDS and RDS, publish it for use as p2p memory
> by other devices.
>
> Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
>  1 file changed, 41 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 73036d2fbbd5..56ca79be8476 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -29,6 +29,7 @@
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
> +#include <linux/pci-p2pdma.h>
>
>  #include "nvme.h"
>
> @@ -91,9 +92,8 @@ struct nvme_dev {
>         struct work_struct remove_work;
>         struct mutex shutdown_lock;
>         bool subsystem;
> -       void __iomem *cmb;
> -       pci_bus_addr_t cmb_bus_addr;
>         u64 cmb_size;
> +       bool cmb_use_sqes;
>         u32 cmbsz;
>         u32 cmbloc;
>         struct nvme_ctrl ctrl;
> @@ -148,7 +148,7 @@ struct nvme_queue {
>         struct nvme_dev *dev;
>         spinlock_t q_lock;
>         struct nvme_command *sq_cmds;
> -       struct nvme_command __iomem *sq_cmds_io;
> +       bool sq_cmds_is_io;
>         volatile struct nvme_completion *cqes;
>         struct blk_mq_tags **tags;
>         dma_addr_t sq_dma_addr;
> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>  {
>         u16 tail = nvmeq->sq_tail;

> -       if (nvmeq->sq_cmds_io)
> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> -       else
> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
the _toio() variant enforces alignment, does the copy with 4 byte
stores, and has a full barrier after the copy. In comparison our
regular memcpy() does none of those things and may use unaligned and
vector load/stores. For normal (cacheable) memory that is perfectly
fine, but they can cause alignment faults when targeted at MMIO
(cache-inhibited) memory.

I think in this particular case it might be ok since we know SEQs are
aligned to 64 byte boundaries and the copy is too small to use our
vectorised memcpy(). I'll assume we don't need explicit ordering
between writes of SEQs since the existing code doesn't seem to care
unless the doorbell is being rung, so you're probably fine there too.
That said, I still think this is a little bit sketchy and at the very
least you should add a comment explaining what's going on when the CMB
is being used. If someone more familiar with the NVMe driver could
chime in I would appreciate it.

>         if (++tail == nvmeq->q_depth)
>                 tail = 0;
> @@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
>         dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
> -       if (nvmeq->sq_cmds)
> -               dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> -                                       nvmeq->sq_cmds, nvmeq->sq_dma_addr);
> +
> +       if (nvmeq->sq_cmds) {
> +               if (nvmeq->sq_cmds_is_io)
> +                       pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
> +                                       nvmeq->sq_cmds,
> +                                       SQ_SIZE(nvmeq->q_depth));
> +               else
> +                       dma_free_coherent(nvmeq->q_dmadev,
> +                                         SQ_SIZE(nvmeq->q_depth),
> +                                         nvmeq->sq_cmds,
> +                                         nvmeq->sq_dma_addr);
> +       }
>  }
>
>  static void nvme_free_queues(struct nvme_dev *dev, int lowest)
> @@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
>  static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
>                                 int qid, int depth)
>  {
> -       /* CMB SQEs will be mapped before creation */
> -       if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
> -               return 0;
> +       struct pci_dev *pdev = to_pci_dev(dev->dev);
> +
> +       if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> +               nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
> +               nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
> +                                               nvmeq->sq_cmds);
> +               nvmeq->sq_cmds_is_io = true;
> +       }
> +
> +       if (!nvmeq->sq_cmds) {
> +               nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> +                                       &nvmeq->sq_dma_addr, GFP_KERNEL);
> +               nvmeq->sq_cmds_is_io = false;
> +       }
>
> -       nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> -                                           &nvmeq->sq_dma_addr, GFP_KERNEL);
>         if (!nvmeq->sq_cmds)
>                 return -ENOMEM;
>         return 0;
> @@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
>         struct nvme_dev *dev = nvmeq->dev;
>         int result;
>
> -       if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> -               unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> -                                                     dev->ctrl.page_size);
> -               nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
> -               nvmeq->sq_cmds_io = dev->cmb + offset;
> -       }
> -
>         nvmeq->cq_vector = qid - 1;
>         result = adapter_alloc_cq(dev, qid, nvmeq);
>         if (result < 0)
> @@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>                 return;
>         dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
>
> -       if (!use_cmb_sqes)
> -               return;
> -
>         size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
>         offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
>         bar = NVME_CMB_BIR(dev->cmbloc);
> @@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>         if (size > bar_size - offset)
>                 size = bar_size - offset;
>
> -       dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
> -       if (!dev->cmb)
> +       if (pci_p2pdma_add_resource(pdev, bar, size, offset))
>                 return;
> -       dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
> +
>         dev->cmb_size = size;
> +       dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
> +
> +       if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
> +                       (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
> +               pci_p2pmem_publish(pdev, true);
>
>         if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
>                                     &dev_attr_cmb.attr, NULL))
> @@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>
>  static inline void nvme_release_cmb(struct nvme_dev *dev)
>  {
> -       if (dev->cmb) {
> -               iounmap(dev->cmb);
> -               dev->cmb = NULL;
> +       if (dev->cmb_size) {
>                 sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
>                                              &dev_attr_cmb.attr, NULL);
> -               dev->cmbsz = 0;
> +               dev->cmb_size = 0;
>         }
>  }
>
> @@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>         if (nr_io_queues == 0)
>                 return 0;
>
> -       if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> +       if (dev->cmb_use_sqes) {
>                 result = nvme_cmb_qdepth(dev, nr_io_queues,
>                                 sizeof(struct nvme_command));
>                 if (result > 0)
>                         dev->q_depth = result;
>                 else
> -                       nvme_release_cmb(dev);
> +                       dev->cmb_use_sqes = false;
>         }
>
>         do {
> --
> 2.11.0
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05  1:33     ` Oliver
  0 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-05  1:33 UTC (permalink / raw)


On Thu, Mar 1, 2018@10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Register the CMB buffer as p2pmem and use the appropriate allocation
> functions to create and destroy the IO SQ.
>
> If the CMB supports WDS and RDS, publish it for use as p2p memory
> by other devices.
>
> Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
> ---
>  drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++----------------------
>  1 file changed, 41 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 73036d2fbbd5..56ca79be8476 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -29,6 +29,7 @@
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
> +#include <linux/pci-p2pdma.h>
>
>  #include "nvme.h"
>
> @@ -91,9 +92,8 @@ struct nvme_dev {
>         struct work_struct remove_work;
>         struct mutex shutdown_lock;
>         bool subsystem;
> -       void __iomem *cmb;
> -       pci_bus_addr_t cmb_bus_addr;
>         u64 cmb_size;
> +       bool cmb_use_sqes;
>         u32 cmbsz;
>         u32 cmbloc;
>         struct nvme_ctrl ctrl;
> @@ -148,7 +148,7 @@ struct nvme_queue {
>         struct nvme_dev *dev;
>         spinlock_t q_lock;
>         struct nvme_command *sq_cmds;
> -       struct nvme_command __iomem *sq_cmds_io;
> +       bool sq_cmds_is_io;
>         volatile struct nvme_completion *cqes;
>         struct blk_mq_tags **tags;
>         dma_addr_t sq_dma_addr;
> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>  {
>         u16 tail = nvmeq->sq_tail;

> -       if (nvmeq->sq_cmds_io)
> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> -       else
> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
the _toio() variant enforces alignment, does the copy with 4 byte
stores, and has a full barrier after the copy. In comparison our
regular memcpy() does none of those things and may use unaligned and
vector load/stores. For normal (cacheable) memory that is perfectly
fine, but they can cause alignment faults when targeted at MMIO
(cache-inhibited) memory.

I think in this particular case it might be ok since we know SEQs are
aligned to 64 byte boundaries and the copy is too small to use our
vectorised memcpy(). I'll assume we don't need explicit ordering
between writes of SEQs since the existing code doesn't seem to care
unless the doorbell is being rung, so you're probably fine there too.
That said, I still think this is a little bit sketchy and at the very
least you should add a comment explaining what's going on when the CMB
is being used. If someone more familiar with the NVMe driver could
chime in I would appreciate it.

>         if (++tail == nvmeq->q_depth)
>                 tail = 0;
> @@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
>         dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
> -       if (nvmeq->sq_cmds)
> -               dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> -                                       nvmeq->sq_cmds, nvmeq->sq_dma_addr);
> +
> +       if (nvmeq->sq_cmds) {
> +               if (nvmeq->sq_cmds_is_io)
> +                       pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
> +                                       nvmeq->sq_cmds,
> +                                       SQ_SIZE(nvmeq->q_depth));
> +               else
> +                       dma_free_coherent(nvmeq->q_dmadev,
> +                                         SQ_SIZE(nvmeq->q_depth),
> +                                         nvmeq->sq_cmds,
> +                                         nvmeq->sq_dma_addr);
> +       }
>  }
>
>  static void nvme_free_queues(struct nvme_dev *dev, int lowest)
> @@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
>  static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
>                                 int qid, int depth)
>  {
> -       /* CMB SQEs will be mapped before creation */
> -       if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
> -               return 0;
> +       struct pci_dev *pdev = to_pci_dev(dev->dev);
> +
> +       if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> +               nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
> +               nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
> +                                               nvmeq->sq_cmds);
> +               nvmeq->sq_cmds_is_io = true;
> +       }
> +
> +       if (!nvmeq->sq_cmds) {
> +               nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> +                                       &nvmeq->sq_dma_addr, GFP_KERNEL);
> +               nvmeq->sq_cmds_is_io = false;
> +       }
>
> -       nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> -                                           &nvmeq->sq_dma_addr, GFP_KERNEL);
>         if (!nvmeq->sq_cmds)
>                 return -ENOMEM;
>         return 0;
> @@ -1449,13 +1464,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
>         struct nvme_dev *dev = nvmeq->dev;
>         int result;
>
> -       if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> -               unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> -                                                     dev->ctrl.page_size);
> -               nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset;
> -               nvmeq->sq_cmds_io = dev->cmb + offset;
> -       }
> -
>         nvmeq->cq_vector = qid - 1;
>         result = adapter_alloc_cq(dev, qid, nvmeq);
>         if (result < 0)
> @@ -1685,9 +1693,6 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>                 return;
>         dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
>
> -       if (!use_cmb_sqes)
> -               return;
> -
>         size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
>         offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
>         bar = NVME_CMB_BIR(dev->cmbloc);
> @@ -1704,11 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>         if (size > bar_size - offset)
>                 size = bar_size - offset;
>
> -       dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size);
> -       if (!dev->cmb)
> +       if (pci_p2pdma_add_resource(pdev, bar, size, offset))
>                 return;
> -       dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset;
> +
>         dev->cmb_size = size;
> +       dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
> +
> +       if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
> +                       (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
> +               pci_p2pmem_publish(pdev, true);
>
>         if (sysfs_add_file_to_group(&dev->ctrl.device->kobj,
>                                     &dev_attr_cmb.attr, NULL))
> @@ -1718,12 +1727,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
>
>  static inline void nvme_release_cmb(struct nvme_dev *dev)
>  {
> -       if (dev->cmb) {
> -               iounmap(dev->cmb);
> -               dev->cmb = NULL;
> +       if (dev->cmb_size) {
>                 sysfs_remove_file_from_group(&dev->ctrl.device->kobj,
>                                              &dev_attr_cmb.attr, NULL);
> -               dev->cmbsz = 0;
> +               dev->cmb_size = 0;
>         }
>  }
>
> @@ -1918,13 +1925,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>         if (nr_io_queues == 0)
>                 return 0;
>
> -       if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> +       if (dev->cmb_use_sqes) {
>                 result = nvme_cmb_qdepth(dev, nr_io_queues,
>                                 sizeof(struct nvme_command));
>                 if (result > 0)
>                         dev->q_depth = result;
>                 else
> -                       nvme_release_cmb(dev);
> +                       dev->cmb_use_sqes = false;
>         }
>
>         do {
> --
> 2.11.0
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm at lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05  1:33     ` Oliver
  (?)
  (?)
@ 2018-03-05 16:00       ` Keith Busch
  -1 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-05 16:00 UTC (permalink / raw)
  To: Oliver
  Cc: Jens Axboe, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, linux-block, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> > @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> >  {
> >         u16 tail = nvmeq->sq_tail;
> 
> > -       if (nvmeq->sq_cmds_io)
> > -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> > -       else
> > -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> > +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> 
> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
> the _toio() variant enforces alignment, does the copy with 4 byte
> stores, and has a full barrier after the copy. In comparison our
> regular memcpy() does none of those things and may use unaligned and
> vector load/stores. For normal (cacheable) memory that is perfectly
> fine, but they can cause alignment faults when targeted at MMIO
> (cache-inhibited) memory.
> 
> I think in this particular case it might be ok since we know SEQs are
> aligned to 64 byte boundaries and the copy is too small to use our
> vectorised memcpy(). I'll assume we don't need explicit ordering
> between writes of SEQs since the existing code doesn't seem to care
> unless the doorbell is being rung, so you're probably fine there too.
> That said, I still think this is a little bit sketchy and at the very
> least you should add a comment explaining what's going on when the CMB
> is being used. If someone more familiar with the NVMe driver could
> chime in I would appreciate it.

I may not be understanding the concern, but I'll give it a shot.

You're right, the start of any SQE is always 64-byte aligned, so that
should satisfy alignment requirements.

The order when writing multiple/successive SQEs in a submission queue
does matter, and this is currently serialized through the q_lock.

The order in which the bytes of a single SQE is written doesn't really
matter as long as the entire SQE is written into the CMB prior to writing
that SQ's doorbell register.

The doorbell register is written immediately after copying a command
entry into the submission queue (ignore "shadow buffer" features),
so the doorbells written to commands submitted is 1:1.

If a CMB SQE and DB order is not enforced with the memcpy, then we do
need a barrier after the SQE's memcpy and before the doorbell's writel.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 16:00       ` Keith Busch
  0 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-05 16:00 UTC (permalink / raw)
  To: Oliver
  Cc: Logan Gunthorpe, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Jens Axboe, Benjamin Herrenschmidt,
	Alex Williamson, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> > @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> >  {
> >         u16 tail = nvmeq->sq_tail;
> 
> > -       if (nvmeq->sq_cmds_io)
> > -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> > -       else
> > -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> > +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> 
> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
> the _toio() variant enforces alignment, does the copy with 4 byte
> stores, and has a full barrier after the copy. In comparison our
> regular memcpy() does none of those things and may use unaligned and
> vector load/stores. For normal (cacheable) memory that is perfectly
> fine, but they can cause alignment faults when targeted at MMIO
> (cache-inhibited) memory.
> 
> I think in this particular case it might be ok since we know SEQs are
> aligned to 64 byte boundaries and the copy is too small to use our
> vectorised memcpy(). I'll assume we don't need explicit ordering
> between writes of SEQs since the existing code doesn't seem to care
> unless the doorbell is being rung, so you're probably fine there too.
> That said, I still think this is a little bit sketchy and at the very
> least you should add a comment explaining what's going on when the CMB
> is being used. If someone more familiar with the NVMe driver could
> chime in I would appreciate it.

I may not be understanding the concern, but I'll give it a shot.

You're right, the start of any SQE is always 64-byte aligned, so that
should satisfy alignment requirements.

The order when writing multiple/successive SQEs in a submission queue
does matter, and this is currently serialized through the q_lock.

The order in which the bytes of a single SQE is written doesn't really
matter as long as the entire SQE is written into the CMB prior to writing
that SQ's doorbell register.

The doorbell register is written immediately after copying a command
entry into the submission queue (ignore "shadow buffer" features),
so the doorbells written to commands submitted is 1:1.

If a CMB SQE and DB order is not enforced with the memcpy, then we do
need a barrier after the SQE's memcpy and before the doorbell's writel.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 16:00       ` Keith Busch
  0 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-05 16:00 UTC (permalink / raw)
  To: Oliver
  Cc: Jens Axboe, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
> > @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> >  {
> >         u16 tail = nvmeq->sq_tail;
> 
> > -       if (nvmeq->sq_cmds_io)
> > -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> > -       else
> > -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> > +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> 
> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
> the _toio() variant enforces alignment, does the copy with 4 byte
> stores, and has a full barrier after the copy. In comparison our
> regular memcpy() does none of those things and may use unaligned and
> vector load/stores. For normal (cacheable) memory that is perfectly
> fine, but they can cause alignment faults when targeted at MMIO
> (cache-inhibited) memory.
> 
> I think in this particular case it might be ok since we know SEQs are
> aligned to 64 byte boundaries and the copy is too small to use our
> vectorised memcpy(). I'll assume we don't need explicit ordering
> between writes of SEQs since the existing code doesn't seem to care
> unless the doorbell is being rung, so you're probably fine there too.
> That said, I still think this is a little bit sketchy and at the very
> least you should add a comment explaining what's going on when the CMB
> is being used. If someone more familiar with the NVMe driver could
> chime in I would appreciate it.

I may not be understanding the concern, but I'll give it a shot.

You're right, the start of any SQE is always 64-byte aligned, so that
should satisfy alignment requirements.

The order when writing multiple/successive SQEs in a submission queue
does matter, and this is currently serialized through the q_lock.

The order in which the bytes of a single SQE is written doesn't really
matter as long as the entire SQE is written into the CMB prior to writing
that SQ's doorbell register.

The doorbell register is written immediately after copying a command
entry into the submission queue (ignore "shadow buffer" features),
so the doorbells written to commands submitted is 1:1.

If a CMB SQE and DB order is not enforced with the memcpy, then we do
need a barrier after the SQE's memcpy and before the doorbell's writel.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 16:00       ` Keith Busch
  0 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-05 16:00 UTC (permalink / raw)


On Mon, Mar 05, 2018@12:33:29PM +1100, Oliver wrote:
> On Thu, Mar 1, 2018@10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> > @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> >  {
> >         u16 tail = nvmeq->sq_tail;
> 
> > -       if (nvmeq->sq_cmds_io)
> > -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> > -       else
> > -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> > +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> 
> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
> the _toio() variant enforces alignment, does the copy with 4 byte
> stores, and has a full barrier after the copy. In comparison our
> regular memcpy() does none of those things and may use unaligned and
> vector load/stores. For normal (cacheable) memory that is perfectly
> fine, but they can cause alignment faults when targeted at MMIO
> (cache-inhibited) memory.
> 
> I think in this particular case it might be ok since we know SEQs are
> aligned to 64 byte boundaries and the copy is too small to use our
> vectorised memcpy(). I'll assume we don't need explicit ordering
> between writes of SEQs since the existing code doesn't seem to care
> unless the doorbell is being rung, so you're probably fine there too.
> That said, I still think this is a little bit sketchy and at the very
> least you should add a comment explaining what's going on when the CMB
> is being used. If someone more familiar with the NVMe driver could
> chime in I would appreciate it.

I may not be understanding the concern, but I'll give it a shot.

You're right, the start of any SQE is always 64-byte aligned, so that
should satisfy alignment requirements.

The order when writing multiple/successive SQEs in a submission queue
does matter, and this is currently serialized through the q_lock.

The order in which the bytes of a single SQE is written doesn't really
matter as long as the entire SQE is written into the CMB prior to writing
that SQ's doorbell register.

The doorbell register is written immediately after copying a command
entry into the submission queue (ignore "shadow buffer" features),
so the doorbells written to commands submitted is 1:1.

If a CMB SQE and DB order is not enforced with the memcpy, then we do
need a barrier after the SQE's memcpy and before the doorbell's writel.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 16:00       ` Keith Busch
  (?)
  (?)
@ 2018-03-05 17:10         ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 17:10 UTC (permalink / raw)
  To: Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, linux-block, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 05/03/18 09:00 AM, Keith Busch wrote:
> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>>   {
>>>          u16 tail = nvmeq->sq_tail;
>>
>>> -       if (nvmeq->sq_cmds_io)
>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> -       else
>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
> 
> I may not be understanding the concern, but I'll give it a shot.
> 
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
> 
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
> 
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
> 
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
> 
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.


Thanks for the information Keith.

Adding to this: regular memcpy generally also enforces alignment as 
unaligned access to regular memory is typically bad in some way on most 
arches. The generic memcpy_toio also does not have any barrier as it is 
just a call to memcpy. Arm64 also does not appear to have a barrier in 
its implementation and in the short survey I did I could not find any 
implementation with a barrier. I also did not find a ppc implementation 
in the tree but it would be weird for it to add a barrier when other 
arches do not appear to need it.

We've been operating on the assumption that memory mapped by 
devm_memremap_pages() can be treated as regular memory. This is 
emphasized by the fact that it does not return an __iomem pointer. If 
this assumption does not hold for an arch then we cannot support P2P DMA 
without an overhaul of many kernel interfaces or creating other backend 
interfaces into the drivers which take different data types (ie. we'd 
have to bypass the entire block layer when trying to write data in 
p2pmem to an nvme device. This is very undesirable.

Logan



_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 17:10         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 17:10 UTC (permalink / raw)
  To: Keith Busch, Oliver
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Jens Axboe, Benjamin Herrenschmidt, Alex Williamson,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 05/03/18 09:00 AM, Keith Busch wrote:
> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>>   {
>>>          u16 tail = nvmeq->sq_tail;
>>
>>> -       if (nvmeq->sq_cmds_io)
>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> -       else
>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
> 
> I may not be understanding the concern, but I'll give it a shot.
> 
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
> 
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
> 
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
> 
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
> 
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.


Thanks for the information Keith.

Adding to this: regular memcpy generally also enforces alignment as 
unaligned access to regular memory is typically bad in some way on most 
arches. The generic memcpy_toio also does not have any barrier as it is 
just a call to memcpy. Arm64 also does not appear to have a barrier in 
its implementation and in the short survey I did I could not find any 
implementation with a barrier. I also did not find a ppc implementation 
in the tree but it would be weird for it to add a barrier when other 
arches do not appear to need it.

We've been operating on the assumption that memory mapped by 
devm_memremap_pages() can be treated as regular memory. This is 
emphasized by the fact that it does not return an __iomem pointer. If 
this assumption does not hold for an arch then we cannot support P2P DMA 
without an overhaul of many kernel interfaces or creating other backend 
interfaces into the drivers which take different data types (ie. we'd 
have to bypass the entire block layer when trying to write data in 
p2pmem to an nvme device. This is very undesirable.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 17:10         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 17:10 UTC (permalink / raw)
  To: Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 05/03/18 09:00 AM, Keith Busch wrote:
> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>>   {
>>>          u16 tail = nvmeq->sq_tail;
>>
>>> -       if (nvmeq->sq_cmds_io)
>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> -       else
>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
> 
> I may not be understanding the concern, but I'll give it a shot.
> 
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
> 
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
> 
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
> 
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
> 
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.


Thanks for the information Keith.

Adding to this: regular memcpy generally also enforces alignment as 
unaligned access to regular memory is typically bad in some way on most 
arches. The generic memcpy_toio also does not have any barrier as it is 
just a call to memcpy. Arm64 also does not appear to have a barrier in 
its implementation and in the short survey I did I could not find any 
implementation with a barrier. I also did not find a ppc implementation 
in the tree but it would be weird for it to add a barrier when other 
arches do not appear to need it.

We've been operating on the assumption that memory mapped by 
devm_memremap_pages() can be treated as regular memory. This is 
emphasized by the fact that it does not return an __iomem pointer. If 
this assumption does not hold for an arch then we cannot support P2P DMA 
without an overhaul of many kernel interfaces or creating other backend 
interfaces into the drivers which take different data types (ie. we'd 
have to bypass the entire block layer when trying to write data in 
p2pmem to an nvme device. This is very undesirable.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 17:10         ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 17:10 UTC (permalink / raw)




On 05/03/18 09:00 AM, Keith Busch wrote:
> On Mon, Mar 05, 2018@12:33:29PM +1100, Oliver wrote:
>> On Thu, Mar 1, 2018@10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>>   {
>>>          u16 tail = nvmeq->sq_tail;
>>
>>> -       if (nvmeq->sq_cmds_io)
>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> -       else
>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
> 
> I may not be understanding the concern, but I'll give it a shot.
> 
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
> 
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
> 
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
> 
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
> 
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.


Thanks for the information Keith.

Adding to this: regular memcpy generally also enforces alignment as 
unaligned access to regular memory is typically bad in some way on most 
arches. The generic memcpy_toio also does not have any barrier as it is 
just a call to memcpy. Arm64 also does not appear to have a barrier in 
its implementation and in the short survey I did I could not find any 
implementation with a barrier. I also did not find a ppc implementation 
in the tree but it would be weird for it to add a barrier when other 
arches do not appear to need it.

We've been operating on the assumption that memory mapped by 
devm_memremap_pages() can be treated as regular memory. This is 
emphasized by the fact that it does not return an __iomem pointer. If 
this assumption does not hold for an arch then we cannot support P2P DMA 
without an overhaul of many kernel interfaces or creating other backend 
interfaces into the drivers which take different data types (ie. we'd 
have to bypass the entire block layer when trying to write data in 
p2pmem to an nvme device. This is very undesirable.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 17:10         ` Logan Gunthorpe
  (?)
@ 2018-03-05 18:02           ` Sinan Kaya
  -1 siblings, 0 replies; 504+ messages in thread
From: Sinan Kaya @ 2018-03-05 18:02 UTC (permalink / raw)
  To: Logan Gunthorpe, Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, linux-block, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On 3/5/2018 12:10 PM, Logan Gunthorpe wrote:
> 
> 
> On 05/03/18 09:00 AM, Keith Busch wrote:
>> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>>>   {
>>>>          u16 tail = nvmeq->sq_tail;
>>>
>>>> -       if (nvmeq->sq_cmds_io)
>>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>>> -       else
>>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>
>>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>>> the _toio() variant enforces alignment, does the copy with 4 byte
>>> stores, and has a full barrier after the copy. In comparison our
>>> regular memcpy() does none of those things and may use unaligned and
>>> vector load/stores. For normal (cacheable) memory that is perfectly
>>> fine, but they can cause alignment faults when targeted at MMIO
>>> (cache-inhibited) memory.
>>>
>>> I think in this particular case it might be ok since we know SEQs are
>>> aligned to 64 byte boundaries and the copy is too small to use our
>>> vectorised memcpy(). I'll assume we don't need explicit ordering
>>> between writes of SEQs since the existing code doesn't seem to care
>>> unless the doorbell is being rung, so you're probably fine there too.
>>> That said, I still think this is a little bit sketchy and at the very
>>> least you should add a comment explaining what's going on when the CMB
>>> is being used. If someone more familiar with the NVMe driver could
>>> chime in I would appreciate it.
>>
>> I may not be understanding the concern, but I'll give it a shot.
>>
>> You're right, the start of any SQE is always 64-byte aligned, so that
>> should satisfy alignment requirements.
>>
>> The order when writing multiple/successive SQEs in a submission queue
>> does matter, and this is currently serialized through the q_lock.
>>
>> The order in which the bytes of a single SQE is written doesn't really
>> matter as long as the entire SQE is written into the CMB prior to writing
>> that SQ's doorbell register.
>>
>> The doorbell register is written immediately after copying a command
>> entry into the submission queue (ignore "shadow buffer" features),
>> so the doorbells written to commands submitted is 1:1.
>>
>> If a CMB SQE and DB order is not enforced with the memcpy, then we do
>> need a barrier after the SQE's memcpy and before the doorbell's writel.
> 
> 
> Thanks for the information Keith.
> 
> Adding to this: regular memcpy generally also enforces alignment as unaligned access to regular memory is typically bad in some way on most arches. The generic memcpy_toio also does not have any barrier as it is just a call to memcpy. Arm64 also does not appear to have a barrier in its implementation and in the short survey I did I could not find any implementation with a barrier. I also did not find a ppc implementation in the tree but it would be weird for it to add a barrier when other arches do not appear to need it.
> 
> We've been operating on the assumption that memory mapped by devm_memremap_pages() can be treated as regular memory. This is emphasized by the fact that it does not return an __iomem pointer. If this assumption does not hold for an arch then we cannot support P2P DMA without an overhaul of many kernel interfaces or creating other backend interfaces into the drivers which take different data types (ie. we'd have to bypass the entire block layer when trying to write data in p2pmem to an nvme device. This is very undesirable.
> 

writel has a barrier inside on ARM64.

https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/io.h#L143

Why do you need another barrier?


ACCESSING DEVICES
-----------------

Many devices can be memory mapped, and so appear to the CPU as if they're just
a set of memory locations.  To control such a device, the driver usually has to
make the right memory accesses in exactly the right order.

However, having a clever CPU or a clever compiler creates a potential problem
in that the carefully sequenced accesses in the driver code won't reach the
device in the requisite order if the CPU or the compiler thinks it is more
efficient to reorder, combine or merge accesses - something that would cause
the device to malfunction.

Inside of the Linux kernel, I/O should be done through the appropriate accessor
routines - such as inb() or writel() - which know how to make such accesses
appropriately sequential. 


> Logan
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 18:02           ` Sinan Kaya
  0 siblings, 0 replies; 504+ messages in thread
From: Sinan Kaya @ 2018-03-05 18:02 UTC (permalink / raw)
  To: Logan Gunthorpe, Keith Busch, Oliver
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Jens Axboe, Benjamin Herrenschmidt, Alex Williamson,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On 3/5/2018 12:10 PM, Logan Gunthorpe wrote:
> 
> 
> On 05/03/18 09:00 AM, Keith Busch wrote:
>> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>>>   {
>>>>          u16 tail = nvmeq->sq_tail;
>>>
>>>> -       if (nvmeq->sq_cmds_io)
>>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>>> -       else
>>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>
>>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>>> the _toio() variant enforces alignment, does the copy with 4 byte
>>> stores, and has a full barrier after the copy. In comparison our
>>> regular memcpy() does none of those things and may use unaligned and
>>> vector load/stores. For normal (cacheable) memory that is perfectly
>>> fine, but they can cause alignment faults when targeted at MMIO
>>> (cache-inhibited) memory.
>>>
>>> I think in this particular case it might be ok since we know SEQs are
>>> aligned to 64 byte boundaries and the copy is too small to use our
>>> vectorised memcpy(). I'll assume we don't need explicit ordering
>>> between writes of SEQs since the existing code doesn't seem to care
>>> unless the doorbell is being rung, so you're probably fine there too.
>>> That said, I still think this is a little bit sketchy and at the very
>>> least you should add a comment explaining what's going on when the CMB
>>> is being used. If someone more familiar with the NVMe driver could
>>> chime in I would appreciate it.
>>
>> I may not be understanding the concern, but I'll give it a shot.
>>
>> You're right, the start of any SQE is always 64-byte aligned, so that
>> should satisfy alignment requirements.
>>
>> The order when writing multiple/successive SQEs in a submission queue
>> does matter, and this is currently serialized through the q_lock.
>>
>> The order in which the bytes of a single SQE is written doesn't really
>> matter as long as the entire SQE is written into the CMB prior to writing
>> that SQ's doorbell register.
>>
>> The doorbell register is written immediately after copying a command
>> entry into the submission queue (ignore "shadow buffer" features),
>> so the doorbells written to commands submitted is 1:1.
>>
>> If a CMB SQE and DB order is not enforced with the memcpy, then we do
>> need a barrier after the SQE's memcpy and before the doorbell's writel.
> 
> 
> Thanks for the information Keith.
> 
> Adding to this: regular memcpy generally also enforces alignment as unaligned access to regular memory is typically bad in some way on most arches. The generic memcpy_toio also does not have any barrier as it is just a call to memcpy. Arm64 also does not appear to have a barrier in its implementation and in the short survey I did I could not find any implementation with a barrier. I also did not find a ppc implementation in the tree but it would be weird for it to add a barrier when other arches do not appear to need it.
> 
> We've been operating on the assumption that memory mapped by devm_memremap_pages() can be treated as regular memory. This is emphasized by the fact that it does not return an __iomem pointer. If this assumption does not hold for an arch then we cannot support P2P DMA without an overhaul of many kernel interfaces or creating other backend interfaces into the drivers which take different data types (ie. we'd have to bypass the entire block layer when trying to write data in p2pmem to an nvme device. This is very undesirable.
> 

writel has a barrier inside on ARM64.

https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/io.h#L143

Why do you need another barrier?


ACCESSING DEVICES
-----------------

Many devices can be memory mapped, and so appear to the CPU as if they're just
a set of memory locations.  To control such a device, the driver usually has to
make the right memory accesses in exactly the right order.

However, having a clever CPU or a clever compiler creates a potential problem
in that the carefully sequenced accesses in the driver code won't reach the
device in the requisite order if the CPU or the compiler thinks it is more
efficient to reorder, combine or merge accesses - something that would cause
the device to malfunction.

Inside of the Linux kernel, I/O should be done through the appropriate accessor
routines - such as inb() or writel() - which know how to make such accesses
appropriately sequential. 


> Logan
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 18:02           ` Sinan Kaya
  0 siblings, 0 replies; 504+ messages in thread
From: Sinan Kaya @ 2018-03-05 18:02 UTC (permalink / raw)


On 3/5/2018 12:10 PM, Logan Gunthorpe wrote:
> 
> 
> On 05/03/18 09:00 AM, Keith Busch wrote:
>> On Mon, Mar 05, 2018@12:33:29PM +1100, Oliver wrote:
>>> On Thu, Mar 1, 2018@10:40 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>>>> ? {
>>>> ???????? u16 tail = nvmeq->sq_tail;
>>>
>>>> -?????? if (nvmeq->sq_cmds_io)
>>>> -?????????????? memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>>> -?????? else
>>>> -?????????????? memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>> +?????? memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>
>>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>>> the _toio() variant enforces alignment, does the copy with 4 byte
>>> stores, and has a full barrier after the copy. In comparison our
>>> regular memcpy() does none of those things and may use unaligned and
>>> vector load/stores. For normal (cacheable) memory that is perfectly
>>> fine, but they can cause alignment faults when targeted at MMIO
>>> (cache-inhibited) memory.
>>>
>>> I think in this particular case it might be ok since we know SEQs are
>>> aligned to 64 byte boundaries and the copy is too small to use our
>>> vectorised memcpy(). I'll assume we don't need explicit ordering
>>> between writes of SEQs since the existing code doesn't seem to care
>>> unless the doorbell is being rung, so you're probably fine there too.
>>> That said, I still think this is a little bit sketchy and at the very
>>> least you should add a comment explaining what's going on when the CMB
>>> is being used. If someone more familiar with the NVMe driver could
>>> chime in I would appreciate it.
>>
>> I may not be understanding the concern, but I'll give it a shot.
>>
>> You're right, the start of any SQE is always 64-byte aligned, so that
>> should satisfy alignment requirements.
>>
>> The order when writing multiple/successive SQEs in a submission queue
>> does matter, and this is currently serialized through the q_lock.
>>
>> The order in which the bytes of a single SQE is written doesn't really
>> matter as long as the entire SQE is written into the CMB prior to writing
>> that SQ's doorbell register.
>>
>> The doorbell register is written immediately after copying a command
>> entry into the submission queue (ignore "shadow buffer" features),
>> so the doorbells written to commands submitted is 1:1.
>>
>> If a CMB SQE and DB order is not enforced with the memcpy, then we do
>> need a barrier after the SQE's memcpy and before the doorbell's writel.
> 
> 
> Thanks for the information Keith.
> 
> Adding to this: regular memcpy generally also enforces alignment as unaligned access to regular memory is typically bad in some way on most arches. The generic memcpy_toio also does not have any barrier as it is just a call to memcpy. Arm64 also does not appear to have a barrier in its implementation and in the short survey I did I could not find any implementation with a barrier. I also did not find a ppc implementation in the tree but it would be weird for it to add a barrier when other arches do not appear to need it.
> 
> We've been operating on the assumption that memory mapped by devm_memremap_pages() can be treated as regular memory. This is emphasized by the fact that it does not return an __iomem pointer. If this assumption does not hold for an arch then we cannot support P2P DMA without an overhaul of many kernel interfaces or creating other backend interfaces into the drivers which take different data types (ie. we'd have to bypass the entire block layer when trying to write data in p2pmem to an nvme device. This is very undesirable.
> 

writel has a barrier inside on ARM64.

https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/io.h#L143

Why do you need another barrier?


ACCESSING DEVICES
-----------------

Many devices can be memory mapped, and so appear to the CPU as if they're just
a set of memory locations.  To control such a device, the driver usually has to
make the right memory accesses in exactly the right order.

However, having a clever CPU or a clever compiler creates a potential problem
in that the carefully sequenced accesses in the driver code won't reach the
device in the requisite order if the CPU or the compiler thinks it is more
efficient to reorder, combine or merge accesses - something that would cause
the device to malfunction.

Inside of the Linux kernel, I/O should be done through the appropriate accessor
routines - such as inb() or writel() - which know how to make such accesses
appropriately sequential. 


> Logan
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 18:02           ` Sinan Kaya
  (?)
@ 2018-03-05 18:09             ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 18:09 UTC (permalink / raw)
  To: Sinan Kaya, Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, linux-block, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 05/03/18 11:02 AM, Sinan Kaya wrote:
> writel has a barrier inside on ARM64.
> 
> https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/io.h#L143

Yes, and no barrier inside memcpy_toio as it uses __raw_writes. This 
should be sufficient as we are only accessing addresses that look like 
memory and have no side effects (those enabling doorbell accesses may 
need to worry about this though). Typically, what could happen, in this 
case, is the CPU would issue writes to the BAR normally and the next 
time it programmed the DMA engine it would flush everything via the 
flush in writel.

> Why do you need another barrier?

We don't.

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 18:09             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 18:09 UTC (permalink / raw)
  To: Sinan Kaya, Keith Busch, Oliver
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Jens Axboe, Benjamin Herrenschmidt, Alex Williamson,
	Jérôme Glisse, Jason Gunthorpe, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 05/03/18 11:02 AM, Sinan Kaya wrote:
> writel has a barrier inside on ARM64.
> 
> https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/io.h#L143

Yes, and no barrier inside memcpy_toio as it uses __raw_writes. This 
should be sufficient as we are only accessing addresses that look like 
memory and have no side effects (those enabling doorbell accesses may 
need to worry about this though). Typically, what could happen, in this 
case, is the CPU would issue writes to the BAR normally and the next 
time it programmed the DMA engine it would flush everything via the 
flush in writel.

> Why do you need another barrier?

We don't.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 18:09             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 18:09 UTC (permalink / raw)




On 05/03/18 11:02 AM, Sinan Kaya wrote:
> writel has a barrier inside on ARM64.
> 
> https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/io.h#L143

Yes, and no barrier inside memcpy_toio as it uses __raw_writes. This 
should be sufficient as we are only accessing addresses that look like 
memory and have no side effects (those enabling doorbell accesses may 
need to worry about this though). Typically, what could happen, in this 
case, is the CPU would issue writes to the BAR normally and the next 
time it programmed the DMA engine it would flush everything via the 
flush in writel.

> Why do you need another barrier?

We don't.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 16:00       ` Keith Busch
  (?)
  (?)
@ 2018-03-05 19:57         ` Sagi Grimberg
  -1 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-05 19:57 UTC (permalink / raw)
  To: Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, linux-block, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig


>>> -       if (nvmeq->sq_cmds_io)
>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> -       else
>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
> 
> I may not be understanding the concern, but I'll give it a shot.
> 
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
> 
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
> 
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
> 
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
> 
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.

Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
update ordering always guaranteed?

If you look at mlx4 (rdma device driver) that works exactly the same as
nvme you will find:
--
                 qp->sq.head += nreq;

                 /*
                  * Make sure that descriptors are written before
                  * doorbell record.
                  */
                 wmb();

                 writel(qp->doorbell_qpn,
                        to_mdev(ibqp->device)->uar_map + 
MLX4_SEND_DOORBELL);

                 /*
                  * Make sure doorbells don't leak out of SQ spinlock
                  * and reach the HCA out of order.
                  */
                 mmiowb();
--

That seems to explicitly make sure to place a barrier before updating
the doorbell. So as I see it, either ordering is guaranteed and the
above code is redundant, or nvme needs to do the same.

Thoughts?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 19:57         ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-05 19:57 UTC (permalink / raw)
  To: Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, linux-block, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig


>>> -       if (nvmeq->sq_cmds_io)
>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> -       else
>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
> 
> I may not be understanding the concern, but I'll give it a shot.
> 
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
> 
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
> 
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
> 
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
> 
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.

Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
update ordering always guaranteed?

If you look at mlx4 (rdma device driver) that works exactly the same as
nvme you will find:
--
                 qp->sq.head += nreq;

                 /*
                  * Make sure that descriptors are written before
                  * doorbell record.
                  */
                 wmb();

                 writel(qp->doorbell_qpn,
                        to_mdev(ibqp->device)->uar_map + 
MLX4_SEND_DOORBELL);

                 /*
                  * Make sure doorbells don't leak out of SQ spinlock
                  * and reach the HCA out of order.
                  */
                 mmiowb();
--

That seems to explicitly make sure to place a barrier before updating
the doorbell. So as I see it, either ordering is guaranteed and the
above code is redundant, or nvme needs to do the same.

Thoughts?

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 19:57         ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-05 19:57 UTC (permalink / raw)
  To: Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig


>>> -       if (nvmeq->sq_cmds_io)
>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> -       else
>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
> 
> I may not be understanding the concern, but I'll give it a shot.
> 
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
> 
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
> 
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
> 
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
> 
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.

Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
update ordering always guaranteed?

If you look at mlx4 (rdma device driver) that works exactly the same as
nvme you will find:
--
                 qp->sq.head += nreq;

                 /*
                  * Make sure that descriptors are written before
                  * doorbell record.
                  */
                 wmb();

                 writel(qp->doorbell_qpn,
                        to_mdev(ibqp->device)->uar_map + 
MLX4_SEND_DOORBELL);

                 /*
                  * Make sure doorbells don't leak out of SQ spinlock
                  * and reach the HCA out of order.
                  */
                 mmiowb();
--

That seems to explicitly make sure to place a barrier before updating
the doorbell. So as I see it, either ordering is guaranteed and the
above code is redundant, or nvme needs to do the same.

Thoughts?

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 19:57         ` Sagi Grimberg
  0 siblings, 0 replies; 504+ messages in thread
From: Sagi Grimberg @ 2018-03-05 19:57 UTC (permalink / raw)



>>> -       if (nvmeq->sq_cmds_io)
>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>>> -       else
>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>
>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>> the _toio() variant enforces alignment, does the copy with 4 byte
>> stores, and has a full barrier after the copy. In comparison our
>> regular memcpy() does none of those things and may use unaligned and
>> vector load/stores. For normal (cacheable) memory that is perfectly
>> fine, but they can cause alignment faults when targeted at MMIO
>> (cache-inhibited) memory.
>>
>> I think in this particular case it might be ok since we know SEQs are
>> aligned to 64 byte boundaries and the copy is too small to use our
>> vectorised memcpy(). I'll assume we don't need explicit ordering
>> between writes of SEQs since the existing code doesn't seem to care
>> unless the doorbell is being rung, so you're probably fine there too.
>> That said, I still think this is a little bit sketchy and at the very
>> least you should add a comment explaining what's going on when the CMB
>> is being used. If someone more familiar with the NVMe driver could
>> chime in I would appreciate it.
> 
> I may not be understanding the concern, but I'll give it a shot.
> 
> You're right, the start of any SQE is always 64-byte aligned, so that
> should satisfy alignment requirements.
> 
> The order when writing multiple/successive SQEs in a submission queue
> does matter, and this is currently serialized through the q_lock.
> 
> The order in which the bytes of a single SQE is written doesn't really
> matter as long as the entire SQE is written into the CMB prior to writing
> that SQ's doorbell register.
> 
> The doorbell register is written immediately after copying a command
> entry into the submission queue (ignore "shadow buffer" features),
> so the doorbells written to commands submitted is 1:1.
> 
> If a CMB SQE and DB order is not enforced with the memcpy, then we do
> need a barrier after the SQE's memcpy and before the doorbell's writel.

Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
update ordering always guaranteed?

If you look at mlx4 (rdma device driver) that works exactly the same as
nvme you will find:
--
                 qp->sq.head += nreq;

                 /*
                  * Make sure that descriptors are written before
                  * doorbell record.
                  */
                 wmb();

                 writel(qp->doorbell_qpn,
                        to_mdev(ibqp->device)->uar_map + 
MLX4_SEND_DOORBELL);

                 /*
                  * Make sure doorbells don't leak out of SQ spinlock
                  * and reach the HCA out of order.
                  */
                 mmiowb();
--

That seems to explicitly make sure to place a barrier before updating
the doorbell. So as I see it, either ordering is guaranteed and the
above code is redundant, or nvme needs to do the same.

Thoughts?

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 19:57         ` Sagi Grimberg
  (?)
  (?)
@ 2018-03-05 20:10           ` Jason Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-05 20:10 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, linux-block, Benjamin Herrenschmidt, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, Keith Busch,
	Alex Williamson, Jérôme Glisse, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 09:57:27PM +0200, Sagi Grimberg wrote:

> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB update
> ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> --
>                 qp->sq.head += nreq;
> 
>                 /*
>                  * Make sure that descriptors are written before
>                  * doorbell record.
>                  */
>                 wmb();
> 
>                 writel(qp->doorbell_qpn,
>                        to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
> 
>                 /*
>                  * Make sure doorbells don't leak out of SQ spinlock
>                  * and reach the HCA out of order.
>                  */
>                 mmiowb();
> --
> 
> That seems to explicitly make sure to place a barrier before updating
> the doorbell. So as I see it, either ordering is guaranteed and the
> above code is redundant, or nvme needs to do the same.

A wmb() is always required before operations that can trigger DMA.

The reason ARM has a barrier in writel() is not to make it ordered
with respect to CPU stores to cachable memory, but to make it ordered
with respect to *other* writels.

Eg Linux defines this:

writel(A, mem);
writel(B, mem);

To always produce two TLPs on PCI-E when mem is UC BAR memory.

And ARM cannot guarentee that without the extra barrier.

So now we see stuff like this:

writel_relaxed(A, mem);
writel_relaxed(B, mem+4);

Which says the TLPs to A and B can be issued in any order..

So when reading the above mlx code, we see the first wmb() being used
to ensure that CPU stores to cachable memory are visible to the DMA
triggered by the doorbell ring.

The mmiowb() is used to ensure that DB writes are not combined and not
issued in any order other than implied by the lock that encloses the
whole thing. This is needed because uar_map is WC memory.

We don't have ordering with respect to two writel's here, so if ARM
performance was a concern the writel could be switched to
writel_relaxed().

Presumably nvme has similar requirments, although I guess the DB
register is mapped UC not WC?

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:10           ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-05 20:10 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Oliver, Jens Axboe, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 09:57:27PM +0200, Sagi Grimberg wrote:

> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB update
> ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> --
>                 qp->sq.head += nreq;
> 
>                 /*
>                  * Make sure that descriptors are written before
>                  * doorbell record.
>                  */
>                 wmb();
> 
>                 writel(qp->doorbell_qpn,
>                        to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
> 
>                 /*
>                  * Make sure doorbells don't leak out of SQ spinlock
>                  * and reach the HCA out of order.
>                  */
>                 mmiowb();
> --
> 
> That seems to explicitly make sure to place a barrier before updating
> the doorbell. So as I see it, either ordering is guaranteed and the
> above code is redundant, or nvme needs to do the same.

A wmb() is always required before operations that can trigger DMA.

The reason ARM has a barrier in writel() is not to make it ordered
with respect to CPU stores to cachable memory, but to make it ordered
with respect to *other* writels.

Eg Linux defines this:

writel(A, mem);
writel(B, mem);

To always produce two TLPs on PCI-E when mem is UC BAR memory.

And ARM cannot guarentee that without the extra barrier.

So now we see stuff like this:

writel_relaxed(A, mem);
writel_relaxed(B, mem+4);

Which says the TLPs to A and B can be issued in any order..

So when reading the above mlx code, we see the first wmb() being used
to ensure that CPU stores to cachable memory are visible to the DMA
triggered by the doorbell ring.

The mmiowb() is used to ensure that DB writes are not combined and not
issued in any order other than implied by the lock that encloses the
whole thing. This is needed because uar_map is WC memory.

We don't have ordering with respect to two writel's here, so if ARM
performance was a concern the writel could be switched to
writel_relaxed().

Presumably nvme has similar requirments, although I guess the DB
register is mapped UC not WC?

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:10           ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-05 20:10 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA,
	Benjamin Herrenschmidt, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jérôme Glisse, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 09:57:27PM +0200, Sagi Grimberg wrote:

> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB update
> ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> --
>                 qp->sq.head += nreq;
> 
>                 /*
>                  * Make sure that descriptors are written before
>                  * doorbell record.
>                  */
>                 wmb();
> 
>                 writel(qp->doorbell_qpn,
>                        to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
> 
>                 /*
>                  * Make sure doorbells don't leak out of SQ spinlock
>                  * and reach the HCA out of order.
>                  */
>                 mmiowb();
> --
> 
> That seems to explicitly make sure to place a barrier before updating
> the doorbell. So as I see it, either ordering is guaranteed and the
> above code is redundant, or nvme needs to do the same.

A wmb() is always required before operations that can trigger DMA.

The reason ARM has a barrier in writel() is not to make it ordered
with respect to CPU stores to cachable memory, but to make it ordered
with respect to *other* writels.

Eg Linux defines this:

writel(A, mem);
writel(B, mem);

To always produce two TLPs on PCI-E when mem is UC BAR memory.

And ARM cannot guarentee that without the extra barrier.

So now we see stuff like this:

writel_relaxed(A, mem);
writel_relaxed(B, mem+4);

Which says the TLPs to A and B can be issued in any order..

So when reading the above mlx code, we see the first wmb() being used
to ensure that CPU stores to cachable memory are visible to the DMA
triggered by the doorbell ring.

The mmiowb() is used to ensure that DB writes are not combined and not
issued in any order other than implied by the lock that encloses the
whole thing. This is needed because uar_map is WC memory.

We don't have ordering with respect to two writel's here, so if ARM
performance was a concern the writel could be switched to
writel_relaxed().

Presumably nvme has similar requirments, although I guess the DB
register is mapped UC not WC?

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:10           ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-05 20:10 UTC (permalink / raw)


On Mon, Mar 05, 2018@09:57:27PM +0200, Sagi Grimberg wrote:

> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB update
> ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> --
>                 qp->sq.head += nreq;
> 
>                 /*
>                  * Make sure that descriptors are written before
>                  * doorbell record.
>                  */
>                 wmb();
> 
>                 writel(qp->doorbell_qpn,
>                        to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
> 
>                 /*
>                  * Make sure doorbells don't leak out of SQ spinlock
>                  * and reach the HCA out of order.
>                  */
>                 mmiowb();
> --
> 
> That seems to explicitly make sure to place a barrier before updating
> the doorbell. So as I see it, either ordering is guaranteed and the
> above code is redundant, or nvme needs to do the same.

A wmb() is always required before operations that can trigger DMA.

The reason ARM has a barrier in writel() is not to make it ordered
with respect to CPU stores to cachable memory, but to make it ordered
with respect to *other* writels.

Eg Linux defines this:

writel(A, mem);
writel(B, mem);

To always produce two TLPs on PCI-E when mem is UC BAR memory.

And ARM cannot guarentee that without the extra barrier.

So now we see stuff like this:

writel_relaxed(A, mem);
writel_relaxed(B, mem+4);

Which says the TLPs to A and B can be issued in any order..

So when reading the above mlx code, we see the first wmb() being used
to ensure that CPU stores to cachable memory are visible to the DMA
triggered by the doorbell ring.

The mmiowb() is used to ensure that DB writes are not combined and not
issued in any order other than implied by the lock that encloses the
whole thing. This is needed because uar_map is WC memory.

We don't have ordering with respect to two writel's here, so if ARM
performance was a concern the writel could be switched to
writel_relaxed().

Presumably nvme has similar requirments, although I guess the DB
register is mapped UC not WC?

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 19:57         ` Sagi Grimberg
  (?)
  (?)
@ 2018-03-05 20:13           ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 20:13 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, linux-block, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 05/03/18 12:57 PM, Sagi Grimberg wrote:
> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
> update ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> -- 
>                  qp->sq.head += nreq;
> 
>                  /*
>                   * Make sure that descriptors are written before
>                   * doorbell record.
>                   */
>                  wmb();
> 
>                  writel(qp->doorbell_qpn,
>                         to_mdev(ibqp->device)->uar_map + 
> MLX4_SEND_DOORBELL);
> 
>                  /*
>                   * Make sure doorbells don't leak out of SQ spinlock
>                   * and reach the HCA out of order.
>                   */
>                  mmiowb();
> -- 

To me, it looks like the wmb() is redundant as writel should guarantee 
the order. (Indeed, per Sinan's comment, writel on arm64 starts with a 
wmb() which means, on that platform, there are two wmb() calls in a row.)

The mmiowb() call, on the other hand, looks correct per my understanding 
of it's purpose with respect to the spinlock.

Logan


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:13           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 20:13 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, linux-block, Alex Williamson, Jason Gunthorpe,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 05/03/18 12:57 PM, Sagi Grimberg wrote:
> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
> update ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> -- 
>                  qp->sq.head += nreq;
> 
>                  /*
>                   * Make sure that descriptors are written before
>                   * doorbell record.
>                   */
>                  wmb();
> 
>                  writel(qp->doorbell_qpn,
>                         to_mdev(ibqp->device)->uar_map + 
> MLX4_SEND_DOORBELL);
> 
>                  /*
>                   * Make sure doorbells don't leak out of SQ spinlock
>                   * and reach the HCA out of order.
>                   */
>                  mmiowb();
> -- 

To me, it looks like the wmb() is redundant as writel should guarantee 
the order. (Indeed, per Sinan's comment, writel on arm64 starts with a 
wmb() which means, on that platform, there are two wmb() calls in a row.)

The mmiowb() call, on the other hand, looks correct per my understanding 
of it's purpose with respect to the spinlock.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:13           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 20:13 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Oliver
  Cc: Jens Axboe, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 05/03/18 12:57 PM, Sagi Grimberg wrote:
> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
> update ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> -- 
>                  qp->sq.head += nreq;
> 
>                  /*
>                   * Make sure that descriptors are written before
>                   * doorbell record.
>                   */
>                  wmb();
> 
>                  writel(qp->doorbell_qpn,
>                         to_mdev(ibqp->device)->uar_map + 
> MLX4_SEND_DOORBELL);
> 
>                  /*
>                   * Make sure doorbells don't leak out of SQ spinlock
>                   * and reach the HCA out of order.
>                   */
>                  mmiowb();
> -- 

To me, it looks like the wmb() is redundant as writel should guarantee 
the order. (Indeed, per Sinan's comment, writel on arm64 starts with a 
wmb() which means, on that platform, there are two wmb() calls in a row.)

The mmiowb() call, on the other hand, looks correct per my understanding 
of it's purpose with respect to the spinlock.

Logan


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:13           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 20:13 UTC (permalink / raw)




On 05/03/18 12:57 PM, Sagi Grimberg wrote:
> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
> update ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> -- 
>  ??????????????? qp->sq.head += nreq;
> 
>  ??????????????? /*
>  ???????????????? * Make sure that descriptors are written before
>  ???????????????? * doorbell record.
>  ???????????????? */
>  ??????????????? wmb();
> 
>  ??????????????? writel(qp->doorbell_qpn,
>  ?????????????????????? to_mdev(ibqp->device)->uar_map + 
> MLX4_SEND_DOORBELL);
> 
>  ??????????????? /*
>  ???????????????? * Make sure doorbells don't leak out of SQ spinlock
>  ???????????????? * and reach the HCA out of order.
>  ???????????????? */
>  ??????????????? mmiowb();
> -- 

To me, it looks like the wmb() is redundant as writel should guarantee 
the order. (Indeed, per Sinan's comment, writel on arm64 starts with a 
wmb() which means, on that platform, there are two wmb() calls in a row.)

The mmiowb() call, on the other hand, looks correct per my understanding 
of it's purpose with respect to the spinlock.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 20:10           ` Jason Gunthorpe
  (?)
@ 2018-03-05 20:16             ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 20:16 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Jens Axboe, Keith Busch, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jérôme Glisse, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 05/03/18 01:10 PM, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.

Oh, yes, that makes sense. Disregard my previous email as I was wrong.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:16             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 20:16 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Jens Axboe, linux-block, Benjamin Herrenschmidt, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, Keith Busch,
	Alex Williamson, Jérôme Glisse, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 05/03/18 01:10 PM, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.

Oh, yes, that makes sense. Disregard my previous email as I was wrong.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:16             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 20:16 UTC (permalink / raw)




On 05/03/18 01:10 PM, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.

Oh, yes, that makes sense. Disregard my previous email as I was wrong.

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
  2018-03-02 22:09                   ` Jerome Glisse
                                       ` (2 preceding siblings ...)
  (?)
@ 2018-03-05 20:36                     ` Stephen  Bates
  -1 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-05 20:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, linux-block, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

>    Yes i need to document that some more in hmm.txt...

Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with what you sent out?

>  I am about to send RFC for nouveau, i am still working out some bugs.

Great. I will keep an eye out for it. An example user of hmm will be very helpful.

> i will fix the MAINTAINERS as part of those.
    
Awesome, thanks.

Stephen
    

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-05 20:36                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-05 20:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Alex Williamson, Oliver OHalloran

PiAgICBZZXMgaSBuZWVkIHRvIGRvY3VtZW50IHRoYXQgc29tZSBtb3JlIGluIGhtbS50eHQuLi4N
Cg0KSGkgSmVybW9uZSwgdGhhbmtzIGZvciB0aGUgZXhwbGFuYXRpb24uIENhbiBJIHN1Z2dlc3Qg
eW91IHVwZGF0ZSBobW0udHh0IHdpdGggd2hhdCB5b3Ugc2VudCBvdXQ/DQoNCj4gIEkgYW0gYWJv
dXQgdG8gc2VuZCBSRkMgZm9yIG5vdXZlYXUsIGkgYW0gc3RpbGwgd29ya2luZyBvdXQgc29tZSBi
dWdzLg0KDQpHcmVhdC4gSSB3aWxsIGtlZXAgYW4gZXllIG91dCBmb3IgaXQuIEFuIGV4YW1wbGUg
dXNlciBvZiBobW0gd2lsbCBiZSB2ZXJ5IGhlbHBmdWwuDQoNCj4gaSB3aWxsIGZpeCB0aGUgTUFJ
TlRBSU5FUlMgYXMgcGFydCBvZiB0aG9zZS4NCiAgICANCkF3ZXNvbWUsIHRoYW5rcy4NCg0KU3Rl
cGhlbg0KICAgIA0KDQo=

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-05 20:36                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-05 20:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, Keith Busch, Oliver OHalloran,
	Benjamin Herrenschmidt, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jason Gunthorpe, Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

>    Yes i need to document that some more in hmm.txt...

Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with what you sent out?

>  I am about to send RFC for nouveau, i am still working out some bugs.

Great. I will keep an eye out for it. An example user of hmm will be very helpful.

> i will fix the MAINTAINERS as part of those.
    
Awesome, thanks.

Stephen

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-05 20:36                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-05 20:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, linux-kernel, linux-pci,
	linux-nvme, linux-rdma, linux-nvdimm, linux-block,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Bjorn Helgaas, Jason Gunthorpe, Max Gurtovoy, Dan Williams,
	Alex Williamson, Oliver OHalloran

>    Yes i need to document that some more in hmm.txt...

Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with what you sent out?

>  I am about to send RFC for nouveau, i am still working out some bugs.

Great. I will keep an eye out for it. An example user of hmm will be very helpful.

> i will fix the MAINTAINERS as part of those.
    
Awesome, thanks.

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
@ 2018-03-05 20:36                     ` Stephen  Bates
  0 siblings, 0 replies; 504+ messages in thread
From: Stephen  Bates @ 2018-03-05 20:36 UTC (permalink / raw)


>    Yes i need to document that some more in hmm.txt...

Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with what you sent out?

>  I am about to send RFC for nouveau, i am still working out some bugs.

Great. I will keep an eye out for it. An example user of hmm will be very helpful.

> i will fix the MAINTAINERS as part of those.
    
Awesome, thanks.

Stephen
    

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 20:10           ` Jason Gunthorpe
  (?)
  (?)
@ 2018-03-05 20:42             ` Keith Busch
  -1 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-05 20:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jens Axboe, Benjamin Herrenschmidt, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jérôme Glisse, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.

IIUC, we don't need a similar barrier for NVMe to ensure memory is
visibile to DMA since the SQE memory is allocated DMA coherent when the
SQ is not within a CMB.
 
> The mmiowb() is used to ensure that DB writes are not combined and not
> issued in any order other than implied by the lock that encloses the
> whole thing. This is needed because uar_map is WC memory.
> 
> We don't have ordering with respect to two writel's here, so if ARM
> performance was a concern the writel could be switched to
> writel_relaxed().
> 
> Presumably nvme has similar requirments, although I guess the DB
> register is mapped UC not WC?

Yep, the NVMe DB register is required by the spec to be mapped UC.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:42             ` Keith Busch
  0 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-05 20:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Oliver, Jens Axboe, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.

IIUC, we don't need a similar barrier for NVMe to ensure memory is
visibile to DMA since the SQE memory is allocated DMA coherent when the
SQ is not within a CMB.
 
> The mmiowb() is used to ensure that DB writes are not combined and not
> issued in any order other than implied by the lock that encloses the
> whole thing. This is needed because uar_map is WC memory.
> 
> We don't have ordering with respect to two writel's here, so if ARM
> performance was a concern the writel could be switched to
> writel_relaxed().
> 
> Presumably nvme has similar requirments, although I guess the DB
> register is mapped UC not WC?

Yep, the NVMe DB register is required by the spec to be mapped UC.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:42             ` Keith Busch
  0 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-05 20:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jens Axboe, Benjamin Herrenschmidt,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Alex Williamson,
	Jérôme Glisse, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.

IIUC, we don't need a similar barrier for NVMe to ensure memory is
visibile to DMA since the SQE memory is allocated DMA coherent when the
SQ is not within a CMB.
 
> The mmiowb() is used to ensure that DB writes are not combined and not
> issued in any order other than implied by the lock that encloses the
> whole thing. This is needed because uar_map is WC memory.
> 
> We don't have ordering with respect to two writel's here, so if ARM
> performance was a concern the writel could be switched to
> writel_relaxed().
> 
> Presumably nvme has similar requirments, although I guess the DB
> register is mapped UC not WC?

Yep, the NVMe DB register is required by the spec to be mapped UC.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:42             ` Keith Busch
  0 siblings, 0 replies; 504+ messages in thread
From: Keith Busch @ 2018-03-05 20:42 UTC (permalink / raw)


On Mon, Mar 05, 2018@01:10:53PM -0700, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.

IIUC, we don't need a similar barrier for NVMe to ensure memory is
visibile to DMA since the SQE memory is allocated DMA coherent when the
SQ is not within a CMB.
 
> The mmiowb() is used to ensure that DB writes are not combined and not
> issued in any order other than implied by the lock that encloses the
> whole thing. This is needed because uar_map is WC memory.
> 
> We don't have ordering with respect to two writel's here, so if ARM
> performance was a concern the writel could be switched to
> writel_relaxed().
> 
> Presumably nvme has similar requirments, although I guess the DB
> register is mapped UC not WC?

Yep, the NVMe DB register is required by the spec to be mapped UC.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 20:42             ` Keith Busch
  (?)
@ 2018-03-05 20:50               ` Jason Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-05 20:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Benjamin Herrenschmidt, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jérôme Glisse, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 01:42:12PM -0700, Keith Busch wrote:
> On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> > So when reading the above mlx code, we see the first wmb() being used
> > to ensure that CPU stores to cachable memory are visible to the DMA
> > triggered by the doorbell ring.
> 
> IIUC, we don't need a similar barrier for NVMe to ensure memory is
> visibile to DMA since the SQE memory is allocated DMA coherent when the
> SQ is not within a CMB.

You still need it.

DMA coherent just means you don't need to call the DMA API after
writing, it says nothing about CPU ordering.

eg on x86 DMA coherent is just normal system memory, and you do need
the SFENCE betweeen system memory stores and DMA triggering MMIO,
apparently.

Jason
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:50               ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-05 20:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Oliver, Jens Axboe, linux-nvdimm, linux-rdma,
	linux-pci, linux-kernel, linux-nvme, linux-block,
	Alex Williamson, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Mon, Mar 05, 2018 at 01:42:12PM -0700, Keith Busch wrote:
> On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> > So when reading the above mlx code, we see the first wmb() being used
> > to ensure that CPU stores to cachable memory are visible to the DMA
> > triggered by the doorbell ring.
> 
> IIUC, we don't need a similar barrier for NVMe to ensure memory is
> visibile to DMA since the SQE memory is allocated DMA coherent when the
> SQ is not within a CMB.

You still need it.

DMA coherent just means you don't need to call the DMA API after
writing, it says nothing about CPU ordering.

eg on x86 DMA coherent is just normal system memory, and you do need
the SFENCE betweeen system memory stores and DMA triggering MMIO,
apparently.

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-05 20:50               ` Jason Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Jason Gunthorpe @ 2018-03-05 20:50 UTC (permalink / raw)


On Mon, Mar 05, 2018@01:42:12PM -0700, Keith Busch wrote:
> On Mon, Mar 05, 2018@01:10:53PM -0700, Jason Gunthorpe wrote:
> > So when reading the above mlx code, we see the first wmb() being used
> > to ensure that CPU stores to cachable memory are visible to the DMA
> > triggered by the doorbell ring.
> 
> IIUC, we don't need a similar barrier for NVMe to ensure memory is
> visibile to DMA since the SQE memory is allocated DMA coherent when the
> SQ is not within a CMB.

You still need it.

DMA coherent just means you don't need to call the DMA API after
writing, it says nothing about CPU ordering.

eg on x86 DMA coherent is just normal system memory, and you do need
the SFENCE betweeen system memory stores and DMA triggering MMIO,
apparently.

Jason

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-01 19:13       ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-05 22:28         ` Bjorn Helgaas
  -1 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-05 22:28 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 12:13:10PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
> > >   void pci_enable_acs(struct pci_dev *dev)
> > >   {
> > > +	if (pci_p2pdma_disable_acs(dev))
> > > +		return;
> > 
> > This doesn't read naturally to me.  I do see that when
> > CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> > and returns 0, so we'll go ahead and try to enable ACS as before.
> > 
> > But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> > right here so it's more obvious that we only disable ACS when it's
> > selected.
> 
> I could do this... however, I wrote it this way because I've read Linus
> dislikes using #ifdef's inside function bodies and I personally agree with
> that sentiment.

I try to avoid #ifdefs too, but in this case the plain reading of the
code makes no sense (we're in a function called "enable_acs()", and
the first thing we do is call a function to "disable_acs()".

Disabling ACS is scary for all the security reasons mentioned
elsewhere, so a reader who knows what ACS does and cares about
virtualization and security *has* to go look up the definition of
pci_p2pdma_disable_acs().

If you put the #ifdef right here, then it's easier to read because we
can see that "oh, this is a special and uncommon case that I can
probably ignore".

Bjorn
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-05 22:28         ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-05 22:28 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson

On Thu, Mar 01, 2018 at 12:13:10PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
> > >   void pci_enable_acs(struct pci_dev *dev)
> > >   {
> > > +	if (pci_p2pdma_disable_acs(dev))
> > > +		return;
> > 
> > This doesn't read naturally to me.  I do see that when
> > CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> > and returns 0, so we'll go ahead and try to enable ACS as before.
> > 
> > But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> > right here so it's more obvious that we only disable ACS when it's
> > selected.
> 
> I could do this... however, I wrote it this way because I've read Linus
> dislikes using #ifdef's inside function bodies and I personally agree with
> that sentiment.

I try to avoid #ifdefs too, but in this case the plain reading of the
code makes no sense (we're in a function called "enable_acs()", and
the first thing we do is call a function to "disable_acs()".

Disabling ACS is scary for all the security reasons mentioned
elsewhere, so a reader who knows what ACS does and cares about
virtualization and security *has* to go look up the definition of
pci_p2pdma_disable_acs().

If you put the #ifdef right here, then it's easier to read because we
can see that "oh, this is a special and uncommon case that I can
probably ignore".

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-05 22:28         ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-05 22:28 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig

On Thu, Mar 01, 2018 at 12:13:10PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
> > >   void pci_enable_acs(struct pci_dev *dev)
> > >   {
> > > +	if (pci_p2pdma_disable_acs(dev))
> > > +		return;
> > 
> > This doesn't read naturally to me.  I do see that when
> > CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> > and returns 0, so we'll go ahead and try to enable ACS as before.
> > 
> > But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> > right here so it's more obvious that we only disable ACS when it's
> > selected.
> 
> I could do this... however, I wrote it this way because I've read Linus
> dislikes using #ifdef's inside function bodies and I personally agree with
> that sentiment.

I try to avoid #ifdefs too, but in this case the plain reading of the
code makes no sense (we're in a function called "enable_acs()", and
the first thing we do is call a function to "disable_acs()".

Disabling ACS is scary for all the security reasons mentioned
elsewhere, so a reader who knows what ACS does and cares about
virtualization and security *has* to go look up the definition of
pci_p2pdma_disable_acs().

If you put the #ifdef right here, then it's easier to read because we
can see that "oh, this is a special and uncommon case that I can
probably ignore".

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-05 22:28         ` Bjorn Helgaas
  0 siblings, 0 replies; 504+ messages in thread
From: Bjorn Helgaas @ 2018-03-05 22:28 UTC (permalink / raw)


On Thu, Mar 01, 2018@12:13:10PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 11:02 AM, Bjorn Helgaas wrote:
> > >   void pci_enable_acs(struct pci_dev *dev)
> > >   {
> > > +	if (pci_p2pdma_disable_acs(dev))
> > > +		return;
> > 
> > This doesn't read naturally to me.  I do see that when
> > CONFIG_PCI_P2PDMA is not set, pci_p2pdma_disable_acs() does nothing
> > and returns 0, so we'll go ahead and try to enable ACS as before.
> > 
> > But I think it would be clearer to have an #ifdef CONFIG_PCI_P2PDMA
> > right here so it's more obvious that we only disable ACS when it's
> > selected.
> 
> I could do this... however, I wrote it this way because I've read Linus
> dislikes using #ifdef's inside function bodies and I personally agree with
> that sentiment.

I try to avoid #ifdefs too, but in this case the plain reading of the
code makes no sense (we're in a function called "enable_acs()", and
the first thing we do is call a function to "disable_acs()".

Disabling ACS is scary for all the security reasons mentioned
elsewhere, so a reader who knows what ACS does and cares about
virtualization and security *has* to go look up the definition of
pci_p2pdma_disable_acs().

If you put the #ifdef right here, then it's easier to read because we
can see that "oh, this is a special and uncommon case that I can
probably ignore".

Bjorn

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  2018-03-05 22:28         ` Bjorn Helgaas
  (?)
  (?)
@ 2018-03-05 23:01           ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 23:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson, linux-nvdimm,
	linux-rdma, linux-pci, linux-kernel, linux-nvme, linux-block,
	Jérôme Glisse, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 05/03/18 03:28 PM, Bjorn Helgaas wrote:
> If you put the #ifdef right here, then it's easier to read because we
> can see that "oh, this is a special and uncommon case that I can
> probably ignore".

Makes sense. I'll do that.

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-05 23:01           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 23:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-pci, linux-nvme, linux-rdma, linux-nvdimm,
	linux-block, Stephen Bates, Christoph Hellwig, Jens Axboe,
	Keith Busch, Sagi Grimberg, Bjorn Helgaas, Jason Gunthorpe,
	Max Gurtovoy, Dan Williams, Jérôme Glisse,
	Benjamin Herrenschmidt, Alex Williamson



On 05/03/18 03:28 PM, Bjorn Helgaas wrote:
> If you put the #ifdef right here, then it's easier to read because we
> can see that "oh, this is a special and uncommon case that I can
> probably ignore".

Makes sense. I'll do that.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-05 23:01           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 23:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Keith Busch, Alex Williamson,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Jérôme Glisse,
	Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Max Gurtovoy, Christoph Hellwig



On 05/03/18 03:28 PM, Bjorn Helgaas wrote:
> If you put the #ifdef right here, then it's easier to read because we
> can see that "oh, this is a special and uncommon case that I can
> probably ignore".

Makes sense. I'll do that.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
@ 2018-03-05 23:01           ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-05 23:01 UTC (permalink / raw)




On 05/03/18 03:28 PM, Bjorn Helgaas wrote:
> If you put the #ifdef right here, then it's easier to read because we
> can see that "oh, this is a special and uncommon case that I can
> probably ignore".

Makes sense. I'll do that.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-05 17:10         ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-06  0:49           ` Oliver
  -1 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-06  0:49 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, linux-block, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Keith Busch, Alex Williamson,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Tue, Mar 6, 2018 at 4:10 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 05/03/18 09:00 AM, Keith Busch wrote:
>>
>> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>>>
>>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com>
>>> wrote:
>>>>
>>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue
>>>> *nvmeq,
>>>>   {
>>>>          u16 tail = nvmeq->sq_tail;
>>>
>>>
>>>> -       if (nvmeq->sq_cmds_io)
>>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd,
>>>> sizeof(*cmd));
>>>> -       else
>>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>
>>>
>>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>>> the _toio() variant enforces alignment, does the copy with 4 byte
>>> stores, and has a full barrier after the copy. In comparison our
>>> regular memcpy() does none of those things and may use unaligned and
>>> vector load/stores. For normal (cacheable) memory that is perfectly
>>> fine, but they can cause alignment faults when targeted at MMIO
>>> (cache-inhibited) memory.
>>>
>>> I think in this particular case it might be ok since we know SEQs are
>>> aligned to 64 byte boundaries and the copy is too small to use our
>>> vectorised memcpy(). I'll assume we don't need explicit ordering
>>> between writes of SEQs since the existing code doesn't seem to care
>>> unless the doorbell is being rung, so you're probably fine there too.
>>> That said, I still think this is a little bit sketchy and at the very
>>> least you should add a comment explaining what's going on when the CMB
>>> is being used. If someone more familiar with the NVMe driver could
>>> chime in I would appreciate it.
>>
>>
>> I may not be understanding the concern, but I'll give it a shot.
>>
>> You're right, the start of any SQE is always 64-byte aligned, so that
>> should satisfy alignment requirements.
>>
>> The order when writing multiple/successive SQEs in a submission queue
>> does matter, and this is currently serialized through the q_lock.
>>
>> The order in which the bytes of a single SQE is written doesn't really
>> matter as long as the entire SQE is written into the CMB prior to writing
>> that SQ's doorbell register.
>>
>> The doorbell register is written immediately after copying a command
>> entry into the submission queue (ignore "shadow buffer" features),
>> so the doorbells written to commands submitted is 1:1.
>>
>> If a CMB SQE and DB order is not enforced with the memcpy, then we do
>> need a barrier after the SQE's memcpy and before the doorbell's writel.
>
>
>
> Thanks for the information Keith.
>
> Adding to this: regular memcpy generally also enforces alignment as
> unaligned access to regular memory is typically bad in some way on most
> arches. The generic memcpy_toio also does not have any barrier as it is just
> a call to memcpy. Arm64 also does not appear to have a barrier in its
> implementation and in the short survey I did I could not find any
> implementation with a barrier. I also did not find a ppc implementation in
> the tree but it would be weird for it to add a barrier when other arches do
> not appear to need it.

It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!

Awesome!

Our io.h indicates that our iomem accessors are designed to provide x86ish
strong ordering of accesses to MMIO space. The git log indicates
arch/powerpc/kernel/io.c has barely been touched in the last decade so
odds are most of that code was written in the elder days when people
were less aware of ordering issues. It might just be overly conservative
by today's standards, but maybe not (see below).

> We've been operating on the assumption that memory mapped by
> devm_memremap_pages() can be treated as regular memory. This is emphasized
> by the fact that it does not return an __iomem pointer.

I think the lack of an __iomem annotation is a bit of a red herring. The comment
header for devm_memremap_pages() outlines the conditions for when using
it is valid, the important one being:

> * 4/ res is expected to be a host memory range that could feasibly be
> *    treated as a "System RAM" range, i.e. not a device mmio range, but
> *    this is not enforced.

Granted, the CMB is a MMIO buffer rather than a register range, but from
my perspective it's still a device MMIO range rather than something we
could feasibly treat as system RAM. The two big reasons being:

a) System RAM is cacheable and coherent while the CMB is neither, and
b) On PPC MMIO accesses are in a separate ordering domain to cacheable
   memory accesses.

The latter isn't as terrifying as it sounds since a full barrier will
order everything
with everything, but it still causes problems. For performance reasons
we want to
minimise the number of cases where we need to use a sync instruction
(full barrier)
in favour of a lightweight sync (lwsync) which only orders cacheable
accesses. To
implement that our MMIO accessors set a percpu flag that arch_spin_unlock()
checks to see if any MMIO accesses have occured which would require a full
barrier. If we havn't done any MMIO operations then it'll only do a lwsync. My
concern here is that a driver might do something like this:

void *buffer = devm_memremap_pages(<stuff corresponding to card memory buffer>);

spin_lock();
<do some stuff with system memory>
// buffer is in MMIO space, so the writes are uncached
memcpy(buffer, source_data, <size>);
<do more stuff with system memory>
spin_unlock(); // no MMIO access detected, so it does a lwsync

As far as I know the spinlock API guarantees that accesses that
occurring inside the
critical section will be ordered relative to what happens outside the
critical section.
In this case we would be violating that guarantee and I don't see a
clean way to keep
the fix for it contained to arch/powerpc/. Fundamentally we need to know when
something is accessing MMIO space.

(I'm not going to suggest ditching the lwsync trick. mpe is not going
to take that patch
without a really good reason)

>If this assumption
> does not hold for an arch then we cannot support P2P DMA without an overhaul
> of many kernel interfaces or creating other backend interfaces into the
> drivers which take different data types (ie. we'd have to bypass the entire
> block layer when trying to write data in p2pmem to an nvme device. This is
> very undesirable.

I know. I'm not trying to ruin your day, but the separation between io
and normal memory
is there for a reason. Maybe we can relax that separation, but we need
to be careful about
how we do it.

Oliver
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-06  0:49           ` Oliver
  0 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-06  0:49 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Keith Busch, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Jens Axboe, Benjamin Herrenschmidt,
	Alex Williamson, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Tue, Mar 6, 2018 at 4:10 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 05/03/18 09:00 AM, Keith Busch wrote:
>>
>> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>>>
>>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang@deltatee.com>
>>> wrote:
>>>>
>>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue
>>>> *nvmeq,
>>>>   {
>>>>          u16 tail = nvmeq->sq_tail;
>>>
>>>
>>>> -       if (nvmeq->sq_cmds_io)
>>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd,
>>>> sizeof(*cmd));
>>>> -       else
>>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>
>>>
>>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>>> the _toio() variant enforces alignment, does the copy with 4 byte
>>> stores, and has a full barrier after the copy. In comparison our
>>> regular memcpy() does none of those things and may use unaligned and
>>> vector load/stores. For normal (cacheable) memory that is perfectly
>>> fine, but they can cause alignment faults when targeted at MMIO
>>> (cache-inhibited) memory.
>>>
>>> I think in this particular case it might be ok since we know SEQs are
>>> aligned to 64 byte boundaries and the copy is too small to use our
>>> vectorised memcpy(). I'll assume we don't need explicit ordering
>>> between writes of SEQs since the existing code doesn't seem to care
>>> unless the doorbell is being rung, so you're probably fine there too.
>>> That said, I still think this is a little bit sketchy and at the very
>>> least you should add a comment explaining what's going on when the CMB
>>> is being used. If someone more familiar with the NVMe driver could
>>> chime in I would appreciate it.
>>
>>
>> I may not be understanding the concern, but I'll give it a shot.
>>
>> You're right, the start of any SQE is always 64-byte aligned, so that
>> should satisfy alignment requirements.
>>
>> The order when writing multiple/successive SQEs in a submission queue
>> does matter, and this is currently serialized through the q_lock.
>>
>> The order in which the bytes of a single SQE is written doesn't really
>> matter as long as the entire SQE is written into the CMB prior to writing
>> that SQ's doorbell register.
>>
>> The doorbell register is written immediately after copying a command
>> entry into the submission queue (ignore "shadow buffer" features),
>> so the doorbells written to commands submitted is 1:1.
>>
>> If a CMB SQE and DB order is not enforced with the memcpy, then we do
>> need a barrier after the SQE's memcpy and before the doorbell's writel.
>
>
>
> Thanks for the information Keith.
>
> Adding to this: regular memcpy generally also enforces alignment as
> unaligned access to regular memory is typically bad in some way on most
> arches. The generic memcpy_toio also does not have any barrier as it is just
> a call to memcpy. Arm64 also does not appear to have a barrier in its
> implementation and in the short survey I did I could not find any
> implementation with a barrier. I also did not find a ppc implementation in
> the tree but it would be weird for it to add a barrier when other arches do
> not appear to need it.

It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!

Awesome!

Our io.h indicates that our iomem accessors are designed to provide x86ish
strong ordering of accesses to MMIO space. The git log indicates
arch/powerpc/kernel/io.c has barely been touched in the last decade so
odds are most of that code was written in the elder days when people
were less aware of ordering issues. It might just be overly conservative
by today's standards, but maybe not (see below).

> We've been operating on the assumption that memory mapped by
> devm_memremap_pages() can be treated as regular memory. This is emphasized
> by the fact that it does not return an __iomem pointer.

I think the lack of an __iomem annotation is a bit of a red herring. The comment
header for devm_memremap_pages() outlines the conditions for when using
it is valid, the important one being:

> * 4/ res is expected to be a host memory range that could feasibly be
> *    treated as a "System RAM" range, i.e. not a device mmio range, but
> *    this is not enforced.

Granted, the CMB is a MMIO buffer rather than a register range, but from
my perspective it's still a device MMIO range rather than something we
could feasibly treat as system RAM. The two big reasons being:

a) System RAM is cacheable and coherent while the CMB is neither, and
b) On PPC MMIO accesses are in a separate ordering domain to cacheable
   memory accesses.

The latter isn't as terrifying as it sounds since a full barrier will
order everything
with everything, but it still causes problems. For performance reasons
we want to
minimise the number of cases where we need to use a sync instruction
(full barrier)
in favour of a lightweight sync (lwsync) which only orders cacheable
accesses. To
implement that our MMIO accessors set a percpu flag that arch_spin_unlock()
checks to see if any MMIO accesses have occured which would require a full
barrier. If we havn't done any MMIO operations then it'll only do a lwsync. My
concern here is that a driver might do something like this:

void *buffer = devm_memremap_pages(<stuff corresponding to card memory buffer>);

spin_lock();
<do some stuff with system memory>
// buffer is in MMIO space, so the writes are uncached
memcpy(buffer, source_data, <size>);
<do more stuff with system memory>
spin_unlock(); // no MMIO access detected, so it does a lwsync

As far as I know the spinlock API guarantees that accesses that
occurring inside the
critical section will be ordered relative to what happens outside the
critical section.
In this case we would be violating that guarantee and I don't see a
clean way to keep
the fix for it contained to arch/powerpc/. Fundamentally we need to know when
something is accessing MMIO space.

(I'm not going to suggest ditching the lwsync trick. mpe is not going
to take that patch
without a really good reason)

>If this assumption
> does not hold for an arch then we cannot support P2P DMA without an overhaul
> of many kernel interfaces or creating other backend interfaces into the
> drivers which take different data types (ie. we'd have to bypass the entire
> block layer when trying to write data in p2pmem to an nvme device. This is
> very undesirable.

I know. I'm not trying to ruin your day, but the separation between io
and normal memory
is there for a reason. Maybe we can relax that separation, but we need
to be careful about
how we do it.

Oliver

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-06  0:49           ` Oliver
  0 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-06  0:49 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Tue, Mar 6, 2018 at 4:10 AM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>
>
> On 05/03/18 09:00 AM, Keith Busch wrote:
>>
>> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
>>>
>>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
>>> wrote:
>>>>
>>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue
>>>> *nvmeq,
>>>>   {
>>>>          u16 tail = nvmeq->sq_tail;
>>>
>>>
>>>> -       if (nvmeq->sq_cmds_io)
>>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd,
>>>> sizeof(*cmd));
>>>> -       else
>>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>
>>>
>>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>>> the _toio() variant enforces alignment, does the copy with 4 byte
>>> stores, and has a full barrier after the copy. In comparison our
>>> regular memcpy() does none of those things and may use unaligned and
>>> vector load/stores. For normal (cacheable) memory that is perfectly
>>> fine, but they can cause alignment faults when targeted at MMIO
>>> (cache-inhibited) memory.
>>>
>>> I think in this particular case it might be ok since we know SEQs are
>>> aligned to 64 byte boundaries and the copy is too small to use our
>>> vectorised memcpy(). I'll assume we don't need explicit ordering
>>> between writes of SEQs since the existing code doesn't seem to care
>>> unless the doorbell is being rung, so you're probably fine there too.
>>> That said, I still think this is a little bit sketchy and at the very
>>> least you should add a comment explaining what's going on when the CMB
>>> is being used. If someone more familiar with the NVMe driver could
>>> chime in I would appreciate it.
>>
>>
>> I may not be understanding the concern, but I'll give it a shot.
>>
>> You're right, the start of any SQE is always 64-byte aligned, so that
>> should satisfy alignment requirements.
>>
>> The order when writing multiple/successive SQEs in a submission queue
>> does matter, and this is currently serialized through the q_lock.
>>
>> The order in which the bytes of a single SQE is written doesn't really
>> matter as long as the entire SQE is written into the CMB prior to writing
>> that SQ's doorbell register.
>>
>> The doorbell register is written immediately after copying a command
>> entry into the submission queue (ignore "shadow buffer" features),
>> so the doorbells written to commands submitted is 1:1.
>>
>> If a CMB SQE and DB order is not enforced with the memcpy, then we do
>> need a barrier after the SQE's memcpy and before the doorbell's writel.
>
>
>
> Thanks for the information Keith.
>
> Adding to this: regular memcpy generally also enforces alignment as
> unaligned access to regular memory is typically bad in some way on most
> arches. The generic memcpy_toio also does not have any barrier as it is just
> a call to memcpy. Arm64 also does not appear to have a barrier in its
> implementation and in the short survey I did I could not find any
> implementation with a barrier. I also did not find a ppc implementation in
> the tree but it would be weird for it to add a barrier when other arches do
> not appear to need it.

It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!

Awesome!

Our io.h indicates that our iomem accessors are designed to provide x86ish
strong ordering of accesses to MMIO space. The git log indicates
arch/powerpc/kernel/io.c has barely been touched in the last decade so
odds are most of that code was written in the elder days when people
were less aware of ordering issues. It might just be overly conservative
by today's standards, but maybe not (see below).

> We've been operating on the assumption that memory mapped by
> devm_memremap_pages() can be treated as regular memory. This is emphasized
> by the fact that it does not return an __iomem pointer.

I think the lack of an __iomem annotation is a bit of a red herring. The comment
header for devm_memremap_pages() outlines the conditions for when using
it is valid, the important one being:

> * 4/ res is expected to be a host memory range that could feasibly be
> *    treated as a "System RAM" range, i.e. not a device mmio range, but
> *    this is not enforced.

Granted, the CMB is a MMIO buffer rather than a register range, but from
my perspective it's still a device MMIO range rather than something we
could feasibly treat as system RAM. The two big reasons being:

a) System RAM is cacheable and coherent while the CMB is neither, and
b) On PPC MMIO accesses are in a separate ordering domain to cacheable
   memory accesses.

The latter isn't as terrifying as it sounds since a full barrier will
order everything
with everything, but it still causes problems. For performance reasons
we want to
minimise the number of cases where we need to use a sync instruction
(full barrier)
in favour of a lightweight sync (lwsync) which only orders cacheable
accesses. To
implement that our MMIO accessors set a percpu flag that arch_spin_unlock()
checks to see if any MMIO accesses have occured which would require a full
barrier. If we havn't done any MMIO operations then it'll only do a lwsync. My
concern here is that a driver might do something like this:

void *buffer = devm_memremap_pages(<stuff corresponding to card memory buffer>);

spin_lock();
<do some stuff with system memory>
// buffer is in MMIO space, so the writes are uncached
memcpy(buffer, source_data, <size>);
<do more stuff with system memory>
spin_unlock(); // no MMIO access detected, so it does a lwsync

As far as I know the spinlock API guarantees that accesses that
occurring inside the
critical section will be ordered relative to what happens outside the
critical section.
In this case we would be violating that guarantee and I don't see a
clean way to keep
the fix for it contained to arch/powerpc/. Fundamentally we need to know when
something is accessing MMIO space.

(I'm not going to suggest ditching the lwsync trick. mpe is not going
to take that patch
without a really good reason)

>If this assumption
> does not hold for an arch then we cannot support P2P DMA without an overhaul
> of many kernel interfaces or creating other backend interfaces into the
> drivers which take different data types (ie. we'd have to bypass the entire
> block layer when trying to write data in p2pmem to an nvme device. This is
> very undesirable.

I know. I'm not trying to ruin your day, but the separation between io
and normal memory
is there for a reason. Maybe we can relax that separation, but we need
to be careful about
how we do it.

Oliver

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-06  0:49           ` Oliver
  0 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-06  0:49 UTC (permalink / raw)


On Tue, Mar 6, 2018@4:10 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 05/03/18 09:00 AM, Keith Busch wrote:
>>
>> On Mon, Mar 05, 2018@12:33:29PM +1100, Oliver wrote:
>>>
>>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe <logang at deltatee.com>
>>> wrote:
>>>>
>>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue
>>>> *nvmeq,
>>>>   {
>>>>          u16 tail = nvmeq->sq_tail;
>>>
>>>
>>>> -       if (nvmeq->sq_cmds_io)
>>>> -               memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd,
>>>> sizeof(*cmd));
>>>> -       else
>>>> -               memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>> +       memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>>
>>>
>>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
>>> the _toio() variant enforces alignment, does the copy with 4 byte
>>> stores, and has a full barrier after the copy. In comparison our
>>> regular memcpy() does none of those things and may use unaligned and
>>> vector load/stores. For normal (cacheable) memory that is perfectly
>>> fine, but they can cause alignment faults when targeted at MMIO
>>> (cache-inhibited) memory.
>>>
>>> I think in this particular case it might be ok since we know SEQs are
>>> aligned to 64 byte boundaries and the copy is too small to use our
>>> vectorised memcpy(). I'll assume we don't need explicit ordering
>>> between writes of SEQs since the existing code doesn't seem to care
>>> unless the doorbell is being rung, so you're probably fine there too.
>>> That said, I still think this is a little bit sketchy and at the very
>>> least you should add a comment explaining what's going on when the CMB
>>> is being used. If someone more familiar with the NVMe driver could
>>> chime in I would appreciate it.
>>
>>
>> I may not be understanding the concern, but I'll give it a shot.
>>
>> You're right, the start of any SQE is always 64-byte aligned, so that
>> should satisfy alignment requirements.
>>
>> The order when writing multiple/successive SQEs in a submission queue
>> does matter, and this is currently serialized through the q_lock.
>>
>> The order in which the bytes of a single SQE is written doesn't really
>> matter as long as the entire SQE is written into the CMB prior to writing
>> that SQ's doorbell register.
>>
>> The doorbell register is written immediately after copying a command
>> entry into the submission queue (ignore "shadow buffer" features),
>> so the doorbells written to commands submitted is 1:1.
>>
>> If a CMB SQE and DB order is not enforced with the memcpy, then we do
>> need a barrier after the SQE's memcpy and before the doorbell's writel.
>
>
>
> Thanks for the information Keith.
>
> Adding to this: regular memcpy generally also enforces alignment as
> unaligned access to regular memory is typically bad in some way on most
> arches. The generic memcpy_toio also does not have any barrier as it is just
> a call to memcpy. Arm64 also does not appear to have a barrier in its
> implementation and in the short survey I did I could not find any
> implementation with a barrier. I also did not find a ppc implementation in
> the tree but it would be weird for it to add a barrier when other arches do
> not appear to need it.

It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!

Awesome!

Our io.h indicates that our iomem accessors are designed to provide x86ish
strong ordering of accesses to MMIO space. The git log indicates
arch/powerpc/kernel/io.c has barely been touched in the last decade so
odds are most of that code was written in the elder days when people
were less aware of ordering issues. It might just be overly conservative
by today's standards, but maybe not (see below).

> We've been operating on the assumption that memory mapped by
> devm_memremap_pages() can be treated as regular memory. This is emphasized
> by the fact that it does not return an __iomem pointer.

I think the lack of an __iomem annotation is a bit of a red herring. The comment
header for devm_memremap_pages() outlines the conditions for when using
it is valid, the important one being:

> * 4/ res is expected to be a host memory range that could feasibly be
> *    treated as a "System RAM" range, i.e. not a device mmio range, but
> *    this is not enforced.

Granted, the CMB is a MMIO buffer rather than a register range, but from
my perspective it's still a device MMIO range rather than something we
could feasibly treat as system RAM. The two big reasons being:

a) System RAM is cacheable and coherent while the CMB is neither, and
b) On PPC MMIO accesses are in a separate ordering domain to cacheable
   memory accesses.

The latter isn't as terrifying as it sounds since a full barrier will
order everything
with everything, but it still causes problems. For performance reasons
we want to
minimise the number of cases where we need to use a sync instruction
(full barrier)
in favour of a lightweight sync (lwsync) which only orders cacheable
accesses. To
implement that our MMIO accessors set a percpu flag that arch_spin_unlock()
checks to see if any MMIO accesses have occured which would require a full
barrier. If we havn't done any MMIO operations then it'll only do a lwsync. My
concern here is that a driver might do something like this:

void *buffer = devm_memremap_pages(<stuff corresponding to card memory buffer>);

spin_lock();
<do some stuff with system memory>
// buffer is in MMIO space, so the writes are uncached
memcpy(buffer, source_data, <size>);
<do more stuff with system memory>
spin_unlock(); // no MMIO access detected, so it does a lwsync

As far as I know the spinlock API guarantees that accesses that
occurring inside the
critical section will be ordered relative to what happens outside the
critical section.
In this case we would be violating that guarantee and I don't see a
clean way to keep
the fix for it contained to arch/powerpc/. Fundamentally we need to know when
something is accessing MMIO space.

(I'm not going to suggest ditching the lwsync trick. mpe is not going
to take that patch
without a really good reason)

>If this assumption
> does not hold for an arch then we cannot support P2P DMA without an overhaul
> of many kernel interfaces or creating other backend interfaces into the
> drivers which take different data types (ie. we'd have to bypass the entire
> block layer when trying to write data in p2pmem to an nvme device. This is
> very undesirable.

I know. I'm not trying to ruin your day, but the separation between io
and normal memory
is there for a reason. Maybe we can relax that separation, but we need
to be careful about
how we do it.

Oliver

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-06  0:49           ` Oliver
  (?)
  (?)
@ 2018-03-06  1:14             ` Logan Gunthorpe
  -1 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-06  1:14 UTC (permalink / raw)
  To: Oliver
  Cc: Jens Axboe, linux-block, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Keith Busch, Alex Williamson,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 05/03/18 05:49 PM, Oliver wrote:
> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!
> 
> Awesome!
> 
> Our io.h indicates that our iomem accessors are designed to provide x86ish
> strong ordering of accesses to MMIO space. The git log indicates
> arch/powerpc/kernel/io.c has barely been touched in the last decade so
> odds are most of that code was written in the elder days when people
> were less aware of ordering issues. It might just be overly conservative
> by today's standards, but maybe not (see below).

Yes, that seems overly conservative.

> (I'm not going to suggest ditching the lwsync trick. mpe is not going
> to take that patch
> without a really good reason)

Well, that's pretty gross. Is this not exactly the situation mmiowb() is 
meant to solve? See [1].

Though, you're right in principle. Even if power was similar to other 
systems in this way, it's still a risk that if these pages get passed 
somewhere in the kernel that uses a spin lock like that without an 
mmiowb() call, then it's going to have a bug. For now, the risk is 
pretty low as we know exactly where all the p2pmem pages will be used 
but if it gets into other places, all bets are off. I did do some work 
trying to make a safe version of io-pages and also trying to change from 
pages to pfn_t in large areas but neither approach seemed likely to get 
any traction in the community, at least not in the near term.

Logan

[1] ACQUIRES VS I/O ACCESSES in 
https://www.kernel.org/doc/Documentation/memory-barriers.txt

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-06  1:14             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-06  1:14 UTC (permalink / raw)
  To: Oliver
  Cc: Keith Busch, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Jens Axboe, Benjamin Herrenschmidt,
	Alex Williamson, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 05/03/18 05:49 PM, Oliver wrote:
> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!
> 
> Awesome!
> 
> Our io.h indicates that our iomem accessors are designed to provide x86ish
> strong ordering of accesses to MMIO space. The git log indicates
> arch/powerpc/kernel/io.c has barely been touched in the last decade so
> odds are most of that code was written in the elder days when people
> were less aware of ordering issues. It might just be overly conservative
> by today's standards, but maybe not (see below).

Yes, that seems overly conservative.

> (I'm not going to suggest ditching the lwsync trick. mpe is not going
> to take that patch
> without a really good reason)

Well, that's pretty gross. Is this not exactly the situation mmiowb() is 
meant to solve? See [1].

Though, you're right in principle. Even if power was similar to other 
systems in this way, it's still a risk that if these pages get passed 
somewhere in the kernel that uses a spin lock like that without an 
mmiowb() call, then it's going to have a bug. For now, the risk is 
pretty low as we know exactly where all the p2pmem pages will be used 
but if it gets into other places, all bets are off. I did do some work 
trying to make a safe version of io-pages and also trying to change from 
pages to pfn_t in large areas but neither approach seemed likely to get 
any traction in the community, at least not in the near term.

Logan

[1] ACQUIRES VS I/O ACCESSES in 
https://www.kernel.org/doc/Documentation/memory-barriers.txt

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-06  1:14             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-06  1:14 UTC (permalink / raw)
  To: Oliver
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 05/03/18 05:49 PM, Oliver wrote:
> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!
> 
> Awesome!
> 
> Our io.h indicates that our iomem accessors are designed to provide x86ish
> strong ordering of accesses to MMIO space. The git log indicates
> arch/powerpc/kernel/io.c has barely been touched in the last decade so
> odds are most of that code was written in the elder days when people
> were less aware of ordering issues. It might just be overly conservative
> by today's standards, but maybe not (see below).

Yes, that seems overly conservative.

> (I'm not going to suggest ditching the lwsync trick. mpe is not going
> to take that patch
> without a really good reason)

Well, that's pretty gross. Is this not exactly the situation mmiowb() is 
meant to solve? See [1].

Though, you're right in principle. Even if power was similar to other 
systems in this way, it's still a risk that if these pages get passed 
somewhere in the kernel that uses a spin lock like that without an 
mmiowb() call, then it's going to have a bug. For now, the risk is 
pretty low as we know exactly where all the p2pmem pages will be used 
but if it gets into other places, all bets are off. I did do some work 
trying to make a safe version of io-pages and also trying to change from 
pages to pfn_t in large areas but neither approach seemed likely to get 
any traction in the community, at least not in the near term.

Logan

[1] ACQUIRES VS I/O ACCESSES in 
https://www.kernel.org/doc/Documentation/memory-barriers.txt

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-06  1:14             ` Logan Gunthorpe
  0 siblings, 0 replies; 504+ messages in thread
From: Logan Gunthorpe @ 2018-03-06  1:14 UTC (permalink / raw)




On 05/03/18 05:49 PM, Oliver wrote:
> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers!
> 
> Awesome!
> 
> Our io.h indicates that our iomem accessors are designed to provide x86ish
> strong ordering of accesses to MMIO space. The git log indicates
> arch/powerpc/kernel/io.c has barely been touched in the last decade so
> odds are most of that code was written in the elder days when people
> were less aware of ordering issues. It might just be overly conservative
> by today's standards, but maybe not (see below).

Yes, that seems overly conservative.

> (I'm not going to suggest ditching the lwsync trick. mpe is not going
> to take that patch
> without a really good reason)

Well, that's pretty gross. Is this not exactly the situation mmiowb() is 
meant to solve? See [1].

Though, you're right in principle. Even if power was similar to other 
systems in this way, it's still a risk that if these pages get passed 
somewhere in the kernel that uses a spin lock like that without an 
mmiowb() call, then it's going to have a bug. For now, the risk is 
pretty low as we know exactly where all the p2pmem pages will be used 
but if it gets into other places, all bets are off. I did do some work 
trying to make a safe version of io-pages and also trying to change from 
pages to pfn_t in large areas but neither approach seemed likely to get 
any traction in the community, at least not in the near term.

Logan

[1] ACQUIRES VS I/O ACCESSES in 
https://www.kernel.org/doc/Documentation/memory-barriers.txt

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  2018-03-06  1:14             ` Logan Gunthorpe
  (?)
  (?)
@ 2018-03-06 10:40               ` Oliver
  -1 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-06 10:40 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, linux-block, linux-nvdimm, linux-rdma, linux-pci,
	linux-kernel, linux-nvme, Keith Busch, Alex Williamson,
	Jason Gunthorpe, Jérôme Glisse, Benjamin Herrenschmidt,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Tue, Mar 6, 2018 at 12:14 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
> On 05/03/18 05:49 PM, Oliver wrote:
>>
>> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full
>> barriers!
>>
>> Awesome!
>>
>> Our io.h indicates that our iomem accessors are designed to provide x86ish
>> strong ordering of accesses to MMIO space. The git log indicates
>> arch/powerpc/kernel/io.c has barely been touched in the last decade so
>> odds are most of that code was written in the elder days when people
>> were less aware of ordering issues. It might just be overly conservative
>> by today's standards, but maybe not (see below).
>
>
> Yes, that seems overly conservative.
>
>> (I'm not going to suggest ditching the lwsync trick. mpe is not going
>> to take that patch
>> without a really good reason)
>
>
> Well, that's pretty gross. Is this not exactly the situation mmiowb() is
> meant to solve? See [1].

Yep, mmiowb() is supposed to be used in this situation. According to BenH,
author of that io_sync hack, we implement the stronger semantics
so that we don't break existing drivers that assume spin_unlock() does
order i/o even though it's not supposed to. At a guess the x86 version of
spin_unlock() happens to do that so the rest of us need to either live
with it or fix all the bugs :)

> Though, you're right in principle. Even if power was similar to other
> systems in this way, it's still a risk that if these pages get passed
> somewhere in the kernel that uses a spin lock like that without an mmiowb()
> call, then it's going to have a bug. For now, the risk is pretty low as we
> know exactly where all the p2pmem pages will be used but if it gets into
> other places, all bets are off.

Yeah this was my main concern with the whole approach. For ioremap()ed
memory we have the __iomem annotation to help with tracking when we
need to be careful, but we'd lose that entirely here.

> I did do some work trying to make a safe
> version of io-pages and also trying to change from pages to pfn_t in large
> areas but neither approach seemed likely to get any traction in the
> community, at least not in the near term.

It's a tricky problem. HMM with DEVICE_PRIVATE might be more
palatable than the pfn_t conversion since there would still be struct pages
backing the device memory. That said, there are probably other issues with
device private memory not being part of the linear mapping, but HMM
provides some assistance there.

Oliver
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-06 10:40               ` Oliver
  0 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-06 10:40 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Keith Busch, linux-kernel, linux-pci, linux-nvme, linux-rdma,
	linux-nvdimm, linux-block, Jens Axboe, Benjamin Herrenschmidt,
	Alex Williamson, Jérôme Glisse, Jason Gunthorpe,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig

On Tue, Mar 6, 2018 at 12:14 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
> On 05/03/18 05:49 PM, Oliver wrote:
>>
>> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full
>> barriers!
>>
>> Awesome!
>>
>> Our io.h indicates that our iomem accessors are designed to provide x86ish
>> strong ordering of accesses to MMIO space. The git log indicates
>> arch/powerpc/kernel/io.c has barely been touched in the last decade so
>> odds are most of that code was written in the elder days when people
>> were less aware of ordering issues. It might just be overly conservative
>> by today's standards, but maybe not (see below).
>
>
> Yes, that seems overly conservative.
>
>> (I'm not going to suggest ditching the lwsync trick. mpe is not going
>> to take that patch
>> without a really good reason)
>
>
> Well, that's pretty gross. Is this not exactly the situation mmiowb() is
> meant to solve? See [1].

Yep, mmiowb() is supposed to be used in this situation. According to BenH,
author of that io_sync hack, we implement the stronger semantics
so that we don't break existing drivers that assume spin_unlock() does
order i/o even though it's not supposed to. At a guess the x86 version of
spin_unlock() happens to do that so the rest of us need to either live
with it or fix all the bugs :)

> Though, you're right in principle. Even if power was similar to other
> systems in this way, it's still a risk that if these pages get passed
> somewhere in the kernel that uses a spin lock like that without an mmiowb()
> call, then it's going to have a bug. For now, the risk is pretty low as we
> know exactly where all the p2pmem pages will be used but if it gets into
> other places, all bets are off.

Yeah this was my main concern with the whole approach. For ioremap()ed
memory we have the __iomem annotation to help with tracking when we
need to be careful, but we'd lose that entirely here.

> I did do some work trying to make a safe
> version of io-pages and also trying to change from pages to pfn_t in large
> areas but neither approach seemed likely to get any traction in the
> community, at least not in the near term.

It's a tricky problem. HMM with DEVICE_PRIVATE might be more
palatable than the pfn_t conversion since there would still be struct pages
backing the device memory. That said, there are probably other issues with
device private memory not being part of the linear mapping, but HMM
provides some assistance there.

Oliver

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-06 10:40               ` Oliver
  0 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-06 10:40 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, Jason Gunthorpe, Jérôme Glisse,
	Benjamin Herrenschmidt, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On Tue, Mar 6, 2018 at 12:14 PM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>
> On 05/03/18 05:49 PM, Oliver wrote:
>>
>> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full
>> barriers!
>>
>> Awesome!
>>
>> Our io.h indicates that our iomem accessors are designed to provide x86ish
>> strong ordering of accesses to MMIO space. The git log indicates
>> arch/powerpc/kernel/io.c has barely been touched in the last decade so
>> odds are most of that code was written in the elder days when people
>> were less aware of ordering issues. It might just be overly conservative
>> by today's standards, but maybe not (see below).
>
>
> Yes, that seems overly conservative.
>
>> (I'm not going to suggest ditching the lwsync trick. mpe is not going
>> to take that patch
>> without a really good reason)
>
>
> Well, that's pretty gross. Is this not exactly the situation mmiowb() is
> meant to solve? See [1].

Yep, mmiowb() is supposed to be used in this situation. According to BenH,
author of that io_sync hack, we implement the stronger semantics
so that we don't break existing drivers that assume spin_unlock() does
order i/o even though it's not supposed to. At a guess the x86 version of
spin_unlock() happens to do that so the rest of us need to either live
with it or fix all the bugs :)

> Though, you're right in principle. Even if power was similar to other
> systems in this way, it's still a risk that if these pages get passed
> somewhere in the kernel that uses a spin lock like that without an mmiowb()
> call, then it's going to have a bug. For now, the risk is pretty low as we
> know exactly where all the p2pmem pages will be used but if it gets into
> other places, all bets are off.

Yeah this was my main concern with the whole approach. For ioremap()ed
memory we have the __iomem annotation to help with tracking when we
need to be careful, but we'd lose that entirely here.

> I did do some work trying to make a safe
> version of io-pages and also trying to change from pages to pfn_t in large
> areas but neither approach seemed likely to get any traction in the
> community, at least not in the near term.

It's a tricky problem. HMM with DEVICE_PRIVATE might be more
palatable than the pfn_t conversion since there would still be struct pages
backing the device memory. That said, there are probably other issues with
device private memory not being part of the linear mapping, but HMM
provides some assistance there.

Oliver

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
@ 2018-03-06 10:40               ` Oliver
  0 siblings, 0 replies; 504+ messages in thread
From: Oliver @ 2018-03-06 10:40 UTC (permalink / raw)


On Tue, Mar 6, 2018@12:14 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
> On 05/03/18 05:49 PM, Oliver wrote:
>>
>> It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full
>> barriers!
>>
>> Awesome!
>>
>> Our io.h indicates that our iomem accessors are designed to provide x86ish
>> strong ordering of accesses to MMIO space. The git log indicates
>> arch/powerpc/kernel/io.c has barely been touched in the last decade so
>> odds are most of that code was written in the elder days when people
>> were less aware of ordering issues. It might just be overly conservative
>> by today's standards, but maybe not (see below).
>
>
> Yes, that seems overly conservative.
>
>> (I'm not going to suggest ditching the lwsync trick. mpe is not going
>> to take that patch
>> without a really good reason)
>
>
> Well, that's pretty gross. Is this not exactly the situation mmiowb() is
> meant to solve? See [1].

Yep, mmiowb() is supposed to be used in this situation. According to BenH,
author of that io_sync hack, we implement the stronger semantics
so that we don't break existing drivers that assume spin_unlock() does
order i/o even though it's not supposed to. At a guess the x86 version of
spin_unlock() happens to do that so the rest of us need to either live
with it or fix all the bugs :)

> Though, you're right in principle. Even if power was similar to other
> systems in this way, it's still a risk that if these pages get passed
> somewhere in the kernel that uses a spin lock like that without an mmiowb()
> call, then it's going to have a bug. For now, the risk is pretty low as we
> know exactly where all the p2pmem pages will be used but if it gets into
> other places, all bets are off.

Yeah this was my main concern with the whole approach. For ioremap()ed
memory we have the __iomem annotation to help with tracking when we
need to be careful, but we'd lose that entirely here.

> I did do some work trying to make a safe
> version of io-pages and also trying to change from pages to pfn_t in large
> areas but neither approach seemed likely to get any traction in the
> community, at least not in the near term.

It's a tricky problem. HMM with DEVICE_PRIVATE might be more
palatable than the pfn_t conversion since there would still be struct pages
backing the device memory. That said, there are probably other issues with
device private memory not being part of the linear mapping, but HMM
provides some assistance there.

Oliver

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
  2018-03-01 15:58       ` Stephen  Bates
                           ` (2 preceding siblings ...)
  (?)
@ 2018-03-09  5:08         ` Bart Van Assche
  -1 siblings, 0 replies; 504+ messages in thread
From: Bart Van Assche @ 2018-03-09  5:08 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-rdma, sagi, sbates, logang,
	linux-nvdimm, linux-nvme, linux-pci
  Cc: axboe, benh, jglisse, keith.busch, alex.williamson, jgg,
	bhelgaas, maxg, hch

On Thu, 2018-03-01 at 15:58 +0000, Stephen  Bates wrote:
> > Any plans adding the capability to nvme-rdma? Should be
> > straight-forward... In theory, the use-case would be rdma backend
> > fabric behind. Shouldn't be hard to test either...
> 
> Nice idea Sagi. Yes we have been starting to look at that. Though again we
> would probably want to impose the "attached to the same PCIe switch" rule
> which might be less common to satisfy in initiator systems. 
> 
> Down the road I would also like to discuss the best way to use this P2P
> framework to facilitate copies between NVMe namespaces (on both PCIe and
> fabric attached namespaces) without having to expose the CMB up to user
> space. Wasn't something like that done in the SCSI world at some point
> Martin?

Are you perhaps referring to the following patch series: "Copy Offload"
(https://www.spinics.net/lists/linux-scsi/msg74680.html /
https://lwn.net/Articles/592094/)? I will contact Martin off-list and in
case he wouldn't have the time myself to revive that patch series then I
will free up some time to work on this.

Bart.




_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-09  5:08         ` Bart Van Assche
  0 siblings, 0 replies; 504+ messages in thread
From: Bart Van Assche @ 2018-03-09  5:08 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-rdma, sagi, sbates, logang,
	linux-nvdimm, linux-nvme, linux-pci
  Cc: alex.williamson, jgg, hch, martin.petersen, axboe,
	dan.j.williams, jglisse, maxg, bhelgaas, keith.busch, benh

T24gVGh1LCAyMDE4LTAzLTAxIGF0IDE1OjU4ICswMDAwLCBTdGVwaGVuICBCYXRlcyB3cm90ZToN
Cj4gPiBBbnkgcGxhbnMgYWRkaW5nIHRoZSBjYXBhYmlsaXR5IHRvIG52bWUtcmRtYT8gU2hvdWxk
IGJlDQo+ID4gc3RyYWlnaHQtZm9yd2FyZC4uLiBJbiB0aGVvcnksIHRoZSB1c2UtY2FzZSB3b3Vs
ZCBiZSByZG1hIGJhY2tlbmQNCj4gPiBmYWJyaWMgYmVoaW5kLiBTaG91bGRuJ3QgYmUgaGFyZCB0
byB0ZXN0IGVpdGhlci4uLg0KPiANCj4gTmljZSBpZGVhIFNhZ2kuIFllcyB3ZSBoYXZlIGJlZW4g
c3RhcnRpbmcgdG8gbG9vayBhdCB0aGF0LiBUaG91Z2ggYWdhaW4gd2UNCj4gd291bGQgcHJvYmFi
bHkgd2FudCB0byBpbXBvc2UgdGhlICJhdHRhY2hlZCB0byB0aGUgc2FtZSBQQ0llIHN3aXRjaCIg
cnVsZQ0KPiB3aGljaCBtaWdodCBiZSBsZXNzIGNvbW1vbiB0byBzYXRpc2Z5IGluIGluaXRpYXRv
ciBzeXN0ZW1zLiANCj4gDQo+IERvd24gdGhlIHJvYWQgSSB3b3VsZCBhbHNvIGxpa2UgdG8gZGlz
Y3VzcyB0aGUgYmVzdCB3YXkgdG8gdXNlIHRoaXMgUDJQDQo+IGZyYW1ld29yayB0byBmYWNpbGl0
YXRlIGNvcGllcyBiZXR3ZWVuIE5WTWUgbmFtZXNwYWNlcyAob24gYm90aCBQQ0llIGFuZA0KPiBm
YWJyaWMgYXR0YWNoZWQgbmFtZXNwYWNlcykgd2l0aG91dCBoYXZpbmcgdG8gZXhwb3NlIHRoZSBD
TUIgdXAgdG8gdXNlcg0KPiBzcGFjZS4gV2Fzbid0IHNvbWV0aGluZyBsaWtlIHRoYXQgZG9uZSBp
biB0aGUgU0NTSSB3b3JsZCBhdCBzb21lIHBvaW50DQo+IE1hcnRpbj8NCg0KQXJlIHlvdSBwZXJo
YXBzIHJlZmVycmluZyB0byB0aGUgZm9sbG93aW5nIHBhdGNoIHNlcmllczogIkNvcHkgT2ZmbG9h
ZCINCihodHRwczovL3d3dy5zcGluaWNzLm5ldC9saXN0cy9saW51eC1zY3NpL21zZzc0NjgwLmh0
bWwgLw0KaHR0cHM6Ly9sd24ubmV0L0FydGljbGVzLzU5MjA5NC8pPyBJIHdpbGwgY29udGFjdCBN
YXJ0aW4gb2ZmLWxpc3QgYW5kIGluDQpjYXNlIGhlIHdvdWxkbid0IGhhdmUgdGhlIHRpbWUgbXlz
ZWxmIHRvIHJldml2ZSB0aGF0IHBhdGNoIHNlcmllcyB0aGVuIEkNCndpbGwgZnJlZSB1cCBzb21l
IHRpbWUgdG8gd29yayBvbiB0aGlzLg0KDQpCYXJ0Lg0KDQoNCg0KDQo=

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-09  5:08         ` Bart Van Assche
  0 siblings, 0 replies; 504+ messages in thread
From: Bart Van Assche @ 2018-03-09  5:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, sagi-NQWnxTmZq1alnMjI0IkVqw,
	sbates-pv7U853sEMVWk0Htik3J/w, logang-OTvnGxWRz7hWk0Htik3J/w,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-pci-u79uwXL29TY76Z2rM5mHXA
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r,
	jglisse-H+wXaHxf7aLQT0dZR+AlfA,
	keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	alex.williamson-H+wXaHxf7aLQT0dZR+AlfA,
	jgg-VPRAkNaXOzVWk0Htik3J/w, bhelgaas-hpIqsD4AKlfQT0dZR+AlfA,
	maxg-VPRAkNaXOzVWk0Htik3J/w, hch-jcswGhMUV9g

On Thu, 2018-03-01 at 15:58 +0000, Stephen  Bates wrote:
> > Any plans adding the capability to nvme-rdma? Should be
> > straight-forward... In theory, the use-case would be rdma backend
> > fabric behind. Shouldn't be hard to test either...
> 
> Nice idea Sagi. Yes we have been starting to look at that. Though again we
> would probably want to impose the "attached to the same PCIe switch" rule
> which might be less common to satisfy in initiator systems. 
> 
> Down the road I would also like to discuss the best way to use this P2P
> framework to facilitate copies between NVMe namespaces (on both PCIe and
> fabric attached namespaces) without having to expose the CMB up to user
> space. Wasn't something like that done in the SCSI world at some point
> Martin?

Are you perhaps referring to the following patch series: "Copy Offload"
(https://www.spinics.net/lists/linux-scsi/msg74680.html /
https://lwn.net/Articles/592094/)? I will contact Martin off-list and in
case he wouldn't have the time myself to revive that patch series then I
will free up some time to work on this.

Bart.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-09  5:08         ` Bart Van Assche
  0 siblings, 0 replies; 504+ messages in thread
From: Bart Van Assche @ 2018-03-09  5:08 UTC (permalink / raw)
  To: linux-kernel, linux-block, linux-rdma, sagi, sbates, logang,
	linux-nvdimm, linux-nvme, linux-pci
  Cc: alex.williamson, jgg, hch, martin.petersen, axboe,
	dan.j.williams, jglisse, maxg, bhelgaas, keith.busch, benh

On Thu, 2018-03-01 at 15:58 +0000, Stephen  Bates wrote:
> > Any plans adding the capability to nvme-rdma? Should be
> > straight-forward... In theory, the use-case would be rdma backend
> > fabric behind. Shouldn't be hard to test either...
> 
> Nice idea Sagi. Yes we have been starting to look at that. Though again we
> would probably want to impose the "attached to the same PCIe switch" rule
> which might be less common to satisfy in initiator systems. 
> 
> Down the road I would also like to discuss the best way to use this P2P
> framework to facilitate copies between NVMe namespaces (on both PCIe and
> fabric attached namespaces) without having to expose the CMB up to user
> space. Wasn't something like that done in the SCSI world at some point
> Martin?

Are you perhaps referring to the following patch series: "Copy Offload"
(https://www.spinics.net/lists/linux-scsi/msg74680.html /
https://lwn.net/Articles/592094/)? I will contact Martin off-list and in
case he wouldn't have the time myself to revive that patch series then I
will free up some time to work on this.

Bart.

^ permalink raw reply	[flat|nested] 504+ messages in thread

* [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
@ 2018-03-09  5:08         ` Bart Van Assche
  0 siblings, 0 replies; 504+ messages in thread
From: Bart Van Assche @ 2018-03-09  5:08 UTC (permalink / raw)


On Thu, 2018-03-01@15:58 +0000, Stephen  Bates wrote:
> > Any plans adding the capability to nvme-rdma? Should be
> > straight-forward... In theory, the use-case would be rdma backend
> > fabric behind. Shouldn't be hard to test either...
> 
> Nice idea Sagi. Yes we have been starting to look at that. Though again we
> would probably want to impose the "attached to the same PCIe switch" rule
> which might be less common to satisfy in initiator systems. 
> 
> Down the road I would also like to discuss the best way to use this P2P
> framework to facilitate copies between NVMe namespaces (on both PCIe and
> fabric attached namespaces) without having to expose the CMB up to user
> space. Wasn't something like that done in the SCSI world at some point
> Martin?

Are you perhaps referring to the following patch series: "Copy Offload"
(https://www.spinics.net/lists/linux-scsi/msg74680.html /
https://lwn.net/Articles/592094/)? I will contact Martin off-list and in
case he wouldn't have the time myself to revive that patch series then I
will free up some time to work on this.

Bart.

^ permalink raw reply	[flat|nested] 504+ messages in thread

end of thread, other threads:[~2018-03-09  5:08 UTC | newest]

Thread overview: 504+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-28 23:39 [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory Logan Gunthorpe
2018-02-28 23:39 ` Logan Gunthorpe
2018-02-28 23:39 ` Logan Gunthorpe
2018-02-28 23:39 ` Logan Gunthorpe
     [not found] ` <20180228234006.21093-1-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
2018-02-28 23:39   ` [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory Logan Gunthorpe
2018-02-28 23:39     ` Logan Gunthorpe
2018-02-28 23:39     ` Logan Gunthorpe
2018-03-01 17:37     ` Bjorn Helgaas
2018-03-01 17:37       ` Bjorn Helgaas
2018-03-01 17:37       ` Bjorn Helgaas
2018-03-01 17:37       ` Bjorn Helgaas
2018-03-01 18:55       ` Logan Gunthorpe
2018-03-01 18:55         ` Logan Gunthorpe
2018-03-01 18:55         ` Logan Gunthorpe
2018-03-01 18:55         ` Logan Gunthorpe
2018-03-01 23:00         ` Bjorn Helgaas
2018-03-01 23:00           ` Bjorn Helgaas
2018-03-01 23:00           ` Bjorn Helgaas
2018-03-01 23:00           ` Bjorn Helgaas
2018-03-01 23:06           ` Logan Gunthorpe
2018-03-01 23:06             ` Logan Gunthorpe
2018-03-01 23:06             ` Logan Gunthorpe
2018-03-01 23:06             ` Logan Gunthorpe
2018-03-01 23:14             ` Stephen  Bates
2018-03-01 23:14               ` Stephen  Bates
2018-03-01 23:14               ` Stephen  Bates
2018-03-01 23:14               ` Stephen  Bates
2018-03-01 23:14               ` Stephen  Bates
2018-03-01 23:45               ` Bjorn Helgaas
2018-03-01 23:45                 ` Bjorn Helgaas
2018-03-01 23:45                 ` Bjorn Helgaas
2018-03-01 23:45                 ` Bjorn Helgaas
2018-02-28 23:39   ` [PATCH v2 03/10] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset Logan Gunthorpe
2018-02-28 23:39     ` Logan Gunthorpe
2018-02-28 23:39     ` Logan Gunthorpe
2018-03-01 17:49     ` Bjorn Helgaas
2018-03-01 17:49       ` Bjorn Helgaas
2018-03-01 17:49       ` Bjorn Helgaas
2018-03-01 17:49       ` Bjorn Helgaas
2018-03-01 19:36       ` Logan Gunthorpe
2018-03-01 19:36         ` Logan Gunthorpe
2018-03-01 19:36         ` Logan Gunthorpe
2018-03-01 19:36         ` Logan Gunthorpe
2018-02-28 23:39 ` [PATCH v2 02/10] PCI/P2PDMA: Add sysfs group to display p2pmem stats Logan Gunthorpe
2018-02-28 23:39   ` Logan Gunthorpe
2018-02-28 23:39   ` Logan Gunthorpe
2018-02-28 23:39   ` Logan Gunthorpe
2018-03-01 17:44   ` Bjorn Helgaas
2018-03-01 17:44     ` Bjorn Helgaas
2018-03-01 17:44     ` Bjorn Helgaas
2018-03-01 17:44     ` Bjorn Helgaas
2018-03-02  0:15     ` Logan Gunthorpe
2018-03-02  0:15       ` Logan Gunthorpe
2018-03-02  0:15       ` Logan Gunthorpe
2018-03-02  0:15       ` Logan Gunthorpe
2018-03-02  0:36       ` Dan Williams
2018-03-02  0:36         ` Dan Williams
2018-03-02  0:36         ` Dan Williams
2018-03-02  0:36         ` Dan Williams
2018-03-02  0:37         ` Logan Gunthorpe
2018-03-02  0:37           ` Logan Gunthorpe
2018-03-02  0:37           ` Logan Gunthorpe
2018-03-02  0:37           ` Logan Gunthorpe
2018-02-28 23:40 ` [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-03-01 18:02   ` Bjorn Helgaas
2018-03-01 18:02     ` Bjorn Helgaas
2018-03-01 18:02     ` Bjorn Helgaas
2018-03-01 18:02     ` Bjorn Helgaas
2018-03-01 18:54     ` Stephen  Bates
2018-03-01 18:54       ` Stephen  Bates
2018-03-01 18:54       ` Stephen  Bates
2018-03-01 18:54       ` Stephen  Bates
2018-03-01 18:54       ` Stephen  Bates
2018-03-01 21:21       ` Alex Williamson
2018-03-01 21:21         ` Alex Williamson
2018-03-01 21:21         ` Alex Williamson
2018-03-01 21:21         ` Alex Williamson
2018-03-01 21:26         ` Logan Gunthorpe
2018-03-01 21:26           ` Logan Gunthorpe
2018-03-01 21:26           ` Logan Gunthorpe
2018-03-01 21:26           ` Logan Gunthorpe
2018-03-01 21:32         ` Stephen  Bates
2018-03-01 21:32           ` Stephen  Bates
2018-03-01 21:32           ` Stephen  Bates
2018-03-01 21:32           ` Stephen  Bates
2018-03-01 21:32           ` Stephen  Bates
2018-03-01 21:35           ` Jerome Glisse
2018-03-01 21:35             ` Jerome Glisse
2018-03-01 21:35             ` Jerome Glisse
2018-03-01 21:35             ` Jerome Glisse
2018-03-01 21:35             ` Jerome Glisse
2018-03-01 21:37             ` Logan Gunthorpe
2018-03-01 21:37               ` Logan Gunthorpe
2018-03-01 21:37               ` Logan Gunthorpe
2018-03-01 21:37               ` Logan Gunthorpe
2018-03-01 23:15       ` Bjorn Helgaas
2018-03-01 23:15         ` Bjorn Helgaas
2018-03-01 23:15         ` Bjorn Helgaas
2018-03-01 23:15         ` Bjorn Helgaas
2018-03-01 23:59         ` Logan Gunthorpe
2018-03-01 23:59           ` Logan Gunthorpe
2018-03-01 23:59           ` Logan Gunthorpe
2018-03-01 23:59           ` Logan Gunthorpe
2018-03-01 19:13     ` Logan Gunthorpe
2018-03-01 19:13       ` Logan Gunthorpe
2018-03-01 19:13       ` Logan Gunthorpe
2018-03-01 19:13       ` Logan Gunthorpe
2018-03-05 22:28       ` Bjorn Helgaas
2018-03-05 22:28         ` Bjorn Helgaas
2018-03-05 22:28         ` Bjorn Helgaas
2018-03-05 22:28         ` Bjorn Helgaas
2018-03-05 23:01         ` Logan Gunthorpe
2018-03-05 23:01           ` Logan Gunthorpe
2018-03-05 23:01           ` Logan Gunthorpe
2018-03-05 23:01           ` Logan Gunthorpe
2018-02-28 23:40 ` [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-03-01 11:08   ` Sagi Grimberg
2018-03-01 11:08     ` Sagi Grimberg
2018-03-01 11:08     ` Sagi Grimberg
2018-03-01 11:08     ` Sagi Grimberg
2018-02-28 23:40 ` [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]() Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-03-01 10:32   ` Sagi Grimberg
2018-03-01 10:32     ` Sagi Grimberg
2018-03-01 10:32     ` Sagi Grimberg
2018-03-01 10:32     ` Sagi Grimberg
2018-03-01 17:16     ` Logan Gunthorpe
2018-03-01 17:16       ` Logan Gunthorpe
2018-03-01 17:16       ` Logan Gunthorpe
2018-03-01 17:16       ` Logan Gunthorpe
2018-02-28 23:40 ` [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-03-05  1:33   ` Oliver
2018-03-05  1:33     ` Oliver
2018-03-05  1:33     ` Oliver
2018-03-05  1:33     ` Oliver
2018-03-05 16:00     ` Keith Busch
2018-03-05 16:00       ` Keith Busch
2018-03-05 16:00       ` Keith Busch
2018-03-05 16:00       ` Keith Busch
2018-03-05 17:10       ` Logan Gunthorpe
2018-03-05 17:10         ` Logan Gunthorpe
2018-03-05 17:10         ` Logan Gunthorpe
2018-03-05 17:10         ` Logan Gunthorpe
2018-03-05 18:02         ` Sinan Kaya
2018-03-05 18:02           ` Sinan Kaya
2018-03-05 18:02           ` Sinan Kaya
2018-03-05 18:09           ` Logan Gunthorpe
2018-03-05 18:09             ` Logan Gunthorpe
2018-03-05 18:09             ` Logan Gunthorpe
2018-03-06  0:49         ` Oliver
2018-03-06  0:49           ` Oliver
2018-03-06  0:49           ` Oliver
2018-03-06  0:49           ` Oliver
2018-03-06  1:14           ` Logan Gunthorpe
2018-03-06  1:14             ` Logan Gunthorpe
2018-03-06  1:14             ` Logan Gunthorpe
2018-03-06  1:14             ` Logan Gunthorpe
2018-03-06 10:40             ` Oliver
2018-03-06 10:40               ` Oliver
2018-03-06 10:40               ` Oliver
2018-03-06 10:40               ` Oliver
2018-03-05 19:57       ` Sagi Grimberg
2018-03-05 19:57         ` Sagi Grimberg
2018-03-05 19:57         ` Sagi Grimberg
2018-03-05 19:57         ` Sagi Grimberg
2018-03-05 20:10         ` Jason Gunthorpe
2018-03-05 20:10           ` Jason Gunthorpe
2018-03-05 20:10           ` Jason Gunthorpe
2018-03-05 20:10           ` Jason Gunthorpe
2018-03-05 20:16           ` Logan Gunthorpe
2018-03-05 20:16             ` Logan Gunthorpe
2018-03-05 20:16             ` Logan Gunthorpe
2018-03-05 20:42           ` Keith Busch
2018-03-05 20:42             ` Keith Busch
2018-03-05 20:42             ` Keith Busch
2018-03-05 20:42             ` Keith Busch
2018-03-05 20:50             ` Jason Gunthorpe
2018-03-05 20:50               ` Jason Gunthorpe
2018-03-05 20:50               ` Jason Gunthorpe
2018-03-05 20:13         ` Logan Gunthorpe
2018-03-05 20:13           ` Logan Gunthorpe
2018-03-05 20:13           ` Logan Gunthorpe
2018-03-05 20:13           ` Logan Gunthorpe
2018-02-28 23:40 ` [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-03-01 11:07   ` Sagi Grimberg
2018-03-01 11:07     ` Sagi Grimberg
2018-03-01 11:07     ` Sagi Grimberg
2018-03-01 11:07     ` Sagi Grimberg
2018-03-01 15:58     ` Stephen  Bates
2018-03-01 15:58       ` Stephen  Bates
2018-03-01 15:58       ` Stephen  Bates
2018-03-01 15:58       ` Stephen  Bates
2018-03-01 15:58       ` Stephen  Bates
2018-03-09  5:08       ` Bart Van Assche
2018-03-09  5:08         ` Bart Van Assche
2018-03-09  5:08         ` Bart Van Assche
2018-03-09  5:08         ` Bart Van Assche
2018-03-09  5:08         ` Bart Van Assche
2018-02-28 23:40 ` [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-03-01 11:03   ` Sagi Grimberg
2018-03-01 11:03     ` Sagi Grimberg
2018-03-01 11:03     ` Sagi Grimberg
2018-03-01 11:03     ` Sagi Grimberg
2018-02-28 23:40 ` [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-02-28 23:40   ` Logan Gunthorpe
2018-03-01 11:03   ` Sagi Grimberg
2018-03-01 11:03     ` Sagi Grimberg
2018-03-01 11:03     ` Sagi Grimberg
2018-03-01 11:03     ` Sagi Grimberg
2018-03-01 16:15     ` Stephen  Bates
2018-03-01 16:15       ` Stephen  Bates
2018-03-01 16:15       ` Stephen  Bates
2018-03-01 16:15       ` Stephen  Bates
2018-03-01 16:15       ` Stephen  Bates
2018-03-01 17:40     ` Logan Gunthorpe
2018-03-01 17:40       ` Logan Gunthorpe
2018-03-01 17:40       ` Logan Gunthorpe
2018-03-01 17:40       ` Logan Gunthorpe
2018-03-01 18:35       ` Sagi Grimberg
2018-03-01 18:35         ` Sagi Grimberg
2018-03-01 18:35         ` Sagi Grimberg
2018-03-01 18:35         ` Sagi Grimberg
2018-03-01 18:42         ` Jason Gunthorpe
2018-03-01 18:42           ` Jason Gunthorpe
2018-03-01 19:01           ` Stephen  Bates
2018-03-01 19:01             ` Stephen  Bates
2018-03-01 19:01             ` Stephen  Bates
2018-03-01 19:01             ` Stephen  Bates
2018-03-01 19:01             ` Stephen  Bates
2018-03-01 19:27           ` Logan Gunthorpe
2018-03-01 19:27             ` Logan Gunthorpe
2018-03-01 19:27             ` Logan Gunthorpe
2018-03-01 19:27             ` Logan Gunthorpe
2018-03-01 22:45             ` Jason Gunthorpe
2018-03-01 22:45               ` Jason Gunthorpe
2018-03-01 22:56               ` Logan Gunthorpe
2018-03-01 22:56                 ` Logan Gunthorpe
2018-03-01 22:56                 ` Logan Gunthorpe
2018-03-01 22:56                 ` Logan Gunthorpe
2018-03-01 23:00               ` Stephen  Bates
2018-03-01 23:00                 ` Stephen  Bates
2018-03-01 23:00                 ` Stephen  Bates
2018-03-01 23:00                 ` Stephen  Bates
2018-03-01 23:00                 ` Stephen  Bates
2018-03-01 23:20                 ` Jason Gunthorpe
2018-03-01 23:20                   ` Jason Gunthorpe
2018-03-01 23:20                   ` Jason Gunthorpe
2018-03-01 23:29                   ` Logan Gunthorpe
2018-03-01 23:29                     ` Logan Gunthorpe
2018-03-01 23:29                     ` Logan Gunthorpe
2018-03-01 23:29                     ` Logan Gunthorpe
2018-03-01 23:32                   ` Stephen  Bates
2018-03-01 23:32                     ` Stephen  Bates
2018-03-01 23:32                     ` Stephen  Bates
2018-03-01 23:32                     ` Stephen  Bates
2018-03-01 23:32                     ` Stephen  Bates
2018-03-01 23:49                 ` Keith Busch
2018-03-01 23:49                   ` Keith Busch
2018-03-01 23:49                   ` Keith Busch
2018-03-01 23:49                   ` Keith Busch
2018-03-01 23:52                   ` Logan Gunthorpe
2018-03-01 23:52                     ` Logan Gunthorpe
2018-03-01 23:52                     ` Logan Gunthorpe
2018-03-01 23:52                     ` Logan Gunthorpe
2018-03-01 23:53                   ` Stephen  Bates
2018-03-01 23:53                     ` Stephen  Bates
2018-03-01 23:53                     ` Stephen  Bates
2018-03-01 23:53                     ` Stephen  Bates
2018-03-01 23:53                     ` Stephen  Bates
2018-03-02 15:53                     ` Christoph Hellwig
2018-03-02 15:53                       ` Christoph Hellwig
2018-03-02 15:53                       ` Christoph Hellwig
2018-03-02 15:53                       ` Christoph Hellwig
2018-03-02 20:51                       ` Stephen  Bates
2018-03-02 20:51                         ` Stephen  Bates
2018-03-02 20:51                         ` Stephen  Bates
2018-03-02 20:51                         ` Stephen  Bates
2018-03-01 23:57                   ` Stephen  Bates
2018-03-01 23:57                     ` Stephen  Bates
2018-03-01 23:57                     ` Stephen  Bates
2018-03-01 23:57                     ` Stephen  Bates
2018-03-01 23:57                     ` Stephen  Bates
2018-03-02  0:03                     ` Logan Gunthorpe
2018-03-02  0:03                       ` Logan Gunthorpe
2018-03-02  0:03                       ` Logan Gunthorpe
2018-03-02  0:03                       ` Logan Gunthorpe
2018-03-02 16:18                     ` Jason Gunthorpe
2018-03-02 16:18                       ` Jason Gunthorpe
2018-03-02 16:18                       ` Jason Gunthorpe
2018-03-02 17:10                       ` Logan Gunthorpe
2018-03-02 17:10                         ` Logan Gunthorpe
2018-03-02 17:10                         ` Logan Gunthorpe
2018-03-02 17:10                         ` Logan Gunthorpe
2018-03-01 19:10         ` Logan Gunthorpe
2018-03-01 19:10           ` Logan Gunthorpe
2018-03-01 19:10           ` Logan Gunthorpe
2018-03-01 19:10           ` Logan Gunthorpe
2018-03-01  3:54 ` [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory Benjamin Herrenschmidt
2018-03-01  3:54   ` Benjamin Herrenschmidt
2018-03-01  3:54   ` Benjamin Herrenschmidt
2018-03-01  3:54   ` Benjamin Herrenschmidt
2018-03-01  3:56   ` Benjamin Herrenschmidt
2018-03-01  3:56     ` Benjamin Herrenschmidt
2018-03-01  3:56     ` Benjamin Herrenschmidt
2018-03-01  3:56     ` Benjamin Herrenschmidt
2018-03-01 18:04     ` Logan Gunthorpe
2018-03-01 18:04       ` Logan Gunthorpe
2018-03-01 18:04       ` Logan Gunthorpe
2018-03-01 18:04       ` Logan Gunthorpe
2018-03-01 20:29       ` Benjamin Herrenschmidt
2018-03-01 20:29         ` Benjamin Herrenschmidt
2018-03-01 20:29         ` Benjamin Herrenschmidt
2018-03-01 20:29         ` Benjamin Herrenschmidt
2018-03-01 20:55         ` Jerome Glisse
2018-03-01 20:55           ` Jerome Glisse
2018-03-01 20:55           ` Jerome Glisse
2018-03-01 20:55           ` Jerome Glisse
2018-03-01 20:55           ` Jerome Glisse
2018-03-01 21:03           ` Logan Gunthorpe
2018-03-01 21:03             ` Logan Gunthorpe
2018-03-01 21:03             ` Logan Gunthorpe
2018-03-01 21:03             ` Logan Gunthorpe
2018-03-01 21:10             ` Jerome Glisse
2018-03-01 21:10               ` Jerome Glisse
2018-03-01 21:10               ` Jerome Glisse
2018-03-01 21:10               ` Jerome Glisse
2018-03-01 21:10               ` Jerome Glisse
2018-03-01 21:15               ` Logan Gunthorpe
2018-03-01 21:15                 ` Logan Gunthorpe
2018-03-01 21:15                 ` Logan Gunthorpe
2018-03-01 21:15                 ` Logan Gunthorpe
2018-03-01 21:25                 ` Jerome Glisse
2018-03-01 21:25                   ` Jerome Glisse
2018-03-01 21:25                   ` Jerome Glisse
2018-03-01 21:25                   ` Jerome Glisse
2018-03-01 21:25                   ` Jerome Glisse
2018-03-01 21:37               ` Stephen  Bates
2018-03-01 21:37                 ` Stephen  Bates
2018-03-01 21:37                 ` Stephen  Bates
2018-03-01 21:37                 ` Stephen  Bates
2018-03-01 21:37                 ` Stephen  Bates
2018-03-02 21:38               ` Stephen  Bates
2018-03-02 21:38                 ` Stephen  Bates
2018-03-02 21:38                 ` Stephen  Bates
2018-03-02 21:38                 ` Stephen  Bates
2018-03-02 21:38                 ` Stephen  Bates
2018-03-02 22:09                 ` Jerome Glisse
2018-03-02 22:09                   ` Jerome Glisse
2018-03-02 22:09                   ` Jerome Glisse
2018-03-02 22:09                   ` Jerome Glisse
2018-03-02 22:09                   ` Jerome Glisse
2018-03-05 20:36                   ` Stephen  Bates
2018-03-05 20:36                     ` Stephen  Bates
2018-03-05 20:36                     ` Stephen  Bates
2018-03-05 20:36                     ` Stephen  Bates
2018-03-05 20:36                     ` Stephen  Bates
2018-03-01 20:55         ` Logan Gunthorpe
2018-03-01 20:55           ` Logan Gunthorpe
2018-03-01 20:55           ` Logan Gunthorpe
2018-03-01 20:55           ` Logan Gunthorpe
2018-03-01 18:09     ` Stephen  Bates
2018-03-01 18:09       ` Stephen  Bates
2018-03-01 18:09       ` Stephen  Bates
2018-03-01 18:09       ` Stephen  Bates
2018-03-01 18:09       ` Stephen  Bates
2018-03-01 20:32       ` Benjamin Herrenschmidt
2018-03-01 20:32         ` Benjamin Herrenschmidt
2018-03-01 20:32         ` Benjamin Herrenschmidt
2018-03-01 20:32         ` Benjamin Herrenschmidt
2018-03-01 19:21     ` Dan Williams
2018-03-01 19:21       ` Dan Williams
2018-03-01 19:21       ` Dan Williams
2018-03-01 19:21       ` Dan Williams
2018-03-01 19:30       ` Logan Gunthorpe
2018-03-01 19:30         ` Logan Gunthorpe
2018-03-01 19:30         ` Logan Gunthorpe
2018-03-01 19:30         ` Logan Gunthorpe
2018-03-01 20:34       ` Benjamin Herrenschmidt
2018-03-01 20:34         ` Benjamin Herrenschmidt
2018-03-01 20:34         ` Benjamin Herrenschmidt
2018-03-01 20:34         ` Benjamin Herrenschmidt
2018-03-01 20:40         ` Benjamin Herrenschmidt
2018-03-01 20:40           ` Benjamin Herrenschmidt
2018-03-01 20:40           ` Benjamin Herrenschmidt
2018-03-01 20:40           ` Benjamin Herrenschmidt
2018-03-01 20:53           ` Jason Gunthorpe
2018-03-01 20:53             ` Jason Gunthorpe
2018-03-01 20:57             ` Logan Gunthorpe
2018-03-01 20:57               ` Logan Gunthorpe
2018-03-01 20:57               ` Logan Gunthorpe
2018-03-01 20:57               ` Logan Gunthorpe
2018-03-01 22:06             ` Benjamin Herrenschmidt
2018-03-01 22:06               ` Benjamin Herrenschmidt
2018-03-01 22:06               ` Benjamin Herrenschmidt
2018-03-01 22:06               ` Benjamin Herrenschmidt
2018-03-01 22:31               ` Linus Torvalds
2018-03-01 22:31                 ` Linus Torvalds
2018-03-01 22:31                 ` Linus Torvalds
2018-03-01 22:31                 ` Linus Torvalds
2018-03-01 22:34                 ` Benjamin Herrenschmidt
2018-03-01 22:34                   ` Benjamin Herrenschmidt
2018-03-01 22:34                   ` Benjamin Herrenschmidt
2018-03-01 22:34                   ` Benjamin Herrenschmidt
2018-03-02 16:22                   ` Kani, Toshi
2018-03-02 16:22                     ` Kani, Toshi
2018-03-02 16:22                     ` Kani, Toshi
2018-03-02 16:22                     ` Kani, Toshi
2018-03-02 16:22                     ` Kani, Toshi
2018-03-02 16:57                     ` Linus Torvalds
2018-03-02 16:57                       ` Linus Torvalds
2018-03-02 16:57                       ` Linus Torvalds
2018-03-02 16:57                       ` Linus Torvalds
2018-03-02 17:34                       ` Linus Torvalds
2018-03-02 17:34                         ` Linus Torvalds
2018-03-02 17:34                         ` Linus Torvalds
2018-03-02 17:34                         ` Linus Torvalds
2018-03-02 17:38                       ` Kani, Toshi
2018-03-02 17:38                         ` Kani, Toshi
2018-03-02 17:38                         ` Kani, Toshi
2018-03-02 17:38                         ` Kani, Toshi
2018-03-02 17:38                         ` Kani, Toshi
2018-03-01 21:37         ` Dan Williams
2018-03-01 21:37           ` Dan Williams
2018-03-01 21:37           ` Dan Williams
2018-03-01 21:37           ` Dan Williams
2018-03-01 21:45           ` Logan Gunthorpe
2018-03-01 21:45             ` Logan Gunthorpe
2018-03-01 21:45             ` Logan Gunthorpe
2018-03-01 21:57             ` Logan Gunthorpe
2018-03-01 21:57               ` Logan Gunthorpe
2018-03-01 21:57               ` Logan Gunthorpe
2018-03-01 21:57               ` Logan Gunthorpe
2018-03-01 23:00               ` Benjamin Herrenschmidt
2018-03-01 23:00                 ` Benjamin Herrenschmidt
2018-03-01 23:00                 ` Benjamin Herrenschmidt
2018-03-01 23:00                 ` Benjamin Herrenschmidt
2018-03-01 23:19                 ` Logan Gunthorpe
2018-03-01 23:19                   ` Logan Gunthorpe
2018-03-01 23:19                   ` Logan Gunthorpe
2018-03-01 23:19                   ` Logan Gunthorpe
2018-03-01 23:25                   ` Benjamin Herrenschmidt
2018-03-01 23:25                     ` Benjamin Herrenschmidt
2018-03-01 23:25                     ` Benjamin Herrenschmidt
2018-03-01 23:25                     ` Benjamin Herrenschmidt
2018-03-02 21:44                     ` Benjamin Herrenschmidt
2018-03-02 21:44                       ` Benjamin Herrenschmidt
2018-03-02 21:44                       ` Benjamin Herrenschmidt
2018-03-02 21:44                       ` Benjamin Herrenschmidt
2018-03-02 22:24                       ` Logan Gunthorpe
2018-03-02 22:24                         ` Logan Gunthorpe
2018-03-02 22:24                         ` Logan Gunthorpe
2018-03-02 22:24                         ` Logan Gunthorpe
2018-03-01 23:26                   ` Benjamin Herrenschmidt
2018-03-01 23:26                     ` Benjamin Herrenschmidt
2018-03-01 23:26                     ` Benjamin Herrenschmidt
2018-03-01 23:26                     ` Benjamin Herrenschmidt
2018-03-01 23:54                     ` Logan Gunthorpe
2018-03-01 23:54                       ` Logan Gunthorpe
2018-03-01 23:54                       ` Logan Gunthorpe
2018-03-01 23:54                       ` Logan Gunthorpe
2018-03-01 21:03       ` Benjamin Herrenschmidt
2018-03-01 21:03         ` Benjamin Herrenschmidt
2018-03-01 21:03         ` Benjamin Herrenschmidt
2018-03-01 21:03         ` Benjamin Herrenschmidt
2018-03-01 21:11         ` Logan Gunthorpe
2018-03-01 21:11           ` Logan Gunthorpe
2018-03-01 21:11           ` Logan Gunthorpe
2018-03-01 21:11           ` Logan Gunthorpe
2018-03-01 21:18           ` Jerome Glisse
2018-03-01 21:18             ` Jerome Glisse
2018-03-01 21:18             ` Jerome Glisse
2018-03-01 21:18             ` Jerome Glisse
2018-03-01 21:18             ` Jerome Glisse
2018-03-01 21:22             ` Logan Gunthorpe
2018-03-01 21:22               ` Logan Gunthorpe
2018-03-01 21:22               ` Logan Gunthorpe
2018-03-01 21:22               ` Logan Gunthorpe
2018-03-01 10:31 ` Sagi Grimberg
2018-03-01 10:31   ` Sagi Grimberg
2018-03-01 10:31   ` Sagi Grimberg
2018-03-01 10:31   ` Sagi Grimberg
2018-03-01 19:33   ` Logan Gunthorpe
2018-03-01 19:33     ` Logan Gunthorpe
2018-03-01 19:33     ` Logan Gunthorpe
2018-03-01 19:33     ` Logan Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.