All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
@ 2017-03-30 22:12 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

Hello,

As discussed at LSF/MM we'd like to present our work to enable
copy offload support in NVMe fabrics RDMA targets. We'd appreciate
some review and feedback from the community on our direction.
This series is not intended to go upstream at this point.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the trade-off
is currently a reduction in overall throughput. (Largely due to hardware
issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We've chosen to go this route based on feedback we
received at LSF).

In order to enable this functionality we introduce a new p2pmem device
which can be instantiated by PCI drivers. The device will register some
PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
users of these devices to get buffers. We give an example of enabling
p2p memory with the cxgb4 driver, however currently these devices have
some hardware issues that prevent their use so we will likely be
dropping this patch in the future. Ideally, we'd want to enable this
functionality with NVME CMB buffers, however we don't have any hardware
with this feature at this time.

In nvmet-rdma, we attempt to get an appropriate p2pmem device at
queue creation time and if a suitable one is found we will use it for
all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
attribute is also created which is required to be set before any p2pmem
is attempted.

This patchset also includes a more controversial patch which provides an
interface for userspace to obtain p2pmem buffers through an mmap call on
a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
O_DIRECT interfaces. However, the user would be entirely responsible for
knowing what their doing and inspecting sysfs to understand the pci
topology and only using it in sane situations.

Thanks,

Logan


Logan Gunthorpe (6):
  Introduce Peer-to-Peer memory (p2pmem) device
  nvmet: Use p2pmem in nvme target
  scatterlist: Modify SG copy functions to support io memory.
  nvmet: Be careful about using iomem accesses when dealing with p2pmem
  p2pmem: Support device removal
  p2pmem: Added char device user interface

Steve Wise (2):
  cxgb4: setup pcie memory window 4 and create p2pmem region
  p2pmem: Add debugfs "stats" file

 drivers/memory/Kconfig                          |   5 +
 drivers/memory/Makefile                         |   2 +
 drivers/memory/p2pmem.c                         | 697 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  97 +++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   5 +
 drivers/nvme/target/configfs.c                  |  31 ++
 drivers/nvme/target/core.c                      |  18 +-
 drivers/nvme/target/fabrics-cmd.c               |  28 +-
 drivers/nvme/target/nvmet.h                     |   2 +
 drivers/nvme/target/rdma.c                      | 183 +++++--
 drivers/scsi/scsi_debug.c                       |   7 +-
 include/linux/p2pmem.h                          | 120 ++++
 include/linux/scatterlist.h                     |   7 +-
 lib/scatterlist.c                               |  64 ++-
 15 files changed, 1189 insertions(+), 80 deletions(-)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

--
2.1.4
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
@ 2017-03-30 22:12 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hello,

As discussed at LSF/MM we'd like to present our work to enable
copy offload support in NVMe fabrics RDMA targets. We'd appreciate
some review and feedback from the community on our direction.
This series is not intended to go upstream at this point.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the trade-off
is currently a reduction in overall throughput. (Largely due to hardware
issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We've chosen to go this route based on feedback we
received at LSF).

In order to enable this functionality we introduce a new p2pmem device
which can be instantiated by PCI drivers. The device will register some
PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
users of these devices to get buffers. We give an example of enabling
p2p memory with the cxgb4 driver, however currently these devices have
some hardware issues that prevent their use so we will likely be
dropping this patch in the future. Ideally, we'd want to enable this
functionality with NVME CMB buffers, however we don't have any hardware
with this feature at this time.

In nvmet-rdma, we attempt to get an appropriate p2pmem device at
queue creation time and if a suitable one is found we will use it for
all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
attribute is also created which is required to be set before any p2pmem
is attempted.

This patchset also includes a more controversial patch which provides an
interface for userspace to obtain p2pmem buffers through an mmap call on
a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
O_DIRECT interfaces. However, the user would be entirely responsible for
knowing what their doing and inspecting sysfs to understand the pci
topology and only using it in sane situations.

Thanks,

Logan


Logan Gunthorpe (6):
  Introduce Peer-to-Peer memory (p2pmem) device
  nvmet: Use p2pmem in nvme target
  scatterlist: Modify SG copy functions to support io memory.
  nvmet: Be careful about using iomem accesses when dealing with p2pmem
  p2pmem: Support device removal
  p2pmem: Added char device user interface

Steve Wise (2):
  cxgb4: setup pcie memory window 4 and create p2pmem region
  p2pmem: Add debugfs "stats" file

 drivers/memory/Kconfig                          |   5 +
 drivers/memory/Makefile                         |   2 +
 drivers/memory/p2pmem.c                         | 697 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  97 +++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   5 +
 drivers/nvme/target/configfs.c                  |  31 ++
 drivers/nvme/target/core.c                      |  18 +-
 drivers/nvme/target/fabrics-cmd.c               |  28 +-
 drivers/nvme/target/nvmet.h                     |   2 +
 drivers/nvme/target/rdma.c                      | 183 +++++--
 drivers/scsi/scsi_debug.c                       |   7 +-
 include/linux/p2pmem.h                          | 120 ++++
 include/linux/scatterlist.h                     |   7 +-
 lib/scatterlist.c                               |  64 ++-
 15 files changed, 1189 insertions(+), 80 deletions(-)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

--
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
@ 2017-03-30 22:12 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

Hello,

As discussed at LSF/MM we'd like to present our work to enable
copy offload support in NVMe fabrics RDMA targets. We'd appreciate
some review and feedback from the community on our direction.
This series is not intended to go upstream at this point.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the trade-off
is currently a reduction in overall throughput. (Largely due to hardware
issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We've chosen to go this route based on feedback we
received at LSF).

In order to enable this functionality we introduce a new p2pmem device
which can be instantiated by PCI drivers. The device will register some
PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
users of these devices to get buffers. We give an example of enabling
p2p memory with the cxgb4 driver, however currently these devices have
some hardware issues that prevent their use so we will likely be
dropping this patch in the future. Ideally, we'd want to enable this
functionality with NVME CMB buffers, however we don't have any hardware
with this feature at this time.

In nvmet-rdma, we attempt to get an appropriate p2pmem device at
queue creation time and if a suitable one is found we will use it for
all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
attribute is also created which is required to be set before any p2pmem
is attempted.

This patchset also includes a more controversial patch which provides an
interface for userspace to obtain p2pmem buffers through an mmap call on
a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
O_DIRECT interfaces. However, the user would be entirely responsible for
knowing what their doing and inspecting sysfs to understand the pci
topology and only using it in sane situations.

Thanks,

Logan


Logan Gunthorpe (6):
  Introduce Peer-to-Peer memory (p2pmem) device
  nvmet: Use p2pmem in nvme target
  scatterlist: Modify SG copy functions to support io memory.
  nvmet: Be careful about using iomem accesses when dealing with p2pmem
  p2pmem: Support device removal
  p2pmem: Added char device user interface

Steve Wise (2):
  cxgb4: setup pcie memory window 4 and create p2pmem region
  p2pmem: Add debugfs "stats" file

 drivers/memory/Kconfig                          |   5 +
 drivers/memory/Makefile                         |   2 +
 drivers/memory/p2pmem.c                         | 697 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  97 +++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   5 +
 drivers/nvme/target/configfs.c                  |  31 ++
 drivers/nvme/target/core.c                      |  18 +-
 drivers/nvme/target/fabrics-cmd.c               |  28 +-
 drivers/nvme/target/nvmet.h                     |   2 +
 drivers/nvme/target/rdma.c                      | 183 +++++--
 drivers/scsi/scsi_debug.c                       |   7 +-
 include/linux/p2pmem.h                          | 120 ++++
 include/linux/scatterlist.h                     |   7 +-
 lib/scatterlist.c                               |  64 ++-
 15 files changed, 1189 insertions(+), 80 deletions(-)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

--
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
@ 2017-03-30 22:12 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

Hello,

As discussed at LSF/MM we'd like to present our work to enable
copy offload support in NVMe fabrics RDMA targets. We'd appreciate
some review and feedback from the community on our direction.
This series is not intended to go upstream at this point.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the trade-off
is currently a reduction in overall throughput. (Largely due to hardware
issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We've chosen to go this route based on feedback we
received at LSF).

In order to enable this functionality we introduce a new p2pmem device
which can be instantiated by PCI drivers. The device will register some
PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
users of these devices to get buffers. We give an example of enabling
p2p memory with the cxgb4 driver, however currently these devices have
some hardware issues that prevent their use so we will likely be
dropping this patch in the future. Ideally, we'd want to enable this
functionality with NVME CMB buffers, however we don't have any hardware
with this feature at this time.

In nvmet-rdma, we attempt to get an appropriate p2pmem device at
queue creation time and if a suitable one is found we will use it for
all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
attribute is also created which is required to be set before any p2pmem
is attempted.

This patchset also includes a more controversial patch which provides an
interface for userspace to obtain p2pmem buffers through an mmap call on
a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
O_DIRECT interfaces. However, the user would be entirely responsible for
knowing what their doing and inspecting sysfs to understand the pci
topology and only using it in sane situations.

Thanks,

Logan


Logan Gunthorpe (6):
  Introduce Peer-to-Peer memory (p2pmem) device
  nvmet: Use p2pmem in nvme target
  scatterlist: Modify SG copy functions to support io memory.
  nvmet: Be careful about using iomem accesses when dealing with p2pmem
  p2pmem: Support device removal
  p2pmem: Added char device user interface

Steve Wise (2):
  cxgb4: setup pcie memory window 4 and create p2pmem region
  p2pmem: Add debugfs "stats" file

 drivers/memory/Kconfig                          |   5 +
 drivers/memory/Makefile                         |   2 +
 drivers/memory/p2pmem.c                         | 697 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  97 +++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   5 +
 drivers/nvme/target/configfs.c                  |  31 ++
 drivers/nvme/target/core.c                      |  18 +-
 drivers/nvme/target/fabrics-cmd.c               |  28 +-
 drivers/nvme/target/nvmet.h                     |   2 +
 drivers/nvme/target/rdma.c                      | 183 +++++--
 drivers/scsi/scsi_debug.c                       |   7 +-
 include/linux/p2pmem.h                          | 120 ++++
 include/linux/scatterlist.h                     |   7 +-
 lib/scatterlist.c                               |  64 ++-
 15 files changed, 1189 insertions(+), 80 deletions(-)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

--
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
@ 2017-03-30 22:12 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)


Hello,

As discussed at LSF/MM we'd like to present our work to enable
copy offload support in NVMe fabrics RDMA targets. We'd appreciate
some review and feedback from the community on our direction.
This series is not intended to go upstream at this point.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the trade-off
is currently a reduction in overall throughput. (Largely due to hardware
issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We've chosen to go this route based on feedback we
received at LSF).

In order to enable this functionality we introduce a new p2pmem device
which can be instantiated by PCI drivers. The device will register some
PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
users of these devices to get buffers. We give an example of enabling
p2p memory with the cxgb4 driver, however currently these devices have
some hardware issues that prevent their use so we will likely be
dropping this patch in the future. Ideally, we'd want to enable this
functionality with NVME CMB buffers, however we don't have any hardware
with this feature at this time.

In nvmet-rdma, we attempt to get an appropriate p2pmem device at
queue creation time and if a suitable one is found we will use it for
all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
attribute is also created which is required to be set before any p2pmem
is attempted.

This patchset also includes a more controversial patch which provides an
interface for userspace to obtain p2pmem buffers through an mmap call on
a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
O_DIRECT interfaces. However, the user would be entirely responsible for
knowing what their doing and inspecting sysfs to understand the pci
topology and only using it in sane situations.

Thanks,

Logan


Logan Gunthorpe (6):
  Introduce Peer-to-Peer memory (p2pmem) device
  nvmet: Use p2pmem in nvme target
  scatterlist: Modify SG copy functions to support io memory.
  nvmet: Be careful about using iomem accesses when dealing with p2pmem
  p2pmem: Support device removal
  p2pmem: Added char device user interface

Steve Wise (2):
  cxgb4: setup pcie memory window 4 and create p2pmem region
  p2pmem: Add debugfs "stats" file

 drivers/memory/Kconfig                          |   5 +
 drivers/memory/Makefile                         |   2 +
 drivers/memory/p2pmem.c                         | 697 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  97 +++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   5 +
 drivers/nvme/target/configfs.c                  |  31 ++
 drivers/nvme/target/core.c                      |  18 +-
 drivers/nvme/target/fabrics-cmd.c               |  28 +-
 drivers/nvme/target/nvmet.h                     |   2 +
 drivers/nvme/target/rdma.c                      | 183 +++++--
 drivers/scsi/scsi_debug.c                       |   7 +-
 include/linux/p2pmem.h                          | 120 ++++
 include/linux/scatterlist.h                     |   7 +-
 lib/scatterlist.c                               |  64 ++-
 15 files changed, 1189 insertions(+), 80 deletions(-)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

--
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
       [not found] ` <1490911959-5146-1-git-send-email-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  (?)
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

A p2pmem device is simply a PCI card with a BAR space that points to
regular memory. This may be an independent PCI card or part of another
completely unrelated device (like an IB card or a NVMe card). The
p2pmem device is designed such that other drivers may register p2pmem
memory for use by the system.

p2pmem devices then provide a kernel interface so that other subsystems
can allocate chunks of this memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but could significantly reduce presure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

Seeing using p2p memory can often have negative effects, especially
with older PCI root complexes. The code is designed to only utilize the
p2pmem device if all the devices involved in a transfer are behind the
same PCI switch. Other cases may still work or be desirable for some
end users but it was decided this would be the best course of action
to prevent users enabling it and wondering why their performance
dropped.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/Kconfig  |   5 +
 drivers/memory/Makefile |   2 +
 drivers/memory/p2pmem.c | 403 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  | 103 +++++++++++++
 4 files changed, 513 insertions(+)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig
index ec80e35..4a02cd3 100644
--- a/drivers/memory/Kconfig
+++ b/drivers/memory/Kconfig
@@ -146,3 +146,8 @@ source "drivers/memory/samsung/Kconfig"
 source "drivers/memory/tegra/Kconfig"
 
 endif
+
+config P2PMEM
+	bool "Peer 2 Peer Memory Device Support"
+	help
+	  This driver is for peer 2 peer memory device managers.
diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile
index e88097fb..260bfe9 100644
--- a/drivers/memory/Makefile
+++ b/drivers/memory/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_DA8XX_DDRCTL)	+= da8xx-ddrctl.o
 
 obj-$(CONFIG_SAMSUNG_MC)	+= samsung/
 obj-$(CONFIG_TEGRA_MC)		+= tegra/
+
+obj-$(CONFIG_P2PMEM)        += p2pmem.o
diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
new file mode 100644
index 0000000..c4ea311
--- /dev/null
+++ b/drivers/memory/p2pmem.c
@@ -0,0 +1,403 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <linux/p2pmem.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+
+MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation");
+
+static struct class *p2pmem_class;
+static DEFINE_IDA(p2pmem_ida);
+
+static struct p2pmem_dev *to_p2pmem(struct device *dev)
+{
+	return container_of(dev, struct p2pmem_dev, dev);
+}
+
+static void p2pmem_percpu_release(struct percpu_ref *ref)
+{
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	complete_all(&p->cmp);
+}
+
+static void p2pmem_percpu_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	percpu_ref_exit(ref);
+}
+
+static void p2pmem_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+	wait_for_completion(&p->cmp);
+}
+
+static void p2pmem_release(struct device *dev)
+{
+	struct p2pmem_dev *p = to_p2pmem(dev);
+
+	if (p->pool)
+		gen_pool_destroy(p->pool);
+
+	kfree(p);
+}
+
+/**
+ * p2pmem_create() - create a new p2pmem device
+ * @parent: the parent device to create it under
+ *
+ * Return value is a pointer to the new device or an ERR_PTR
+ * on failure.
+ */
+struct p2pmem_dev *p2pmem_create(struct device *parent)
+{
+	struct p2pmem_dev *p;
+	int nid = dev_to_node(parent);
+	int rc;
+
+	p = kzalloc_node(sizeof(*p), GFP_KERNEL, nid);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&p->cmp);
+	device_initialize(&p->dev);
+	p->dev.class = p2pmem_class;
+	p->dev.parent = parent;
+	p->dev.release = p2pmem_release;
+
+	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
+	if (p->id < 0) {
+		rc = p->id;
+		goto err_free;
+	}
+
+	dev_set_name(&p->dev, "p2pmem%d", p->id);
+
+	p->pool = gen_pool_create(PAGE_SHIFT, nid);
+	if (!p->pool) {
+		rc = -ENOMEM;
+		goto err_id;
+	}
+
+	rc = percpu_ref_init(&p->ref, p2pmem_percpu_release, 0,
+			     GFP_KERNEL);
+	if (rc)
+		goto err_id;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_exit, &p->ref);
+	if (rc)
+		goto err_id;
+
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_id;
+
+	dev_info(&p->dev, "registered");
+
+	return p;
+
+err_id:
+	ida_simple_remove(&p2pmem_ida, p->id);
+err_free:
+	put_device(&p->dev);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL(p2pmem_create);
+
+/**
+ * p2pmem_unregister() - unregister a p2pmem device
+ * @p: the device to unregister
+ *
+ * The device will remain until all users are done with it
+ */
+void p2pmem_unregister(struct p2pmem_dev *p)
+{
+	if (!p)
+		return;
+
+	dev_info(&p->dev, "unregistered");
+	device_del(&p->dev);
+	ida_simple_remove(&p2pmem_ida, p->id);
+	put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_unregister);
+
+/**
+ * p2pmem_add_resource() - add memory for use as p2pmem to the device
+ * @p: the device to add the memory to
+ * @res: resource describing the memory
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res)
+{
+	int rc;
+	void *addr;
+	int nid = dev_to_node(&p->dev);
+
+	addr = devm_memremap_pages(&p->dev, res, &p->ref, NULL);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	rc = gen_pool_add_virt(p->pool, (unsigned long)addr,
+			       res->start, resource_size(res), nid);
+	if (rc)
+		return rc;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_kill, &p->ref);
+	if (rc)
+		return rc;
+
+	dev_info(&p->dev, "added %pR", res);
+
+	return 0;
+}
+EXPORT_SYMBOL(p2pmem_add_resource);
+
+struct pci_region {
+	struct pci_dev *pdev;
+	int bar;
+};
+
+static void p2pmem_release_pci_region(void *data)
+{
+	struct pci_region *r = data;
+
+	pci_release_region(r->pdev, r->bar);
+	kfree(r);
+}
+
+/**
+ * p2pmem_add_pci_region() - request and add an entire PCI region to the
+ *	specified p2pmem device
+ * @p: the device to add the memory to
+ * @pdev: pci device to register the bar from
+ * @bar: the bar number to add
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar)
+{
+	int rc;
+	struct pci_region *r;
+
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		return -ENOMEM;
+
+	r->pdev = pdev;
+	r->bar = bar;
+
+	rc = pci_request_region(pdev, bar, dev_name(&p->dev));
+	if (rc < 0)
+		goto err_pci;
+
+	rc = p2pmem_add_resource(p, &pdev->resource[bar]);
+	if (rc < 0)
+		goto err_add;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_release_pci_region, r);
+	if (rc)
+		return rc;
+
+	return 0;
+
+err_add:
+	pci_release_region(pdev, bar);
+err_pci:
+	kfree(r);
+	return rc;
+}
+EXPORT_SYMBOL(p2pmem_add_pci_region);
+
+/**
+ * p2pmem_alloc() - allocate some p2p memory
+ * @p: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error
+ */
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return (void *)gen_pool_alloc(p->pool, size);
+}
+EXPORT_SYMBOL(p2pmem_alloc);
+
+/**
+ * p2pmem_free() - free allocated p2p memory
+ * @p: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+	gen_pool_free(p->pool, (unsigned long)addr, size);
+}
+EXPORT_SYMBOL(p2pmem_free);
+
+static struct device *find_parent_pci_dev(struct device *dev)
+{
+	while (dev) {
+		if (dev_is_pci(dev))
+			return dev;
+
+		dev = dev->parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_switch_port(struct device *dev)
+{
+	struct device *dpci;
+	struct pci_dev *pci;
+
+	dpci = find_parent_pci_dev(dev);
+	if (!dpci)
+		return NULL;
+
+	pci = pci_upstream_bridge(to_pci_dev(dpci));
+	if (!pci)
+		return NULL;
+
+	return pci_upstream_bridge(pci);
+}
+
+static int upstream_bridges_match(struct device *p2pmem,
+				  const void *data)
+{
+	struct device * const *dma_devices = data;
+	struct pci_dev *p2p_up;
+	struct pci_dev *dma_up;
+
+	p2p_up = get_upstream_switch_port(p2pmem);
+	if (!p2p_up) {
+		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
+		return false;
+	}
+
+	while (*dma_devices) {
+		dma_up = get_upstream_switch_port(*dma_devices);
+
+		if (!dma_up) {
+			dev_dbg(p2pmem, "%s is not a pci device behind a switch",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		if (p2p_up != dma_up) {
+			dev_dbg(p2pmem,
+				"%s does not reside on the same upstream bridge",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		dev_dbg(p2pmem, "%s is compatible", dev_name(*dma_devices));
+		dma_devices++;
+	}
+
+	return true;
+}
+
+/**
+ * p2pmem_find_compat() - find a p2pmem device compatible with the
+ *	specified devices
+ * @dma_devices: a null terminated array of device pointers which
+ *	all must be compatible with the returned p2pmem device
+ *
+ * For now, we only support cases where all the devices that
+ * will transfer to the p2pmem device are on the same switch.
+ * This cuts out cases that may work but is safest for the user.
+ * We also do not presently support cases where two devices
+ * are behind multiple levels of switches even though this would
+ * likely work fine.
+ *
+ * Future work could be done to whitelist root ports that are known
+ * to be good and support many levels of switches. Additionally,
+ * it would make sense to choose the topographically closest p2pmem
+ * for a given setup. (Presently we only return the first that matches.)
+ *
+ * Returns a pointer to the p2pmem device with the reference taken
+ * (use p2pmem_put to return the reference) or NULL if no compatible
+ * p2pmem device is found.
+ */
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+{
+	struct device *dev;
+
+	dev = class_find_device(p2pmem_class, NULL, dma_devices,
+				upstream_bridges_match);
+
+	if (!dev)
+		return NULL;
+
+	return to_p2pmem(dev);
+}
+EXPORT_SYMBOL(p2pmem_find_compat);
+
+/**
+ * p2pmem_put() - decrement a p2pmem device reference
+ * @p: p2pmem device to return
+ *
+ * Dereference and free (if last) the device's reference counter.
+ * It's safe to pass a NULL pointer to this function.
+ */
+void p2pmem_put(struct p2pmem_dev *p)
+{
+	if (p)
+		put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_put);
+
+static int __init p2pmem_init(void)
+{
+	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
+	if (IS_ERR(p2pmem_class))
+		return PTR_ERR(p2pmem_class);
+
+	return 0;
+}
+module_init(p2pmem_init);
+
+static void __exit p2pmem_exit(void)
+{
+	class_destroy(p2pmem_class);
+
+	pr_info(KBUILD_MODNAME ": unloaded.\n");
+}
+module_exit(p2pmem_exit);
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
new file mode 100644
index 0000000..71dc1e1
--- /dev/null
+++ b/include/linux/p2pmem.h
@@ -0,0 +1,103 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef __P2PMEM_H__
+#define __P2PMEM_H__
+
+#include <linux/device.h>
+#include <linux/pci.h>
+
+struct p2pmem_dev {
+	struct device dev;
+	int id;
+
+	struct percpu_ref ref;
+	struct completion cmp;
+	struct gen_pool *pool;
+};
+
+#ifdef CONFIG_P2PMEM
+
+struct p2pmem_dev *p2pmem_create(struct device *parent);
+void p2pmem_unregister(struct p2pmem_dev *p);
+
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res);
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
+
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
+
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
+void p2pmem_put(struct p2pmem_dev *p);
+
+#else
+
+static inline void *p2pmem_create(struct device *parent)
+{
+	return NULL;
+}
+
+static inline void p2pmem_unregister(struct p2pmem_dev *p)
+{
+}
+
+static inline int p2pmem_add_resource(struct p2pmem_dev *p,
+				      struct resource *res)
+{
+	return -ENODEV;
+}
+
+static inline int p2pmem_add_pci_region(struct p2pmem_dev *p,
+					struct pci_dev *pdev, int bar)
+{
+	return -ENODEV;
+}
+
+static inline void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return NULL;
+}
+
+static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+}
+
+static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+{
+	return NULL;
+}
+
+static inline void p2pmem_put(struct p2pmem_dev *p)
+{
+}
+
+#endif
+
+static inline struct page *p2pmem_alloc_page(struct p2pmem_dev *p)
+{
+	struct page *pg = p2pmem_alloc(p, PAGE_SIZE);
+
+	if (pg)
+		return virt_to_page(pg);
+
+	return NULL;
+}
+
+static inline void p2pmem_free_page(struct p2pmem_dev *p, struct page *pg)
+{
+	p2pmem_free(p, page_to_virt(pg), PAGE_SIZE);
+}
+
+#endif
-- 
2.1.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

A p2pmem device is simply a PCI card with a BAR space that points to
regular memory. This may be an independent PCI card or part of another
completely unrelated device (like an IB card or a NVMe card). The
p2pmem device is designed such that other drivers may register p2pmem
memory for use by the system.

p2pmem devices then provide a kernel interface so that other subsystems
can allocate chunks of this memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but could significantly reduce presure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

Seeing using p2p memory can often have negative effects, especially
with older PCI root complexes. The code is designed to only utilize the
p2pmem device if all the devices involved in a transfer are behind the
same PCI switch. Other cases may still work or be desirable for some
end users but it was decided this would be the best course of action
to prevent users enabling it and wondering why their performance
dropped.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Signed-off-by: Stephen Bates <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/memory/Kconfig  |   5 +
 drivers/memory/Makefile |   2 +
 drivers/memory/p2pmem.c | 403 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  | 103 +++++++++++++
 4 files changed, 513 insertions(+)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig
index ec80e35..4a02cd3 100644
--- a/drivers/memory/Kconfig
+++ b/drivers/memory/Kconfig
@@ -146,3 +146,8 @@ source "drivers/memory/samsung/Kconfig"
 source "drivers/memory/tegra/Kconfig"
 
 endif
+
+config P2PMEM
+	bool "Peer 2 Peer Memory Device Support"
+	help
+	  This driver is for peer 2 peer memory device managers.
diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile
index e88097fb..260bfe9 100644
--- a/drivers/memory/Makefile
+++ b/drivers/memory/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_DA8XX_DDRCTL)	+= da8xx-ddrctl.o
 
 obj-$(CONFIG_SAMSUNG_MC)	+= samsung/
 obj-$(CONFIG_TEGRA_MC)		+= tegra/
+
+obj-$(CONFIG_P2PMEM)        += p2pmem.o
diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
new file mode 100644
index 0000000..c4ea311
--- /dev/null
+++ b/drivers/memory/p2pmem.c
@@ -0,0 +1,403 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <linux/p2pmem.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+
+MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation");
+
+static struct class *p2pmem_class;
+static DEFINE_IDA(p2pmem_ida);
+
+static struct p2pmem_dev *to_p2pmem(struct device *dev)
+{
+	return container_of(dev, struct p2pmem_dev, dev);
+}
+
+static void p2pmem_percpu_release(struct percpu_ref *ref)
+{
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	complete_all(&p->cmp);
+}
+
+static void p2pmem_percpu_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	percpu_ref_exit(ref);
+}
+
+static void p2pmem_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+	wait_for_completion(&p->cmp);
+}
+
+static void p2pmem_release(struct device *dev)
+{
+	struct p2pmem_dev *p = to_p2pmem(dev);
+
+	if (p->pool)
+		gen_pool_destroy(p->pool);
+
+	kfree(p);
+}
+
+/**
+ * p2pmem_create() - create a new p2pmem device
+ * @parent: the parent device to create it under
+ *
+ * Return value is a pointer to the new device or an ERR_PTR
+ * on failure.
+ */
+struct p2pmem_dev *p2pmem_create(struct device *parent)
+{
+	struct p2pmem_dev *p;
+	int nid = dev_to_node(parent);
+	int rc;
+
+	p = kzalloc_node(sizeof(*p), GFP_KERNEL, nid);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&p->cmp);
+	device_initialize(&p->dev);
+	p->dev.class = p2pmem_class;
+	p->dev.parent = parent;
+	p->dev.release = p2pmem_release;
+
+	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
+	if (p->id < 0) {
+		rc = p->id;
+		goto err_free;
+	}
+
+	dev_set_name(&p->dev, "p2pmem%d", p->id);
+
+	p->pool = gen_pool_create(PAGE_SHIFT, nid);
+	if (!p->pool) {
+		rc = -ENOMEM;
+		goto err_id;
+	}
+
+	rc = percpu_ref_init(&p->ref, p2pmem_percpu_release, 0,
+			     GFP_KERNEL);
+	if (rc)
+		goto err_id;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_exit, &p->ref);
+	if (rc)
+		goto err_id;
+
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_id;
+
+	dev_info(&p->dev, "registered");
+
+	return p;
+
+err_id:
+	ida_simple_remove(&p2pmem_ida, p->id);
+err_free:
+	put_device(&p->dev);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL(p2pmem_create);
+
+/**
+ * p2pmem_unregister() - unregister a p2pmem device
+ * @p: the device to unregister
+ *
+ * The device will remain until all users are done with it
+ */
+void p2pmem_unregister(struct p2pmem_dev *p)
+{
+	if (!p)
+		return;
+
+	dev_info(&p->dev, "unregistered");
+	device_del(&p->dev);
+	ida_simple_remove(&p2pmem_ida, p->id);
+	put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_unregister);
+
+/**
+ * p2pmem_add_resource() - add memory for use as p2pmem to the device
+ * @p: the device to add the memory to
+ * @res: resource describing the memory
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res)
+{
+	int rc;
+	void *addr;
+	int nid = dev_to_node(&p->dev);
+
+	addr = devm_memremap_pages(&p->dev, res, &p->ref, NULL);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	rc = gen_pool_add_virt(p->pool, (unsigned long)addr,
+			       res->start, resource_size(res), nid);
+	if (rc)
+		return rc;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_kill, &p->ref);
+	if (rc)
+		return rc;
+
+	dev_info(&p->dev, "added %pR", res);
+
+	return 0;
+}
+EXPORT_SYMBOL(p2pmem_add_resource);
+
+struct pci_region {
+	struct pci_dev *pdev;
+	int bar;
+};
+
+static void p2pmem_release_pci_region(void *data)
+{
+	struct pci_region *r = data;
+
+	pci_release_region(r->pdev, r->bar);
+	kfree(r);
+}
+
+/**
+ * p2pmem_add_pci_region() - request and add an entire PCI region to the
+ *	specified p2pmem device
+ * @p: the device to add the memory to
+ * @pdev: pci device to register the bar from
+ * @bar: the bar number to add
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar)
+{
+	int rc;
+	struct pci_region *r;
+
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		return -ENOMEM;
+
+	r->pdev = pdev;
+	r->bar = bar;
+
+	rc = pci_request_region(pdev, bar, dev_name(&p->dev));
+	if (rc < 0)
+		goto err_pci;
+
+	rc = p2pmem_add_resource(p, &pdev->resource[bar]);
+	if (rc < 0)
+		goto err_add;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_release_pci_region, r);
+	if (rc)
+		return rc;
+
+	return 0;
+
+err_add:
+	pci_release_region(pdev, bar);
+err_pci:
+	kfree(r);
+	return rc;
+}
+EXPORT_SYMBOL(p2pmem_add_pci_region);
+
+/**
+ * p2pmem_alloc() - allocate some p2p memory
+ * @p: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error
+ */
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return (void *)gen_pool_alloc(p->pool, size);
+}
+EXPORT_SYMBOL(p2pmem_alloc);
+
+/**
+ * p2pmem_free() - free allocated p2p memory
+ * @p: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+	gen_pool_free(p->pool, (unsigned long)addr, size);
+}
+EXPORT_SYMBOL(p2pmem_free);
+
+static struct device *find_parent_pci_dev(struct device *dev)
+{
+	while (dev) {
+		if (dev_is_pci(dev))
+			return dev;
+
+		dev = dev->parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_switch_port(struct device *dev)
+{
+	struct device *dpci;
+	struct pci_dev *pci;
+
+	dpci = find_parent_pci_dev(dev);
+	if (!dpci)
+		return NULL;
+
+	pci = pci_upstream_bridge(to_pci_dev(dpci));
+	if (!pci)
+		return NULL;
+
+	return pci_upstream_bridge(pci);
+}
+
+static int upstream_bridges_match(struct device *p2pmem,
+				  const void *data)
+{
+	struct device * const *dma_devices = data;
+	struct pci_dev *p2p_up;
+	struct pci_dev *dma_up;
+
+	p2p_up = get_upstream_switch_port(p2pmem);
+	if (!p2p_up) {
+		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
+		return false;
+	}
+
+	while (*dma_devices) {
+		dma_up = get_upstream_switch_port(*dma_devices);
+
+		if (!dma_up) {
+			dev_dbg(p2pmem, "%s is not a pci device behind a switch",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		if (p2p_up != dma_up) {
+			dev_dbg(p2pmem,
+				"%s does not reside on the same upstream bridge",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		dev_dbg(p2pmem, "%s is compatible", dev_name(*dma_devices));
+		dma_devices++;
+	}
+
+	return true;
+}
+
+/**
+ * p2pmem_find_compat() - find a p2pmem device compatible with the
+ *	specified devices
+ * @dma_devices: a null terminated array of device pointers which
+ *	all must be compatible with the returned p2pmem device
+ *
+ * For now, we only support cases where all the devices that
+ * will transfer to the p2pmem device are on the same switch.
+ * This cuts out cases that may work but is safest for the user.
+ * We also do not presently support cases where two devices
+ * are behind multiple levels of switches even though this would
+ * likely work fine.
+ *
+ * Future work could be done to whitelist root ports that are known
+ * to be good and support many levels of switches. Additionally,
+ * it would make sense to choose the topographically closest p2pmem
+ * for a given setup. (Presently we only return the first that matches.)
+ *
+ * Returns a pointer to the p2pmem device with the reference taken
+ * (use p2pmem_put to return the reference) or NULL if no compatible
+ * p2pmem device is found.
+ */
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+{
+	struct device *dev;
+
+	dev = class_find_device(p2pmem_class, NULL, dma_devices,
+				upstream_bridges_match);
+
+	if (!dev)
+		return NULL;
+
+	return to_p2pmem(dev);
+}
+EXPORT_SYMBOL(p2pmem_find_compat);
+
+/**
+ * p2pmem_put() - decrement a p2pmem device reference
+ * @p: p2pmem device to return
+ *
+ * Dereference and free (if last) the device's reference counter.
+ * It's safe to pass a NULL pointer to this function.
+ */
+void p2pmem_put(struct p2pmem_dev *p)
+{
+	if (p)
+		put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_put);
+
+static int __init p2pmem_init(void)
+{
+	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
+	if (IS_ERR(p2pmem_class))
+		return PTR_ERR(p2pmem_class);
+
+	return 0;
+}
+module_init(p2pmem_init);
+
+static void __exit p2pmem_exit(void)
+{
+	class_destroy(p2pmem_class);
+
+	pr_info(KBUILD_MODNAME ": unloaded.\n");
+}
+module_exit(p2pmem_exit);
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
new file mode 100644
index 0000000..71dc1e1
--- /dev/null
+++ b/include/linux/p2pmem.h
@@ -0,0 +1,103 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef __P2PMEM_H__
+#define __P2PMEM_H__
+
+#include <linux/device.h>
+#include <linux/pci.h>
+
+struct p2pmem_dev {
+	struct device dev;
+	int id;
+
+	struct percpu_ref ref;
+	struct completion cmp;
+	struct gen_pool *pool;
+};
+
+#ifdef CONFIG_P2PMEM
+
+struct p2pmem_dev *p2pmem_create(struct device *parent);
+void p2pmem_unregister(struct p2pmem_dev *p);
+
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res);
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
+
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
+
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
+void p2pmem_put(struct p2pmem_dev *p);
+
+#else
+
+static inline void *p2pmem_create(struct device *parent)
+{
+	return NULL;
+}
+
+static inline void p2pmem_unregister(struct p2pmem_dev *p)
+{
+}
+
+static inline int p2pmem_add_resource(struct p2pmem_dev *p,
+				      struct resource *res)
+{
+	return -ENODEV;
+}
+
+static inline int p2pmem_add_pci_region(struct p2pmem_dev *p,
+					struct pci_dev *pdev, int bar)
+{
+	return -ENODEV;
+}
+
+static inline void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return NULL;
+}
+
+static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+}
+
+static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+{
+	return NULL;
+}
+
+static inline void p2pmem_put(struct p2pmem_dev *p)
+{
+}
+
+#endif
+
+static inline struct page *p2pmem_alloc_page(struct p2pmem_dev *p)
+{
+	struct page *pg = p2pmem_alloc(p, PAGE_SIZE);
+
+	if (pg)
+		return virt_to_page(pg);
+
+	return NULL;
+}
+
+static inline void p2pmem_free_page(struct p2pmem_dev *p, struct page *pg)
+{
+	p2pmem_free(p, page_to_virt(pg), PAGE_SIZE);
+}
+
+#endif
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

A p2pmem device is simply a PCI card with a BAR space that points to
regular memory. This may be an independent PCI card or part of another
completely unrelated device (like an IB card or a NVMe card). The
p2pmem device is designed such that other drivers may register p2pmem
memory for use by the system.

p2pmem devices then provide a kernel interface so that other subsystems
can allocate chunks of this memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but could significantly reduce presure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

Seeing using p2p memory can often have negative effects, especially
with older PCI root complexes. The code is designed to only utilize the
p2pmem device if all the devices involved in a transfer are behind the
same PCI switch. Other cases may still work or be desirable for some
end users but it was decided this would be the best course of action
to prevent users enabling it and wondering why their performance
dropped.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/Kconfig  |   5 +
 drivers/memory/Makefile |   2 +
 drivers/memory/p2pmem.c | 403 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  | 103 +++++++++++++
 4 files changed, 513 insertions(+)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig
index ec80e35..4a02cd3 100644
--- a/drivers/memory/Kconfig
+++ b/drivers/memory/Kconfig
@@ -146,3 +146,8 @@ source "drivers/memory/samsung/Kconfig"
 source "drivers/memory/tegra/Kconfig"
 
 endif
+
+config P2PMEM
+	bool "Peer 2 Peer Memory Device Support"
+	help
+	  This driver is for peer 2 peer memory device managers.
diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile
index e88097fb..260bfe9 100644
--- a/drivers/memory/Makefile
+++ b/drivers/memory/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_DA8XX_DDRCTL)	+= da8xx-ddrctl.o
 
 obj-$(CONFIG_SAMSUNG_MC)	+= samsung/
 obj-$(CONFIG_TEGRA_MC)		+= tegra/
+
+obj-$(CONFIG_P2PMEM)        += p2pmem.o
diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
new file mode 100644
index 0000000..c4ea311
--- /dev/null
+++ b/drivers/memory/p2pmem.c
@@ -0,0 +1,403 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <linux/p2pmem.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+
+MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation");
+
+static struct class *p2pmem_class;
+static DEFINE_IDA(p2pmem_ida);
+
+static struct p2pmem_dev *to_p2pmem(struct device *dev)
+{
+	return container_of(dev, struct p2pmem_dev, dev);
+}
+
+static void p2pmem_percpu_release(struct percpu_ref *ref)
+{
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	complete_all(&p->cmp);
+}
+
+static void p2pmem_percpu_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	percpu_ref_exit(ref);
+}
+
+static void p2pmem_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+	wait_for_completion(&p->cmp);
+}
+
+static void p2pmem_release(struct device *dev)
+{
+	struct p2pmem_dev *p = to_p2pmem(dev);
+
+	if (p->pool)
+		gen_pool_destroy(p->pool);
+
+	kfree(p);
+}
+
+/**
+ * p2pmem_create() - create a new p2pmem device
+ * @parent: the parent device to create it under
+ *
+ * Return value is a pointer to the new device or an ERR_PTR
+ * on failure.
+ */
+struct p2pmem_dev *p2pmem_create(struct device *parent)
+{
+	struct p2pmem_dev *p;
+	int nid = dev_to_node(parent);
+	int rc;
+
+	p = kzalloc_node(sizeof(*p), GFP_KERNEL, nid);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&p->cmp);
+	device_initialize(&p->dev);
+	p->dev.class = p2pmem_class;
+	p->dev.parent = parent;
+	p->dev.release = p2pmem_release;
+
+	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
+	if (p->id < 0) {
+		rc = p->id;
+		goto err_free;
+	}
+
+	dev_set_name(&p->dev, "p2pmem%d", p->id);
+
+	p->pool = gen_pool_create(PAGE_SHIFT, nid);
+	if (!p->pool) {
+		rc = -ENOMEM;
+		goto err_id;
+	}
+
+	rc = percpu_ref_init(&p->ref, p2pmem_percpu_release, 0,
+			     GFP_KERNEL);
+	if (rc)
+		goto err_id;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_exit, &p->ref);
+	if (rc)
+		goto err_id;
+
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_id;
+
+	dev_info(&p->dev, "registered");
+
+	return p;
+
+err_id:
+	ida_simple_remove(&p2pmem_ida, p->id);
+err_free:
+	put_device(&p->dev);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL(p2pmem_create);
+
+/**
+ * p2pmem_unregister() - unregister a p2pmem device
+ * @p: the device to unregister
+ *
+ * The device will remain until all users are done with it
+ */
+void p2pmem_unregister(struct p2pmem_dev *p)
+{
+	if (!p)
+		return;
+
+	dev_info(&p->dev, "unregistered");
+	device_del(&p->dev);
+	ida_simple_remove(&p2pmem_ida, p->id);
+	put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_unregister);
+
+/**
+ * p2pmem_add_resource() - add memory for use as p2pmem to the device
+ * @p: the device to add the memory to
+ * @res: resource describing the memory
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res)
+{
+	int rc;
+	void *addr;
+	int nid = dev_to_node(&p->dev);
+
+	addr = devm_memremap_pages(&p->dev, res, &p->ref, NULL);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	rc = gen_pool_add_virt(p->pool, (unsigned long)addr,
+			       res->start, resource_size(res), nid);
+	if (rc)
+		return rc;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_kill, &p->ref);
+	if (rc)
+		return rc;
+
+	dev_info(&p->dev, "added %pR", res);
+
+	return 0;
+}
+EXPORT_SYMBOL(p2pmem_add_resource);
+
+struct pci_region {
+	struct pci_dev *pdev;
+	int bar;
+};
+
+static void p2pmem_release_pci_region(void *data)
+{
+	struct pci_region *r = data;
+
+	pci_release_region(r->pdev, r->bar);
+	kfree(r);
+}
+
+/**
+ * p2pmem_add_pci_region() - request and add an entire PCI region to the
+ *	specified p2pmem device
+ * @p: the device to add the memory to
+ * @pdev: pci device to register the bar from
+ * @bar: the bar number to add
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar)
+{
+	int rc;
+	struct pci_region *r;
+
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		return -ENOMEM;
+
+	r->pdev = pdev;
+	r->bar = bar;
+
+	rc = pci_request_region(pdev, bar, dev_name(&p->dev));
+	if (rc < 0)
+		goto err_pci;
+
+	rc = p2pmem_add_resource(p, &pdev->resource[bar]);
+	if (rc < 0)
+		goto err_add;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_release_pci_region, r);
+	if (rc)
+		return rc;
+
+	return 0;
+
+err_add:
+	pci_release_region(pdev, bar);
+err_pci:
+	kfree(r);
+	return rc;
+}
+EXPORT_SYMBOL(p2pmem_add_pci_region);
+
+/**
+ * p2pmem_alloc() - allocate some p2p memory
+ * @p: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error
+ */
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return (void *)gen_pool_alloc(p->pool, size);
+}
+EXPORT_SYMBOL(p2pmem_alloc);
+
+/**
+ * p2pmem_free() - free allocated p2p memory
+ * @p: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+	gen_pool_free(p->pool, (unsigned long)addr, size);
+}
+EXPORT_SYMBOL(p2pmem_free);
+
+static struct device *find_parent_pci_dev(struct device *dev)
+{
+	while (dev) {
+		if (dev_is_pci(dev))
+			return dev;
+
+		dev = dev->parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_switch_port(struct device *dev)
+{
+	struct device *dpci;
+	struct pci_dev *pci;
+
+	dpci = find_parent_pci_dev(dev);
+	if (!dpci)
+		return NULL;
+
+	pci = pci_upstream_bridge(to_pci_dev(dpci));
+	if (!pci)
+		return NULL;
+
+	return pci_upstream_bridge(pci);
+}
+
+static int upstream_bridges_match(struct device *p2pmem,
+				  const void *data)
+{
+	struct device * const *dma_devices = data;
+	struct pci_dev *p2p_up;
+	struct pci_dev *dma_up;
+
+	p2p_up = get_upstream_switch_port(p2pmem);
+	if (!p2p_up) {
+		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
+		return false;
+	}
+
+	while (*dma_devices) {
+		dma_up = get_upstream_switch_port(*dma_devices);
+
+		if (!dma_up) {
+			dev_dbg(p2pmem, "%s is not a pci device behind a switch",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		if (p2p_up != dma_up) {
+			dev_dbg(p2pmem,
+				"%s does not reside on the same upstream bridge",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		dev_dbg(p2pmem, "%s is compatible", dev_name(*dma_devices));
+		dma_devices++;
+	}
+
+	return true;
+}
+
+/**
+ * p2pmem_find_compat() - find a p2pmem device compatible with the
+ *	specified devices
+ * @dma_devices: a null terminated array of device pointers which
+ *	all must be compatible with the returned p2pmem device
+ *
+ * For now, we only support cases where all the devices that
+ * will transfer to the p2pmem device are on the same switch.
+ * This cuts out cases that may work but is safest for the user.
+ * We also do not presently support cases where two devices
+ * are behind multiple levels of switches even though this would
+ * likely work fine.
+ *
+ * Future work could be done to whitelist root ports that are known
+ * to be good and support many levels of switches. Additionally,
+ * it would make sense to choose the topographically closest p2pmem
+ * for a given setup. (Presently we only return the first that matches.)
+ *
+ * Returns a pointer to the p2pmem device with the reference taken
+ * (use p2pmem_put to return the reference) or NULL if no compatible
+ * p2pmem device is found.
+ */
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+{
+	struct device *dev;
+
+	dev = class_find_device(p2pmem_class, NULL, dma_devices,
+				upstream_bridges_match);
+
+	if (!dev)
+		return NULL;
+
+	return to_p2pmem(dev);
+}
+EXPORT_SYMBOL(p2pmem_find_compat);
+
+/**
+ * p2pmem_put() - decrement a p2pmem device reference
+ * @p: p2pmem device to return
+ *
+ * Dereference and free (if last) the device's reference counter.
+ * It's safe to pass a NULL pointer to this function.
+ */
+void p2pmem_put(struct p2pmem_dev *p)
+{
+	if (p)
+		put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_put);
+
+static int __init p2pmem_init(void)
+{
+	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
+	if (IS_ERR(p2pmem_class))
+		return PTR_ERR(p2pmem_class);
+
+	return 0;
+}
+module_init(p2pmem_init);
+
+static void __exit p2pmem_exit(void)
+{
+	class_destroy(p2pmem_class);
+
+	pr_info(KBUILD_MODNAME ": unloaded.\n");
+}
+module_exit(p2pmem_exit);
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
new file mode 100644
index 0000000..71dc1e1
--- /dev/null
+++ b/include/linux/p2pmem.h
@@ -0,0 +1,103 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef __P2PMEM_H__
+#define __P2PMEM_H__
+
+#include <linux/device.h>
+#include <linux/pci.h>
+
+struct p2pmem_dev {
+	struct device dev;
+	int id;
+
+	struct percpu_ref ref;
+	struct completion cmp;
+	struct gen_pool *pool;
+};
+
+#ifdef CONFIG_P2PMEM
+
+struct p2pmem_dev *p2pmem_create(struct device *parent);
+void p2pmem_unregister(struct p2pmem_dev *p);
+
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res);
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
+
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
+
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
+void p2pmem_put(struct p2pmem_dev *p);
+
+#else
+
+static inline void *p2pmem_create(struct device *parent)
+{
+	return NULL;
+}
+
+static inline void p2pmem_unregister(struct p2pmem_dev *p)
+{
+}
+
+static inline int p2pmem_add_resource(struct p2pmem_dev *p,
+				      struct resource *res)
+{
+	return -ENODEV;
+}
+
+static inline int p2pmem_add_pci_region(struct p2pmem_dev *p,
+					struct pci_dev *pdev, int bar)
+{
+	return -ENODEV;
+}
+
+static inline void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return NULL;
+}
+
+static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+}
+
+static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+{
+	return NULL;
+}
+
+static inline void p2pmem_put(struct p2pmem_dev *p)
+{
+}
+
+#endif
+
+static inline struct page *p2pmem_alloc_page(struct p2pmem_dev *p)
+{
+	struct page *pg = p2pmem_alloc(p, PAGE_SIZE);
+
+	if (pg)
+		return virt_to_page(pg);
+
+	return NULL;
+}
+
+static inline void p2pmem_free_page(struct p2pmem_dev *p, struct page *pg)
+{
+	p2pmem_free(p, page_to_virt(pg), PAGE_SIZE);
+}
+
+#endif
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

A p2pmem device is simply a PCI card with a BAR space that points to
regular memory. This may be an independent PCI card or part of another
completely unrelated device (like an IB card or a NVMe card). The
p2pmem device is designed such that other drivers may register p2pmem
memory for use by the system.

p2pmem devices then provide a kernel interface so that other subsystems
can allocate chunks of this memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but could significantly reduce presure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

Seeing using p2p memory can often have negative effects, especially
with older PCI root complexes. The code is designed to only utilize the
p2pmem device if all the devices involved in a transfer are behind the
same PCI switch. Other cases may still work or be desirable for some
end users but it was decided this would be the best course of action
to prevent users enabling it and wondering why their performance
dropped.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/Kconfig  |   5 +
 drivers/memory/Makefile |   2 +
 drivers/memory/p2pmem.c | 403 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  | 103 +++++++++++++
 4 files changed, 513 insertions(+)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig
index ec80e35..4a02cd3 100644
--- a/drivers/memory/Kconfig
+++ b/drivers/memory/Kconfig
@@ -146,3 +146,8 @@ source "drivers/memory/samsung/Kconfig"
 source "drivers/memory/tegra/Kconfig"
 
 endif
+
+config P2PMEM
+	bool "Peer 2 Peer Memory Device Support"
+	help
+	  This driver is for peer 2 peer memory device managers.
diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile
index e88097fb..260bfe9 100644
--- a/drivers/memory/Makefile
+++ b/drivers/memory/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_DA8XX_DDRCTL)	+= da8xx-ddrctl.o
 
 obj-$(CONFIG_SAMSUNG_MC)	+= samsung/
 obj-$(CONFIG_TEGRA_MC)		+= tegra/
+
+obj-$(CONFIG_P2PMEM)        += p2pmem.o
diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
new file mode 100644
index 0000000..c4ea311
--- /dev/null
+++ b/drivers/memory/p2pmem.c
@@ -0,0 +1,403 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <linux/p2pmem.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+
+MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation");
+
+static struct class *p2pmem_class;
+static DEFINE_IDA(p2pmem_ida);
+
+static struct p2pmem_dev *to_p2pmem(struct device *dev)
+{
+	return container_of(dev, struct p2pmem_dev, dev);
+}
+
+static void p2pmem_percpu_release(struct percpu_ref *ref)
+{
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	complete_all(&p->cmp);
+}
+
+static void p2pmem_percpu_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	percpu_ref_exit(ref);
+}
+
+static void p2pmem_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+	wait_for_completion(&p->cmp);
+}
+
+static void p2pmem_release(struct device *dev)
+{
+	struct p2pmem_dev *p = to_p2pmem(dev);
+
+	if (p->pool)
+		gen_pool_destroy(p->pool);
+
+	kfree(p);
+}
+
+/**
+ * p2pmem_create() - create a new p2pmem device
+ * @parent: the parent device to create it under
+ *
+ * Return value is a pointer to the new device or an ERR_PTR
+ * on failure.
+ */
+struct p2pmem_dev *p2pmem_create(struct device *parent)
+{
+	struct p2pmem_dev *p;
+	int nid = dev_to_node(parent);
+	int rc;
+
+	p = kzalloc_node(sizeof(*p), GFP_KERNEL, nid);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&p->cmp);
+	device_initialize(&p->dev);
+	p->dev.class = p2pmem_class;
+	p->dev.parent = parent;
+	p->dev.release = p2pmem_release;
+
+	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
+	if (p->id < 0) {
+		rc = p->id;
+		goto err_free;
+	}
+
+	dev_set_name(&p->dev, "p2pmem%d", p->id);
+
+	p->pool = gen_pool_create(PAGE_SHIFT, nid);
+	if (!p->pool) {
+		rc = -ENOMEM;
+		goto err_id;
+	}
+
+	rc = percpu_ref_init(&p->ref, p2pmem_percpu_release, 0,
+			     GFP_KERNEL);
+	if (rc)
+		goto err_id;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_exit, &p->ref);
+	if (rc)
+		goto err_id;
+
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_id;
+
+	dev_info(&p->dev, "registered");
+
+	return p;
+
+err_id:
+	ida_simple_remove(&p2pmem_ida, p->id);
+err_free:
+	put_device(&p->dev);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL(p2pmem_create);
+
+/**
+ * p2pmem_unregister() - unregister a p2pmem device
+ * @p: the device to unregister
+ *
+ * The device will remain until all users are done with it
+ */
+void p2pmem_unregister(struct p2pmem_dev *p)
+{
+	if (!p)
+		return;
+
+	dev_info(&p->dev, "unregistered");
+	device_del(&p->dev);
+	ida_simple_remove(&p2pmem_ida, p->id);
+	put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_unregister);
+
+/**
+ * p2pmem_add_resource() - add memory for use as p2pmem to the device
+ * @p: the device to add the memory to
+ * @res: resource describing the memory
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res)
+{
+	int rc;
+	void *addr;
+	int nid = dev_to_node(&p->dev);
+
+	addr = devm_memremap_pages(&p->dev, res, &p->ref, NULL);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	rc = gen_pool_add_virt(p->pool, (unsigned long)addr,
+			       res->start, resource_size(res), nid);
+	if (rc)
+		return rc;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_kill, &p->ref);
+	if (rc)
+		return rc;
+
+	dev_info(&p->dev, "added %pR", res);
+
+	return 0;
+}
+EXPORT_SYMBOL(p2pmem_add_resource);
+
+struct pci_region {
+	struct pci_dev *pdev;
+	int bar;
+};
+
+static void p2pmem_release_pci_region(void *data)
+{
+	struct pci_region *r = data;
+
+	pci_release_region(r->pdev, r->bar);
+	kfree(r);
+}
+
+/**
+ * p2pmem_add_pci_region() - request and add an entire PCI region to the
+ *	specified p2pmem device
+ * @p: the device to add the memory to
+ * @pdev: pci device to register the bar from
+ * @bar: the bar number to add
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar)
+{
+	int rc;
+	struct pci_region *r;
+
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		return -ENOMEM;
+
+	r->pdev = pdev;
+	r->bar = bar;
+
+	rc = pci_request_region(pdev, bar, dev_name(&p->dev));
+	if (rc < 0)
+		goto err_pci;
+
+	rc = p2pmem_add_resource(p, &pdev->resource[bar]);
+	if (rc < 0)
+		goto err_add;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_release_pci_region, r);
+	if (rc)
+		return rc;
+
+	return 0;
+
+err_add:
+	pci_release_region(pdev, bar);
+err_pci:
+	kfree(r);
+	return rc;
+}
+EXPORT_SYMBOL(p2pmem_add_pci_region);
+
+/**
+ * p2pmem_alloc() - allocate some p2p memory
+ * @p: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error
+ */
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return (void *)gen_pool_alloc(p->pool, size);
+}
+EXPORT_SYMBOL(p2pmem_alloc);
+
+/**
+ * p2pmem_free() - free allocated p2p memory
+ * @p: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+	gen_pool_free(p->pool, (unsigned long)addr, size);
+}
+EXPORT_SYMBOL(p2pmem_free);
+
+static struct device *find_parent_pci_dev(struct device *dev)
+{
+	while (dev) {
+		if (dev_is_pci(dev))
+			return dev;
+
+		dev = dev->parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_switch_port(struct device *dev)
+{
+	struct device *dpci;
+	struct pci_dev *pci;
+
+	dpci = find_parent_pci_dev(dev);
+	if (!dpci)
+		return NULL;
+
+	pci = pci_upstream_bridge(to_pci_dev(dpci));
+	if (!pci)
+		return NULL;
+
+	return pci_upstream_bridge(pci);
+}
+
+static int upstream_bridges_match(struct device *p2pmem,
+				  const void *data)
+{
+	struct device * const *dma_devices = data;
+	struct pci_dev *p2p_up;
+	struct pci_dev *dma_up;
+
+	p2p_up = get_upstream_switch_port(p2pmem);
+	if (!p2p_up) {
+		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
+		return false;
+	}
+
+	while (*dma_devices) {
+		dma_up = get_upstream_switch_port(*dma_devices);
+
+		if (!dma_up) {
+			dev_dbg(p2pmem, "%s is not a pci device behind a switch",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		if (p2p_up != dma_up) {
+			dev_dbg(p2pmem,
+				"%s does not reside on the same upstream bridge",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		dev_dbg(p2pmem, "%s is compatible", dev_name(*dma_devices));
+		dma_devices++;
+	}
+
+	return true;
+}
+
+/**
+ * p2pmem_find_compat() - find a p2pmem device compatible with the
+ *	specified devices
+ * @dma_devices: a null terminated array of device pointers which
+ *	all must be compatible with the returned p2pmem device
+ *
+ * For now, we only support cases where all the devices that
+ * will transfer to the p2pmem device are on the same switch.
+ * This cuts out cases that may work but is safest for the user.
+ * We also do not presently support cases where two devices
+ * are behind multiple levels of switches even though this would
+ * likely work fine.
+ *
+ * Future work could be done to whitelist root ports that are known
+ * to be good and support many levels of switches. Additionally,
+ * it would make sense to choose the topographically closest p2pmem
+ * for a given setup. (Presently we only return the first that matches.)
+ *
+ * Returns a pointer to the p2pmem device with the reference taken
+ * (use p2pmem_put to return the reference) or NULL if no compatible
+ * p2pmem device is found.
+ */
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+{
+	struct device *dev;
+
+	dev = class_find_device(p2pmem_class, NULL, dma_devices,
+				upstream_bridges_match);
+
+	if (!dev)
+		return NULL;
+
+	return to_p2pmem(dev);
+}
+EXPORT_SYMBOL(p2pmem_find_compat);
+
+/**
+ * p2pmem_put() - decrement a p2pmem device reference
+ * @p: p2pmem device to return
+ *
+ * Dereference and free (if last) the device's reference counter.
+ * It's safe to pass a NULL pointer to this function.
+ */
+void p2pmem_put(struct p2pmem_dev *p)
+{
+	if (p)
+		put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_put);
+
+static int __init p2pmem_init(void)
+{
+	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
+	if (IS_ERR(p2pmem_class))
+		return PTR_ERR(p2pmem_class);
+
+	return 0;
+}
+module_init(p2pmem_init);
+
+static void __exit p2pmem_exit(void)
+{
+	class_destroy(p2pmem_class);
+
+	pr_info(KBUILD_MODNAME ": unloaded.\n");
+}
+module_exit(p2pmem_exit);
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
new file mode 100644
index 0000000..71dc1e1
--- /dev/null
+++ b/include/linux/p2pmem.h
@@ -0,0 +1,103 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef __P2PMEM_H__
+#define __P2PMEM_H__
+
+#include <linux/device.h>
+#include <linux/pci.h>
+
+struct p2pmem_dev {
+	struct device dev;
+	int id;
+
+	struct percpu_ref ref;
+	struct completion cmp;
+	struct gen_pool *pool;
+};
+
+#ifdef CONFIG_P2PMEM
+
+struct p2pmem_dev *p2pmem_create(struct device *parent);
+void p2pmem_unregister(struct p2pmem_dev *p);
+
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res);
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
+
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
+
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
+void p2pmem_put(struct p2pmem_dev *p);
+
+#else
+
+static inline void *p2pmem_create(struct device *parent)
+{
+	return NULL;
+}
+
+static inline void p2pmem_unregister(struct p2pmem_dev *p)
+{
+}
+
+static inline int p2pmem_add_resource(struct p2pmem_dev *p,
+				      struct resource *res)
+{
+	return -ENODEV;
+}
+
+static inline int p2pmem_add_pci_region(struct p2pmem_dev *p,
+					struct pci_dev *pdev, int bar)
+{
+	return -ENODEV;
+}
+
+static inline void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return NULL;
+}
+
+static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+}
+
+static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+{
+	return NULL;
+}
+
+static inline void p2pmem_put(struct p2pmem_dev *p)
+{
+}
+
+#endif
+
+static inline struct page *p2pmem_alloc_page(struct p2pmem_dev *p)
+{
+	struct page *pg = p2pmem_alloc(p, PAGE_SIZE);
+
+	if (pg)
+		return virt_to_page(pg);
+
+	return NULL;
+}
+
+static inline void p2pmem_free_page(struct p2pmem_dev *p, struct page *pg)
+{
+	p2pmem_free(p, page_to_virt(pg), PAGE_SIZE);
+}
+
+#endif
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)


A p2pmem device is simply a PCI card with a BAR space that points to
regular memory. This may be an independent PCI card or part of another
completely unrelated device (like an IB card or a NVMe card). The
p2pmem device is designed such that other drivers may register p2pmem
memory for use by the system.

p2pmem devices then provide a kernel interface so that other subsystems
can allocate chunks of this memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but could significantly reduce presure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

Seeing using p2p memory can often have negative effects, especially
with older PCI root complexes. The code is designed to only utilize the
p2pmem device if all the devices involved in a transfer are behind the
same PCI switch. Other cases may still work or be desirable for some
end users but it was decided this would be the best course of action
to prevent users enabling it and wondering why their performance
dropped.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Signed-off-by: Stephen Bates <sbates at raithlin.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---
 drivers/memory/Kconfig  |   5 +
 drivers/memory/Makefile |   2 +
 drivers/memory/p2pmem.c | 403 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  | 103 +++++++++++++
 4 files changed, 513 insertions(+)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig
index ec80e35..4a02cd3 100644
--- a/drivers/memory/Kconfig
+++ b/drivers/memory/Kconfig
@@ -146,3 +146,8 @@ source "drivers/memory/samsung/Kconfig"
 source "drivers/memory/tegra/Kconfig"
 
 endif
+
+config P2PMEM
+	bool "Peer 2 Peer Memory Device Support"
+	help
+	  This driver is for peer 2 peer memory device managers.
diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile
index e88097fb..260bfe9 100644
--- a/drivers/memory/Makefile
+++ b/drivers/memory/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_DA8XX_DDRCTL)	+= da8xx-ddrctl.o
 
 obj-$(CONFIG_SAMSUNG_MC)	+= samsung/
 obj-$(CONFIG_TEGRA_MC)		+= tegra/
+
+obj-$(CONFIG_P2PMEM)        += p2pmem.o
diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
new file mode 100644
index 0000000..c4ea311
--- /dev/null
+++ b/drivers/memory/p2pmem.c
@@ -0,0 +1,403 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <linux/p2pmem.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+
+MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation");
+
+static struct class *p2pmem_class;
+static DEFINE_IDA(p2pmem_ida);
+
+static struct p2pmem_dev *to_p2pmem(struct device *dev)
+{
+	return container_of(dev, struct p2pmem_dev, dev);
+}
+
+static void p2pmem_percpu_release(struct percpu_ref *ref)
+{
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	complete_all(&p->cmp);
+}
+
+static void p2pmem_percpu_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	percpu_ref_exit(ref);
+}
+
+static void p2pmem_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+	wait_for_completion(&p->cmp);
+}
+
+static void p2pmem_release(struct device *dev)
+{
+	struct p2pmem_dev *p = to_p2pmem(dev);
+
+	if (p->pool)
+		gen_pool_destroy(p->pool);
+
+	kfree(p);
+}
+
+/**
+ * p2pmem_create() - create a new p2pmem device
+ * @parent: the parent device to create it under
+ *
+ * Return value is a pointer to the new device or an ERR_PTR
+ * on failure.
+ */
+struct p2pmem_dev *p2pmem_create(struct device *parent)
+{
+	struct p2pmem_dev *p;
+	int nid = dev_to_node(parent);
+	int rc;
+
+	p = kzalloc_node(sizeof(*p), GFP_KERNEL, nid);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&p->cmp);
+	device_initialize(&p->dev);
+	p->dev.class = p2pmem_class;
+	p->dev.parent = parent;
+	p->dev.release = p2pmem_release;
+
+	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
+	if (p->id < 0) {
+		rc = p->id;
+		goto err_free;
+	}
+
+	dev_set_name(&p->dev, "p2pmem%d", p->id);
+
+	p->pool = gen_pool_create(PAGE_SHIFT, nid);
+	if (!p->pool) {
+		rc = -ENOMEM;
+		goto err_id;
+	}
+
+	rc = percpu_ref_init(&p->ref, p2pmem_percpu_release, 0,
+			     GFP_KERNEL);
+	if (rc)
+		goto err_id;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_exit, &p->ref);
+	if (rc)
+		goto err_id;
+
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_id;
+
+	dev_info(&p->dev, "registered");
+
+	return p;
+
+err_id:
+	ida_simple_remove(&p2pmem_ida, p->id);
+err_free:
+	put_device(&p->dev);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL(p2pmem_create);
+
+/**
+ * p2pmem_unregister() - unregister a p2pmem device
+ * @p: the device to unregister
+ *
+ * The device will remain until all users are done with it
+ */
+void p2pmem_unregister(struct p2pmem_dev *p)
+{
+	if (!p)
+		return;
+
+	dev_info(&p->dev, "unregistered");
+	device_del(&p->dev);
+	ida_simple_remove(&p2pmem_ida, p->id);
+	put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_unregister);
+
+/**
+ * p2pmem_add_resource() - add memory for use as p2pmem to the device
+ * @p: the device to add the memory to
+ * @res: resource describing the memory
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res)
+{
+	int rc;
+	void *addr;
+	int nid = dev_to_node(&p->dev);
+
+	addr = devm_memremap_pages(&p->dev, res, &p->ref, NULL);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	rc = gen_pool_add_virt(p->pool, (unsigned long)addr,
+			       res->start, resource_size(res), nid);
+	if (rc)
+		return rc;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_kill, &p->ref);
+	if (rc)
+		return rc;
+
+	dev_info(&p->dev, "added %pR", res);
+
+	return 0;
+}
+EXPORT_SYMBOL(p2pmem_add_resource);
+
+struct pci_region {
+	struct pci_dev *pdev;
+	int bar;
+};
+
+static void p2pmem_release_pci_region(void *data)
+{
+	struct pci_region *r = data;
+
+	pci_release_region(r->pdev, r->bar);
+	kfree(r);
+}
+
+/**
+ * p2pmem_add_pci_region() - request and add an entire PCI region to the
+ *	specified p2pmem device
+ * @p: the device to add the memory to
+ * @pdev: pci device to register the bar from
+ * @bar: the bar number to add
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar)
+{
+	int rc;
+	struct pci_region *r;
+
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		return -ENOMEM;
+
+	r->pdev = pdev;
+	r->bar = bar;
+
+	rc = pci_request_region(pdev, bar, dev_name(&p->dev));
+	if (rc < 0)
+		goto err_pci;
+
+	rc = p2pmem_add_resource(p, &pdev->resource[bar]);
+	if (rc < 0)
+		goto err_add;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_release_pci_region, r);
+	if (rc)
+		return rc;
+
+	return 0;
+
+err_add:
+	pci_release_region(pdev, bar);
+err_pci:
+	kfree(r);
+	return rc;
+}
+EXPORT_SYMBOL(p2pmem_add_pci_region);
+
+/**
+ * p2pmem_alloc() - allocate some p2p memory
+ * @p: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error
+ */
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return (void *)gen_pool_alloc(p->pool, size);
+}
+EXPORT_SYMBOL(p2pmem_alloc);
+
+/**
+ * p2pmem_free() - free allocated p2p memory
+ * @p: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+	gen_pool_free(p->pool, (unsigned long)addr, size);
+}
+EXPORT_SYMBOL(p2pmem_free);
+
+static struct device *find_parent_pci_dev(struct device *dev)
+{
+	while (dev) {
+		if (dev_is_pci(dev))
+			return dev;
+
+		dev = dev->parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_switch_port(struct device *dev)
+{
+	struct device *dpci;
+	struct pci_dev *pci;
+
+	dpci = find_parent_pci_dev(dev);
+	if (!dpci)
+		return NULL;
+
+	pci = pci_upstream_bridge(to_pci_dev(dpci));
+	if (!pci)
+		return NULL;
+
+	return pci_upstream_bridge(pci);
+}
+
+static int upstream_bridges_match(struct device *p2pmem,
+				  const void *data)
+{
+	struct device * const *dma_devices = data;
+	struct pci_dev *p2p_up;
+	struct pci_dev *dma_up;
+
+	p2p_up = get_upstream_switch_port(p2pmem);
+	if (!p2p_up) {
+		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
+		return false;
+	}
+
+	while (*dma_devices) {
+		dma_up = get_upstream_switch_port(*dma_devices);
+
+		if (!dma_up) {
+			dev_dbg(p2pmem, "%s is not a pci device behind a switch",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		if (p2p_up != dma_up) {
+			dev_dbg(p2pmem,
+				"%s does not reside on the same upstream bridge",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		dev_dbg(p2pmem, "%s is compatible", dev_name(*dma_devices));
+		dma_devices++;
+	}
+
+	return true;
+}
+
+/**
+ * p2pmem_find_compat() - find a p2pmem device compatible with the
+ *	specified devices
+ * @dma_devices: a null terminated array of device pointers which
+ *	all must be compatible with the returned p2pmem device
+ *
+ * For now, we only support cases where all the devices that
+ * will transfer to the p2pmem device are on the same switch.
+ * This cuts out cases that may work but is safest for the user.
+ * We also do not presently support cases where two devices
+ * are behind multiple levels of switches even though this would
+ * likely work fine.
+ *
+ * Future work could be done to whitelist root ports that are known
+ * to be good and support many levels of switches. Additionally,
+ * it would make sense to choose the topographically closest p2pmem
+ * for a given setup. (Presently we only return the first that matches.)
+ *
+ * Returns a pointer to the p2pmem device with the reference taken
+ * (use p2pmem_put to return the reference) or NULL if no compatible
+ * p2pmem device is found.
+ */
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+{
+	struct device *dev;
+
+	dev = class_find_device(p2pmem_class, NULL, dma_devices,
+				upstream_bridges_match);
+
+	if (!dev)
+		return NULL;
+
+	return to_p2pmem(dev);
+}
+EXPORT_SYMBOL(p2pmem_find_compat);
+
+/**
+ * p2pmem_put() - decrement a p2pmem device reference
+ * @p: p2pmem device to return
+ *
+ * Dereference and free (if last) the device's reference counter.
+ * It's safe to pass a NULL pointer to this function.
+ */
+void p2pmem_put(struct p2pmem_dev *p)
+{
+	if (p)
+		put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_put);
+
+static int __init p2pmem_init(void)
+{
+	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
+	if (IS_ERR(p2pmem_class))
+		return PTR_ERR(p2pmem_class);
+
+	return 0;
+}
+module_init(p2pmem_init);
+
+static void __exit p2pmem_exit(void)
+{
+	class_destroy(p2pmem_class);
+
+	pr_info(KBUILD_MODNAME ": unloaded.\n");
+}
+module_exit(p2pmem_exit);
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
new file mode 100644
index 0000000..71dc1e1
--- /dev/null
+++ b/include/linux/p2pmem.h
@@ -0,0 +1,103 @@
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef __P2PMEM_H__
+#define __P2PMEM_H__
+
+#include <linux/device.h>
+#include <linux/pci.h>
+
+struct p2pmem_dev {
+	struct device dev;
+	int id;
+
+	struct percpu_ref ref;
+	struct completion cmp;
+	struct gen_pool *pool;
+};
+
+#ifdef CONFIG_P2PMEM
+
+struct p2pmem_dev *p2pmem_create(struct device *parent);
+void p2pmem_unregister(struct p2pmem_dev *p);
+
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res);
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
+
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
+
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
+void p2pmem_put(struct p2pmem_dev *p);
+
+#else
+
+static inline void *p2pmem_create(struct device *parent)
+{
+	return NULL;
+}
+
+static inline void p2pmem_unregister(struct p2pmem_dev *p)
+{
+}
+
+static inline int p2pmem_add_resource(struct p2pmem_dev *p,
+				      struct resource *res)
+{
+	return -ENODEV;
+}
+
+static inline int p2pmem_add_pci_region(struct p2pmem_dev *p,
+					struct pci_dev *pdev, int bar)
+{
+	return -ENODEV;
+}
+
+static inline void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return NULL;
+}
+
+static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+}
+
+static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+{
+	return NULL;
+}
+
+static inline void p2pmem_put(struct p2pmem_dev *p)
+{
+}
+
+#endif
+
+static inline struct page *p2pmem_alloc_page(struct p2pmem_dev *p)
+{
+	struct page *pg = p2pmem_alloc(p, PAGE_SIZE);
+
+	if (pg)
+		return virt_to_page(pg);
+
+	return NULL;
+}
+
+static inline void p2pmem_free_page(struct p2pmem_dev *p, struct page *pg)
+{
+	p2pmem_free(p, page_to_virt(pg), PAGE_SIZE);
+}
+
+#endif
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
       [not found] ` <1490911959-5146-1-git-send-email-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  (?)
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

From: Steve Wise <swise@opengridcomputing.com>

Some cxgb4 cards expose memory as part of BAR4. This patch registers
this memory as a p2pmem device.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 97 ++++++++++++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |  5 ++
 3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 163543b..e92443b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -48,6 +48,7 @@
 #include <linux/vmalloc.h>
 #include <linux/etherdevice.h>
 #include <linux/net_tstamp.h>
+#include <linux/p2pmem.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -859,6 +860,8 @@ struct adapter {
 
 	/* TC u32 offload */
 	struct cxgb4_tc_u32_table *tc_u32;
+
+	struct p2pmem_dev *p2pmem;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index afb0967..a33bcd1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -172,6 +172,11 @@ module_param(select_queue, int, 0644);
 MODULE_PARM_DESC(select_queue,
 		 "Select between kernel provided method of selecting or driver method of selecting TX queue. Default is kernel method.");
 
+static bool use_p2pmem;
+module_param(use_p2pmem, bool, 0644);
+MODULE_PARM_DESC(use_p2pmem,
+		 "Enable registering a p2pmem device with bar space (if available)");
+
 static struct dentry *cxgb4_debugfs_root;
 
 LIST_HEAD(adapter_list);
@@ -2835,6 +2840,54 @@ static void setup_memwin_rdma(struct adapter *adap)
 	}
 }
 
+static void setup_memwin_p2pmem(struct adapter *adap)
+{
+	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
+	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
+
+	if (!use_p2pmem)
+		return;
+
+	if (mem_base != 0 && mem_size != 0) {
+		unsigned int sz_kb, pcieofst;
+
+		sz_kb = roundup_pow_of_two(mem_size) >> 10;
+
+		/*
+		 * The start offset must be aligned to the window size.
+		 * Also, BAR4 has MSIX vectors using the first 8KB.
+		 * Further, the min allowed p2pmem region size is 1MB,
+		 * so set the start offset to the memory size and we're aligned
+		 * as well as past the 8KB vector table.
+		 */
+		pcieofst = sz_kb << 10;
+
+		dev_info(adap->pdev_dev,
+			 "p2pmem base 0x%x, size %uB, ilog2(sk_kb) 0x%x, "
+			 "pcieofst 0x%X\n", mem_base, mem_size, ilog2(sz_kb),
+			 pcieofst);
+
+		/* Write the window offset and size */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_BASE_WIN_A,
+					    MEMWIN_RSVD4),
+			pcieofst | BIR_V(2) | WINDOW_V(ilog2(sz_kb)));
+
+		/* Write the adapter memory base/start */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4),
+			MEMOFST_V((mem_base >> MEMOFST_S)) | PFNUM_V(adap->pf));
+
+		/* Read it back to flush it */
+		t4_read_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4));
+	} else
+		dev_info(adap->pdev_dev, "p2pmem memory not reserved, "
+			 "base 0x%x size %uB\n", mem_base, mem_size);
+}
+
 static int adap_init1(struct adapter *adap, struct fw_caps_config_cmd *c)
 {
 	u32 v;
@@ -4622,6 +4675,42 @@ static int cxgb4_iov_configure(struct pci_dev *pdev, int num_vfs)
 }
 #endif
 
+static int init_p2pmem(struct adapter *adapter)
+{
+	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
+	struct p2pmem_dev *p;
+	int rc;
+	struct resource res;
+
+	if (!mem_size || !use_p2pmem)
+		return 0;
+
+	mem_size = roundup_pow_of_two(mem_size);
+
+	/*
+	 * Create a subset of BAR4 for the p2pmem region based on the
+	 * exported memory size.
+	 */
+	memcpy(&res, &adapter->pdev->resource[4], sizeof(res));
+	res.start += mem_size;
+	res.end = res.start + mem_size - 1;
+	dev_info(adapter->pdev_dev, "p2pmem resource start 0x%llx end 0x%llx size %lluB\n",
+		 res.start, res.end, resource_size(&res));
+
+	p = p2pmem_create(&adapter->pdev->dev);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	rc = p2pmem_add_resource(p, &res);
+	if (rc) {
+		p2pmem_unregister(p);
+		return rc;
+	}
+	adapter->p2pmem = p;
+
+	return 0;
+}
+
 static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	int func, i, err, s_qpp, qpp, num_seg;
@@ -4784,8 +4873,8 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bitmap_zero(adapter->sge.blocked_fl, adapter->sge.egr_sz);
 #endif
 	setup_memwin_rdma(adapter);
-	if (err)
-		goto out_unmap_bar;
+
+	setup_memwin_p2pmem(adapter);
 
 	/* configure SGE_STAT_CFG_A to read WC stats */
 	if (!is_t4(adapter->params.chip))
@@ -4989,6 +5078,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	print_adapter_info(adapter);
 	setup_fw_sge_queues(adapter);
+	init_p2pmem(adapter);
 	return 0;
 
 sriov:
@@ -5047,7 +5137,6 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		free_msix_info(adapter);
 	if (adapter->num_uld || adapter->num_ofld_uld)
 		t4_uld_mem_free(adapter);
- out_unmap_bar:
 	if (!is_t4(adapter->params.chip))
 		iounmap(adapter->bar2);
  out_free_adapter:
@@ -5075,6 +5164,8 @@ static void remove_one(struct pci_dev *pdev)
 		return;
 	}
 
+	p2pmem_unregister(adapter->p2pmem);
+
 	if (adapter->pf == 4) {
 		int i;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
index 3348d33..199ddfb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
@@ -735,6 +735,11 @@
 
 #define PCIE_MEM_ACCESS_OFFSET_A 0x306c
 
+#define MEMOFST_S    7
+#define MEMOFST_M    0x1ffffffU
+#define MEMOFST_V(x) ((x) << MEMOFST_S)
+#define MEMOFST_G(x) (((x) >> MEMOFST_S) & MEMOFST_M)
+
 #define ENABLE_S    30
 #define ENABLE_V(x) ((x) << ENABLE_S)
 #define ENABLE_F    ENABLE_V(1U)
-- 
2.1.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

From: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

Some cxgb4 cards expose memory as part of BAR4. This patch registers
this memory as a p2pmem device.

Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Signed-off-by: Stephen Bates <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 97 ++++++++++++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |  5 ++
 3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 163543b..e92443b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -48,6 +48,7 @@
 #include <linux/vmalloc.h>
 #include <linux/etherdevice.h>
 #include <linux/net_tstamp.h>
+#include <linux/p2pmem.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -859,6 +860,8 @@ struct adapter {
 
 	/* TC u32 offload */
 	struct cxgb4_tc_u32_table *tc_u32;
+
+	struct p2pmem_dev *p2pmem;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index afb0967..a33bcd1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -172,6 +172,11 @@ module_param(select_queue, int, 0644);
 MODULE_PARM_DESC(select_queue,
 		 "Select between kernel provided method of selecting or driver method of selecting TX queue. Default is kernel method.");
 
+static bool use_p2pmem;
+module_param(use_p2pmem, bool, 0644);
+MODULE_PARM_DESC(use_p2pmem,
+		 "Enable registering a p2pmem device with bar space (if available)");
+
 static struct dentry *cxgb4_debugfs_root;
 
 LIST_HEAD(adapter_list);
@@ -2835,6 +2840,54 @@ static void setup_memwin_rdma(struct adapter *adap)
 	}
 }
 
+static void setup_memwin_p2pmem(struct adapter *adap)
+{
+	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
+	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
+
+	if (!use_p2pmem)
+		return;
+
+	if (mem_base != 0 && mem_size != 0) {
+		unsigned int sz_kb, pcieofst;
+
+		sz_kb = roundup_pow_of_two(mem_size) >> 10;
+
+		/*
+		 * The start offset must be aligned to the window size.
+		 * Also, BAR4 has MSIX vectors using the first 8KB.
+		 * Further, the min allowed p2pmem region size is 1MB,
+		 * so set the start offset to the memory size and we're aligned
+		 * as well as past the 8KB vector table.
+		 */
+		pcieofst = sz_kb << 10;
+
+		dev_info(adap->pdev_dev,
+			 "p2pmem base 0x%x, size %uB, ilog2(sk_kb) 0x%x, "
+			 "pcieofst 0x%X\n", mem_base, mem_size, ilog2(sz_kb),
+			 pcieofst);
+
+		/* Write the window offset and size */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_BASE_WIN_A,
+					    MEMWIN_RSVD4),
+			pcieofst | BIR_V(2) | WINDOW_V(ilog2(sz_kb)));
+
+		/* Write the adapter memory base/start */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4),
+			MEMOFST_V((mem_base >> MEMOFST_S)) | PFNUM_V(adap->pf));
+
+		/* Read it back to flush it */
+		t4_read_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4));
+	} else
+		dev_info(adap->pdev_dev, "p2pmem memory not reserved, "
+			 "base 0x%x size %uB\n", mem_base, mem_size);
+}
+
 static int adap_init1(struct adapter *adap, struct fw_caps_config_cmd *c)
 {
 	u32 v;
@@ -4622,6 +4675,42 @@ static int cxgb4_iov_configure(struct pci_dev *pdev, int num_vfs)
 }
 #endif
 
+static int init_p2pmem(struct adapter *adapter)
+{
+	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
+	struct p2pmem_dev *p;
+	int rc;
+	struct resource res;
+
+	if (!mem_size || !use_p2pmem)
+		return 0;
+
+	mem_size = roundup_pow_of_two(mem_size);
+
+	/*
+	 * Create a subset of BAR4 for the p2pmem region based on the
+	 * exported memory size.
+	 */
+	memcpy(&res, &adapter->pdev->resource[4], sizeof(res));
+	res.start += mem_size;
+	res.end = res.start + mem_size - 1;
+	dev_info(adapter->pdev_dev, "p2pmem resource start 0x%llx end 0x%llx size %lluB\n",
+		 res.start, res.end, resource_size(&res));
+
+	p = p2pmem_create(&adapter->pdev->dev);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	rc = p2pmem_add_resource(p, &res);
+	if (rc) {
+		p2pmem_unregister(p);
+		return rc;
+	}
+	adapter->p2pmem = p;
+
+	return 0;
+}
+
 static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	int func, i, err, s_qpp, qpp, num_seg;
@@ -4784,8 +4873,8 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bitmap_zero(adapter->sge.blocked_fl, adapter->sge.egr_sz);
 #endif
 	setup_memwin_rdma(adapter);
-	if (err)
-		goto out_unmap_bar;
+
+	setup_memwin_p2pmem(adapter);
 
 	/* configure SGE_STAT_CFG_A to read WC stats */
 	if (!is_t4(adapter->params.chip))
@@ -4989,6 +5078,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	print_adapter_info(adapter);
 	setup_fw_sge_queues(adapter);
+	init_p2pmem(adapter);
 	return 0;
 
 sriov:
@@ -5047,7 +5137,6 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		free_msix_info(adapter);
 	if (adapter->num_uld || adapter->num_ofld_uld)
 		t4_uld_mem_free(adapter);
- out_unmap_bar:
 	if (!is_t4(adapter->params.chip))
 		iounmap(adapter->bar2);
  out_free_adapter:
@@ -5075,6 +5164,8 @@ static void remove_one(struct pci_dev *pdev)
 		return;
 	}
 
+	p2pmem_unregister(adapter->p2pmem);
+
 	if (adapter->pf == 4) {
 		int i;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
index 3348d33..199ddfb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
@@ -735,6 +735,11 @@
 
 #define PCIE_MEM_ACCESS_OFFSET_A 0x306c
 
+#define MEMOFST_S    7
+#define MEMOFST_M    0x1ffffffU
+#define MEMOFST_V(x) ((x) << MEMOFST_S)
+#define MEMOFST_G(x) (((x) >> MEMOFST_S) & MEMOFST_M)
+
 #define ENABLE_S    30
 #define ENABLE_V(x) ((x) << ENABLE_S)
 #define ENABLE_F    ENABLE_V(1U)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

From: Steve Wise <swise@opengridcomputing.com>

Some cxgb4 cards expose memory as part of BAR4. This patch registers
this memory as a p2pmem device.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 97 ++++++++++++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |  5 ++
 3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 163543b..e92443b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -48,6 +48,7 @@
 #include <linux/vmalloc.h>
 #include <linux/etherdevice.h>
 #include <linux/net_tstamp.h>
+#include <linux/p2pmem.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -859,6 +860,8 @@ struct adapter {
 
 	/* TC u32 offload */
 	struct cxgb4_tc_u32_table *tc_u32;
+
+	struct p2pmem_dev *p2pmem;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index afb0967..a33bcd1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -172,6 +172,11 @@ module_param(select_queue, int, 0644);
 MODULE_PARM_DESC(select_queue,
 		 "Select between kernel provided method of selecting or driver method of selecting TX queue. Default is kernel method.");
 
+static bool use_p2pmem;
+module_param(use_p2pmem, bool, 0644);
+MODULE_PARM_DESC(use_p2pmem,
+		 "Enable registering a p2pmem device with bar space (if available)");
+
 static struct dentry *cxgb4_debugfs_root;
 
 LIST_HEAD(adapter_list);
@@ -2835,6 +2840,54 @@ static void setup_memwin_rdma(struct adapter *adap)
 	}
 }
 
+static void setup_memwin_p2pmem(struct adapter *adap)
+{
+	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
+	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
+
+	if (!use_p2pmem)
+		return;
+
+	if (mem_base != 0 && mem_size != 0) {
+		unsigned int sz_kb, pcieofst;
+
+		sz_kb = roundup_pow_of_two(mem_size) >> 10;
+
+		/*
+		 * The start offset must be aligned to the window size.
+		 * Also, BAR4 has MSIX vectors using the first 8KB.
+		 * Further, the min allowed p2pmem region size is 1MB,
+		 * so set the start offset to the memory size and we're aligned
+		 * as well as past the 8KB vector table.
+		 */
+		pcieofst = sz_kb << 10;
+
+		dev_info(adap->pdev_dev,
+			 "p2pmem base 0x%x, size %uB, ilog2(sk_kb) 0x%x, "
+			 "pcieofst 0x%X\n", mem_base, mem_size, ilog2(sz_kb),
+			 pcieofst);
+
+		/* Write the window offset and size */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_BASE_WIN_A,
+					    MEMWIN_RSVD4),
+			pcieofst | BIR_V(2) | WINDOW_V(ilog2(sz_kb)));
+
+		/* Write the adapter memory base/start */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4),
+			MEMOFST_V((mem_base >> MEMOFST_S)) | PFNUM_V(adap->pf));
+
+		/* Read it back to flush it */
+		t4_read_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4));
+	} else
+		dev_info(adap->pdev_dev, "p2pmem memory not reserved, "
+			 "base 0x%x size %uB\n", mem_base, mem_size);
+}
+
 static int adap_init1(struct adapter *adap, struct fw_caps_config_cmd *c)
 {
 	u32 v;
@@ -4622,6 +4675,42 @@ static int cxgb4_iov_configure(struct pci_dev *pdev, int num_vfs)
 }
 #endif
 
+static int init_p2pmem(struct adapter *adapter)
+{
+	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
+	struct p2pmem_dev *p;
+	int rc;
+	struct resource res;
+
+	if (!mem_size || !use_p2pmem)
+		return 0;
+
+	mem_size = roundup_pow_of_two(mem_size);
+
+	/*
+	 * Create a subset of BAR4 for the p2pmem region based on the
+	 * exported memory size.
+	 */
+	memcpy(&res, &adapter->pdev->resource[4], sizeof(res));
+	res.start += mem_size;
+	res.end = res.start + mem_size - 1;
+	dev_info(adapter->pdev_dev, "p2pmem resource start 0x%llx end 0x%llx size %lluB\n",
+		 res.start, res.end, resource_size(&res));
+
+	p = p2pmem_create(&adapter->pdev->dev);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	rc = p2pmem_add_resource(p, &res);
+	if (rc) {
+		p2pmem_unregister(p);
+		return rc;
+	}
+	adapter->p2pmem = p;
+
+	return 0;
+}
+
 static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	int func, i, err, s_qpp, qpp, num_seg;
@@ -4784,8 +4873,8 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bitmap_zero(adapter->sge.blocked_fl, adapter->sge.egr_sz);
 #endif
 	setup_memwin_rdma(adapter);
-	if (err)
-		goto out_unmap_bar;
+
+	setup_memwin_p2pmem(adapter);
 
 	/* configure SGE_STAT_CFG_A to read WC stats */
 	if (!is_t4(adapter->params.chip))
@@ -4989,6 +5078,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	print_adapter_info(adapter);
 	setup_fw_sge_queues(adapter);
+	init_p2pmem(adapter);
 	return 0;
 
 sriov:
@@ -5047,7 +5137,6 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		free_msix_info(adapter);
 	if (adapter->num_uld || adapter->num_ofld_uld)
 		t4_uld_mem_free(adapter);
- out_unmap_bar:
 	if (!is_t4(adapter->params.chip))
 		iounmap(adapter->bar2);
  out_free_adapter:
@@ -5075,6 +5164,8 @@ static void remove_one(struct pci_dev *pdev)
 		return;
 	}
 
+	p2pmem_unregister(adapter->p2pmem);
+
 	if (adapter->pf == 4) {
 		int i;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
index 3348d33..199ddfb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
@@ -735,6 +735,11 @@
 
 #define PCIE_MEM_ACCESS_OFFSET_A 0x306c
 
+#define MEMOFST_S    7
+#define MEMOFST_M    0x1ffffffU
+#define MEMOFST_V(x) ((x) << MEMOFST_S)
+#define MEMOFST_G(x) (((x) >> MEMOFST_S) & MEMOFST_M)
+
 #define ENABLE_S    30
 #define ENABLE_V(x) ((x) << ENABLE_S)
 #define ENABLE_F    ENABLE_V(1U)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

From: Steve Wise <swise@opengridcomputing.com>

Some cxgb4 cards expose memory as part of BAR4. This patch registers
this memory as a p2pmem device.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 97 ++++++++++++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |  5 ++
 3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 163543b..e92443b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -48,6 +48,7 @@
 #include <linux/vmalloc.h>
 #include <linux/etherdevice.h>
 #include <linux/net_tstamp.h>
+#include <linux/p2pmem.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -859,6 +860,8 @@ struct adapter {
 
 	/* TC u32 offload */
 	struct cxgb4_tc_u32_table *tc_u32;
+
+	struct p2pmem_dev *p2pmem;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index afb0967..a33bcd1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -172,6 +172,11 @@ module_param(select_queue, int, 0644);
 MODULE_PARM_DESC(select_queue,
 		 "Select between kernel provided method of selecting or driver method of selecting TX queue. Default is kernel method.");
 
+static bool use_p2pmem;
+module_param(use_p2pmem, bool, 0644);
+MODULE_PARM_DESC(use_p2pmem,
+		 "Enable registering a p2pmem device with bar space (if available)");
+
 static struct dentry *cxgb4_debugfs_root;
 
 LIST_HEAD(adapter_list);
@@ -2835,6 +2840,54 @@ static void setup_memwin_rdma(struct adapter *adap)
 	}
 }
 
+static void setup_memwin_p2pmem(struct adapter *adap)
+{
+	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
+	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
+
+	if (!use_p2pmem)
+		return;
+
+	if (mem_base != 0 && mem_size != 0) {
+		unsigned int sz_kb, pcieofst;
+
+		sz_kb = roundup_pow_of_two(mem_size) >> 10;
+
+		/*
+		 * The start offset must be aligned to the window size.
+		 * Also, BAR4 has MSIX vectors using the first 8KB.
+		 * Further, the min allowed p2pmem region size is 1MB,
+		 * so set the start offset to the memory size and we're aligned
+		 * as well as past the 8KB vector table.
+		 */
+		pcieofst = sz_kb << 10;
+
+		dev_info(adap->pdev_dev,
+			 "p2pmem base 0x%x, size %uB, ilog2(sk_kb) 0x%x, "
+			 "pcieofst 0x%X\n", mem_base, mem_size, ilog2(sz_kb),
+			 pcieofst);
+
+		/* Write the window offset and size */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_BASE_WIN_A,
+					    MEMWIN_RSVD4),
+			pcieofst | BIR_V(2) | WINDOW_V(ilog2(sz_kb)));
+
+		/* Write the adapter memory base/start */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4),
+			MEMOFST_V((mem_base >> MEMOFST_S)) | PFNUM_V(adap->pf));
+
+		/* Read it back to flush it */
+		t4_read_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4));
+	} else
+		dev_info(adap->pdev_dev, "p2pmem memory not reserved, "
+			 "base 0x%x size %uB\n", mem_base, mem_size);
+}
+
 static int adap_init1(struct adapter *adap, struct fw_caps_config_cmd *c)
 {
 	u32 v;
@@ -4622,6 +4675,42 @@ static int cxgb4_iov_configure(struct pci_dev *pdev, int num_vfs)
 }
 #endif
 
+static int init_p2pmem(struct adapter *adapter)
+{
+	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
+	struct p2pmem_dev *p;
+	int rc;
+	struct resource res;
+
+	if (!mem_size || !use_p2pmem)
+		return 0;
+
+	mem_size = roundup_pow_of_two(mem_size);
+
+	/*
+	 * Create a subset of BAR4 for the p2pmem region based on the
+	 * exported memory size.
+	 */
+	memcpy(&res, &adapter->pdev->resource[4], sizeof(res));
+	res.start += mem_size;
+	res.end = res.start + mem_size - 1;
+	dev_info(adapter->pdev_dev, "p2pmem resource start 0x%llx end 0x%llx size %lluB\n",
+		 res.start, res.end, resource_size(&res));
+
+	p = p2pmem_create(&adapter->pdev->dev);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	rc = p2pmem_add_resource(p, &res);
+	if (rc) {
+		p2pmem_unregister(p);
+		return rc;
+	}
+	adapter->p2pmem = p;
+
+	return 0;
+}
+
 static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	int func, i, err, s_qpp, qpp, num_seg;
@@ -4784,8 +4873,8 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bitmap_zero(adapter->sge.blocked_fl, adapter->sge.egr_sz);
 #endif
 	setup_memwin_rdma(adapter);
-	if (err)
-		goto out_unmap_bar;
+
+	setup_memwin_p2pmem(adapter);
 
 	/* configure SGE_STAT_CFG_A to read WC stats */
 	if (!is_t4(adapter->params.chip))
@@ -4989,6 +5078,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	print_adapter_info(adapter);
 	setup_fw_sge_queues(adapter);
+	init_p2pmem(adapter);
 	return 0;
 
 sriov:
@@ -5047,7 +5137,6 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		free_msix_info(adapter);
 	if (adapter->num_uld || adapter->num_ofld_uld)
 		t4_uld_mem_free(adapter);
- out_unmap_bar:
 	if (!is_t4(adapter->params.chip))
 		iounmap(adapter->bar2);
  out_free_adapter:
@@ -5075,6 +5164,8 @@ static void remove_one(struct pci_dev *pdev)
 		return;
 	}
 
+	p2pmem_unregister(adapter->p2pmem);
+
 	if (adapter->pf == 4) {
 		int i;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
index 3348d33..199ddfb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
@@ -735,6 +735,11 @@
 
 #define PCIE_MEM_ACCESS_OFFSET_A 0x306c
 
+#define MEMOFST_S    7
+#define MEMOFST_M    0x1ffffffU
+#define MEMOFST_V(x) ((x) << MEMOFST_S)
+#define MEMOFST_G(x) (((x) >> MEMOFST_S) & MEMOFST_M)
+
 #define ENABLE_S    30
 #define ENABLE_V(x) ((x) << ENABLE_S)
 #define ENABLE_F    ENABLE_V(1U)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)


From: Steve Wise <swise@opengridcomputing.com>

Some cxgb4 cards expose memory as part of BAR4. This patch registers
this memory as a p2pmem device.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Signed-off-by: Stephen Bates <sbates at raithlin.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 97 ++++++++++++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |  5 ++
 3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 163543b..e92443b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -48,6 +48,7 @@
 #include <linux/vmalloc.h>
 #include <linux/etherdevice.h>
 #include <linux/net_tstamp.h>
+#include <linux/p2pmem.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -859,6 +860,8 @@ struct adapter {
 
 	/* TC u32 offload */
 	struct cxgb4_tc_u32_table *tc_u32;
+
+	struct p2pmem_dev *p2pmem;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index afb0967..a33bcd1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -172,6 +172,11 @@ module_param(select_queue, int, 0644);
 MODULE_PARM_DESC(select_queue,
 		 "Select between kernel provided method of selecting or driver method of selecting TX queue. Default is kernel method.");
 
+static bool use_p2pmem;
+module_param(use_p2pmem, bool, 0644);
+MODULE_PARM_DESC(use_p2pmem,
+		 "Enable registering a p2pmem device with bar space (if available)");
+
 static struct dentry *cxgb4_debugfs_root;
 
 LIST_HEAD(adapter_list);
@@ -2835,6 +2840,54 @@ static void setup_memwin_rdma(struct adapter *adap)
 	}
 }
 
+static void setup_memwin_p2pmem(struct adapter *adap)
+{
+	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
+	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
+
+	if (!use_p2pmem)
+		return;
+
+	if (mem_base != 0 && mem_size != 0) {
+		unsigned int sz_kb, pcieofst;
+
+		sz_kb = roundup_pow_of_two(mem_size) >> 10;
+
+		/*
+		 * The start offset must be aligned to the window size.
+		 * Also, BAR4 has MSIX vectors using the first 8KB.
+		 * Further, the min allowed p2pmem region size is 1MB,
+		 * so set the start offset to the memory size and we're aligned
+		 * as well as past the 8KB vector table.
+		 */
+		pcieofst = sz_kb << 10;
+
+		dev_info(adap->pdev_dev,
+			 "p2pmem base 0x%x, size %uB, ilog2(sk_kb) 0x%x, "
+			 "pcieofst 0x%X\n", mem_base, mem_size, ilog2(sz_kb),
+			 pcieofst);
+
+		/* Write the window offset and size */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_BASE_WIN_A,
+					    MEMWIN_RSVD4),
+			pcieofst | BIR_V(2) | WINDOW_V(ilog2(sz_kb)));
+
+		/* Write the adapter memory base/start */
+		t4_write_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4),
+			MEMOFST_V((mem_base >> MEMOFST_S)) | PFNUM_V(adap->pf));
+
+		/* Read it back to flush it */
+		t4_read_reg(adap,
+			PCIE_MEM_ACCESS_REG(PCIE_MEM_ACCESS_OFFSET_A,
+					    MEMWIN_RSVD4));
+	} else
+		dev_info(adap->pdev_dev, "p2pmem memory not reserved, "
+			 "base 0x%x size %uB\n", mem_base, mem_size);
+}
+
 static int adap_init1(struct adapter *adap, struct fw_caps_config_cmd *c)
 {
 	u32 v;
@@ -4622,6 +4675,42 @@ static int cxgb4_iov_configure(struct pci_dev *pdev, int num_vfs)
 }
 #endif
 
+static int init_p2pmem(struct adapter *adapter)
+{
+	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
+	struct p2pmem_dev *p;
+	int rc;
+	struct resource res;
+
+	if (!mem_size || !use_p2pmem)
+		return 0;
+
+	mem_size = roundup_pow_of_two(mem_size);
+
+	/*
+	 * Create a subset of BAR4 for the p2pmem region based on the
+	 * exported memory size.
+	 */
+	memcpy(&res, &adapter->pdev->resource[4], sizeof(res));
+	res.start += mem_size;
+	res.end = res.start + mem_size - 1;
+	dev_info(adapter->pdev_dev, "p2pmem resource start 0x%llx end 0x%llx size %lluB\n",
+		 res.start, res.end, resource_size(&res));
+
+	p = p2pmem_create(&adapter->pdev->dev);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	rc = p2pmem_add_resource(p, &res);
+	if (rc) {
+		p2pmem_unregister(p);
+		return rc;
+	}
+	adapter->p2pmem = p;
+
+	return 0;
+}
+
 static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	int func, i, err, s_qpp, qpp, num_seg;
@@ -4784,8 +4873,8 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bitmap_zero(adapter->sge.blocked_fl, adapter->sge.egr_sz);
 #endif
 	setup_memwin_rdma(adapter);
-	if (err)
-		goto out_unmap_bar;
+
+	setup_memwin_p2pmem(adapter);
 
 	/* configure SGE_STAT_CFG_A to read WC stats */
 	if (!is_t4(adapter->params.chip))
@@ -4989,6 +5078,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	print_adapter_info(adapter);
 	setup_fw_sge_queues(adapter);
+	init_p2pmem(adapter);
 	return 0;
 
 sriov:
@@ -5047,7 +5137,6 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		free_msix_info(adapter);
 	if (adapter->num_uld || adapter->num_ofld_uld)
 		t4_uld_mem_free(adapter);
- out_unmap_bar:
 	if (!is_t4(adapter->params.chip))
 		iounmap(adapter->bar2);
  out_free_adapter:
@@ -5075,6 +5164,8 @@ static void remove_one(struct pci_dev *pdev)
 		return;
 	}
 
+	p2pmem_unregister(adapter->p2pmem);
+
 	if (adapter->pf == 4) {
 		int i;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
index 3348d33..199ddfb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
@@ -735,6 +735,11 @@
 
 #define PCIE_MEM_ACCESS_OFFSET_A 0x306c
 
+#define MEMOFST_S    7
+#define MEMOFST_M    0x1ffffffU
+#define MEMOFST_V(x) ((x) << MEMOFST_S)
+#define MEMOFST_G(x) (((x) >> MEMOFST_S) & MEMOFST_M)
+
 #define ENABLE_S    30
 #define ENABLE_V(x) ((x) << ENABLE_S)
 #define ENABLE_F    ENABLE_V(1U)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 3/8] nvmet: Use p2pmem in nvme target
       [not found] ` <1490911959-5146-1-git-send-email-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  (?)
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/nvme/target/configfs.c | 31 +++++++++++++++
 drivers/nvme/target/nvmet.h    |  1 +
 drivers/nvme/target/rdma.c     | 90 ++++++++++++++++++++++++++++++++++++++----
 3 files changed, 114 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index be8c800..e61a7f4 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -777,12 +777,43 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_P2PMEM
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+					const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	bool allow;
+	int ret;
+
+	ret = strtobool(page, &allow);
+	if (ret)
+		return ret;
+
+	down_write(&nvmet_config_sem);
+	port->allow_p2pmem = allow;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+
+	#ifdef CONFIG_P2PMEM
+	&nvmet_attr_allow_p2pmem,
+	#endif
+
 	NULL,
 };
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index f7ff15f..ab67175 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -95,6 +95,7 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				allow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index ecc4fe8..7fd4840 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/p2pmem.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
+	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
 	int			send_queue_size;
 
 	struct list_head	queue_list;
+
+	struct p2pmem_dev	*p2pmem;
 };
 
 struct nvmet_rdma_device {
@@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
 	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
 }
 
-static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
+static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
+				struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	int count;
@@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
 	if (!sgl || !nents)
 		return;
 
-	for_each_sg(sgl, sg, nents, count)
-		__free_page(sg_page(sg));
+	for_each_sg(sgl, sg, nents, count) {
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(sg));
+		else
+			__free_page(sg_page(sg));
+	}
 	kfree(sgl);
 }
 
 static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
-		u32 length)
+		u32 length, struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	struct page *page;
@@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 	while (length) {
 		u32 page_len = min_t(u32, length, PAGE_SIZE);
 
-		page = alloc_page(GFP_KERNEL);
+		if (p2pmem)
+			page = p2pmem_alloc_page(p2pmem);
+		else
+			page = alloc_page(GFP_KERNEL);
+
 		if (!page)
 			goto out_free_pages;
 
@@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 out_free_pages:
 	while (i > 0) {
 		i--;
-		__free_page(sg_page(&sg[i]));
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(&sg[i]));
+		else
+			__free_page(sg_page(&sg[i]));
 	}
 	kfree(sg);
 out:
@@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
+				    rsp->p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
+	rsp->p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len);
+			len, rsp->p2pmem);
+
+	if (status && rsp->p2pmem) {
+		rsp->p2pmem = NULL;
+		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
+					      len, rsp->p2pmem);
+	}
+
 	if (status)
 		return status;
 
@@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
+	p2pmem_put(queue->p2pmem);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1179,6 +1205,52 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for our
+ * sgl lists. This requires the p2pmem device to be compatible with
+ * the backing device for every namespace this device will support.
+ * If not, we fall back on using system memory.
+ */
+static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
+{
+	struct device **dma_devs;
+	struct nvmet_ns *ns;
+	int ndevs = 1;
+	int i = 0;
+	struct nvmet_subsys_link *s;
+
+	if (!queue->port->allow_p2pmem)
+		return;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			ndevs++;
+		}
+	}
+
+	dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
+	if (!dma_devs)
+		return;
+
+	dma_devs[i++] = &queue->dev->device->dev;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
+		}
+	}
+
+	dma_devs[i++] = NULL;
+
+	queue->p2pmem = p2pmem_find_compat(dma_devs);
+
+	if (queue->p2pmem)
+		pr_debug("using %s for rdma nvme target queue",
+			 dev_name(&queue->p2pmem->dev));
+
+	kfree(dma_devs);
+}
+
 static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 		struct rdma_cm_event *event)
 {
@@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 	}
 	queue->port = cm_id->context;
 
+	nvmet_rdma_queue_setup_p2pmem(queue);
+
 	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
 	if (ret)
 		goto release_queue;
-- 
2.1.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Signed-off-by: Stephen Bates <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/nvme/target/configfs.c | 31 +++++++++++++++
 drivers/nvme/target/nvmet.h    |  1 +
 drivers/nvme/target/rdma.c     | 90 ++++++++++++++++++++++++++++++++++++++----
 3 files changed, 114 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index be8c800..e61a7f4 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -777,12 +777,43 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_P2PMEM
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+					const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	bool allow;
+	int ret;
+
+	ret = strtobool(page, &allow);
+	if (ret)
+		return ret;
+
+	down_write(&nvmet_config_sem);
+	port->allow_p2pmem = allow;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+
+	#ifdef CONFIG_P2PMEM
+	&nvmet_attr_allow_p2pmem,
+	#endif
+
 	NULL,
 };
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index f7ff15f..ab67175 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -95,6 +95,7 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				allow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index ecc4fe8..7fd4840 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/p2pmem.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
+	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
 	int			send_queue_size;
 
 	struct list_head	queue_list;
+
+	struct p2pmem_dev	*p2pmem;
 };
 
 struct nvmet_rdma_device {
@@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
 	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
 }
 
-static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
+static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
+				struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	int count;
@@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
 	if (!sgl || !nents)
 		return;
 
-	for_each_sg(sgl, sg, nents, count)
-		__free_page(sg_page(sg));
+	for_each_sg(sgl, sg, nents, count) {
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(sg));
+		else
+			__free_page(sg_page(sg));
+	}
 	kfree(sgl);
 }
 
 static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
-		u32 length)
+		u32 length, struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	struct page *page;
@@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 	while (length) {
 		u32 page_len = min_t(u32, length, PAGE_SIZE);
 
-		page = alloc_page(GFP_KERNEL);
+		if (p2pmem)
+			page = p2pmem_alloc_page(p2pmem);
+		else
+			page = alloc_page(GFP_KERNEL);
+
 		if (!page)
 			goto out_free_pages;
 
@@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 out_free_pages:
 	while (i > 0) {
 		i--;
-		__free_page(sg_page(&sg[i]));
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(&sg[i]));
+		else
+			__free_page(sg_page(&sg[i]));
 	}
 	kfree(sg);
 out:
@@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
+				    rsp->p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
+	rsp->p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len);
+			len, rsp->p2pmem);
+
+	if (status && rsp->p2pmem) {
+		rsp->p2pmem = NULL;
+		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
+					      len, rsp->p2pmem);
+	}
+
 	if (status)
 		return status;
 
@@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
+	p2pmem_put(queue->p2pmem);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1179,6 +1205,52 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for our
+ * sgl lists. This requires the p2pmem device to be compatible with
+ * the backing device for every namespace this device will support.
+ * If not, we fall back on using system memory.
+ */
+static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
+{
+	struct device **dma_devs;
+	struct nvmet_ns *ns;
+	int ndevs = 1;
+	int i = 0;
+	struct nvmet_subsys_link *s;
+
+	if (!queue->port->allow_p2pmem)
+		return;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			ndevs++;
+		}
+	}
+
+	dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
+	if (!dma_devs)
+		return;
+
+	dma_devs[i++] = &queue->dev->device->dev;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
+		}
+	}
+
+	dma_devs[i++] = NULL;
+
+	queue->p2pmem = p2pmem_find_compat(dma_devs);
+
+	if (queue->p2pmem)
+		pr_debug("using %s for rdma nvme target queue",
+			 dev_name(&queue->p2pmem->dev));
+
+	kfree(dma_devs);
+}
+
 static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 		struct rdma_cm_event *event)
 {
@@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 	}
 	queue->port = cm_id->context;
 
+	nvmet_rdma_queue_setup_p2pmem(queue);
+
 	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
 	if (ret)
 		goto release_queue;
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/nvme/target/configfs.c | 31 +++++++++++++++
 drivers/nvme/target/nvmet.h    |  1 +
 drivers/nvme/target/rdma.c     | 90 ++++++++++++++++++++++++++++++++++++++----
 3 files changed, 114 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index be8c800..e61a7f4 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -777,12 +777,43 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_P2PMEM
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+					const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	bool allow;
+	int ret;
+
+	ret = strtobool(page, &allow);
+	if (ret)
+		return ret;
+
+	down_write(&nvmet_config_sem);
+	port->allow_p2pmem = allow;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+
+	#ifdef CONFIG_P2PMEM
+	&nvmet_attr_allow_p2pmem,
+	#endif
+
 	NULL,
 };
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index f7ff15f..ab67175 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -95,6 +95,7 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				allow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index ecc4fe8..7fd4840 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/p2pmem.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
+	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
 	int			send_queue_size;
 
 	struct list_head	queue_list;
+
+	struct p2pmem_dev	*p2pmem;
 };
 
 struct nvmet_rdma_device {
@@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
 	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
 }
 
-static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
+static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
+				struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	int count;
@@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
 	if (!sgl || !nents)
 		return;
 
-	for_each_sg(sgl, sg, nents, count)
-		__free_page(sg_page(sg));
+	for_each_sg(sgl, sg, nents, count) {
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(sg));
+		else
+			__free_page(sg_page(sg));
+	}
 	kfree(sgl);
 }
 
 static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
-		u32 length)
+		u32 length, struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	struct page *page;
@@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 	while (length) {
 		u32 page_len = min_t(u32, length, PAGE_SIZE);
 
-		page = alloc_page(GFP_KERNEL);
+		if (p2pmem)
+			page = p2pmem_alloc_page(p2pmem);
+		else
+			page = alloc_page(GFP_KERNEL);
+
 		if (!page)
 			goto out_free_pages;
 
@@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 out_free_pages:
 	while (i > 0) {
 		i--;
-		__free_page(sg_page(&sg[i]));
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(&sg[i]));
+		else
+			__free_page(sg_page(&sg[i]));
 	}
 	kfree(sg);
 out:
@@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
+				    rsp->p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
+	rsp->p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len);
+			len, rsp->p2pmem);
+
+	if (status && rsp->p2pmem) {
+		rsp->p2pmem = NULL;
+		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
+					      len, rsp->p2pmem);
+	}
+
 	if (status)
 		return status;
 
@@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
+	p2pmem_put(queue->p2pmem);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1179,6 +1205,52 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for our
+ * sgl lists. This requires the p2pmem device to be compatible with
+ * the backing device for every namespace this device will support.
+ * If not, we fall back on using system memory.
+ */
+static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
+{
+	struct device **dma_devs;
+	struct nvmet_ns *ns;
+	int ndevs = 1;
+	int i = 0;
+	struct nvmet_subsys_link *s;
+
+	if (!queue->port->allow_p2pmem)
+		return;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			ndevs++;
+		}
+	}
+
+	dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
+	if (!dma_devs)
+		return;
+
+	dma_devs[i++] = &queue->dev->device->dev;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
+		}
+	}
+
+	dma_devs[i++] = NULL;
+
+	queue->p2pmem = p2pmem_find_compat(dma_devs);
+
+	if (queue->p2pmem)
+		pr_debug("using %s for rdma nvme target queue",
+			 dev_name(&queue->p2pmem->dev));
+
+	kfree(dma_devs);
+}
+
 static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 		struct rdma_cm_event *event)
 {
@@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 	}
 	queue->port = cm_id->context;
 
+	nvmet_rdma_queue_setup_p2pmem(queue);
+
 	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
 	if (ret)
 		goto release_queue;
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/nvme/target/configfs.c | 31 +++++++++++++++
 drivers/nvme/target/nvmet.h    |  1 +
 drivers/nvme/target/rdma.c     | 90 ++++++++++++++++++++++++++++++++++++++----
 3 files changed, 114 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index be8c800..e61a7f4 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -777,12 +777,43 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_P2PMEM
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+					const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	bool allow;
+	int ret;
+
+	ret = strtobool(page, &allow);
+	if (ret)
+		return ret;
+
+	down_write(&nvmet_config_sem);
+	port->allow_p2pmem = allow;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+
+	#ifdef CONFIG_P2PMEM
+	&nvmet_attr_allow_p2pmem,
+	#endif
+
 	NULL,
 };
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index f7ff15f..ab67175 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -95,6 +95,7 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				allow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index ecc4fe8..7fd4840 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/p2pmem.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
+	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
 	int			send_queue_size;
 
 	struct list_head	queue_list;
+
+	struct p2pmem_dev	*p2pmem;
 };
 
 struct nvmet_rdma_device {
@@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
 	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
 }
 
-static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
+static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
+				struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	int count;
@@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
 	if (!sgl || !nents)
 		return;
 
-	for_each_sg(sgl, sg, nents, count)
-		__free_page(sg_page(sg));
+	for_each_sg(sgl, sg, nents, count) {
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(sg));
+		else
+			__free_page(sg_page(sg));
+	}
 	kfree(sgl);
 }
 
 static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
-		u32 length)
+		u32 length, struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	struct page *page;
@@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 	while (length) {
 		u32 page_len = min_t(u32, length, PAGE_SIZE);
 
-		page = alloc_page(GFP_KERNEL);
+		if (p2pmem)
+			page = p2pmem_alloc_page(p2pmem);
+		else
+			page = alloc_page(GFP_KERNEL);
+
 		if (!page)
 			goto out_free_pages;
 
@@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 out_free_pages:
 	while (i > 0) {
 		i--;
-		__free_page(sg_page(&sg[i]));
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(&sg[i]));
+		else
+			__free_page(sg_page(&sg[i]));
 	}
 	kfree(sg);
 out:
@@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
+				    rsp->p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
+	rsp->p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len);
+			len, rsp->p2pmem);
+
+	if (status && rsp->p2pmem) {
+		rsp->p2pmem = NULL;
+		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
+					      len, rsp->p2pmem);
+	}
+
 	if (status)
 		return status;
 
@@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
+	p2pmem_put(queue->p2pmem);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1179,6 +1205,52 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for our
+ * sgl lists. This requires the p2pmem device to be compatible with
+ * the backing device for every namespace this device will support.
+ * If not, we fall back on using system memory.
+ */
+static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
+{
+	struct device **dma_devs;
+	struct nvmet_ns *ns;
+	int ndevs = 1;
+	int i = 0;
+	struct nvmet_subsys_link *s;
+
+	if (!queue->port->allow_p2pmem)
+		return;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			ndevs++;
+		}
+	}
+
+	dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
+	if (!dma_devs)
+		return;
+
+	dma_devs[i++] = &queue->dev->device->dev;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
+		}
+	}
+
+	dma_devs[i++] = NULL;
+
+	queue->p2pmem = p2pmem_find_compat(dma_devs);
+
+	if (queue->p2pmem)
+		pr_debug("using %s for rdma nvme target queue",
+			 dev_name(&queue->p2pmem->dev));
+
+	kfree(dma_devs);
+}
+
 static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 		struct rdma_cm_event *event)
 {
@@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 	}
 	queue->port = cm_id->context;
 
+	nvmet_rdma_queue_setup_p2pmem(queue);
+
 	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
 	if (ret)
 		goto release_queue;
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)


We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.

Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Signed-off-by: Stephen Bates <sbates at raithlin.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---
 drivers/nvme/target/configfs.c | 31 +++++++++++++++
 drivers/nvme/target/nvmet.h    |  1 +
 drivers/nvme/target/rdma.c     | 90 ++++++++++++++++++++++++++++++++++++++----
 3 files changed, 114 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index be8c800..e61a7f4 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -777,12 +777,43 @@ static void nvmet_port_release(struct config_item *item)
 	kfree(port);
 }
 
+#ifdef CONFIG_P2PMEM
+static ssize_t nvmet_allow_p2pmem_show(struct config_item *item, char *page)
+{
+	return sprintf(page, "%d\n", to_nvmet_port(item)->allow_p2pmem);
+}
+
+static ssize_t nvmet_allow_p2pmem_store(struct config_item *item,
+					const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	bool allow;
+	int ret;
+
+	ret = strtobool(page, &allow);
+	if (ret)
+		return ret;
+
+	down_write(&nvmet_config_sem);
+	port->allow_p2pmem = allow;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_, allow_p2pmem);
+#endif
+
 static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_adrfam,
 	&nvmet_attr_addr_treq,
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+
+	#ifdef CONFIG_P2PMEM
+	&nvmet_attr_allow_p2pmem,
+	#endif
+
 	NULL,
 };
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index f7ff15f..ab67175 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -95,6 +95,7 @@ struct nvmet_port {
 	struct list_head		referrals;
 	void				*priv;
 	bool				enabled;
+	bool				allow_p2pmem;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index ecc4fe8..7fd4840 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -23,6 +23,7 @@
 #include <linux/string.h>
 #include <linux/wait.h>
 #include <linux/inet.h>
+#include <linux/p2pmem.h>
 #include <asm/unaligned.h>
 
 #include <rdma/ib_verbs.h>
@@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
+	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
 	int			send_queue_size;
 
 	struct list_head	queue_list;
+
+	struct p2pmem_dev	*p2pmem;
 };
 
 struct nvmet_rdma_device {
@@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
 	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
 }
 
-static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
+static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
+				struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	int count;
@@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
 	if (!sgl || !nents)
 		return;
 
-	for_each_sg(sgl, sg, nents, count)
-		__free_page(sg_page(sg));
+	for_each_sg(sgl, sg, nents, count) {
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(sg));
+		else
+			__free_page(sg_page(sg));
+	}
 	kfree(sgl);
 }
 
 static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
-		u32 length)
+		u32 length, struct p2pmem_dev *p2pmem)
 {
 	struct scatterlist *sg;
 	struct page *page;
@@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 	while (length) {
 		u32 page_len = min_t(u32, length, PAGE_SIZE);
 
-		page = alloc_page(GFP_KERNEL);
+		if (p2pmem)
+			page = p2pmem_alloc_page(p2pmem);
+		else
+			page = alloc_page(GFP_KERNEL);
+
 		if (!page)
 			goto out_free_pages;
 
@@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
 out_free_pages:
 	while (i > 0) {
 		i--;
-		__free_page(sg_page(&sg[i]));
+		if (p2pmem)
+			p2pmem_free_page(p2pmem, sg_page(&sg[i]));
+		else
+			__free_page(sg_page(&sg[i]));
 	}
 	kfree(sg);
 out:
@@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
-		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
+				    rsp->p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
+	rsp->p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len);
+			len, rsp->p2pmem);
+
+	if (status && rsp->p2pmem) {
+		rsp->p2pmem = NULL;
+		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
+					      len, rsp->p2pmem);
+	}
+
 	if (status)
 		return status;
 
@@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
+	p2pmem_put(queue->p2pmem);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1179,6 +1205,52 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for our
+ * sgl lists. This requires the p2pmem device to be compatible with
+ * the backing device for every namespace this device will support.
+ * If not, we fall back on using system memory.
+ */
+static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
+{
+	struct device **dma_devs;
+	struct nvmet_ns *ns;
+	int ndevs = 1;
+	int i = 0;
+	struct nvmet_subsys_link *s;
+
+	if (!queue->port->allow_p2pmem)
+		return;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			ndevs++;
+		}
+	}
+
+	dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
+	if (!dma_devs)
+		return;
+
+	dma_devs[i++] = &queue->dev->device->dev;
+
+	list_for_each_entry(s, &queue->port->subsystems, entry) {
+		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
+			dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
+		}
+	}
+
+	dma_devs[i++] = NULL;
+
+	queue->p2pmem = p2pmem_find_compat(dma_devs);
+
+	if (queue->p2pmem)
+		pr_debug("using %s for rdma nvme target queue",
+			 dev_name(&queue->p2pmem->dev));
+
+	kfree(dma_devs);
+}
+
 static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 		struct rdma_cm_event *event)
 {
@@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 	}
 	queue->port = cm_id->context;
 
+	nvmet_rdma_queue_setup_p2pmem(queue);
+
 	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
 	if (ret)
 		goto release_queue;
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 4/8] p2pmem: Add debugfs "stats" file
       [not found] ` <1490911959-5146-1-git-send-email-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  (?)
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

From: Steve Wise <swise@opengridcomputing.com>

For each p2pmem instance, add a "stats" file to show
the gen_pool statistics.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 drivers/memory/p2pmem.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  |  2 ++
 2 files changed, 51 insertions(+)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index c4ea311..71741c2 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
+#include <linux/debugfs.h>
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
@@ -27,6 +28,40 @@ MODULE_AUTHOR("Microsemi Corporation");
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
 
+static struct dentry *p2pmem_debugfs_root;
+
+static int stats_show(struct seq_file *seq, void *v)
+{
+	struct p2pmem_dev *p = seq->private;
+
+	if (p->pool) {
+		seq_printf(seq, "total size: %lu\n", gen_pool_size(p->pool));
+		seq_printf(seq, "available:  %lu\n", gen_pool_avail(p->pool));
+	}
+	return 0;
+}
+
+static int stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, stats_show, inode->i_private);
+}
+
+static const struct file_operations stats_debugfs_fops = {
+	.owner   = THIS_MODULE,
+	.open    = stats_open,
+	.release = single_release,
+	.read	 = seq_read,
+	.llseek  = seq_lseek,
+};
+
+static void setup_debugfs(struct p2pmem_dev *p)
+{
+	struct dentry *de;
+
+	de = debugfs_create_file("stats", 0400, p->debugfs_root,
+				 (void *)p, &stats_debugfs_fops);
+}
+
 static struct p2pmem_dev *to_p2pmem(struct device *dev)
 {
 	return container_of(dev, struct p2pmem_dev, dev);
@@ -62,6 +97,8 @@ static void p2pmem_release(struct device *dev)
 {
 	struct p2pmem_dev *p = to_p2pmem(dev);
 
+	debugfs_remove_recursive(p->debugfs_root);
+
 	if (p->pool)
 		gen_pool_destroy(p->pool);
 
@@ -114,6 +151,13 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	if (rc)
 		goto err_id;
 
+	if (p2pmem_debugfs_root) {
+		p->debugfs_root = debugfs_create_dir(dev_name(&p->dev),
+						     p2pmem_debugfs_root);
+		if (p->debugfs_root)
+			setup_debugfs(p);
+	}
+
 	rc = device_add(&p->dev);
 	if (rc)
 		goto err_id;
@@ -390,12 +434,17 @@ static int __init p2pmem_init(void)
 	if (IS_ERR(p2pmem_class))
 		return PTR_ERR(p2pmem_class);
 
+	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
+	if (!p2pmem_debugfs_root)
+		pr_info("could not create debugfs entry, continuing\n");
+
 	return 0;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
+	debugfs_remove_recursive(p2pmem_debugfs_root);
 	class_destroy(p2pmem_class);
 
 	pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 71dc1e1..4cd6f35 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -26,6 +26,8 @@ struct p2pmem_dev {
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
+
+	struct dentry *debugfs_root;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Logan Gunthorpe

From: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

For each p2pmem instance, add a "stats" file to show
the gen_pool statistics.

Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Signed-off-by: Stephen Bates <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org>
---
 drivers/memory/p2pmem.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  |  2 ++
 2 files changed, 51 insertions(+)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index c4ea311..71741c2 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
+#include <linux/debugfs.h>
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
@@ -27,6 +28,40 @@ MODULE_AUTHOR("Microsemi Corporation");
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
 
+static struct dentry *p2pmem_debugfs_root;
+
+static int stats_show(struct seq_file *seq, void *v)
+{
+	struct p2pmem_dev *p = seq->private;
+
+	if (p->pool) {
+		seq_printf(seq, "total size: %lu\n", gen_pool_size(p->pool));
+		seq_printf(seq, "available:  %lu\n", gen_pool_avail(p->pool));
+	}
+	return 0;
+}
+
+static int stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, stats_show, inode->i_private);
+}
+
+static const struct file_operations stats_debugfs_fops = {
+	.owner   = THIS_MODULE,
+	.open    = stats_open,
+	.release = single_release,
+	.read	 = seq_read,
+	.llseek  = seq_lseek,
+};
+
+static void setup_debugfs(struct p2pmem_dev *p)
+{
+	struct dentry *de;
+
+	de = debugfs_create_file("stats", 0400, p->debugfs_root,
+				 (void *)p, &stats_debugfs_fops);
+}
+
 static struct p2pmem_dev *to_p2pmem(struct device *dev)
 {
 	return container_of(dev, struct p2pmem_dev, dev);
@@ -62,6 +97,8 @@ static void p2pmem_release(struct device *dev)
 {
 	struct p2pmem_dev *p = to_p2pmem(dev);
 
+	debugfs_remove_recursive(p->debugfs_root);
+
 	if (p->pool)
 		gen_pool_destroy(p->pool);
 
@@ -114,6 +151,13 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	if (rc)
 		goto err_id;
 
+	if (p2pmem_debugfs_root) {
+		p->debugfs_root = debugfs_create_dir(dev_name(&p->dev),
+						     p2pmem_debugfs_root);
+		if (p->debugfs_root)
+			setup_debugfs(p);
+	}
+
 	rc = device_add(&p->dev);
 	if (rc)
 		goto err_id;
@@ -390,12 +434,17 @@ static int __init p2pmem_init(void)
 	if (IS_ERR(p2pmem_class))
 		return PTR_ERR(p2pmem_class);
 
+	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
+	if (!p2pmem_debugfs_root)
+		pr_info("could not create debugfs entry, continuing\n");
+
 	return 0;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
+	debugfs_remove_recursive(p2pmem_debugfs_root);
 	class_destroy(p2pmem_class);
 
 	pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 71dc1e1..4cd6f35 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -26,6 +26,8 @@ struct p2pmem_dev {
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
+
+	struct dentry *debugfs_root;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

From: Steve Wise <swise@opengridcomputing.com>

For each p2pmem instance, add a "stats" file to show
the gen_pool statistics.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 drivers/memory/p2pmem.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  |  2 ++
 2 files changed, 51 insertions(+)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index c4ea311..71741c2 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
+#include <linux/debugfs.h>
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
@@ -27,6 +28,40 @@ MODULE_AUTHOR("Microsemi Corporation");
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
 
+static struct dentry *p2pmem_debugfs_root;
+
+static int stats_show(struct seq_file *seq, void *v)
+{
+	struct p2pmem_dev *p = seq->private;
+
+	if (p->pool) {
+		seq_printf(seq, "total size: %lu\n", gen_pool_size(p->pool));
+		seq_printf(seq, "available:  %lu\n", gen_pool_avail(p->pool));
+	}
+	return 0;
+}
+
+static int stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, stats_show, inode->i_private);
+}
+
+static const struct file_operations stats_debugfs_fops = {
+	.owner   = THIS_MODULE,
+	.open    = stats_open,
+	.release = single_release,
+	.read	 = seq_read,
+	.llseek  = seq_lseek,
+};
+
+static void setup_debugfs(struct p2pmem_dev *p)
+{
+	struct dentry *de;
+
+	de = debugfs_create_file("stats", 0400, p->debugfs_root,
+				 (void *)p, &stats_debugfs_fops);
+}
+
 static struct p2pmem_dev *to_p2pmem(struct device *dev)
 {
 	return container_of(dev, struct p2pmem_dev, dev);
@@ -62,6 +97,8 @@ static void p2pmem_release(struct device *dev)
 {
 	struct p2pmem_dev *p = to_p2pmem(dev);
 
+	debugfs_remove_recursive(p->debugfs_root);
+
 	if (p->pool)
 		gen_pool_destroy(p->pool);
 
@@ -114,6 +151,13 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	if (rc)
 		goto err_id;
 
+	if (p2pmem_debugfs_root) {
+		p->debugfs_root = debugfs_create_dir(dev_name(&p->dev),
+						     p2pmem_debugfs_root);
+		if (p->debugfs_root)
+			setup_debugfs(p);
+	}
+
 	rc = device_add(&p->dev);
 	if (rc)
 		goto err_id;
@@ -390,12 +434,17 @@ static int __init p2pmem_init(void)
 	if (IS_ERR(p2pmem_class))
 		return PTR_ERR(p2pmem_class);
 
+	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
+	if (!p2pmem_debugfs_root)
+		pr_info("could not create debugfs entry, continuing\n");
+
 	return 0;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
+	debugfs_remove_recursive(p2pmem_debugfs_root);
 	class_destroy(p2pmem_class);
 
 	pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 71dc1e1..4cd6f35 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -26,6 +26,8 @@ struct p2pmem_dev {
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
+
+	struct dentry *debugfs_root;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

From: Steve Wise <swise@opengridcomputing.com>

For each p2pmem instance, add a "stats" file to show
the gen_pool statistics.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 drivers/memory/p2pmem.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  |  2 ++
 2 files changed, 51 insertions(+)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index c4ea311..71741c2 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
+#include <linux/debugfs.h>
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
@@ -27,6 +28,40 @@ MODULE_AUTHOR("Microsemi Corporation");
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
 
+static struct dentry *p2pmem_debugfs_root;
+
+static int stats_show(struct seq_file *seq, void *v)
+{
+	struct p2pmem_dev *p = seq->private;
+
+	if (p->pool) {
+		seq_printf(seq, "total size: %lu\n", gen_pool_size(p->pool));
+		seq_printf(seq, "available:  %lu\n", gen_pool_avail(p->pool));
+	}
+	return 0;
+}
+
+static int stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, stats_show, inode->i_private);
+}
+
+static const struct file_operations stats_debugfs_fops = {
+	.owner   = THIS_MODULE,
+	.open    = stats_open,
+	.release = single_release,
+	.read	 = seq_read,
+	.llseek  = seq_lseek,
+};
+
+static void setup_debugfs(struct p2pmem_dev *p)
+{
+	struct dentry *de;
+
+	de = debugfs_create_file("stats", 0400, p->debugfs_root,
+				 (void *)p, &stats_debugfs_fops);
+}
+
 static struct p2pmem_dev *to_p2pmem(struct device *dev)
 {
 	return container_of(dev, struct p2pmem_dev, dev);
@@ -62,6 +97,8 @@ static void p2pmem_release(struct device *dev)
 {
 	struct p2pmem_dev *p = to_p2pmem(dev);
 
+	debugfs_remove_recursive(p->debugfs_root);
+
 	if (p->pool)
 		gen_pool_destroy(p->pool);
 
@@ -114,6 +151,13 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	if (rc)
 		goto err_id;
 
+	if (p2pmem_debugfs_root) {
+		p->debugfs_root = debugfs_create_dir(dev_name(&p->dev),
+						     p2pmem_debugfs_root);
+		if (p->debugfs_root)
+			setup_debugfs(p);
+	}
+
 	rc = device_add(&p->dev);
 	if (rc)
 		goto err_id;
@@ -390,12 +434,17 @@ static int __init p2pmem_init(void)
 	if (IS_ERR(p2pmem_class))
 		return PTR_ERR(p2pmem_class);
 
+	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
+	if (!p2pmem_debugfs_root)
+		pr_info("could not create debugfs entry, continuing\n");
+
 	return 0;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
+	debugfs_remove_recursive(p2pmem_debugfs_root);
 	class_destroy(p2pmem_class);
 
 	pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 71dc1e1..4cd6f35 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -26,6 +26,8 @@ struct p2pmem_dev {
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
+
+	struct dentry *debugfs_root;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)


From: Steve Wise <swise@opengridcomputing.com>

For each p2pmem instance, add a "stats" file to show
the gen_pool statistics.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Signed-off-by: Stephen Bates <sbates at raithlin.com>
---
 drivers/memory/p2pmem.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  |  2 ++
 2 files changed, 51 insertions(+)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index c4ea311..71741c2 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
+#include <linux/debugfs.h>
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
@@ -27,6 +28,40 @@ MODULE_AUTHOR("Microsemi Corporation");
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
 
+static struct dentry *p2pmem_debugfs_root;
+
+static int stats_show(struct seq_file *seq, void *v)
+{
+	struct p2pmem_dev *p = seq->private;
+
+	if (p->pool) {
+		seq_printf(seq, "total size: %lu\n", gen_pool_size(p->pool));
+		seq_printf(seq, "available:  %lu\n", gen_pool_avail(p->pool));
+	}
+	return 0;
+}
+
+static int stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, stats_show, inode->i_private);
+}
+
+static const struct file_operations stats_debugfs_fops = {
+	.owner   = THIS_MODULE,
+	.open    = stats_open,
+	.release = single_release,
+	.read	 = seq_read,
+	.llseek  = seq_lseek,
+};
+
+static void setup_debugfs(struct p2pmem_dev *p)
+{
+	struct dentry *de;
+
+	de = debugfs_create_file("stats", 0400, p->debugfs_root,
+				 (void *)p, &stats_debugfs_fops);
+}
+
 static struct p2pmem_dev *to_p2pmem(struct device *dev)
 {
 	return container_of(dev, struct p2pmem_dev, dev);
@@ -62,6 +97,8 @@ static void p2pmem_release(struct device *dev)
 {
 	struct p2pmem_dev *p = to_p2pmem(dev);
 
+	debugfs_remove_recursive(p->debugfs_root);
+
 	if (p->pool)
 		gen_pool_destroy(p->pool);
 
@@ -114,6 +151,13 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	if (rc)
 		goto err_id;
 
+	if (p2pmem_debugfs_root) {
+		p->debugfs_root = debugfs_create_dir(dev_name(&p->dev),
+						     p2pmem_debugfs_root);
+		if (p->debugfs_root)
+			setup_debugfs(p);
+	}
+
 	rc = device_add(&p->dev);
 	if (rc)
 		goto err_id;
@@ -390,12 +434,17 @@ static int __init p2pmem_init(void)
 	if (IS_ERR(p2pmem_class))
 		return PTR_ERR(p2pmem_class);
 
+	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
+	if (!p2pmem_debugfs_root)
+		pr_info("could not create debugfs entry, continuing\n");
+
 	return 0;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
+	debugfs_remove_recursive(p2pmem_debugfs_root);
 	class_destroy(p2pmem_class);
 
 	pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 71dc1e1..4cd6f35 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -26,6 +26,8 @@ struct p2pmem_dev {
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
+
+	struct dentry *debugfs_root;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
       [not found] ` <1490911959-5146-1-git-send-email-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  (?)
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

Now that we are using p2pmem SG buffers we occasionally have to copy
to and from this memory. For this, we add an iomem flag to
sg_copy_buffer for copying with iomemcpy. We also add the sg_iocopy_
variants to use this more easily.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/scsi/scsi_debug.c   |  7 ++---
 include/linux/scatterlist.h |  7 ++++-
 lib/scatterlist.c           | 64 ++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 17249c3..70c0d9f 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1309,7 +1309,7 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		int lu_id_num, port_group_id, target_dev_id, len;
 		char lu_id_str[6];
 		int host_no = devip->sdbg_host->shost->host_no;
-		
+
 		port_group_id = (((host_no + 1) & 0x7f) << 8) +
 		    (devip->channel & 0x7f);
 		if (sdebug_vpd_use_hostno == 0)
@@ -2381,14 +2381,15 @@ static int do_device_access(struct scsi_cmnd *scmd, u64 lba, u32 num,
 
 	ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 		   fake_storep + (block * sdebug_sector_size),
-		   (num - rest) * sdebug_sector_size, 0, do_write);
+		   (num - rest) * sdebug_sector_size, 0, do_write, false);
 	if (ret != (num - rest) * sdebug_sector_size)
 		return ret;
 
 	if (rest) {
 		ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 			    fake_storep, rest * sdebug_sector_size,
-			    (num - rest) * sdebug_sector_size, do_write);
+			    (num - rest) * sdebug_sector_size, do_write,
+				      false);
 	}
 
 	return ret;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..030b92b 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -267,7 +267,7 @@ int sg_alloc_table_from_pages(struct sg_table *sgt,
 	gfp_t gfp_mask);
 
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer);
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem);
 
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen);
@@ -279,6 +279,11 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip);
 
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen);
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..22abd94 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -647,7 +647,7 @@ EXPORT_SYMBOL(sg_miter_stop);
  *
  **/
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer)
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem)
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
@@ -668,10 +668,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
 
 		len = min(miter.length, buflen - offset);
 
-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (iomem) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}
 
 		offset += len;
 	}
@@ -695,7 +702,8 @@ EXPORT_SYMBOL(sg_copy_buffer);
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_copy_from_buffer);
 
@@ -712,7 +720,7 @@ EXPORT_SYMBOL(sg_copy_from_buffer);
 size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			 void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, false);
 }
 EXPORT_SYMBOL(sg_copy_to_buffer);
 
@@ -730,7 +738,8 @@ EXPORT_SYMBOL(sg_copy_to_buffer);
 size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			    const void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_pcopy_from_buffer);
 
@@ -748,6 +757,43 @@ EXPORT_SYMBOL(sg_pcopy_from_buffer);
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true, false);
 }
 EXPORT_SYMBOL(sg_pcopy_to_buffer);
+
+/**
+ * sg_iocopy_from_buffer - Copy from a linear buffer to an SG list containing
+ *	IO memory.
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy from
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      true);
+}
+EXPORT_SYMBOL(sg_iocopy_from_buffer);
+
+/**
+ * sg_iocopy_to_buffer - Copy from an SG list containing IO memory
+ *	to a linear buffer
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy to
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, true);
+}
+EXPORT_SYMBOL(sg_iocopy_to_buffer);
-- 
2.1.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Now that we are using p2pmem SG buffers we occasionally have to copy
to and from this memory. For this, we add an iomem flag to
sg_copy_buffer for copying with iomemcpy. We also add the sg_iocopy_
variants to use this more easily.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Signed-off-by: Stephen Bates <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/scsi/scsi_debug.c   |  7 ++---
 include/linux/scatterlist.h |  7 ++++-
 lib/scatterlist.c           | 64 ++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 17249c3..70c0d9f 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1309,7 +1309,7 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		int lu_id_num, port_group_id, target_dev_id, len;
 		char lu_id_str[6];
 		int host_no = devip->sdbg_host->shost->host_no;
-		
+
 		port_group_id = (((host_no + 1) & 0x7f) << 8) +
 		    (devip->channel & 0x7f);
 		if (sdebug_vpd_use_hostno == 0)
@@ -2381,14 +2381,15 @@ static int do_device_access(struct scsi_cmnd *scmd, u64 lba, u32 num,
 
 	ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 		   fake_storep + (block * sdebug_sector_size),
-		   (num - rest) * sdebug_sector_size, 0, do_write);
+		   (num - rest) * sdebug_sector_size, 0, do_write, false);
 	if (ret != (num - rest) * sdebug_sector_size)
 		return ret;
 
 	if (rest) {
 		ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 			    fake_storep, rest * sdebug_sector_size,
-			    (num - rest) * sdebug_sector_size, do_write);
+			    (num - rest) * sdebug_sector_size, do_write,
+				      false);
 	}
 
 	return ret;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..030b92b 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -267,7 +267,7 @@ int sg_alloc_table_from_pages(struct sg_table *sgt,
 	gfp_t gfp_mask);
 
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer);
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem);
 
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen);
@@ -279,6 +279,11 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip);
 
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen);
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..22abd94 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -647,7 +647,7 @@ EXPORT_SYMBOL(sg_miter_stop);
  *
  **/
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer)
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem)
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
@@ -668,10 +668,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
 
 		len = min(miter.length, buflen - offset);
 
-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (iomem) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}
 
 		offset += len;
 	}
@@ -695,7 +702,8 @@ EXPORT_SYMBOL(sg_copy_buffer);
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_copy_from_buffer);
 
@@ -712,7 +720,7 @@ EXPORT_SYMBOL(sg_copy_from_buffer);
 size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			 void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, false);
 }
 EXPORT_SYMBOL(sg_copy_to_buffer);
 
@@ -730,7 +738,8 @@ EXPORT_SYMBOL(sg_copy_to_buffer);
 size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			    const void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_pcopy_from_buffer);
 
@@ -748,6 +757,43 @@ EXPORT_SYMBOL(sg_pcopy_from_buffer);
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true, false);
 }
 EXPORT_SYMBOL(sg_pcopy_to_buffer);
+
+/**
+ * sg_iocopy_from_buffer - Copy from a linear buffer to an SG list containing
+ *	IO memory.
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy from
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      true);
+}
+EXPORT_SYMBOL(sg_iocopy_from_buffer);
+
+/**
+ * sg_iocopy_to_buffer - Copy from an SG list containing IO memory
+ *	to a linear buffer
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy to
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, true);
+}
+EXPORT_SYMBOL(sg_iocopy_to_buffer);
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

Now that we are using p2pmem SG buffers we occasionally have to copy
to and from this memory. For this, we add an iomem flag to
sg_copy_buffer for copying with iomemcpy. We also add the sg_iocopy_
variants to use this more easily.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/scsi/scsi_debug.c   |  7 ++---
 include/linux/scatterlist.h |  7 ++++-
 lib/scatterlist.c           | 64 ++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 17249c3..70c0d9f 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1309,7 +1309,7 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		int lu_id_num, port_group_id, target_dev_id, len;
 		char lu_id_str[6];
 		int host_no = devip->sdbg_host->shost->host_no;
-		
+
 		port_group_id = (((host_no + 1) & 0x7f) << 8) +
 		    (devip->channel & 0x7f);
 		if (sdebug_vpd_use_hostno == 0)
@@ -2381,14 +2381,15 @@ static int do_device_access(struct scsi_cmnd *scmd, u64 lba, u32 num,
 
 	ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 		   fake_storep + (block * sdebug_sector_size),
-		   (num - rest) * sdebug_sector_size, 0, do_write);
+		   (num - rest) * sdebug_sector_size, 0, do_write, false);
 	if (ret != (num - rest) * sdebug_sector_size)
 		return ret;
 
 	if (rest) {
 		ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 			    fake_storep, rest * sdebug_sector_size,
-			    (num - rest) * sdebug_sector_size, do_write);
+			    (num - rest) * sdebug_sector_size, do_write,
+				      false);
 	}
 
 	return ret;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..030b92b 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -267,7 +267,7 @@ int sg_alloc_table_from_pages(struct sg_table *sgt,
 	gfp_t gfp_mask);
 
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer);
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem);
 
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen);
@@ -279,6 +279,11 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip);
 
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen);
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..22abd94 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -647,7 +647,7 @@ EXPORT_SYMBOL(sg_miter_stop);
  *
  **/
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer)
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem)
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
@@ -668,10 +668,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
 
 		len = min(miter.length, buflen - offset);
 
-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (iomem) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}
 
 		offset += len;
 	}
@@ -695,7 +702,8 @@ EXPORT_SYMBOL(sg_copy_buffer);
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_copy_from_buffer);
 
@@ -712,7 +720,7 @@ EXPORT_SYMBOL(sg_copy_from_buffer);
 size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			 void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, false);
 }
 EXPORT_SYMBOL(sg_copy_to_buffer);
 
@@ -730,7 +738,8 @@ EXPORT_SYMBOL(sg_copy_to_buffer);
 size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			    const void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_pcopy_from_buffer);
 
@@ -748,6 +757,43 @@ EXPORT_SYMBOL(sg_pcopy_from_buffer);
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true, false);
 }
 EXPORT_SYMBOL(sg_pcopy_to_buffer);
+
+/**
+ * sg_iocopy_from_buffer - Copy from a linear buffer to an SG list containing
+ *	IO memory.
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy from
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      true);
+}
+EXPORT_SYMBOL(sg_iocopy_from_buffer);
+
+/**
+ * sg_iocopy_to_buffer - Copy from an SG list containing IO memory
+ *	to a linear buffer
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy to
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, true);
+}
+EXPORT_SYMBOL(sg_iocopy_to_buffer);
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

Now that we are using p2pmem SG buffers we occasionally have to copy
to and from this memory. For this, we add an iomem flag to
sg_copy_buffer for copying with iomemcpy. We also add the sg_iocopy_
variants to use this more easily.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/scsi/scsi_debug.c   |  7 ++---
 include/linux/scatterlist.h |  7 ++++-
 lib/scatterlist.c           | 64 ++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 17249c3..70c0d9f 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1309,7 +1309,7 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		int lu_id_num, port_group_id, target_dev_id, len;
 		char lu_id_str[6];
 		int host_no = devip->sdbg_host->shost->host_no;
-		
+
 		port_group_id = (((host_no + 1) & 0x7f) << 8) +
 		    (devip->channel & 0x7f);
 		if (sdebug_vpd_use_hostno == 0)
@@ -2381,14 +2381,15 @@ static int do_device_access(struct scsi_cmnd *scmd, u64 lba, u32 num,
 
 	ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 		   fake_storep + (block * sdebug_sector_size),
-		   (num - rest) * sdebug_sector_size, 0, do_write);
+		   (num - rest) * sdebug_sector_size, 0, do_write, false);
 	if (ret != (num - rest) * sdebug_sector_size)
 		return ret;
 
 	if (rest) {
 		ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 			    fake_storep, rest * sdebug_sector_size,
-			    (num - rest) * sdebug_sector_size, do_write);
+			    (num - rest) * sdebug_sector_size, do_write,
+				      false);
 	}
 
 	return ret;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..030b92b 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -267,7 +267,7 @@ int sg_alloc_table_from_pages(struct sg_table *sgt,
 	gfp_t gfp_mask);
 
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer);
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem);
 
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen);
@@ -279,6 +279,11 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip);
 
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen);
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..22abd94 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -647,7 +647,7 @@ EXPORT_SYMBOL(sg_miter_stop);
  *
  **/
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer)
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem)
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
@@ -668,10 +668,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
 
 		len = min(miter.length, buflen - offset);
 
-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (iomem) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}
 
 		offset += len;
 	}
@@ -695,7 +702,8 @@ EXPORT_SYMBOL(sg_copy_buffer);
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_copy_from_buffer);
 
@@ -712,7 +720,7 @@ EXPORT_SYMBOL(sg_copy_from_buffer);
 size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			 void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, false);
 }
 EXPORT_SYMBOL(sg_copy_to_buffer);
 
@@ -730,7 +738,8 @@ EXPORT_SYMBOL(sg_copy_to_buffer);
 size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			    const void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_pcopy_from_buffer);
 
@@ -748,6 +757,43 @@ EXPORT_SYMBOL(sg_pcopy_from_buffer);
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true, false);
 }
 EXPORT_SYMBOL(sg_pcopy_to_buffer);
+
+/**
+ * sg_iocopy_from_buffer - Copy from a linear buffer to an SG list containing
+ *	IO memory.
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy from
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      true);
+}
+EXPORT_SYMBOL(sg_iocopy_from_buffer);
+
+/**
+ * sg_iocopy_to_buffer - Copy from an SG list containing IO memory
+ *	to a linear buffer
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy to
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, true);
+}
+EXPORT_SYMBOL(sg_iocopy_to_buffer);
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)


Now that we are using p2pmem SG buffers we occasionally have to copy
to and from this memory. For this, we add an iomem flag to
sg_copy_buffer for copying with iomemcpy. We also add the sg_iocopy_
variants to use this more easily.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Signed-off-by: Stephen Bates <sbates at raithlin.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---
 drivers/scsi/scsi_debug.c   |  7 ++---
 include/linux/scatterlist.h |  7 ++++-
 lib/scatterlist.c           | 64 ++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 17249c3..70c0d9f 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1309,7 +1309,7 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		int lu_id_num, port_group_id, target_dev_id, len;
 		char lu_id_str[6];
 		int host_no = devip->sdbg_host->shost->host_no;
-		
+
 		port_group_id = (((host_no + 1) & 0x7f) << 8) +
 		    (devip->channel & 0x7f);
 		if (sdebug_vpd_use_hostno == 0)
@@ -2381,14 +2381,15 @@ static int do_device_access(struct scsi_cmnd *scmd, u64 lba, u32 num,
 
 	ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 		   fake_storep + (block * sdebug_sector_size),
-		   (num - rest) * sdebug_sector_size, 0, do_write);
+		   (num - rest) * sdebug_sector_size, 0, do_write, false);
 	if (ret != (num - rest) * sdebug_sector_size)
 		return ret;
 
 	if (rest) {
 		ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 			    fake_storep, rest * sdebug_sector_size,
-			    (num - rest) * sdebug_sector_size, do_write);
+			    (num - rest) * sdebug_sector_size, do_write,
+				      false);
 	}
 
 	return ret;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..030b92b 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -267,7 +267,7 @@ int sg_alloc_table_from_pages(struct sg_table *sgt,
 	gfp_t gfp_mask);
 
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer);
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem);
 
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen);
@@ -279,6 +279,11 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip);
 
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen);
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..22abd94 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -647,7 +647,7 @@ EXPORT_SYMBOL(sg_miter_stop);
  *
  **/
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
-		      size_t buflen, off_t skip, bool to_buffer)
+		      size_t buflen, off_t skip, bool to_buffer, bool iomem)
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
@@ -668,10 +668,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
 
 		len = min(miter.length, buflen - offset);
 
-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (iomem) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}
 
 		offset += len;
 	}
@@ -695,7 +702,8 @@ EXPORT_SYMBOL(sg_copy_buffer);
 size_t sg_copy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			   const void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_copy_from_buffer);
 
@@ -712,7 +720,7 @@ EXPORT_SYMBOL(sg_copy_from_buffer);
 size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			 void *buf, size_t buflen)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, false);
 }
 EXPORT_SYMBOL(sg_copy_to_buffer);
 
@@ -730,7 +738,8 @@ EXPORT_SYMBOL(sg_copy_to_buffer);
 size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
 			    const void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false);
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, skip, false,
+			      false);
 }
 EXPORT_SYMBOL(sg_pcopy_from_buffer);
 
@@ -748,6 +757,43 @@ EXPORT_SYMBOL(sg_pcopy_from_buffer);
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
 			  void *buf, size_t buflen, off_t skip)
 {
-	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true);
+	return sg_copy_buffer(sgl, nents, buf, buflen, skip, true, false);
 }
 EXPORT_SYMBOL(sg_pcopy_to_buffer);
+
+/**
+ * sg_iocopy_from_buffer - Copy from a linear buffer to an SG list containing
+ *	IO memory.
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy from
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_from_buffer(struct scatterlist *sgl, unsigned int nents,
+			     const void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, (void *)buf, buflen, 0, false,
+			      true);
+}
+EXPORT_SYMBOL(sg_iocopy_from_buffer);
+
+/**
+ * sg_iocopy_to_buffer - Copy from an SG list containing IO memory
+ *	to a linear buffer
+ * @sgl:		 The SG list
+ * @nents:		 Number of SG entries
+ * @buf:		 Where to copy to
+ * @buflen:		 The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ **/
+size_t sg_iocopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
+			   void *buf, size_t buflen)
+{
+	return sg_copy_buffer(sgl, nents, buf, buflen, 0, true, true);
+}
+EXPORT_SYMBOL(sg_iocopy_to_buffer);
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
       [not found] ` <1490911959-5146-1-git-send-email-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  (?)
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

p2pmem will always be iomem so if we ever access it, we should be using
the correct methods to read and write to it.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/nvme/target/core.c        | 18 ++++++++++++++++--
 drivers/nvme/target/fabrics-cmd.c | 28 +++++++++++++++-------------
 drivers/nvme/target/nvmet.h       |  1 +
 drivers/nvme/target/rdma.c        | 13 ++++++-------
 4 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 798653b..a1524d5 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -45,15 +45,29 @@ static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port,
 u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
 		size_t len)
 {
-	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
+			     false, iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
 u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len)
 {
-	if (sg_pcopy_to_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, buf, len, off, true,
+			     iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 8bd022af..9d966f0 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -118,11 +118,13 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
 static void nvmet_execute_admin_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -134,16 +136,16 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	if (unlikely(d->cntlid != cpu_to_le16(0xffff))) {
+	if (unlikely(d.cntlid != cpu_to_le16(0xffff))) {
 		pr_warn("connect attempt for invalid controller ID %#x\n",
-			d->cntlid);
+			d.cntlid);
 		status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR;
 		req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid);
 		goto out;
 	}
 
-	status = nvmet_alloc_ctrl(d->subsysnqn, d->hostnqn, req,
-			le32_to_cpu(c->kato), &ctrl);
+	status = nvmet_alloc_ctrl(d.subsysnqn, d.hostnqn, req,
+				  le32_to_cpu(c->kato), &ctrl);
 	if (status)
 		goto out;
 
@@ -158,19 +160,20 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 }
 
 static void nvmet_execute_io_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 qid = le16_to_cpu(c->qid);
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -182,9 +185,9 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	status = nvmet_ctrl_find_get(d->subsysnqn, d->hostnqn,
-			le16_to_cpu(d->cntlid),
-			req, &ctrl);
+	status = nvmet_ctrl_find_get(d.subsysnqn, d.hostnqn,
+				     le16_to_cpu(d.cntlid),
+				     req, &ctrl);
 	if (status)
 		goto out;
 
@@ -205,7 +208,6 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 	pr_info("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 	return;
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index ab67175..ccd79ed 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -226,6 +226,7 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+	struct p2pmem_dev       *p2pmem;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 7fd4840..abab544 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -65,7 +65,6 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
-	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -501,7 +500,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
 		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
-				    rsp->p2pmem);
+				    rsp->req.p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -642,14 +641,14 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
-	rsp->p2pmem = rsp->queue->p2pmem;
+	rsp->req.p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len, rsp->p2pmem);
+				      len, rsp->req.p2pmem);
 
-	if (status && rsp->p2pmem) {
-		rsp->p2pmem = NULL;
+	if (status && rsp->req.p2pmem) {
+		rsp->req.p2pmem = NULL;
 		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-					      len, rsp->p2pmem);
+					      len, rsp->req.p2pmem);
 	}
 
 	if (status)
-- 
2.1.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

p2pmem will always be iomem so if we ever access it, we should be using
the correct methods to read and write to it.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Signed-off-by: Stephen Bates <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/nvme/target/core.c        | 18 ++++++++++++++++--
 drivers/nvme/target/fabrics-cmd.c | 28 +++++++++++++++-------------
 drivers/nvme/target/nvmet.h       |  1 +
 drivers/nvme/target/rdma.c        | 13 ++++++-------
 4 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 798653b..a1524d5 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -45,15 +45,29 @@ static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port,
 u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
 		size_t len)
 {
-	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
+			     false, iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
 u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len)
 {
-	if (sg_pcopy_to_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, buf, len, off, true,
+			     iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 8bd022af..9d966f0 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -118,11 +118,13 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
 static void nvmet_execute_admin_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -134,16 +136,16 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	if (unlikely(d->cntlid != cpu_to_le16(0xffff))) {
+	if (unlikely(d.cntlid != cpu_to_le16(0xffff))) {
 		pr_warn("connect attempt for invalid controller ID %#x\n",
-			d->cntlid);
+			d.cntlid);
 		status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR;
 		req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid);
 		goto out;
 	}
 
-	status = nvmet_alloc_ctrl(d->subsysnqn, d->hostnqn, req,
-			le32_to_cpu(c->kato), &ctrl);
+	status = nvmet_alloc_ctrl(d.subsysnqn, d.hostnqn, req,
+				  le32_to_cpu(c->kato), &ctrl);
 	if (status)
 		goto out;
 
@@ -158,19 +160,20 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 }
 
 static void nvmet_execute_io_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 qid = le16_to_cpu(c->qid);
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -182,9 +185,9 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	status = nvmet_ctrl_find_get(d->subsysnqn, d->hostnqn,
-			le16_to_cpu(d->cntlid),
-			req, &ctrl);
+	status = nvmet_ctrl_find_get(d.subsysnqn, d.hostnqn,
+				     le16_to_cpu(d.cntlid),
+				     req, &ctrl);
 	if (status)
 		goto out;
 
@@ -205,7 +208,6 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 	pr_info("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 	return;
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index ab67175..ccd79ed 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -226,6 +226,7 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+	struct p2pmem_dev       *p2pmem;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 7fd4840..abab544 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -65,7 +65,6 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
-	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -501,7 +500,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
 		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
-				    rsp->p2pmem);
+				    rsp->req.p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -642,14 +641,14 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
-	rsp->p2pmem = rsp->queue->p2pmem;
+	rsp->req.p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len, rsp->p2pmem);
+				      len, rsp->req.p2pmem);
 
-	if (status && rsp->p2pmem) {
-		rsp->p2pmem = NULL;
+	if (status && rsp->req.p2pmem) {
+		rsp->req.p2pmem = NULL;
 		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-					      len, rsp->p2pmem);
+					      len, rsp->req.p2pmem);
 	}
 
 	if (status)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

p2pmem will always be iomem so if we ever access it, we should be using
the correct methods to read and write to it.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/nvme/target/core.c        | 18 ++++++++++++++++--
 drivers/nvme/target/fabrics-cmd.c | 28 +++++++++++++++-------------
 drivers/nvme/target/nvmet.h       |  1 +
 drivers/nvme/target/rdma.c        | 13 ++++++-------
 4 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 798653b..a1524d5 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -45,15 +45,29 @@ static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port,
 u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
 		size_t len)
 {
-	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
+			     false, iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
 u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len)
 {
-	if (sg_pcopy_to_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, buf, len, off, true,
+			     iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 8bd022af..9d966f0 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -118,11 +118,13 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
 static void nvmet_execute_admin_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -134,16 +136,16 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	if (unlikely(d->cntlid != cpu_to_le16(0xffff))) {
+	if (unlikely(d.cntlid != cpu_to_le16(0xffff))) {
 		pr_warn("connect attempt for invalid controller ID %#x\n",
-			d->cntlid);
+			d.cntlid);
 		status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR;
 		req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid);
 		goto out;
 	}
 
-	status = nvmet_alloc_ctrl(d->subsysnqn, d->hostnqn, req,
-			le32_to_cpu(c->kato), &ctrl);
+	status = nvmet_alloc_ctrl(d.subsysnqn, d.hostnqn, req,
+				  le32_to_cpu(c->kato), &ctrl);
 	if (status)
 		goto out;
 
@@ -158,19 +160,20 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 }
 
 static void nvmet_execute_io_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 qid = le16_to_cpu(c->qid);
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -182,9 +185,9 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	status = nvmet_ctrl_find_get(d->subsysnqn, d->hostnqn,
-			le16_to_cpu(d->cntlid),
-			req, &ctrl);
+	status = nvmet_ctrl_find_get(d.subsysnqn, d.hostnqn,
+				     le16_to_cpu(d.cntlid),
+				     req, &ctrl);
 	if (status)
 		goto out;
 
@@ -205,7 +208,6 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 	pr_info("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 	return;
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index ab67175..ccd79ed 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -226,6 +226,7 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+	struct p2pmem_dev       *p2pmem;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 7fd4840..abab544 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -65,7 +65,6 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
-	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -501,7 +500,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
 		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
-				    rsp->p2pmem);
+				    rsp->req.p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -642,14 +641,14 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
-	rsp->p2pmem = rsp->queue->p2pmem;
+	rsp->req.p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len, rsp->p2pmem);
+				      len, rsp->req.p2pmem);
 
-	if (status && rsp->p2pmem) {
-		rsp->p2pmem = NULL;
+	if (status && rsp->req.p2pmem) {
+		rsp->req.p2pmem = NULL;
 		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-					      len, rsp->p2pmem);
+					      len, rsp->req.p2pmem);
 	}
 
 	if (status)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

p2pmem will always be iomem so if we ever access it, we should be using
the correct methods to read and write to it.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/nvme/target/core.c        | 18 ++++++++++++++++--
 drivers/nvme/target/fabrics-cmd.c | 28 +++++++++++++++-------------
 drivers/nvme/target/nvmet.h       |  1 +
 drivers/nvme/target/rdma.c        | 13 ++++++-------
 4 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 798653b..a1524d5 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -45,15 +45,29 @@ static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port,
 u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
 		size_t len)
 {
-	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
+			     false, iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
 u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len)
 {
-	if (sg_pcopy_to_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, buf, len, off, true,
+			     iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 8bd022af..9d966f0 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -118,11 +118,13 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
 static void nvmet_execute_admin_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -134,16 +136,16 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	if (unlikely(d->cntlid != cpu_to_le16(0xffff))) {
+	if (unlikely(d.cntlid != cpu_to_le16(0xffff))) {
 		pr_warn("connect attempt for invalid controller ID %#x\n",
-			d->cntlid);
+			d.cntlid);
 		status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR;
 		req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid);
 		goto out;
 	}
 
-	status = nvmet_alloc_ctrl(d->subsysnqn, d->hostnqn, req,
-			le32_to_cpu(c->kato), &ctrl);
+	status = nvmet_alloc_ctrl(d.subsysnqn, d.hostnqn, req,
+				  le32_to_cpu(c->kato), &ctrl);
 	if (status)
 		goto out;
 
@@ -158,19 +160,20 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 }
 
 static void nvmet_execute_io_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 qid = le16_to_cpu(c->qid);
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -182,9 +185,9 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	status = nvmet_ctrl_find_get(d->subsysnqn, d->hostnqn,
-			le16_to_cpu(d->cntlid),
-			req, &ctrl);
+	status = nvmet_ctrl_find_get(d.subsysnqn, d.hostnqn,
+				     le16_to_cpu(d.cntlid),
+				     req, &ctrl);
 	if (status)
 		goto out;
 
@@ -205,7 +208,6 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 	pr_info("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 	return;
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index ab67175..ccd79ed 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -226,6 +226,7 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+	struct p2pmem_dev       *p2pmem;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 7fd4840..abab544 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -65,7 +65,6 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
-	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -501,7 +500,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
 		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
-				    rsp->p2pmem);
+				    rsp->req.p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -642,14 +641,14 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
-	rsp->p2pmem = rsp->queue->p2pmem;
+	rsp->req.p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len, rsp->p2pmem);
+				      len, rsp->req.p2pmem);
 
-	if (status && rsp->p2pmem) {
-		rsp->p2pmem = NULL;
+	if (status && rsp->req.p2pmem) {
+		rsp->req.p2pmem = NULL;
 		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-					      len, rsp->p2pmem);
+					      len, rsp->req.p2pmem);
 	}
 
 	if (status)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)


p2pmem will always be iomem so if we ever access it, we should be using
the correct methods to read and write to it.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Signed-off-by: Stephen Bates <sbates at raithlin.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---
 drivers/nvme/target/core.c        | 18 ++++++++++++++++--
 drivers/nvme/target/fabrics-cmd.c | 28 +++++++++++++++-------------
 drivers/nvme/target/nvmet.h       |  1 +
 drivers/nvme/target/rdma.c        | 13 ++++++-------
 4 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 798653b..a1524d5 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -45,15 +45,29 @@ static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port,
 u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
 		size_t len)
 {
-	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
+			     false, iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
 u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len)
 {
-	if (sg_pcopy_to_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, buf, len, off, true,
+			     iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 8bd022af..9d966f0 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -118,11 +118,13 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
 static void nvmet_execute_admin_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -134,16 +136,16 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	if (unlikely(d->cntlid != cpu_to_le16(0xffff))) {
+	if (unlikely(d.cntlid != cpu_to_le16(0xffff))) {
 		pr_warn("connect attempt for invalid controller ID %#x\n",
-			d->cntlid);
+			d.cntlid);
 		status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR;
 		req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid);
 		goto out;
 	}
 
-	status = nvmet_alloc_ctrl(d->subsysnqn, d->hostnqn, req,
-			le32_to_cpu(c->kato), &ctrl);
+	status = nvmet_alloc_ctrl(d.subsysnqn, d.hostnqn, req,
+				  le32_to_cpu(c->kato), &ctrl);
 	if (status)
 		goto out;
 
@@ -158,19 +160,20 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 }
 
 static void nvmet_execute_io_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 qid = le16_to_cpu(c->qid);
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -182,9 +185,9 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	status = nvmet_ctrl_find_get(d->subsysnqn, d->hostnqn,
-			le16_to_cpu(d->cntlid),
-			req, &ctrl);
+	status = nvmet_ctrl_find_get(d.subsysnqn, d.hostnqn,
+				     le16_to_cpu(d.cntlid),
+				     req, &ctrl);
 	if (status)
 		goto out;
 
@@ -205,7 +208,6 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 	pr_info("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 	return;
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index ab67175..ccd79ed 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -226,6 +226,7 @@ struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+	struct p2pmem_dev       *p2pmem;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 7fd4840..abab544 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -65,7 +65,6 @@ struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
-	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -501,7 +500,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
 		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
-				    rsp->p2pmem);
+				    rsp->req.p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -642,14 +641,14 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
-	rsp->p2pmem = rsp->queue->p2pmem;
+	rsp->req.p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len, rsp->p2pmem);
+				      len, rsp->req.p2pmem);
 
-	if (status && rsp->p2pmem) {
-		rsp->p2pmem = NULL;
+	if (status && rsp->req.p2pmem) {
+		rsp->req.p2pmem = NULL;
 		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-					      len, rsp->p2pmem);
+					      len, rsp->req.p2pmem);
 	}
 
 	if (status)
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 7/8] p2pmem: Support device removal
       [not found] ` <1490911959-5146-1-git-send-email-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  (?)
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

This patch creates a list of callbacks to notify users of this memory
that the p2pmem device is going away or gone.

In nvmet-rdma, we disconnect any queue using p2p memory.
The remote side will then automatically reconnect in a
couple seconds and regular system memory (or a different p2pmem device)
will be used.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/p2pmem.c    | 75 ++++++++++++++++++++++++++++++++---
 drivers/nvme/target/rdma.c | 98 ++++++++++++++++++++++++++--------------------
 include/linux/p2pmem.h     | 19 +++++++--
 3 files changed, 140 insertions(+), 52 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 71741c2..499d42c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -105,6 +105,21 @@ static void p2pmem_release(struct device *dev)
 	kfree(p);
 }
 
+struct remove_callback {
+	struct list_head list;
+	void (*callback)(void *context);
+	void *context;
+};
+
+static void p2pmem_remove(struct p2pmem_dev *p)
+{
+	struct remove_callback *remove_call, *tmp;
+
+	p->alive = false;
+	list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list)
+		remove_call->callback(remove_call->context);
+}
+
 /**
  * p2pmem_create() - create a new p2pmem device
  * @parent: the parent device to create it under
@@ -123,6 +138,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 		return ERR_PTR(-ENOMEM);
 
 	init_completion(&p->cmp);
+	mutex_init(&p->remove_mutex);
+	INIT_LIST_HEAD(&p->remove_list);
+	p->alive = true;
+
 	device_initialize(&p->dev);
 	p->dev.class = p2pmem_class;
 	p->dev.parent = parent;
@@ -187,6 +206,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
 	dev_info(&p->dev, "unregistered");
 	device_del(&p->dev);
+	p2pmem_remove(p);
 	ida_simple_remove(&p2pmem_ida, p->id);
 	put_device(&p->dev);
 }
@@ -291,6 +311,9 @@ EXPORT_SYMBOL(p2pmem_add_pci_region);
  */
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
 {
+	if (!p->alive)
+		return NULL;
+
 	return (void *)gen_pool_alloc(p->pool, size);
 }
 EXPORT_SYMBOL(p2pmem_alloc);
@@ -349,6 +372,9 @@ static int upstream_bridges_match(struct device *p2pmem,
 	struct pci_dev *p2p_up;
 	struct pci_dev *dma_up;
 
+	if (!to_p2pmem(p2pmem)->alive)
+		return false;
+
 	p2p_up = get_upstream_switch_port(p2pmem);
 	if (!p2p_up) {
 		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
@@ -383,6 +409,8 @@ static int upstream_bridges_match(struct device *p2pmem,
  *	specified devices
  * @dma_devices: a null terminated array of device pointers which
  *	all must be compatible with the returned p2pmem device
+ * @remove_callback: this callback will be called if the p2pmem
+ *	device is removed.
  *
  * For now, we only support cases where all the devices that
  * will transfer to the p2pmem device are on the same switch.
@@ -400,9 +428,13 @@ static int upstream_bridges_match(struct device *p2pmem,
  * (use p2pmem_put to return the reference) or NULL if no compatible
  * p2pmem device is found.
  */
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices,
+				      void (*remove_callback)(void *context),
+				      void *context)
 {
 	struct device *dev;
+	struct p2pmem_dev *p;
+	struct remove_callback *remove_call;
 
 	dev = class_find_device(p2pmem_class, NULL, dma_devices,
 				upstream_bridges_match);
@@ -410,21 +442,54 @@ struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
 	if (!dev)
 		return NULL;
 
-	return to_p2pmem(dev);
+	p = to_p2pmem(dev);
+	mutex_lock(&p->remove_mutex);
+
+	if (!p->alive) {
+		p = NULL;
+		goto out;
+	}
+
+	remove_call = kzalloc(sizeof(*remove_call), GFP_KERNEL);
+	remove_call->callback = remove_callback;
+	remove_call->context = context;
+	INIT_LIST_HEAD(&remove_call->list);
+	list_add(&remove_call->list, &p->remove_list);
+
+out:
+	mutex_unlock(&p->remove_mutex);
+	return p;
 }
 EXPORT_SYMBOL(p2pmem_find_compat);
 
 /**
  * p2pmem_put() - decrement a p2pmem device reference
  * @p: p2pmem device to return
+ * @data: data pointer that was passed to p2pmem_find_compat
  *
  * Dereference and free (if last) the device's reference counter.
  * It's safe to pass a NULL pointer to this function.
  */
-void p2pmem_put(struct p2pmem_dev *p)
+void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
-	if (p)
-		put_device(&p->dev);
+	struct remove_callback *remove_call;
+
+	if (!p)
+		return;
+
+	mutex_lock(&p->remove_mutex);
+
+	list_for_each_entry(remove_call, &p->remove_list, list) {
+		if (remove_call->context != context)
+			continue;
+
+		list_del(&remove_call->list);
+		kfree(remove_call);
+		break;
+	}
+
+	mutex_unlock(&p->remove_mutex);
+	put_device(&p->dev);
 }
 EXPORT_SYMBOL(p2pmem_put);
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index abab544..9ebcda6 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1008,7 +1008,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
-	p2pmem_put(queue->p2pmem);
+	p2pmem_put(queue->p2pmem, queue);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1204,6 +1204,58 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+	unsigned long flags;
+
+	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
+
+	spin_lock_irqsave(&queue->state_lock, flags);
+	switch (queue->state) {
+	case NVMET_RDMA_Q_CONNECTING:
+	case NVMET_RDMA_Q_LIVE:
+		queue->state = NVMET_RDMA_Q_DISCONNECTING;
+	case NVMET_RDMA_IN_DEVICE_REMOVAL:
+		disconnect = true;
+		break;
+	case NVMET_RDMA_Q_DISCONNECTING:
+		break;
+	}
+	spin_unlock_irqrestore(&queue->state_lock, flags);
+
+	if (disconnect) {
+		rdma_disconnect(queue->cm_id);
+		schedule_work(&queue->release_work);
+	}
+}
+
+static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+
+	mutex_lock(&nvmet_rdma_queue_mutex);
+	if (!list_empty(&queue->queue_list)) {
+		list_del_init(&queue->queue_list);
+		disconnect = true;
+	}
+	mutex_unlock(&nvmet_rdma_queue_mutex);
+
+	if (disconnect)
+		__nvmet_rdma_queue_disconnect(queue);
+}
+
+static void nvmet_rdma_p2pmem_remove(void *context)
+{
+	struct nvmet_rdma_queue *queue = context;
+
+	if (!queue->p2pmem)
+		return;
+
+	nvmet_rdma_queue_disconnect(queue);
+	flush_scheduled_work();
+}
+
 /*
  * If allow_p2pmem is set, we will try to use P2P memory for our
  * sgl lists. This requires the p2pmem device to be compatible with
@@ -1241,7 +1293,8 @@ static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
 
 	dma_devs[i++] = NULL;
 
-	queue->p2pmem = p2pmem_find_compat(dma_devs);
+	queue->p2pmem = p2pmem_find_compat(dma_devs, nvmet_rdma_p2pmem_remove,
+					   queue);
 
 	if (queue->p2pmem)
 		pr_debug("using %s for rdma nvme target queue",
@@ -1317,47 +1370,6 @@ static void nvmet_rdma_queue_established(struct nvmet_rdma_queue *queue)
 	spin_unlock_irqrestore(&queue->state_lock, flags);
 }
 
-static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-	unsigned long flags;
-
-	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
-
-	spin_lock_irqsave(&queue->state_lock, flags);
-	switch (queue->state) {
-	case NVMET_RDMA_Q_CONNECTING:
-	case NVMET_RDMA_Q_LIVE:
-		queue->state = NVMET_RDMA_Q_DISCONNECTING;
-	case NVMET_RDMA_IN_DEVICE_REMOVAL:
-		disconnect = true;
-		break;
-	case NVMET_RDMA_Q_DISCONNECTING:
-		break;
-	}
-	spin_unlock_irqrestore(&queue->state_lock, flags);
-
-	if (disconnect) {
-		rdma_disconnect(queue->cm_id);
-		schedule_work(&queue->release_work);
-	}
-}
-
-static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-
-	mutex_lock(&nvmet_rdma_queue_mutex);
-	if (!list_empty(&queue->queue_list)) {
-		list_del_init(&queue->queue_list);
-		disconnect = true;
-	}
-	mutex_unlock(&nvmet_rdma_queue_mutex);
-
-	if (disconnect)
-		__nvmet_rdma_queue_disconnect(queue);
-}
-
 static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
 		struct nvmet_rdma_queue *queue)
 {
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 4cd6f35..9365b02 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -22,12 +22,16 @@
 struct p2pmem_dev {
 	struct device dev;
 	int id;
+	bool alive;
 
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
 
 	struct dentry *debugfs_root;
+
+	struct mutex remove_mutex;	/* protects the remove callback list */
+	struct list_head remove_list;
 };
 
 #ifdef CONFIG_P2PMEM
@@ -41,8 +45,12 @@ int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
 void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
 
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
-void p2pmem_put(struct p2pmem_dev *p);
+struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context);
+
+void p2pmem_put(struct p2pmem_dev *p, void *context);
 
 #else
 
@@ -76,12 +84,15 @@ static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
 {
 }
 
-static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+static inline struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context)
 {
 	return NULL;
 }
 
-static inline void p2pmem_put(struct p2pmem_dev *p)
+static inline void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
 }
 
-- 
2.1.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 7/8] p2pmem: Support device removal
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

This patch creates a list of callbacks to notify users of this memory
that the p2pmem device is going away or gone.

In nvmet-rdma, we disconnect any queue using p2p memory.
The remote side will then automatically reconnect in a
couple seconds and regular system memory (or a different p2pmem device)
will be used.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Signed-off-by: Stephen Bates <sbates-pv7U853sEMVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/memory/p2pmem.c    | 75 ++++++++++++++++++++++++++++++++---
 drivers/nvme/target/rdma.c | 98 ++++++++++++++++++++++++++--------------------
 include/linux/p2pmem.h     | 19 +++++++--
 3 files changed, 140 insertions(+), 52 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 71741c2..499d42c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -105,6 +105,21 @@ static void p2pmem_release(struct device *dev)
 	kfree(p);
 }
 
+struct remove_callback {
+	struct list_head list;
+	void (*callback)(void *context);
+	void *context;
+};
+
+static void p2pmem_remove(struct p2pmem_dev *p)
+{
+	struct remove_callback *remove_call, *tmp;
+
+	p->alive = false;
+	list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list)
+		remove_call->callback(remove_call->context);
+}
+
 /**
  * p2pmem_create() - create a new p2pmem device
  * @parent: the parent device to create it under
@@ -123,6 +138,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 		return ERR_PTR(-ENOMEM);
 
 	init_completion(&p->cmp);
+	mutex_init(&p->remove_mutex);
+	INIT_LIST_HEAD(&p->remove_list);
+	p->alive = true;
+
 	device_initialize(&p->dev);
 	p->dev.class = p2pmem_class;
 	p->dev.parent = parent;
@@ -187,6 +206,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
 	dev_info(&p->dev, "unregistered");
 	device_del(&p->dev);
+	p2pmem_remove(p);
 	ida_simple_remove(&p2pmem_ida, p->id);
 	put_device(&p->dev);
 }
@@ -291,6 +311,9 @@ EXPORT_SYMBOL(p2pmem_add_pci_region);
  */
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
 {
+	if (!p->alive)
+		return NULL;
+
 	return (void *)gen_pool_alloc(p->pool, size);
 }
 EXPORT_SYMBOL(p2pmem_alloc);
@@ -349,6 +372,9 @@ static int upstream_bridges_match(struct device *p2pmem,
 	struct pci_dev *p2p_up;
 	struct pci_dev *dma_up;
 
+	if (!to_p2pmem(p2pmem)->alive)
+		return false;
+
 	p2p_up = get_upstream_switch_port(p2pmem);
 	if (!p2p_up) {
 		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
@@ -383,6 +409,8 @@ static int upstream_bridges_match(struct device *p2pmem,
  *	specified devices
  * @dma_devices: a null terminated array of device pointers which
  *	all must be compatible with the returned p2pmem device
+ * @remove_callback: this callback will be called if the p2pmem
+ *	device is removed.
  *
  * For now, we only support cases where all the devices that
  * will transfer to the p2pmem device are on the same switch.
@@ -400,9 +428,13 @@ static int upstream_bridges_match(struct device *p2pmem,
  * (use p2pmem_put to return the reference) or NULL if no compatible
  * p2pmem device is found.
  */
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices,
+				      void (*remove_callback)(void *context),
+				      void *context)
 {
 	struct device *dev;
+	struct p2pmem_dev *p;
+	struct remove_callback *remove_call;
 
 	dev = class_find_device(p2pmem_class, NULL, dma_devices,
 				upstream_bridges_match);
@@ -410,21 +442,54 @@ struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
 	if (!dev)
 		return NULL;
 
-	return to_p2pmem(dev);
+	p = to_p2pmem(dev);
+	mutex_lock(&p->remove_mutex);
+
+	if (!p->alive) {
+		p = NULL;
+		goto out;
+	}
+
+	remove_call = kzalloc(sizeof(*remove_call), GFP_KERNEL);
+	remove_call->callback = remove_callback;
+	remove_call->context = context;
+	INIT_LIST_HEAD(&remove_call->list);
+	list_add(&remove_call->list, &p->remove_list);
+
+out:
+	mutex_unlock(&p->remove_mutex);
+	return p;
 }
 EXPORT_SYMBOL(p2pmem_find_compat);
 
 /**
  * p2pmem_put() - decrement a p2pmem device reference
  * @p: p2pmem device to return
+ * @data: data pointer that was passed to p2pmem_find_compat
  *
  * Dereference and free (if last) the device's reference counter.
  * It's safe to pass a NULL pointer to this function.
  */
-void p2pmem_put(struct p2pmem_dev *p)
+void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
-	if (p)
-		put_device(&p->dev);
+	struct remove_callback *remove_call;
+
+	if (!p)
+		return;
+
+	mutex_lock(&p->remove_mutex);
+
+	list_for_each_entry(remove_call, &p->remove_list, list) {
+		if (remove_call->context != context)
+			continue;
+
+		list_del(&remove_call->list);
+		kfree(remove_call);
+		break;
+	}
+
+	mutex_unlock(&p->remove_mutex);
+	put_device(&p->dev);
 }
 EXPORT_SYMBOL(p2pmem_put);
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index abab544..9ebcda6 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1008,7 +1008,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
-	p2pmem_put(queue->p2pmem);
+	p2pmem_put(queue->p2pmem, queue);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1204,6 +1204,58 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+	unsigned long flags;
+
+	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
+
+	spin_lock_irqsave(&queue->state_lock, flags);
+	switch (queue->state) {
+	case NVMET_RDMA_Q_CONNECTING:
+	case NVMET_RDMA_Q_LIVE:
+		queue->state = NVMET_RDMA_Q_DISCONNECTING;
+	case NVMET_RDMA_IN_DEVICE_REMOVAL:
+		disconnect = true;
+		break;
+	case NVMET_RDMA_Q_DISCONNECTING:
+		break;
+	}
+	spin_unlock_irqrestore(&queue->state_lock, flags);
+
+	if (disconnect) {
+		rdma_disconnect(queue->cm_id);
+		schedule_work(&queue->release_work);
+	}
+}
+
+static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+
+	mutex_lock(&nvmet_rdma_queue_mutex);
+	if (!list_empty(&queue->queue_list)) {
+		list_del_init(&queue->queue_list);
+		disconnect = true;
+	}
+	mutex_unlock(&nvmet_rdma_queue_mutex);
+
+	if (disconnect)
+		__nvmet_rdma_queue_disconnect(queue);
+}
+
+static void nvmet_rdma_p2pmem_remove(void *context)
+{
+	struct nvmet_rdma_queue *queue = context;
+
+	if (!queue->p2pmem)
+		return;
+
+	nvmet_rdma_queue_disconnect(queue);
+	flush_scheduled_work();
+}
+
 /*
  * If allow_p2pmem is set, we will try to use P2P memory for our
  * sgl lists. This requires the p2pmem device to be compatible with
@@ -1241,7 +1293,8 @@ static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
 
 	dma_devs[i++] = NULL;
 
-	queue->p2pmem = p2pmem_find_compat(dma_devs);
+	queue->p2pmem = p2pmem_find_compat(dma_devs, nvmet_rdma_p2pmem_remove,
+					   queue);
 
 	if (queue->p2pmem)
 		pr_debug("using %s for rdma nvme target queue",
@@ -1317,47 +1370,6 @@ static void nvmet_rdma_queue_established(struct nvmet_rdma_queue *queue)
 	spin_unlock_irqrestore(&queue->state_lock, flags);
 }
 
-static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-	unsigned long flags;
-
-	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
-
-	spin_lock_irqsave(&queue->state_lock, flags);
-	switch (queue->state) {
-	case NVMET_RDMA_Q_CONNECTING:
-	case NVMET_RDMA_Q_LIVE:
-		queue->state = NVMET_RDMA_Q_DISCONNECTING;
-	case NVMET_RDMA_IN_DEVICE_REMOVAL:
-		disconnect = true;
-		break;
-	case NVMET_RDMA_Q_DISCONNECTING:
-		break;
-	}
-	spin_unlock_irqrestore(&queue->state_lock, flags);
-
-	if (disconnect) {
-		rdma_disconnect(queue->cm_id);
-		schedule_work(&queue->release_work);
-	}
-}
-
-static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-
-	mutex_lock(&nvmet_rdma_queue_mutex);
-	if (!list_empty(&queue->queue_list)) {
-		list_del_init(&queue->queue_list);
-		disconnect = true;
-	}
-	mutex_unlock(&nvmet_rdma_queue_mutex);
-
-	if (disconnect)
-		__nvmet_rdma_queue_disconnect(queue);
-}
-
 static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
 		struct nvmet_rdma_queue *queue)
 {
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 4cd6f35..9365b02 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -22,12 +22,16 @@
 struct p2pmem_dev {
 	struct device dev;
 	int id;
+	bool alive;
 
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
 
 	struct dentry *debugfs_root;
+
+	struct mutex remove_mutex;	/* protects the remove callback list */
+	struct list_head remove_list;
 };
 
 #ifdef CONFIG_P2PMEM
@@ -41,8 +45,12 @@ int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
 void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
 
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
-void p2pmem_put(struct p2pmem_dev *p);
+struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context);
+
+void p2pmem_put(struct p2pmem_dev *p, void *context);
 
 #else
 
@@ -76,12 +84,15 @@ static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
 {
 }
 
-static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+static inline struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context)
 {
 	return NULL;
 }
 
-static inline void p2pmem_put(struct p2pmem_dev *p)
+static inline void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
 }
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 7/8] p2pmem: Support device removal
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

This patch creates a list of callbacks to notify users of this memory
that the p2pmem device is going away or gone.

In nvmet-rdma, we disconnect any queue using p2p memory.
The remote side will then automatically reconnect in a
couple seconds and regular system memory (or a different p2pmem device)
will be used.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/p2pmem.c    | 75 ++++++++++++++++++++++++++++++++---
 drivers/nvme/target/rdma.c | 98 ++++++++++++++++++++++++++--------------------
 include/linux/p2pmem.h     | 19 +++++++--
 3 files changed, 140 insertions(+), 52 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 71741c2..499d42c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -105,6 +105,21 @@ static void p2pmem_release(struct device *dev)
 	kfree(p);
 }
 
+struct remove_callback {
+	struct list_head list;
+	void (*callback)(void *context);
+	void *context;
+};
+
+static void p2pmem_remove(struct p2pmem_dev *p)
+{
+	struct remove_callback *remove_call, *tmp;
+
+	p->alive = false;
+	list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list)
+		remove_call->callback(remove_call->context);
+}
+
 /**
  * p2pmem_create() - create a new p2pmem device
  * @parent: the parent device to create it under
@@ -123,6 +138,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 		return ERR_PTR(-ENOMEM);
 
 	init_completion(&p->cmp);
+	mutex_init(&p->remove_mutex);
+	INIT_LIST_HEAD(&p->remove_list);
+	p->alive = true;
+
 	device_initialize(&p->dev);
 	p->dev.class = p2pmem_class;
 	p->dev.parent = parent;
@@ -187,6 +206,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
 	dev_info(&p->dev, "unregistered");
 	device_del(&p->dev);
+	p2pmem_remove(p);
 	ida_simple_remove(&p2pmem_ida, p->id);
 	put_device(&p->dev);
 }
@@ -291,6 +311,9 @@ EXPORT_SYMBOL(p2pmem_add_pci_region);
  */
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
 {
+	if (!p->alive)
+		return NULL;
+
 	return (void *)gen_pool_alloc(p->pool, size);
 }
 EXPORT_SYMBOL(p2pmem_alloc);
@@ -349,6 +372,9 @@ static int upstream_bridges_match(struct device *p2pmem,
 	struct pci_dev *p2p_up;
 	struct pci_dev *dma_up;
 
+	if (!to_p2pmem(p2pmem)->alive)
+		return false;
+
 	p2p_up = get_upstream_switch_port(p2pmem);
 	if (!p2p_up) {
 		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
@@ -383,6 +409,8 @@ static int upstream_bridges_match(struct device *p2pmem,
  *	specified devices
  * @dma_devices: a null terminated array of device pointers which
  *	all must be compatible with the returned p2pmem device
+ * @remove_callback: this callback will be called if the p2pmem
+ *	device is removed.
  *
  * For now, we only support cases where all the devices that
  * will transfer to the p2pmem device are on the same switch.
@@ -400,9 +428,13 @@ static int upstream_bridges_match(struct device *p2pmem,
  * (use p2pmem_put to return the reference) or NULL if no compatible
  * p2pmem device is found.
  */
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices,
+				      void (*remove_callback)(void *context),
+				      void *context)
 {
 	struct device *dev;
+	struct p2pmem_dev *p;
+	struct remove_callback *remove_call;
 
 	dev = class_find_device(p2pmem_class, NULL, dma_devices,
 				upstream_bridges_match);
@@ -410,21 +442,54 @@ struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
 	if (!dev)
 		return NULL;
 
-	return to_p2pmem(dev);
+	p = to_p2pmem(dev);
+	mutex_lock(&p->remove_mutex);
+
+	if (!p->alive) {
+		p = NULL;
+		goto out;
+	}
+
+	remove_call = kzalloc(sizeof(*remove_call), GFP_KERNEL);
+	remove_call->callback = remove_callback;
+	remove_call->context = context;
+	INIT_LIST_HEAD(&remove_call->list);
+	list_add(&remove_call->list, &p->remove_list);
+
+out:
+	mutex_unlock(&p->remove_mutex);
+	return p;
 }
 EXPORT_SYMBOL(p2pmem_find_compat);
 
 /**
  * p2pmem_put() - decrement a p2pmem device reference
  * @p: p2pmem device to return
+ * @data: data pointer that was passed to p2pmem_find_compat
  *
  * Dereference and free (if last) the device's reference counter.
  * It's safe to pass a NULL pointer to this function.
  */
-void p2pmem_put(struct p2pmem_dev *p)
+void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
-	if (p)
-		put_device(&p->dev);
+	struct remove_callback *remove_call;
+
+	if (!p)
+		return;
+
+	mutex_lock(&p->remove_mutex);
+
+	list_for_each_entry(remove_call, &p->remove_list, list) {
+		if (remove_call->context != context)
+			continue;
+
+		list_del(&remove_call->list);
+		kfree(remove_call);
+		break;
+	}
+
+	mutex_unlock(&p->remove_mutex);
+	put_device(&p->dev);
 }
 EXPORT_SYMBOL(p2pmem_put);
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index abab544..9ebcda6 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1008,7 +1008,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
-	p2pmem_put(queue->p2pmem);
+	p2pmem_put(queue->p2pmem, queue);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1204,6 +1204,58 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+	unsigned long flags;
+
+	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
+
+	spin_lock_irqsave(&queue->state_lock, flags);
+	switch (queue->state) {
+	case NVMET_RDMA_Q_CONNECTING:
+	case NVMET_RDMA_Q_LIVE:
+		queue->state = NVMET_RDMA_Q_DISCONNECTING;
+	case NVMET_RDMA_IN_DEVICE_REMOVAL:
+		disconnect = true;
+		break;
+	case NVMET_RDMA_Q_DISCONNECTING:
+		break;
+	}
+	spin_unlock_irqrestore(&queue->state_lock, flags);
+
+	if (disconnect) {
+		rdma_disconnect(queue->cm_id);
+		schedule_work(&queue->release_work);
+	}
+}
+
+static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+
+	mutex_lock(&nvmet_rdma_queue_mutex);
+	if (!list_empty(&queue->queue_list)) {
+		list_del_init(&queue->queue_list);
+		disconnect = true;
+	}
+	mutex_unlock(&nvmet_rdma_queue_mutex);
+
+	if (disconnect)
+		__nvmet_rdma_queue_disconnect(queue);
+}
+
+static void nvmet_rdma_p2pmem_remove(void *context)
+{
+	struct nvmet_rdma_queue *queue = context;
+
+	if (!queue->p2pmem)
+		return;
+
+	nvmet_rdma_queue_disconnect(queue);
+	flush_scheduled_work();
+}
+
 /*
  * If allow_p2pmem is set, we will try to use P2P memory for our
  * sgl lists. This requires the p2pmem device to be compatible with
@@ -1241,7 +1293,8 @@ static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
 
 	dma_devs[i++] = NULL;
 
-	queue->p2pmem = p2pmem_find_compat(dma_devs);
+	queue->p2pmem = p2pmem_find_compat(dma_devs, nvmet_rdma_p2pmem_remove,
+					   queue);
 
 	if (queue->p2pmem)
 		pr_debug("using %s for rdma nvme target queue",
@@ -1317,47 +1370,6 @@ static void nvmet_rdma_queue_established(struct nvmet_rdma_queue *queue)
 	spin_unlock_irqrestore(&queue->state_lock, flags);
 }
 
-static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-	unsigned long flags;
-
-	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
-
-	spin_lock_irqsave(&queue->state_lock, flags);
-	switch (queue->state) {
-	case NVMET_RDMA_Q_CONNECTING:
-	case NVMET_RDMA_Q_LIVE:
-		queue->state = NVMET_RDMA_Q_DISCONNECTING;
-	case NVMET_RDMA_IN_DEVICE_REMOVAL:
-		disconnect = true;
-		break;
-	case NVMET_RDMA_Q_DISCONNECTING:
-		break;
-	}
-	spin_unlock_irqrestore(&queue->state_lock, flags);
-
-	if (disconnect) {
-		rdma_disconnect(queue->cm_id);
-		schedule_work(&queue->release_work);
-	}
-}
-
-static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-
-	mutex_lock(&nvmet_rdma_queue_mutex);
-	if (!list_empty(&queue->queue_list)) {
-		list_del_init(&queue->queue_list);
-		disconnect = true;
-	}
-	mutex_unlock(&nvmet_rdma_queue_mutex);
-
-	if (disconnect)
-		__nvmet_rdma_queue_disconnect(queue);
-}
-
 static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
 		struct nvmet_rdma_queue *queue)
 {
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 4cd6f35..9365b02 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -22,12 +22,16 @@
 struct p2pmem_dev {
 	struct device dev;
 	int id;
+	bool alive;
 
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
 
 	struct dentry *debugfs_root;
+
+	struct mutex remove_mutex;	/* protects the remove callback list */
+	struct list_head remove_list;
 };
 
 #ifdef CONFIG_P2PMEM
@@ -41,8 +45,12 @@ int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
 void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
 
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
-void p2pmem_put(struct p2pmem_dev *p);
+struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context);
+
+void p2pmem_put(struct p2pmem_dev *p, void *context);
 
 #else
 
@@ -76,12 +84,15 @@ static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
 {
 }
 
-static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+static inline struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context)
 {
 	return NULL;
 }
 
-static inline void p2pmem_put(struct p2pmem_dev *p)
+static inline void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
 }
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 7/8] p2pmem: Support device removal
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

This patch creates a list of callbacks to notify users of this memory
that the p2pmem device is going away or gone.

In nvmet-rdma, we disconnect any queue using p2p memory.
The remote side will then automatically reconnect in a
couple seconds and regular system memory (or a different p2pmem device)
will be used.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/p2pmem.c    | 75 ++++++++++++++++++++++++++++++++---
 drivers/nvme/target/rdma.c | 98 ++++++++++++++++++++++++++--------------------
 include/linux/p2pmem.h     | 19 +++++++--
 3 files changed, 140 insertions(+), 52 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 71741c2..499d42c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -105,6 +105,21 @@ static void p2pmem_release(struct device *dev)
 	kfree(p);
 }
 
+struct remove_callback {
+	struct list_head list;
+	void (*callback)(void *context);
+	void *context;
+};
+
+static void p2pmem_remove(struct p2pmem_dev *p)
+{
+	struct remove_callback *remove_call, *tmp;
+
+	p->alive = false;
+	list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list)
+		remove_call->callback(remove_call->context);
+}
+
 /**
  * p2pmem_create() - create a new p2pmem device
  * @parent: the parent device to create it under
@@ -123,6 +138,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 		return ERR_PTR(-ENOMEM);
 
 	init_completion(&p->cmp);
+	mutex_init(&p->remove_mutex);
+	INIT_LIST_HEAD(&p->remove_list);
+	p->alive = true;
+
 	device_initialize(&p->dev);
 	p->dev.class = p2pmem_class;
 	p->dev.parent = parent;
@@ -187,6 +206,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
 	dev_info(&p->dev, "unregistered");
 	device_del(&p->dev);
+	p2pmem_remove(p);
 	ida_simple_remove(&p2pmem_ida, p->id);
 	put_device(&p->dev);
 }
@@ -291,6 +311,9 @@ EXPORT_SYMBOL(p2pmem_add_pci_region);
  */
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
 {
+	if (!p->alive)
+		return NULL;
+
 	return (void *)gen_pool_alloc(p->pool, size);
 }
 EXPORT_SYMBOL(p2pmem_alloc);
@@ -349,6 +372,9 @@ static int upstream_bridges_match(struct device *p2pmem,
 	struct pci_dev *p2p_up;
 	struct pci_dev *dma_up;
 
+	if (!to_p2pmem(p2pmem)->alive)
+		return false;
+
 	p2p_up = get_upstream_switch_port(p2pmem);
 	if (!p2p_up) {
 		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
@@ -383,6 +409,8 @@ static int upstream_bridges_match(struct device *p2pmem,
  *	specified devices
  * @dma_devices: a null terminated array of device pointers which
  *	all must be compatible with the returned p2pmem device
+ * @remove_callback: this callback will be called if the p2pmem
+ *	device is removed.
  *
  * For now, we only support cases where all the devices that
  * will transfer to the p2pmem device are on the same switch.
@@ -400,9 +428,13 @@ static int upstream_bridges_match(struct device *p2pmem,
  * (use p2pmem_put to return the reference) or NULL if no compatible
  * p2pmem device is found.
  */
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices,
+				      void (*remove_callback)(void *context),
+				      void *context)
 {
 	struct device *dev;
+	struct p2pmem_dev *p;
+	struct remove_callback *remove_call;
 
 	dev = class_find_device(p2pmem_class, NULL, dma_devices,
 				upstream_bridges_match);
@@ -410,21 +442,54 @@ struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
 	if (!dev)
 		return NULL;
 
-	return to_p2pmem(dev);
+	p = to_p2pmem(dev);
+	mutex_lock(&p->remove_mutex);
+
+	if (!p->alive) {
+		p = NULL;
+		goto out;
+	}
+
+	remove_call = kzalloc(sizeof(*remove_call), GFP_KERNEL);
+	remove_call->callback = remove_callback;
+	remove_call->context = context;
+	INIT_LIST_HEAD(&remove_call->list);
+	list_add(&remove_call->list, &p->remove_list);
+
+out:
+	mutex_unlock(&p->remove_mutex);
+	return p;
 }
 EXPORT_SYMBOL(p2pmem_find_compat);
 
 /**
  * p2pmem_put() - decrement a p2pmem device reference
  * @p: p2pmem device to return
+ * @data: data pointer that was passed to p2pmem_find_compat
  *
  * Dereference and free (if last) the device's reference counter.
  * It's safe to pass a NULL pointer to this function.
  */
-void p2pmem_put(struct p2pmem_dev *p)
+void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
-	if (p)
-		put_device(&p->dev);
+	struct remove_callback *remove_call;
+
+	if (!p)
+		return;
+
+	mutex_lock(&p->remove_mutex);
+
+	list_for_each_entry(remove_call, &p->remove_list, list) {
+		if (remove_call->context != context)
+			continue;
+
+		list_del(&remove_call->list);
+		kfree(remove_call);
+		break;
+	}
+
+	mutex_unlock(&p->remove_mutex);
+	put_device(&p->dev);
 }
 EXPORT_SYMBOL(p2pmem_put);
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index abab544..9ebcda6 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1008,7 +1008,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
-	p2pmem_put(queue->p2pmem);
+	p2pmem_put(queue->p2pmem, queue);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1204,6 +1204,58 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+	unsigned long flags;
+
+	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
+
+	spin_lock_irqsave(&queue->state_lock, flags);
+	switch (queue->state) {
+	case NVMET_RDMA_Q_CONNECTING:
+	case NVMET_RDMA_Q_LIVE:
+		queue->state = NVMET_RDMA_Q_DISCONNECTING;
+	case NVMET_RDMA_IN_DEVICE_REMOVAL:
+		disconnect = true;
+		break;
+	case NVMET_RDMA_Q_DISCONNECTING:
+		break;
+	}
+	spin_unlock_irqrestore(&queue->state_lock, flags);
+
+	if (disconnect) {
+		rdma_disconnect(queue->cm_id);
+		schedule_work(&queue->release_work);
+	}
+}
+
+static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+
+	mutex_lock(&nvmet_rdma_queue_mutex);
+	if (!list_empty(&queue->queue_list)) {
+		list_del_init(&queue->queue_list);
+		disconnect = true;
+	}
+	mutex_unlock(&nvmet_rdma_queue_mutex);
+
+	if (disconnect)
+		__nvmet_rdma_queue_disconnect(queue);
+}
+
+static void nvmet_rdma_p2pmem_remove(void *context)
+{
+	struct nvmet_rdma_queue *queue = context;
+
+	if (!queue->p2pmem)
+		return;
+
+	nvmet_rdma_queue_disconnect(queue);
+	flush_scheduled_work();
+}
+
 /*
  * If allow_p2pmem is set, we will try to use P2P memory for our
  * sgl lists. This requires the p2pmem device to be compatible with
@@ -1241,7 +1293,8 @@ static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
 
 	dma_devs[i++] = NULL;
 
-	queue->p2pmem = p2pmem_find_compat(dma_devs);
+	queue->p2pmem = p2pmem_find_compat(dma_devs, nvmet_rdma_p2pmem_remove,
+					   queue);
 
 	if (queue->p2pmem)
 		pr_debug("using %s for rdma nvme target queue",
@@ -1317,47 +1370,6 @@ static void nvmet_rdma_queue_established(struct nvmet_rdma_queue *queue)
 	spin_unlock_irqrestore(&queue->state_lock, flags);
 }
 
-static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-	unsigned long flags;
-
-	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
-
-	spin_lock_irqsave(&queue->state_lock, flags);
-	switch (queue->state) {
-	case NVMET_RDMA_Q_CONNECTING:
-	case NVMET_RDMA_Q_LIVE:
-		queue->state = NVMET_RDMA_Q_DISCONNECTING;
-	case NVMET_RDMA_IN_DEVICE_REMOVAL:
-		disconnect = true;
-		break;
-	case NVMET_RDMA_Q_DISCONNECTING:
-		break;
-	}
-	spin_unlock_irqrestore(&queue->state_lock, flags);
-
-	if (disconnect) {
-		rdma_disconnect(queue->cm_id);
-		schedule_work(&queue->release_work);
-	}
-}
-
-static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-
-	mutex_lock(&nvmet_rdma_queue_mutex);
-	if (!list_empty(&queue->queue_list)) {
-		list_del_init(&queue->queue_list);
-		disconnect = true;
-	}
-	mutex_unlock(&nvmet_rdma_queue_mutex);
-
-	if (disconnect)
-		__nvmet_rdma_queue_disconnect(queue);
-}
-
 static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
 		struct nvmet_rdma_queue *queue)
 {
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 4cd6f35..9365b02 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -22,12 +22,16 @@
 struct p2pmem_dev {
 	struct device dev;
 	int id;
+	bool alive;
 
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
 
 	struct dentry *debugfs_root;
+
+	struct mutex remove_mutex;	/* protects the remove callback list */
+	struct list_head remove_list;
 };
 
 #ifdef CONFIG_P2PMEM
@@ -41,8 +45,12 @@ int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
 void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
 
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
-void p2pmem_put(struct p2pmem_dev *p);
+struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context);
+
+void p2pmem_put(struct p2pmem_dev *p, void *context);
 
 #else
 
@@ -76,12 +84,15 @@ static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
 {
 }
 
-static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+static inline struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context)
 {
 	return NULL;
 }
 
-static inline void p2pmem_put(struct p2pmem_dev *p)
+static inline void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
 }
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 7/8] p2pmem: Support device removal
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)


This patch creates a list of callbacks to notify users of this memory
that the p2pmem device is going away or gone.

In nvmet-rdma, we disconnect any queue using p2p memory.
The remote side will then automatically reconnect in a
couple seconds and regular system memory (or a different p2pmem device)
will be used.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Signed-off-by: Stephen Bates <sbates at raithlin.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---
 drivers/memory/p2pmem.c    | 75 ++++++++++++++++++++++++++++++++---
 drivers/nvme/target/rdma.c | 98 ++++++++++++++++++++++++++--------------------
 include/linux/p2pmem.h     | 19 +++++++--
 3 files changed, 140 insertions(+), 52 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 71741c2..499d42c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -105,6 +105,21 @@ static void p2pmem_release(struct device *dev)
 	kfree(p);
 }
 
+struct remove_callback {
+	struct list_head list;
+	void (*callback)(void *context);
+	void *context;
+};
+
+static void p2pmem_remove(struct p2pmem_dev *p)
+{
+	struct remove_callback *remove_call, *tmp;
+
+	p->alive = false;
+	list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list)
+		remove_call->callback(remove_call->context);
+}
+
 /**
  * p2pmem_create() - create a new p2pmem device
  * @parent: the parent device to create it under
@@ -123,6 +138,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 		return ERR_PTR(-ENOMEM);
 
 	init_completion(&p->cmp);
+	mutex_init(&p->remove_mutex);
+	INIT_LIST_HEAD(&p->remove_list);
+	p->alive = true;
+
 	device_initialize(&p->dev);
 	p->dev.class = p2pmem_class;
 	p->dev.parent = parent;
@@ -187,6 +206,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
 	dev_info(&p->dev, "unregistered");
 	device_del(&p->dev);
+	p2pmem_remove(p);
 	ida_simple_remove(&p2pmem_ida, p->id);
 	put_device(&p->dev);
 }
@@ -291,6 +311,9 @@ EXPORT_SYMBOL(p2pmem_add_pci_region);
  */
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
 {
+	if (!p->alive)
+		return NULL;
+
 	return (void *)gen_pool_alloc(p->pool, size);
 }
 EXPORT_SYMBOL(p2pmem_alloc);
@@ -349,6 +372,9 @@ static int upstream_bridges_match(struct device *p2pmem,
 	struct pci_dev *p2p_up;
 	struct pci_dev *dma_up;
 
+	if (!to_p2pmem(p2pmem)->alive)
+		return false;
+
 	p2p_up = get_upstream_switch_port(p2pmem);
 	if (!p2p_up) {
 		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
@@ -383,6 +409,8 @@ static int upstream_bridges_match(struct device *p2pmem,
  *	specified devices
  * @dma_devices: a null terminated array of device pointers which
  *	all must be compatible with the returned p2pmem device
+ * @remove_callback: this callback will be called if the p2pmem
+ *	device is removed.
  *
  * For now, we only support cases where all the devices that
  * will transfer to the p2pmem device are on the same switch.
@@ -400,9 +428,13 @@ static int upstream_bridges_match(struct device *p2pmem,
  * (use p2pmem_put to return the reference) or NULL if no compatible
  * p2pmem device is found.
  */
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices,
+				      void (*remove_callback)(void *context),
+				      void *context)
 {
 	struct device *dev;
+	struct p2pmem_dev *p;
+	struct remove_callback *remove_call;
 
 	dev = class_find_device(p2pmem_class, NULL, dma_devices,
 				upstream_bridges_match);
@@ -410,21 +442,54 @@ struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
 	if (!dev)
 		return NULL;
 
-	return to_p2pmem(dev);
+	p = to_p2pmem(dev);
+	mutex_lock(&p->remove_mutex);
+
+	if (!p->alive) {
+		p = NULL;
+		goto out;
+	}
+
+	remove_call = kzalloc(sizeof(*remove_call), GFP_KERNEL);
+	remove_call->callback = remove_callback;
+	remove_call->context = context;
+	INIT_LIST_HEAD(&remove_call->list);
+	list_add(&remove_call->list, &p->remove_list);
+
+out:
+	mutex_unlock(&p->remove_mutex);
+	return p;
 }
 EXPORT_SYMBOL(p2pmem_find_compat);
 
 /**
  * p2pmem_put() - decrement a p2pmem device reference
  * @p: p2pmem device to return
+ * @data: data pointer that was passed to p2pmem_find_compat
  *
  * Dereference and free (if last) the device's reference counter.
  * It's safe to pass a NULL pointer to this function.
  */
-void p2pmem_put(struct p2pmem_dev *p)
+void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
-	if (p)
-		put_device(&p->dev);
+	struct remove_callback *remove_call;
+
+	if (!p)
+		return;
+
+	mutex_lock(&p->remove_mutex);
+
+	list_for_each_entry(remove_call, &p->remove_list, list) {
+		if (remove_call->context != context)
+			continue;
+
+		list_del(&remove_call->list);
+		kfree(remove_call);
+		break;
+	}
+
+	mutex_unlock(&p->remove_mutex);
+	put_device(&p->dev);
 }
 EXPORT_SYMBOL(p2pmem_put);
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index abab544..9ebcda6 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1008,7 +1008,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
 				!queue->host_qid);
 	}
 	nvmet_rdma_free_rsps(queue);
-	p2pmem_put(queue->p2pmem);
+	p2pmem_put(queue->p2pmem, queue);
 	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
 	kfree(queue);
 }
@@ -1204,6 +1204,58 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
 	return ret;
 }
 
+static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+	unsigned long flags;
+
+	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
+
+	spin_lock_irqsave(&queue->state_lock, flags);
+	switch (queue->state) {
+	case NVMET_RDMA_Q_CONNECTING:
+	case NVMET_RDMA_Q_LIVE:
+		queue->state = NVMET_RDMA_Q_DISCONNECTING;
+	case NVMET_RDMA_IN_DEVICE_REMOVAL:
+		disconnect = true;
+		break;
+	case NVMET_RDMA_Q_DISCONNECTING:
+		break;
+	}
+	spin_unlock_irqrestore(&queue->state_lock, flags);
+
+	if (disconnect) {
+		rdma_disconnect(queue->cm_id);
+		schedule_work(&queue->release_work);
+	}
+}
+
+static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
+{
+	bool disconnect = false;
+
+	mutex_lock(&nvmet_rdma_queue_mutex);
+	if (!list_empty(&queue->queue_list)) {
+		list_del_init(&queue->queue_list);
+		disconnect = true;
+	}
+	mutex_unlock(&nvmet_rdma_queue_mutex);
+
+	if (disconnect)
+		__nvmet_rdma_queue_disconnect(queue);
+}
+
+static void nvmet_rdma_p2pmem_remove(void *context)
+{
+	struct nvmet_rdma_queue *queue = context;
+
+	if (!queue->p2pmem)
+		return;
+
+	nvmet_rdma_queue_disconnect(queue);
+	flush_scheduled_work();
+}
+
 /*
  * If allow_p2pmem is set, we will try to use P2P memory for our
  * sgl lists. This requires the p2pmem device to be compatible with
@@ -1241,7 +1293,8 @@ static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
 
 	dma_devs[i++] = NULL;
 
-	queue->p2pmem = p2pmem_find_compat(dma_devs);
+	queue->p2pmem = p2pmem_find_compat(dma_devs, nvmet_rdma_p2pmem_remove,
+					   queue);
 
 	if (queue->p2pmem)
 		pr_debug("using %s for rdma nvme target queue",
@@ -1317,47 +1370,6 @@ static void nvmet_rdma_queue_established(struct nvmet_rdma_queue *queue)
 	spin_unlock_irqrestore(&queue->state_lock, flags);
 }
 
-static void __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-	unsigned long flags;
-
-	pr_debug("cm_id= %p queue->state= %d\n", queue->cm_id, queue->state);
-
-	spin_lock_irqsave(&queue->state_lock, flags);
-	switch (queue->state) {
-	case NVMET_RDMA_Q_CONNECTING:
-	case NVMET_RDMA_Q_LIVE:
-		queue->state = NVMET_RDMA_Q_DISCONNECTING;
-	case NVMET_RDMA_IN_DEVICE_REMOVAL:
-		disconnect = true;
-		break;
-	case NVMET_RDMA_Q_DISCONNECTING:
-		break;
-	}
-	spin_unlock_irqrestore(&queue->state_lock, flags);
-
-	if (disconnect) {
-		rdma_disconnect(queue->cm_id);
-		schedule_work(&queue->release_work);
-	}
-}
-
-static void nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue)
-{
-	bool disconnect = false;
-
-	mutex_lock(&nvmet_rdma_queue_mutex);
-	if (!list_empty(&queue->queue_list)) {
-		list_del_init(&queue->queue_list);
-		disconnect = true;
-	}
-	mutex_unlock(&nvmet_rdma_queue_mutex);
-
-	if (disconnect)
-		__nvmet_rdma_queue_disconnect(queue);
-}
-
 static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
 		struct nvmet_rdma_queue *queue)
 {
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 4cd6f35..9365b02 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -22,12 +22,16 @@
 struct p2pmem_dev {
 	struct device dev;
 	int id;
+	bool alive;
 
 	struct percpu_ref ref;
 	struct completion cmp;
 	struct gen_pool *pool;
 
 	struct dentry *debugfs_root;
+
+	struct mutex remove_mutex;	/* protects the remove callback list */
+	struct list_head remove_list;
 };
 
 #ifdef CONFIG_P2PMEM
@@ -41,8 +45,12 @@ int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
 void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
 void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
 
-struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
-void p2pmem_put(struct p2pmem_dev *p);
+struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context);
+
+void p2pmem_put(struct p2pmem_dev *p, void *context);
 
 #else
 
@@ -76,12 +84,15 @@ static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
 {
 }
 
-static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+static inline struct p2pmem_dev *
+p2pmem_find_compat(struct device **dma_devices,
+		   void (*unregister_callback)(void *context),
+		   void *context)
 {
 	return NULL;
 }
 
-static inline void p2pmem_put(struct p2pmem_dev *p)
+static inline void p2pmem_put(struct p2pmem_dev *p, void *context)
 {
 }
 
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 8/8] p2pmem: Added char device user interface
  2017-03-30 22:12 ` Logan Gunthorpe
  (?)
  (?)
@ 2017-03-30 22:12   ` Logan Gunthorpe
  -1 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

This creates a userspace interface to use p2pmemory. A user can use
mmap on the p2pmem char device to get buffers from the corresponding
device. This allows a user to use p2p memory with existing
interfaces like RDMA and O_DIRECT.

This patch is a bit more controversial because people don't want to
expose these interfaces to userspace without more consideration.
However, this patch is _very_ useful for expirementing with p2p memory.

For example, with this patch, you can test with commands like:

ib_write_bw -R --mmap=/dev/p2pmem0 -D 30

or use an fio script like:

[rdma-server]
rw=read
mem=mmapshared:/dev/p2pmem0
ioengine=rdma
port=14242
bs=64k
size=10G
iodepth=2

which would test the bandwidth of RDMA to/from the specified p2p memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/p2pmem.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/p2pmem.h  |   4 ++
 2 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 499d42c..129c49c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -19,14 +19,20 @@
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
 #include <linux/debugfs.h>
+#include <linux/pfn_t.h>
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Microsemi Corporation");
 
+static int max_devices = 16;
+module_param(max_devices, int, 0444);
+MODULE_PARM_DESC(max_devices, "Maximum number of char devices");
+
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
+static dev_t p2pmem_devt;
 
 static struct dentry *p2pmem_debugfs_root;
 
@@ -67,6 +73,144 @@ static struct p2pmem_dev *to_p2pmem(struct device *dev)
 	return container_of(dev, struct p2pmem_dev, dev);
 }
 
+struct p2pmem_vma {
+	struct p2pmem_dev *p2pmem_dev;
+	atomic_t mmap_count;
+	size_t nr_pages;
+
+	/* Protects the used_pages array */
+	struct mutex mutex;
+	struct page *used_pages[];
+};
+
+static void p2pmem_vma_open(struct vm_area_struct *vma)
+{
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	atomic_inc(&pv->mmap_count);
+}
+
+static void p2pmem_vma_free_pages(struct vm_area_struct *vma)
+{
+	int i;
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	mutex_lock(&pv->mutex);
+
+	for (i = 0; i < pv->nr_pages; i++) {
+		if (pv->used_pages[i]) {
+			p2pmem_free_page(pv->p2pmem_dev, pv->used_pages[i]);
+			pv->used_pages[i] = NULL;
+		}
+	}
+
+	mutex_unlock(&pv->mutex);
+}
+
+static void p2pmem_vma_close(struct vm_area_struct *vma)
+{
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	if (!atomic_dec_and_test(&pv->mmap_count))
+		return;
+
+	p2pmem_vma_free_pages(vma);
+
+	dev_dbg(&pv->p2pmem_dev->dev, "vma close");
+	kfree(pv);
+}
+
+static int p2pmem_vma_fault(struct vm_fault *vmf)
+{
+	struct p2pmem_vma *pv = vmf->vma->vm_private_data;
+	unsigned int pg_idx;
+	struct page *pg;
+	pfn_t pfn;
+	int rc;
+
+	if (!pv->p2pmem_dev->alive)
+		return VM_FAULT_SIGBUS;
+
+	pg_idx = (vmf->address - vmf->vma->vm_start) / PAGE_SIZE;
+
+	mutex_lock(&pv->mutex);
+
+	if (pv->used_pages[pg_idx])
+		pg = pv->used_pages[pg_idx];
+	else
+		pg = p2pmem_alloc_page(pv->p2pmem_dev);
+
+	if (!pg)
+		return VM_FAULT_OOM;
+
+	pv->used_pages[pg_idx] = pg;
+
+	pfn = phys_to_pfn_t(page_to_phys(pg), PFN_DEV | PFN_MAP);
+	rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
+
+	mutex_unlock(&pv->mutex);
+
+	if (rc == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (rc < 0 && rc != -EBUSY)
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+const struct vm_operations_struct p2pmem_vmops = {
+	.open = p2pmem_vma_open,
+	.close = p2pmem_vma_close,
+	.fault = p2pmem_vma_fault,
+};
+
+static int p2pmem_open(struct inode *inode, struct file *filp)
+{
+	struct p2pmem_dev *p;
+
+	p = container_of(inode->i_cdev, struct p2pmem_dev, cdev);
+	filp->private_data = p;
+	p->inode = inode;
+
+	return 0;
+}
+
+static int p2pmem_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct p2pmem_dev *p = filp->private_data;
+	struct p2pmem_vma *pv;
+	size_t nr_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
+
+	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+		dev_warn(&p->dev, "mmap failed: can't create private mapping\n");
+		return -EINVAL;
+	}
+
+	dev_dbg(&p->dev, "Allocating mmap with %zd pages.\n", nr_pages);
+
+	pv = kzalloc(sizeof(*pv) + sizeof(pv->used_pages[0]) * nr_pages,
+		     GFP_KERNEL);
+	if (!pv)
+		return -ENOMEM;
+
+	mutex_init(&pv->mutex);
+	pv->nr_pages = nr_pages;
+	pv->p2pmem_dev = p;
+	atomic_set(&pv->mmap_count, 1);
+
+	vma->vm_private_data = pv;
+	vma->vm_ops = &p2pmem_vmops;
+	vma->vm_flags |= VM_MIXEDMAP;
+
+	return 0;
+}
+
+static const struct file_operations p2pmem_fops = {
+	.owner = THIS_MODULE,
+	.open = p2pmem_open,
+	.mmap = p2pmem_mmap,
+};
+
 static void p2pmem_percpu_release(struct percpu_ref *ref)
 {
 	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
@@ -114,10 +258,23 @@ struct remove_callback {
 static void p2pmem_remove(struct p2pmem_dev *p)
 {
 	struct remove_callback *remove_call, *tmp;
+	struct vm_area_struct *vma;
 
 	p->alive = false;
 	list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list)
 		remove_call->callback(remove_call->context);
+
+	if (!p->inode)
+		return;
+
+	unmap_mapping_range(p->inode->i_mapping, 0, 0, 1);
+
+	i_mmap_lock_write(p->inode->i_mapping);
+	vma_interval_tree_foreach(vma, &p->inode->i_mapping->i_mmap, 0,
+				  ULONG_MAX) {
+		p2pmem_vma_free_pages(vma);
+	}
+	i_mmap_unlock_write(p->inode->i_mapping);
 }
 
 /**
@@ -147,6 +304,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	p->dev.parent = parent;
 	p->dev.release = p2pmem_release;
 
+	cdev_init(&p->cdev, &p2pmem_fops);
+	p->cdev.owner = THIS_MODULE;
+	p->cdev.kobj.parent = &p->dev.kobj;
+
 	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
 	if (p->id < 0) {
 		rc = p->id;
@@ -154,6 +315,7 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	}
 
 	dev_set_name(&p->dev, "p2pmem%d", p->id);
+	p->dev.devt = MKDEV(MAJOR(p2pmem_devt), p->id);
 
 	p->pool = gen_pool_create(PAGE_SHIFT, nid);
 	if (!p->pool) {
@@ -177,14 +339,20 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 			setup_debugfs(p);
 	}
 
-	rc = device_add(&p->dev);
+	rc = cdev_add(&p->cdev, p->dev.devt, 1);
 	if (rc)
 		goto err_id;
 
-	dev_info(&p->dev, "registered");
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_cdev;
 
+	dev_info(&p->dev, "registered");
 	return p;
 
+err_cdev:
+	cdev_del(&p->cdev);
+	p2pmem_remove(p);
 err_id:
 	ida_simple_remove(&p2pmem_ida, p->id);
 err_free:
@@ -206,6 +374,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
 	dev_info(&p->dev, "unregistered");
 	device_del(&p->dev);
+	cdev_del(&p->cdev);
 	p2pmem_remove(p);
 	ida_simple_remove(&p2pmem_ida, p->id);
 	put_device(&p->dev);
@@ -495,21 +664,32 @@ EXPORT_SYMBOL(p2pmem_put);
 
 static int __init p2pmem_init(void)
 {
+	int rc;
+
 	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
 	if (IS_ERR(p2pmem_class))
 		return PTR_ERR(p2pmem_class);
 
+	rc = alloc_chrdev_region(&p2pmem_devt, 0, max_devices, "iopmemc");
+	if (rc)
+		goto err_chrdev;
+
 	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
 	if (!p2pmem_debugfs_root)
 		pr_info("could not create debugfs entry, continuing\n");
 
 	return 0;
+
+err_chrdev:
+	class_destroy(p2pmem_class);
+	return rc;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
 	debugfs_remove_recursive(p2pmem_debugfs_root);
+	unregister_chrdev_region(p2pmem_devt, max_devices);
 	class_destroy(p2pmem_class);
 
 	pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 9365b02..aeee60d 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -18,6 +18,7 @@
 
 #include <linux/device.h>
 #include <linux/pci.h>
+#include <linux/cdev.h>
 
 struct p2pmem_dev {
 	struct device dev;
@@ -32,6 +33,9 @@ struct p2pmem_dev {
 
 	struct mutex remove_mutex;	/* protects the remove callback list */
 	struct list_head remove_list;
+
+	struct cdev cdev;
+	struct inode *inode;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 8/8] p2pmem: Added char device user interface
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

This creates a userspace interface to use p2pmemory. A user can use
mmap on the p2pmem char device to get buffers from the corresponding
device. This allows a user to use p2p memory with existing
interfaces like RDMA and O_DIRECT.

This patch is a bit more controversial because people don't want to
expose these interfaces to userspace without more consideration.
However, this patch is _very_ useful for expirementing with p2p memory.

For example, with this patch, you can test with commands like:

ib_write_bw -R --mmap=/dev/p2pmem0 -D 30

or use an fio script like:

[rdma-server]
rw=read
mem=mmapshared:/dev/p2pmem0
ioengine=rdma
port=14242
bs=64k
size=10G
iodepth=2

which would test the bandwidth of RDMA to/from the specified p2p memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/p2pmem.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/p2pmem.h  |   4 ++
 2 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 499d42c..129c49c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -19,14 +19,20 @@
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
 #include <linux/debugfs.h>
+#include <linux/pfn_t.h>
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Microsemi Corporation");
 
+static int max_devices = 16;
+module_param(max_devices, int, 0444);
+MODULE_PARM_DESC(max_devices, "Maximum number of char devices");
+
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
+static dev_t p2pmem_devt;
 
 static struct dentry *p2pmem_debugfs_root;
 
@@ -67,6 +73,144 @@ static struct p2pmem_dev *to_p2pmem(struct device *dev)
 	return container_of(dev, struct p2pmem_dev, dev);
 }
 
+struct p2pmem_vma {
+	struct p2pmem_dev *p2pmem_dev;
+	atomic_t mmap_count;
+	size_t nr_pages;
+
+	/* Protects the used_pages array */
+	struct mutex mutex;
+	struct page *used_pages[];
+};
+
+static void p2pmem_vma_open(struct vm_area_struct *vma)
+{
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	atomic_inc(&pv->mmap_count);
+}
+
+static void p2pmem_vma_free_pages(struct vm_area_struct *vma)
+{
+	int i;
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	mutex_lock(&pv->mutex);
+
+	for (i = 0; i < pv->nr_pages; i++) {
+		if (pv->used_pages[i]) {
+			p2pmem_free_page(pv->p2pmem_dev, pv->used_pages[i]);
+			pv->used_pages[i] = NULL;
+		}
+	}
+
+	mutex_unlock(&pv->mutex);
+}
+
+static void p2pmem_vma_close(struct vm_area_struct *vma)
+{
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	if (!atomic_dec_and_test(&pv->mmap_count))
+		return;
+
+	p2pmem_vma_free_pages(vma);
+
+	dev_dbg(&pv->p2pmem_dev->dev, "vma close");
+	kfree(pv);
+}
+
+static int p2pmem_vma_fault(struct vm_fault *vmf)
+{
+	struct p2pmem_vma *pv = vmf->vma->vm_private_data;
+	unsigned int pg_idx;
+	struct page *pg;
+	pfn_t pfn;
+	int rc;
+
+	if (!pv->p2pmem_dev->alive)
+		return VM_FAULT_SIGBUS;
+
+	pg_idx = (vmf->address - vmf->vma->vm_start) / PAGE_SIZE;
+
+	mutex_lock(&pv->mutex);
+
+	if (pv->used_pages[pg_idx])
+		pg = pv->used_pages[pg_idx];
+	else
+		pg = p2pmem_alloc_page(pv->p2pmem_dev);
+
+	if (!pg)
+		return VM_FAULT_OOM;
+
+	pv->used_pages[pg_idx] = pg;
+
+	pfn = phys_to_pfn_t(page_to_phys(pg), PFN_DEV | PFN_MAP);
+	rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
+
+	mutex_unlock(&pv->mutex);
+
+	if (rc == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (rc < 0 && rc != -EBUSY)
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+const struct vm_operations_struct p2pmem_vmops = {
+	.open = p2pmem_vma_open,
+	.close = p2pmem_vma_close,
+	.fault = p2pmem_vma_fault,
+};
+
+static int p2pmem_open(struct inode *inode, struct file *filp)
+{
+	struct p2pmem_dev *p;
+
+	p = container_of(inode->i_cdev, struct p2pmem_dev, cdev);
+	filp->private_data = p;
+	p->inode = inode;
+
+	return 0;
+}
+
+static int p2pmem_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct p2pmem_dev *p = filp->private_data;
+	struct p2pmem_vma *pv;
+	size_t nr_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
+
+	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+		dev_warn(&p->dev, "mmap failed: can't create private mapping\n");
+		return -EINVAL;
+	}
+
+	dev_dbg(&p->dev, "Allocating mmap with %zd pages.\n", nr_pages);
+
+	pv = kzalloc(sizeof(*pv) + sizeof(pv->used_pages[0]) * nr_pages,
+		     GFP_KERNEL);
+	if (!pv)
+		return -ENOMEM;
+
+	mutex_init(&pv->mutex);
+	pv->nr_pages = nr_pages;
+	pv->p2pmem_dev = p;
+	atomic_set(&pv->mmap_count, 1);
+
+	vma->vm_private_data = pv;
+	vma->vm_ops = &p2pmem_vmops;
+	vma->vm_flags |= VM_MIXEDMAP;
+
+	return 0;
+}
+
+static const struct file_operations p2pmem_fops = {
+	.owner = THIS_MODULE,
+	.open = p2pmem_open,
+	.mmap = p2pmem_mmap,
+};
+
 static void p2pmem_percpu_release(struct percpu_ref *ref)
 {
 	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
@@ -114,10 +258,23 @@ struct remove_callback {
 static void p2pmem_remove(struct p2pmem_dev *p)
 {
 	struct remove_callback *remove_call, *tmp;
+	struct vm_area_struct *vma;
 
 	p->alive = false;
 	list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list)
 		remove_call->callback(remove_call->context);
+
+	if (!p->inode)
+		return;
+
+	unmap_mapping_range(p->inode->i_mapping, 0, 0, 1);
+
+	i_mmap_lock_write(p->inode->i_mapping);
+	vma_interval_tree_foreach(vma, &p->inode->i_mapping->i_mmap, 0,
+				  ULONG_MAX) {
+		p2pmem_vma_free_pages(vma);
+	}
+	i_mmap_unlock_write(p->inode->i_mapping);
 }
 
 /**
@@ -147,6 +304,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	p->dev.parent = parent;
 	p->dev.release = p2pmem_release;
 
+	cdev_init(&p->cdev, &p2pmem_fops);
+	p->cdev.owner = THIS_MODULE;
+	p->cdev.kobj.parent = &p->dev.kobj;
+
 	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
 	if (p->id < 0) {
 		rc = p->id;
@@ -154,6 +315,7 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	}
 
 	dev_set_name(&p->dev, "p2pmem%d", p->id);
+	p->dev.devt = MKDEV(MAJOR(p2pmem_devt), p->id);
 
 	p->pool = gen_pool_create(PAGE_SHIFT, nid);
 	if (!p->pool) {
@@ -177,14 +339,20 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 			setup_debugfs(p);
 	}
 
-	rc = device_add(&p->dev);
+	rc = cdev_add(&p->cdev, p->dev.devt, 1);
 	if (rc)
 		goto err_id;
 
-	dev_info(&p->dev, "registered");
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_cdev;
 
+	dev_info(&p->dev, "registered");
 	return p;
 
+err_cdev:
+	cdev_del(&p->cdev);
+	p2pmem_remove(p);
 err_id:
 	ida_simple_remove(&p2pmem_ida, p->id);
 err_free:
@@ -206,6 +374,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
 	dev_info(&p->dev, "unregistered");
 	device_del(&p->dev);
+	cdev_del(&p->cdev);
 	p2pmem_remove(p);
 	ida_simple_remove(&p2pmem_ida, p->id);
 	put_device(&p->dev);
@@ -495,21 +664,32 @@ EXPORT_SYMBOL(p2pmem_put);
 
 static int __init p2pmem_init(void)
 {
+	int rc;
+
 	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
 	if (IS_ERR(p2pmem_class))
 		return PTR_ERR(p2pmem_class);
 
+	rc = alloc_chrdev_region(&p2pmem_devt, 0, max_devices, "iopmemc");
+	if (rc)
+		goto err_chrdev;
+
 	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
 	if (!p2pmem_debugfs_root)
 		pr_info("could not create debugfs entry, continuing\n");
 
 	return 0;
+
+err_chrdev:
+	class_destroy(p2pmem_class);
+	return rc;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
 	debugfs_remove_recursive(p2pmem_debugfs_root);
+	unregister_chrdev_region(p2pmem_devt, max_devices);
 	class_destroy(p2pmem_class);
 
 	pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 9365b02..aeee60d 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -18,6 +18,7 @@
 
 #include <linux/device.h>
 #include <linux/pci.h>
+#include <linux/cdev.h>
 
 struct p2pmem_dev {
 	struct device dev;
@@ -32,6 +33,9 @@ struct p2pmem_dev {
 
 	struct mutex remove_mutex;	/* protects the remove callback list */
 	struct list_head remove_list;
+
+	struct cdev cdev;
+	struct inode *inode;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 8/8] p2pmem: Added char device user interface
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

This creates a userspace interface to use p2pmemory. A user can use
mmap on the p2pmem char device to get buffers from the corresponding
device. This allows a user to use p2p memory with existing
interfaces like RDMA and O_DIRECT.

This patch is a bit more controversial because people don't want to
expose these interfaces to userspace without more consideration.
However, this patch is _very_ useful for expirementing with p2p memory.

For example, with this patch, you can test with commands like:

ib_write_bw -R --mmap=/dev/p2pmem0 -D 30

or use an fio script like:

[rdma-server]
rw=read
mem=mmapshared:/dev/p2pmem0
ioengine=rdma
port=14242
bs=64k
size=10G
iodepth=2

which would test the bandwidth of RDMA to/from the specified p2p memory.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/p2pmem.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/p2pmem.h  |   4 ++
 2 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 499d42c..129c49c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -19,14 +19,20 @@
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
 #include <linux/debugfs.h>
+#include <linux/pfn_t.h>
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Microsemi Corporation");
 
+static int max_devices = 16;
+module_param(max_devices, int, 0444);
+MODULE_PARM_DESC(max_devices, "Maximum number of char devices");
+
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
+static dev_t p2pmem_devt;
 
 static struct dentry *p2pmem_debugfs_root;
 
@@ -67,6 +73,144 @@ static struct p2pmem_dev *to_p2pmem(struct device *dev)
 	return container_of(dev, struct p2pmem_dev, dev);
 }
 
+struct p2pmem_vma {
+	struct p2pmem_dev *p2pmem_dev;
+	atomic_t mmap_count;
+	size_t nr_pages;
+
+	/* Protects the used_pages array */
+	struct mutex mutex;
+	struct page *used_pages[];
+};
+
+static void p2pmem_vma_open(struct vm_area_struct *vma)
+{
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	atomic_inc(&pv->mmap_count);
+}
+
+static void p2pmem_vma_free_pages(struct vm_area_struct *vma)
+{
+	int i;
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	mutex_lock(&pv->mutex);
+
+	for (i = 0; i < pv->nr_pages; i++) {
+		if (pv->used_pages[i]) {
+			p2pmem_free_page(pv->p2pmem_dev, pv->used_pages[i]);
+			pv->used_pages[i] = NULL;
+		}
+	}
+
+	mutex_unlock(&pv->mutex);
+}
+
+static void p2pmem_vma_close(struct vm_area_struct *vma)
+{
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	if (!atomic_dec_and_test(&pv->mmap_count))
+		return;
+
+	p2pmem_vma_free_pages(vma);
+
+	dev_dbg(&pv->p2pmem_dev->dev, "vma close");
+	kfree(pv);
+}
+
+static int p2pmem_vma_fault(struct vm_fault *vmf)
+{
+	struct p2pmem_vma *pv = vmf->vma->vm_private_data;
+	unsigned int pg_idx;
+	struct page *pg;
+	pfn_t pfn;
+	int rc;
+
+	if (!pv->p2pmem_dev->alive)
+		return VM_FAULT_SIGBUS;
+
+	pg_idx = (vmf->address - vmf->vma->vm_start) / PAGE_SIZE;
+
+	mutex_lock(&pv->mutex);
+
+	if (pv->used_pages[pg_idx])
+		pg = pv->used_pages[pg_idx];
+	else
+		pg = p2pmem_alloc_page(pv->p2pmem_dev);
+
+	if (!pg)
+		return VM_FAULT_OOM;
+
+	pv->used_pages[pg_idx] = pg;
+
+	pfn = phys_to_pfn_t(page_to_phys(pg), PFN_DEV | PFN_MAP);
+	rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
+
+	mutex_unlock(&pv->mutex);
+
+	if (rc == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (rc < 0 && rc != -EBUSY)
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+const struct vm_operations_struct p2pmem_vmops = {
+	.open = p2pmem_vma_open,
+	.close = p2pmem_vma_close,
+	.fault = p2pmem_vma_fault,
+};
+
+static int p2pmem_open(struct inode *inode, struct file *filp)
+{
+	struct p2pmem_dev *p;
+
+	p = container_of(inode->i_cdev, struct p2pmem_dev, cdev);
+	filp->private_data = p;
+	p->inode = inode;
+
+	return 0;
+}
+
+static int p2pmem_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct p2pmem_dev *p = filp->private_data;
+	struct p2pmem_vma *pv;
+	size_t nr_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
+
+	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+		dev_warn(&p->dev, "mmap failed: can't create private mapping\n");
+		return -EINVAL;
+	}
+
+	dev_dbg(&p->dev, "Allocating mmap with %zd pages.\n", nr_pages);
+
+	pv = kzalloc(sizeof(*pv) + sizeof(pv->used_pages[0]) * nr_pages,
+		     GFP_KERNEL);
+	if (!pv)
+		return -ENOMEM;
+
+	mutex_init(&pv->mutex);
+	pv->nr_pages = nr_pages;
+	pv->p2pmem_dev = p;
+	atomic_set(&pv->mmap_count, 1);
+
+	vma->vm_private_data = pv;
+	vma->vm_ops = &p2pmem_vmops;
+	vma->vm_flags |= VM_MIXEDMAP;
+
+	return 0;
+}
+
+static const struct file_operations p2pmem_fops = {
+	.owner = THIS_MODULE,
+	.open = p2pmem_open,
+	.mmap = p2pmem_mmap,
+};
+
 static void p2pmem_percpu_release(struct percpu_ref *ref)
 {
 	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
@@ -114,10 +258,23 @@ struct remove_callback {
 static void p2pmem_remove(struct p2pmem_dev *p)
 {
 	struct remove_callback *remove_call, *tmp;
+	struct vm_area_struct *vma;
 
 	p->alive = false;
 	list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list)
 		remove_call->callback(remove_call->context);
+
+	if (!p->inode)
+		return;
+
+	unmap_mapping_range(p->inode->i_mapping, 0, 0, 1);
+
+	i_mmap_lock_write(p->inode->i_mapping);
+	vma_interval_tree_foreach(vma, &p->inode->i_mapping->i_mmap, 0,
+				  ULONG_MAX) {
+		p2pmem_vma_free_pages(vma);
+	}
+	i_mmap_unlock_write(p->inode->i_mapping);
 }
 
 /**
@@ -147,6 +304,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	p->dev.parent = parent;
 	p->dev.release = p2pmem_release;
 
+	cdev_init(&p->cdev, &p2pmem_fops);
+	p->cdev.owner = THIS_MODULE;
+	p->cdev.kobj.parent = &p->dev.kobj;
+
 	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
 	if (p->id < 0) {
 		rc = p->id;
@@ -154,6 +315,7 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	}
 
 	dev_set_name(&p->dev, "p2pmem%d", p->id);
+	p->dev.devt = MKDEV(MAJOR(p2pmem_devt), p->id);
 
 	p->pool = gen_pool_create(PAGE_SHIFT, nid);
 	if (!p->pool) {
@@ -177,14 +339,20 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 			setup_debugfs(p);
 	}
 
-	rc = device_add(&p->dev);
+	rc = cdev_add(&p->cdev, p->dev.devt, 1);
 	if (rc)
 		goto err_id;
 
-	dev_info(&p->dev, "registered");
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_cdev;
 
+	dev_info(&p->dev, "registered");
 	return p;
 
+err_cdev:
+	cdev_del(&p->cdev);
+	p2pmem_remove(p);
 err_id:
 	ida_simple_remove(&p2pmem_ida, p->id);
 err_free:
@@ -206,6 +374,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
 	dev_info(&p->dev, "unregistered");
 	device_del(&p->dev);
+	cdev_del(&p->cdev);
 	p2pmem_remove(p);
 	ida_simple_remove(&p2pmem_ida, p->id);
 	put_device(&p->dev);
@@ -495,21 +664,32 @@ EXPORT_SYMBOL(p2pmem_put);
 
 static int __init p2pmem_init(void)
 {
+	int rc;
+
 	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
 	if (IS_ERR(p2pmem_class))
 		return PTR_ERR(p2pmem_class);
 
+	rc = alloc_chrdev_region(&p2pmem_devt, 0, max_devices, "iopmemc");
+	if (rc)
+		goto err_chrdev;
+
 	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
 	if (!p2pmem_debugfs_root)
 		pr_info("could not create debugfs entry, continuing\n");
 
 	return 0;
+
+err_chrdev:
+	class_destroy(p2pmem_class);
+	return rc;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
 	debugfs_remove_recursive(p2pmem_debugfs_root);
+	unregister_chrdev_region(p2pmem_devt, max_devices);
 	class_destroy(p2pmem_class);
 
 	pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 9365b02..aeee60d 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -18,6 +18,7 @@
 
 #include <linux/device.h>
 #include <linux/pci.h>
+#include <linux/cdev.h>
 
 struct p2pmem_dev {
 	struct device dev;
@@ -32,6 +33,9 @@ struct p2pmem_dev {
 
 	struct mutex remove_mutex;	/* protects the remove callback list */
 	struct list_head remove_list;
+
+	struct cdev cdev;
+	struct inode *inode;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 8/8] p2pmem: Added char device user interface
@ 2017-03-30 22:12   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)


This creates a userspace interface to use p2pmemory. A user can use
mmap on the p2pmem char device to get buffers from the corresponding
device. This allows a user to use p2p memory with existing
interfaces like RDMA and O_DIRECT.

This patch is a bit more controversial because people don't want to
expose these interfaces to userspace without more consideration.
However, this patch is _very_ useful for expirementing with p2p memory.

For example, with this patch, you can test with commands like:

ib_write_bw -R --mmap=/dev/p2pmem0 -D 30

or use an fio script like:

[rdma-server]
rw=read
mem=mmapshared:/dev/p2pmem0
ioengine=rdma
port=14242
bs=64k
size=10G
iodepth=2

which would test the bandwidth of RDMA to/from the specified p2p memory.

Signed-off-by: Logan Gunthorpe <logang at deltatee.com>
Signed-off-by: Stephen Bates <sbates at raithlin.com>
Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---
 drivers/memory/p2pmem.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/p2pmem.h  |   4 ++
 2 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
index 499d42c..129c49c 100644
--- a/drivers/memory/p2pmem.c
+++ b/drivers/memory/p2pmem.c
@@ -19,14 +19,20 @@
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
 #include <linux/debugfs.h>
+#include <linux/pfn_t.h>
 
 MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
 MODULE_VERSION("0.1");
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Microsemi Corporation");
 
+static int max_devices = 16;
+module_param(max_devices, int, 0444);
+MODULE_PARM_DESC(max_devices, "Maximum number of char devices");
+
 static struct class *p2pmem_class;
 static DEFINE_IDA(p2pmem_ida);
+static dev_t p2pmem_devt;
 
 static struct dentry *p2pmem_debugfs_root;
 
@@ -67,6 +73,144 @@ static struct p2pmem_dev *to_p2pmem(struct device *dev)
 	return container_of(dev, struct p2pmem_dev, dev);
 }
 
+struct p2pmem_vma {
+	struct p2pmem_dev *p2pmem_dev;
+	atomic_t mmap_count;
+	size_t nr_pages;
+
+	/* Protects the used_pages array */
+	struct mutex mutex;
+	struct page *used_pages[];
+};
+
+static void p2pmem_vma_open(struct vm_area_struct *vma)
+{
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	atomic_inc(&pv->mmap_count);
+}
+
+static void p2pmem_vma_free_pages(struct vm_area_struct *vma)
+{
+	int i;
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	mutex_lock(&pv->mutex);
+
+	for (i = 0; i < pv->nr_pages; i++) {
+		if (pv->used_pages[i]) {
+			p2pmem_free_page(pv->p2pmem_dev, pv->used_pages[i]);
+			pv->used_pages[i] = NULL;
+		}
+	}
+
+	mutex_unlock(&pv->mutex);
+}
+
+static void p2pmem_vma_close(struct vm_area_struct *vma)
+{
+	struct p2pmem_vma *pv = vma->vm_private_data;
+
+	if (!atomic_dec_and_test(&pv->mmap_count))
+		return;
+
+	p2pmem_vma_free_pages(vma);
+
+	dev_dbg(&pv->p2pmem_dev->dev, "vma close");
+	kfree(pv);
+}
+
+static int p2pmem_vma_fault(struct vm_fault *vmf)
+{
+	struct p2pmem_vma *pv = vmf->vma->vm_private_data;
+	unsigned int pg_idx;
+	struct page *pg;
+	pfn_t pfn;
+	int rc;
+
+	if (!pv->p2pmem_dev->alive)
+		return VM_FAULT_SIGBUS;
+
+	pg_idx = (vmf->address - vmf->vma->vm_start) / PAGE_SIZE;
+
+	mutex_lock(&pv->mutex);
+
+	if (pv->used_pages[pg_idx])
+		pg = pv->used_pages[pg_idx];
+	else
+		pg = p2pmem_alloc_page(pv->p2pmem_dev);
+
+	if (!pg)
+		return VM_FAULT_OOM;
+
+	pv->used_pages[pg_idx] = pg;
+
+	pfn = phys_to_pfn_t(page_to_phys(pg), PFN_DEV | PFN_MAP);
+	rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
+
+	mutex_unlock(&pv->mutex);
+
+	if (rc == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (rc < 0 && rc != -EBUSY)
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+const struct vm_operations_struct p2pmem_vmops = {
+	.open = p2pmem_vma_open,
+	.close = p2pmem_vma_close,
+	.fault = p2pmem_vma_fault,
+};
+
+static int p2pmem_open(struct inode *inode, struct file *filp)
+{
+	struct p2pmem_dev *p;
+
+	p = container_of(inode->i_cdev, struct p2pmem_dev, cdev);
+	filp->private_data = p;
+	p->inode = inode;
+
+	return 0;
+}
+
+static int p2pmem_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct p2pmem_dev *p = filp->private_data;
+	struct p2pmem_vma *pv;
+	size_t nr_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
+
+	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+		dev_warn(&p->dev, "mmap failed: can't create private mapping\n");
+		return -EINVAL;
+	}
+
+	dev_dbg(&p->dev, "Allocating mmap with %zd pages.\n", nr_pages);
+
+	pv = kzalloc(sizeof(*pv) + sizeof(pv->used_pages[0]) * nr_pages,
+		     GFP_KERNEL);
+	if (!pv)
+		return -ENOMEM;
+
+	mutex_init(&pv->mutex);
+	pv->nr_pages = nr_pages;
+	pv->p2pmem_dev = p;
+	atomic_set(&pv->mmap_count, 1);
+
+	vma->vm_private_data = pv;
+	vma->vm_ops = &p2pmem_vmops;
+	vma->vm_flags |= VM_MIXEDMAP;
+
+	return 0;
+}
+
+static const struct file_operations p2pmem_fops = {
+	.owner = THIS_MODULE,
+	.open = p2pmem_open,
+	.mmap = p2pmem_mmap,
+};
+
 static void p2pmem_percpu_release(struct percpu_ref *ref)
 {
 	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
@@ -114,10 +258,23 @@ struct remove_callback {
 static void p2pmem_remove(struct p2pmem_dev *p)
 {
 	struct remove_callback *remove_call, *tmp;
+	struct vm_area_struct *vma;
 
 	p->alive = false;
 	list_for_each_entry_safe(remove_call, tmp, &p->remove_list, list)
 		remove_call->callback(remove_call->context);
+
+	if (!p->inode)
+		return;
+
+	unmap_mapping_range(p->inode->i_mapping, 0, 0, 1);
+
+	i_mmap_lock_write(p->inode->i_mapping);
+	vma_interval_tree_foreach(vma, &p->inode->i_mapping->i_mmap, 0,
+				  ULONG_MAX) {
+		p2pmem_vma_free_pages(vma);
+	}
+	i_mmap_unlock_write(p->inode->i_mapping);
 }
 
 /**
@@ -147,6 +304,10 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	p->dev.parent = parent;
 	p->dev.release = p2pmem_release;
 
+	cdev_init(&p->cdev, &p2pmem_fops);
+	p->cdev.owner = THIS_MODULE;
+	p->cdev.kobj.parent = &p->dev.kobj;
+
 	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
 	if (p->id < 0) {
 		rc = p->id;
@@ -154,6 +315,7 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 	}
 
 	dev_set_name(&p->dev, "p2pmem%d", p->id);
+	p->dev.devt = MKDEV(MAJOR(p2pmem_devt), p->id);
 
 	p->pool = gen_pool_create(PAGE_SHIFT, nid);
 	if (!p->pool) {
@@ -177,14 +339,20 @@ struct p2pmem_dev *p2pmem_create(struct device *parent)
 			setup_debugfs(p);
 	}
 
-	rc = device_add(&p->dev);
+	rc = cdev_add(&p->cdev, p->dev.devt, 1);
 	if (rc)
 		goto err_id;
 
-	dev_info(&p->dev, "registered");
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_cdev;
 
+	dev_info(&p->dev, "registered");
 	return p;
 
+err_cdev:
+	cdev_del(&p->cdev);
+	p2pmem_remove(p);
 err_id:
 	ida_simple_remove(&p2pmem_ida, p->id);
 err_free:
@@ -206,6 +374,7 @@ void p2pmem_unregister(struct p2pmem_dev *p)
 
 	dev_info(&p->dev, "unregistered");
 	device_del(&p->dev);
+	cdev_del(&p->cdev);
 	p2pmem_remove(p);
 	ida_simple_remove(&p2pmem_ida, p->id);
 	put_device(&p->dev);
@@ -495,21 +664,32 @@ EXPORT_SYMBOL(p2pmem_put);
 
 static int __init p2pmem_init(void)
 {
+	int rc;
+
 	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
 	if (IS_ERR(p2pmem_class))
 		return PTR_ERR(p2pmem_class);
 
+	rc = alloc_chrdev_region(&p2pmem_devt, 0, max_devices, "iopmemc");
+	if (rc)
+		goto err_chrdev;
+
 	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
 	if (!p2pmem_debugfs_root)
 		pr_info("could not create debugfs entry, continuing\n");
 
 	return 0;
+
+err_chrdev:
+	class_destroy(p2pmem_class);
+	return rc;
 }
 module_init(p2pmem_init);
 
 static void __exit p2pmem_exit(void)
 {
 	debugfs_remove_recursive(p2pmem_debugfs_root);
+	unregister_chrdev_region(p2pmem_devt, max_devices);
 	class_destroy(p2pmem_class);
 
 	pr_info(KBUILD_MODNAME ": unloaded.\n");
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
index 9365b02..aeee60d 100644
--- a/include/linux/p2pmem.h
+++ b/include/linux/p2pmem.h
@@ -18,6 +18,7 @@
 
 #include <linux/device.h>
 #include <linux/pci.h>
+#include <linux/cdev.h>
 
 struct p2pmem_dev {
 	struct device dev;
@@ -32,6 +33,9 @@ struct p2pmem_dev {
 
 	struct mutex remove_mutex;	/* protects the remove callback list */
 	struct list_head remove_list;
+
+	struct cdev cdev;
+	struct inode *inode;
 };
 
 #ifdef CONFIG_P2PMEM
-- 
2.1.4

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31  7:09     ` Christoph Hellwig
  0 siblings, 0 replies; 545+ messages in thread
From: Christoph Hellwig @ 2017-03-31  7:09 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch, linux-scsi,
	Max Gurtovoy, Christoph Hellwig

You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
is a fundamental no-go as we keep I/O memory separate from kernel
pointers.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31  7:09     ` Christoph Hellwig
  0 siblings, 0 replies; 545+ messages in thread
From: Christoph Hellwig @ 2017-03-31  7:09 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Max Gurtovoy,
	Christoph Hellwig

You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
is a fundamental no-go as we keep I/O memory separate from kernel
pointers.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31  7:09     ` Christoph Hellwig
  0 siblings, 0 replies; 545+ messages in thread
From: Christoph Hellwig @ 2017-03-31  7:09 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
is a fundamental no-go as we keep I/O memory separate from kernel
pointers.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31  7:09     ` Christoph Hellwig
  0 siblings, 0 replies; 545+ messages in thread
From: Christoph Hellwig @ 2017-03-31  7:09 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
is a fundamental no-go as we keep I/O memory separate from kernel
pointers.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31  7:09     ` Christoph Hellwig
  0 siblings, 0 replies; 545+ messages in thread
From: Christoph Hellwig @ 2017-03-31  7:09 UTC (permalink / raw)


You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
is a fundamental no-go as we keep I/O memory separate from kernel
pointers.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31 15:41       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 15:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch, linux-scsi,
	Max Gurtovoy, Christoph Hellwig



On 31/03/17 01:09 AM, Christoph Hellwig wrote:
> You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
> is a fundamental no-go as we keep I/O memory separate from kernel
> pointers.

Yes, that's true, however I don't know how we could get around that when
the iomem is referenced by struct pages inside a scatter gather list. Do
we need to now have special __iomem sgls? And even still, I'm not sure
how that could work when the nvme target code is using the same sgls to
sometimes point to iomem and sometimes point to regular memory.

I'm certainly open to suggestions, though.

Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31 15:41       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 15:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA



On 31/03/17 01:09 AM, Christoph Hellwig wrote:
> You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
> is a fundamental no-go as we keep I/O memory separate from kernel
> pointers.

Yes, that's true, however I don't know how we could get around that when
the iomem is referenced by struct pages inside a scatter gather list. Do
we need to now have special __iomem sgls? And even still, I'm not sure
how that could work when the nvme target code is using the same sgls to
sometimes point to iomem and sometimes point to regular memory.

I'm certainly open to suggestions, though.

Logan

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31 15:41       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 15:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 31/03/17 01:09 AM, Christoph Hellwig wrote:
> You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
> is a fundamental no-go as we keep I/O memory separate from kernel
> pointers.

Yes, that's true, however I don't know how we could get around that when
the iomem is referenced by struct pages inside a scatter gather list. Do
we need to now have special __iomem sgls? And even still, I'm not sure
how that could work when the nvme target code is using the same sgls to
sometimes point to iomem and sometimes point to regular memory.

I'm certainly open to suggestions, though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31 15:41       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 15:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 31/03/17 01:09 AM, Christoph Hellwig wrote:
> You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
> is a fundamental no-go as we keep I/O memory separate from kernel
> pointers.

Yes, that's true, however I don't know how we could get around that when
the iomem is referenced by struct pages inside a scatter gather list. Do
we need to now have special __iomem sgls? And even still, I'm not sure
how that could work when the nvme target code is using the same sgls to
sometimes point to iomem and sometimes point to regular memory.

I'm certainly open to suggestions, though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-03-31 15:41       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 15:41 UTC (permalink / raw)




On 31/03/17 01:09 AM, Christoph Hellwig wrote:
> You're calling memcpy_{to,from}_iomem on non-__iomem pointers.  This
> is a fundamental no-go as we keep I/O memory separate from kernel
> pointers.

Yes, that's true, however I don't know how we could get around that when
the iomem is referenced by struct pages inside a scatter gather list. Do
we need to now have special __iomem sgls? And even still, I'm not sure
how that could work when the nvme target code is using the same sgls to
sometimes point to iomem and sometimes point to regular memory.

I'm certainly open to suggestions, though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 18:49     ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 18:49 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

Hi Logan,

> +/**
> + * p2pmem_unregister() - unregister a p2pmem device
> + * @p: the device to unregister
> + *
> + * The device will remain until all users are done with it
> + */
> +void p2pmem_unregister(struct p2pmem_dev *p)
> +{
> +	if (!p)
> +		return;
> +
> +	dev_info(&p->dev, "unregistered");
> +	device_del(&p->dev);
> +	ida_simple_remove(&p2pmem_ida, p->id);

Don't you need to clean up the p->pool here.

> +	put_device(&p->dev);
> +}
> +EXPORT_SYMBOL(p2pmem_unregister);
> +

I don't like the ugliness around the switch port to be honest. 

Going to whitelist/blacklist looks simpler in my opinion.

Sinan


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 18:49     ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 18:49 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Logan,

> +/**
> + * p2pmem_unregister() - unregister a p2pmem device
> + * @p: the device to unregister
> + *
> + * The device will remain until all users are done with it
> + */
> +void p2pmem_unregister(struct p2pmem_dev *p)
> +{
> +	if (!p)
> +		return;
> +
> +	dev_info(&p->dev, "unregistered");
> +	device_del(&p->dev);
> +	ida_simple_remove(&p2pmem_ida, p->id);

Don't you need to clean up the p->pool here.

> +	put_device(&p->dev);
> +}
> +EXPORT_SYMBOL(p2pmem_unregister);
> +

I don't like the ugliness around the switch port to be honest. 

Going to whitelist/blacklist looks simpler in my opinion.

Sinan


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 18:49     ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 18:49 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

Hi Logan,

> +/**
> + * p2pmem_unregister() - unregister a p2pmem device
> + * @p: the device to unregister
> + *
> + * The device will remain until all users are done with it
> + */
> +void p2pmem_unregister(struct p2pmem_dev *p)
> +{
> +	if (!p)
> +		return;
> +
> +	dev_info(&p->dev, "unregistered");
> +	device_del(&p->dev);
> +	ida_simple_remove(&p2pmem_ida, p->id);

Don't you need to clean up the p->pool here.

> +	put_device(&p->dev);
> +}
> +EXPORT_SYMBOL(p2pmem_unregister);
> +

I don't like the ugliness around the switch port to be honest. 

Going to whitelist/blacklist looks simpler in my opinion.

Sinan


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 18:49     ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 18:49 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

Hi Logan,

> +/**
> + * p2pmem_unregister() - unregister a p2pmem device
> + * @p: the device to unregister
> + *
> + * The device will remain until all users are done with it
> + */
> +void p2pmem_unregister(struct p2pmem_dev *p)
> +{
> +	if (!p)
> +		return;
> +
> +	dev_info(&p->dev, "unregistered");
> +	device_del(&p->dev);
> +	ida_simple_remove(&p2pmem_ida, p->id);

Don't you need to clean up the p->pool here.

> +	put_device(&p->dev);
> +}
> +EXPORT_SYMBOL(p2pmem_unregister);
> +

I don't like the ugliness around the switch port to be honest. 

Going to whitelist/blacklist looks simpler in my opinion.

Sinan


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 18:49     ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 18:49 UTC (permalink / raw)


Hi Logan,

> +/**
> + * p2pmem_unregister() - unregister a p2pmem device
> + * @p: the device to unregister
> + *
> + * The device will remain until all users are done with it
> + */
> +void p2pmem_unregister(struct p2pmem_dev *p)
> +{
> +	if (!p)
> +		return;
> +
> +	dev_info(&p->dev, "unregistered");
> +	device_del(&p->dev);
> +	ida_simple_remove(&p2pmem_ida, p->id);

Don't you need to clean up the p->pool here.

> +	put_device(&p->dev);
> +}
> +EXPORT_SYMBOL(p2pmem_unregister);
> +

I don't like the ugliness around the switch port to be honest. 

Going to whitelist/blacklist looks simpler in my opinion.

Sinan


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:23       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 21:23 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme



On 31/03/17 12:49 PM, Sinan Kaya wrote:
> Don't you need to clean up the p->pool here.

See Patch 7 in the series.

>> +	put_device(&p->dev);
>> +}
>> +EXPORT_SYMBOL(p2pmem_unregister);
>> +
> 
> I don't like the ugliness around the switch port to be honest. 
> 
> Going to whitelist/blacklist looks simpler in my opinion.

What exactly would you white/black list? It can't be the NIC or the
disk. If it's going to be a white/black list on the switch or root port
then you'd need essentially the same code to ensure they are all behind
the same switch or root port. So you could add a white/black list on top
of the current scheme but you couldn't get rid of it.

Our original plan was to just punt the decision to userspace but we had
pushback on that at LSF.

Thanks,

Logan


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:23       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 21:23 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 31/03/17 12:49 PM, Sinan Kaya wrote:
> Don't you need to clean up the p->pool here.

See Patch 7 in the series.

>> +	put_device(&p->dev);
>> +}
>> +EXPORT_SYMBOL(p2pmem_unregister);
>> +
> 
> I don't like the ugliness around the switch port to be honest. 
> 
> Going to whitelist/blacklist looks simpler in my opinion.

What exactly would you white/black list? It can't be the NIC or the
disk. If it's going to be a white/black list on the switch or root port
then you'd need essentially the same code to ensure they are all behind
the same switch or root port. So you could add a white/black list on top
of the current scheme but you couldn't get rid of it.

Our original plan was to just punt the decision to userspace but we had
pushback on that at LSF.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:23       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 21:23 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 31/03/17 12:49 PM, Sinan Kaya wrote:
> Don't you need to clean up the p->pool here.

See Patch 7 in the series.

>> +	put_device(&p->dev);
>> +}
>> +EXPORT_SYMBOL(p2pmem_unregister);
>> +
> 
> I don't like the ugliness around the switch port to be honest. 
> 
> Going to whitelist/blacklist looks simpler in my opinion.

What exactly would you white/black list? It can't be the NIC or the
disk. If it's going to be a white/black list on the switch or root port
then you'd need essentially the same code to ensure they are all behind
the same switch or root port. So you could add a white/black list on top
of the current scheme but you couldn't get rid of it.

Our original plan was to just punt the decision to userspace but we had
pushback on that at LSF.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:23       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 21:23 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 31/03/17 12:49 PM, Sinan Kaya wrote:
> Don't you need to clean up the p->pool here.

See Patch 7 in the series.

>> +	put_device(&p->dev);
>> +}
>> +EXPORT_SYMBOL(p2pmem_unregister);
>> +
> 
> I don't like the ugliness around the switch port to be honest. 
> 
> Going to whitelist/blacklist looks simpler in my opinion.

What exactly would you white/black list? It can't be the NIC or the
disk. If it's going to be a white/black list on the switch or root port
then you'd need essentially the same code to ensure they are all behind
the same switch or root port. So you could add a white/black list on top
of the current scheme but you couldn't get rid of it.

Our original plan was to just punt the decision to userspace but we had
pushback on that at LSF.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:23       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 21:23 UTC (permalink / raw)




On 31/03/17 12:49 PM, Sinan Kaya wrote:
> Don't you need to clean up the p->pool here.

See Patch 7 in the series.

>> +	put_device(&p->dev);
>> +}
>> +EXPORT_SYMBOL(p2pmem_unregister);
>> +
> 
> I don't like the ugliness around the switch port to be honest. 
> 
> Going to whitelist/blacklist looks simpler in my opinion.

What exactly would you white/black list? It can't be the NIC or the
disk. If it's going to be a white/black list on the switch or root port
then you'd need essentially the same code to ensure they are all behind
the same switch or root port. So you could add a white/black list on top
of the current scheme but you couldn't get rid of it.

Our original plan was to just punt the decision to userspace but we had
pushback on that at LSF.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:38         ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 21:38 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
> What exactly would you white/black list? It can't be the NIC or the
> disk. If it's going to be a white/black list on the switch or root port
> then you'd need essentially the same code to ensure they are all behind
> the same switch or root port.

What is so special about being connected to the same switch?

Why don't we allow the feature by default and blacklist by the root ports
that don't work with a quirk.

I'm looking at this from portability perspective to be honest.

I'd rather see the feature enabled by default without any assumptions.
Using it with a switch is just a use case that you happened to test.
It can allow new architectures to use your code tomorrow.

Sinan
-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:38         ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 21:38 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
> What exactly would you white/black list? It can't be the NIC or the
> disk. If it's going to be a white/black list on the switch or root port
> then you'd need essentially the same code to ensure they are all behind
> the same switch or root port.

What is so special about being connected to the same switch?

Why don't we allow the feature by default and blacklist by the root ports
that don't work with a quirk.

I'm looking at this from portability perspective to be honest.

I'd rather see the feature enabled by default without any assumptions.
Using it with a switch is just a use case that you happened to test.
It can allow new architectures to use your code tomorrow.

Sinan
-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:38         ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 21:38 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
> What exactly would you white/black list? It can't be the NIC or the
> disk. If it's going to be a white/black list on the switch or root port
> then you'd need essentially the same code to ensure they are all behind
> the same switch or root port.

What is so special about being connected to the same switch?

Why don't we allow the feature by default and blacklist by the root ports
that don't work with a quirk.

I'm looking at this from portability perspective to be honest.

I'd rather see the feature enabled by default without any assumptions.
Using it with a switch is just a use case that you happened to test.
It can allow new architectures to use your code tomorrow.

Sinan
-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:38         ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 21:38 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
> What exactly would you white/black list? It can't be the NIC or the
> disk. If it's going to be a white/black list on the switch or root port
> then you'd need essentially the same code to ensure they are all behind
> the same switch or root port.

What is so special about being connected to the same switch?

Why don't we allow the feature by default and blacklist by the root ports
that don't work with a quirk.

I'm looking at this from portability perspective to be honest.

I'd rather see the feature enabled by default without any assumptions.
Using it with a switch is just a use case that you happened to test.
It can allow new architectures to use your code tomorrow.

Sinan
-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 21:38         ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 21:38 UTC (permalink / raw)


On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
> What exactly would you white/black list? It can't be the NIC or the
> disk. If it's going to be a white/black list on the switch or root port
> then you'd need essentially the same code to ensure they are all behind
> the same switch or root port.

What is so special about being connected to the same switch?

Why don't we allow the feature by default and blacklist by the root ports
that don't work with a quirk.

I'm looking at this from portability perspective to be honest.

I'd rather see the feature enabled by default without any assumptions.
Using it with a switch is just a use case that you happened to test.
It can allow new architectures to use your code tomorrow.

Sinan
-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 22:42           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 22:42 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme



On 31/03/17 03:38 PM, Sinan Kaya wrote:
> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>> What exactly would you white/black list? It can't be the NIC or the
>> disk. If it's going to be a white/black list on the switch or root port
>> then you'd need essentially the same code to ensure they are all behind
>> the same switch or root port.
> 
> What is so special about being connected to the same switch?
> 
> Why don't we allow the feature by default and blacklist by the root ports
> that don't work with a quirk.

Well root ports have the same issue here. There may be more than one
root port or other buses (ie QPI) between the devices in question. So
you can't just say "this system has root port X therefore we can always
use p2pmem". In the end, if you want to do any kind of restrictions
you're going to have to walk the tree, as the code currently does, and
figure out what's between the devices being used and black or white list
accordingly. Then seeing there's just such a vast number of devices out
there you'd almost certainly have to use some kind of white list and not
a black list. Then the question becomes which devices will be white
listed? The first to be listed would be switches seeing they will always
work. This is pretty much what we have (though it doesn't currently
cover multiple levels of switches). The next step, if someone wanted to
test with specific hardware, might be to allow the case where all the
devices are behind the same root port which Intel Ivy Bridge or newer.
However, I don't think a comprehensive white list should be a
requirement for this work to go forward and I don't think anything
you've suggested will remove any of the "ugliness".

What we discussed at LSF was that only allowing cases with a switch was
the simplest way to be sure any given setup would actually work.

> I'm looking at this from portability perspective to be honest.

I'm looking at this from the fact that there's a vast number of
topologies and devices involved, and figuring out which will work is
very complicated and could require a lot of hardware testing. The LSF
folks were primarily concerned with not having users enable the feature
and see breakage or terrible performance.

> I'd rather see the feature enabled by default without any assumptions.
> Using it with a switch is just a use case that you happened to test.
> It can allow new architectures to use your code tomorrow.

That's why I was advocating for letting userspace decide such that if
you're setting up a system with this you say to use a specific p2pmem
device and then you are responsible to test and benchmark it and decide
to use it in going forward. However, this has received a lot of push back.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 22:42           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 22:42 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 31/03/17 03:38 PM, Sinan Kaya wrote:
> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>> What exactly would you white/black list? It can't be the NIC or the
>> disk. If it's going to be a white/black list on the switch or root port
>> then you'd need essentially the same code to ensure they are all behind
>> the same switch or root port.
> 
> What is so special about being connected to the same switch?
> 
> Why don't we allow the feature by default and blacklist by the root ports
> that don't work with a quirk.

Well root ports have the same issue here. There may be more than one
root port or other buses (ie QPI) between the devices in question. So
you can't just say "this system has root port X therefore we can always
use p2pmem". In the end, if you want to do any kind of restrictions
you're going to have to walk the tree, as the code currently does, and
figure out what's between the devices being used and black or white list
accordingly. Then seeing there's just such a vast number of devices out
there you'd almost certainly have to use some kind of white list and not
a black list. Then the question becomes which devices will be white
listed? The first to be listed would be switches seeing they will always
work. This is pretty much what we have (though it doesn't currently
cover multiple levels of switches). The next step, if someone wanted to
test with specific hardware, might be to allow the case where all the
devices are behind the same root port which Intel Ivy Bridge or newer.
However, I don't think a comprehensive white list should be a
requirement for this work to go forward and I don't think anything
you've suggested will remove any of the "ugliness".

What we discussed at LSF was that only allowing cases with a switch was
the simplest way to be sure any given setup would actually work.

> I'm looking at this from portability perspective to be honest.

I'm looking at this from the fact that there's a vast number of
topologies and devices involved, and figuring out which will work is
very complicated and could require a lot of hardware testing. The LSF
folks were primarily concerned with not having users enable the feature
and see breakage or terrible performance.

> I'd rather see the feature enabled by default without any assumptions.
> Using it with a switch is just a use case that you happened to test.
> It can allow new architectures to use your code tomorrow.

That's why I was advocating for letting userspace decide such that if
you're setting up a system with this you say to use a specific p2pmem
device and then you are responsible to test and benchmark it and decide
to use it in going forward. However, this has received a lot of push back.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 22:42           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 22:42 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 31/03/17 03:38 PM, Sinan Kaya wrote:
> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>> What exactly would you white/black list? It can't be the NIC or the
>> disk. If it's going to be a white/black list on the switch or root port
>> then you'd need essentially the same code to ensure they are all behind
>> the same switch or root port.
> 
> What is so special about being connected to the same switch?
> 
> Why don't we allow the feature by default and blacklist by the root ports
> that don't work with a quirk.

Well root ports have the same issue here. There may be more than one
root port or other buses (ie QPI) between the devices in question. So
you can't just say "this system has root port X therefore we can always
use p2pmem". In the end, if you want to do any kind of restrictions
you're going to have to walk the tree, as the code currently does, and
figure out what's between the devices being used and black or white list
accordingly. Then seeing there's just such a vast number of devices out
there you'd almost certainly have to use some kind of white list and not
a black list. Then the question becomes which devices will be white
listed? The first to be listed would be switches seeing they will always
work. This is pretty much what we have (though it doesn't currently
cover multiple levels of switches). The next step, if someone wanted to
test with specific hardware, might be to allow the case where all the
devices are behind the same root port which Intel Ivy Bridge or newer.
However, I don't think a comprehensive white list should be a
requirement for this work to go forward and I don't think anything
you've suggested will remove any of the "ugliness".

What we discussed at LSF was that only allowing cases with a switch was
the simplest way to be sure any given setup would actually work.

> I'm looking at this from portability perspective to be honest.

I'm looking at this from the fact that there's a vast number of
topologies and devices involved, and figuring out which will work is
very complicated and could require a lot of hardware testing. The LSF
folks were primarily concerned with not having users enable the feature
and see breakage or terrible performance.

> I'd rather see the feature enabled by default without any assumptions.
> Using it with a switch is just a use case that you happened to test.
> It can allow new architectures to use your code tomorrow.

That's why I was advocating for letting userspace decide such that if
you're setting up a system with this you say to use a specific p2pmem
device and then you are responsible to test and benchmark it and decide
to use it in going forward. However, this has received a lot of push back.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 22:42           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 22:42 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 31/03/17 03:38 PM, Sinan Kaya wrote:
> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>> What exactly would you white/black list? It can't be the NIC or the
>> disk. If it's going to be a white/black list on the switch or root port
>> then you'd need essentially the same code to ensure they are all behind
>> the same switch or root port.
> 
> What is so special about being connected to the same switch?
> 
> Why don't we allow the feature by default and blacklist by the root ports
> that don't work with a quirk.

Well root ports have the same issue here. There may be more than one
root port or other buses (ie QPI) between the devices in question. So
you can't just say "this system has root port X therefore we can always
use p2pmem". In the end, if you want to do any kind of restrictions
you're going to have to walk the tree, as the code currently does, and
figure out what's between the devices being used and black or white list
accordingly. Then seeing there's just such a vast number of devices out
there you'd almost certainly have to use some kind of white list and not
a black list. Then the question becomes which devices will be white
listed? The first to be listed would be switches seeing they will always
work. This is pretty much what we have (though it doesn't currently
cover multiple levels of switches). The next step, if someone wanted to
test with specific hardware, might be to allow the case where all the
devices are behind the same root port which Intel Ivy Bridge or newer.
However, I don't think a comprehensive white list should be a
requirement for this work to go forward and I don't think anything
you've suggested will remove any of the "ugliness".

What we discussed at LSF was that only allowing cases with a switch was
the simplest way to be sure any given setup would actually work.

> I'm looking at this from portability perspective to be honest.

I'm looking at this from the fact that there's a vast number of
topologies and devices involved, and figuring out which will work is
very complicated and could require a lot of hardware testing. The LSF
folks were primarily concerned with not having users enable the feature
and see breakage or terrible performance.

> I'd rather see the feature enabled by default without any assumptions.
> Using it with a switch is just a use case that you happened to test.
> It can allow new architectures to use your code tomorrow.

That's why I was advocating for letting userspace decide such that if
you're setting up a system with this you say to use a specific p2pmem
device and then you are responsible to test and benchmark it and decide
to use it in going forward. However, this has received a lot of push back.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 22:42           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-03-31 22:42 UTC (permalink / raw)




On 31/03/17 03:38 PM, Sinan Kaya wrote:
> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>> What exactly would you white/black list? It can't be the NIC or the
>> disk. If it's going to be a white/black list on the switch or root port
>> then you'd need essentially the same code to ensure they are all behind
>> the same switch or root port.
> 
> What is so special about being connected to the same switch?
> 
> Why don't we allow the feature by default and blacklist by the root ports
> that don't work with a quirk.

Well root ports have the same issue here. There may be more than one
root port or other buses (ie QPI) between the devices in question. So
you can't just say "this system has root port X therefore we can always
use p2pmem". In the end, if you want to do any kind of restrictions
you're going to have to walk the tree, as the code currently does, and
figure out what's between the devices being used and black or white list
accordingly. Then seeing there's just such a vast number of devices out
there you'd almost certainly have to use some kind of white list and not
a black list. Then the question becomes which devices will be white
listed? The first to be listed would be switches seeing they will always
work. This is pretty much what we have (though it doesn't currently
cover multiple levels of switches). The next step, if someone wanted to
test with specific hardware, might be to allow the case where all the
devices are behind the same root port which Intel Ivy Bridge or newer.
However, I don't think a comprehensive white list should be a
requirement for this work to go forward and I don't think anything
you've suggested will remove any of the "ugliness".

What we discussed at LSF was that only allowing cases with a switch was
the simplest way to be sure any given setup would actually work.

> I'm looking at this from portability perspective to be honest.

I'm looking at this from the fact that there's a vast number of
topologies and devices involved, and figuring out which will work is
very complicated and could require a lot of hardware testing. The LSF
folks were primarily concerned with not having users enable the feature
and see breakage or terrible performance.

> I'd rather see the feature enabled by default without any assumptions.
> Using it with a switch is just a use case that you happened to test.
> It can allow new architectures to use your code tomorrow.

That's why I was advocating for letting userspace decide such that if
you're setting up a system with this you say to use a specific p2pmem
device and then you are responsible to test and benchmark it and decide
to use it in going forward. However, this has received a lot of push back.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 23:51             ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 23:51 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

On 3/31/2017 6:42 PM, Logan Gunthorpe wrote:
> 
> 
> On 31/03/17 03:38 PM, Sinan Kaya wrote:
>> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>>> What exactly would you white/black list? It can't be the NIC or the
>>> disk. If it's going to be a white/black list on the switch or root port
>>> then you'd need essentially the same code to ensure they are all behind
>>> the same switch or root port.
>>
>> What is so special about being connected to the same switch?
>>
>> Why don't we allow the feature by default and blacklist by the root ports
>> that don't work with a quirk.
> 
> Well root ports have the same issue here. There may be more than one
> root port or other buses (ie QPI) between the devices in question. So
> you can't just say "this system has root port X therefore we can always
> use p2pmem". 

We only care about devices on the data path between two devices.

> In the end, if you want to do any kind of restrictions
> you're going to have to walk the tree, as the code currently does, and
> figure out what's between the devices being used and black or white list
> accordingly. Then seeing there's just such a vast number of devices out
> there you'd almost certainly have to use some kind of white list and not
> a black list. Then the question becomes which devices will be white
> listed? 

How about a combination of blacklist + time bomb + peer-to-peer feature?

You can put a restriction with DMI/SMBIOS such that all devices from 2016
work else they belong to blacklist.

> The first to be listed would be switches seeing they will always
> work. This is pretty much what we have (though it doesn't currently
> cover multiple levels of switches). The next step, if someone wanted to
> test with specific hardware, might be to allow the case where all the
> devices are behind the same root port which Intel Ivy Bridge or newer.

Sorry, I'm not familiar with Intel architecture. Based on what you just
wrote, I think I see your point. 

I'm trying to generalize what you are doing to a little
bigger context so that I can use it on another architecture like arm64
where I may or may not have a switch.

This text below is sort of repeating what you are writing above. 

How about this:

The goal is to find a common parent between any two devices that need to
use your code. 

- all bridges/switches on the data need to support peer-to-peer, otherwise
stop.

- Make sure that all devices on the data path are not blacklisted via your
code.

- If there is at least somebody blacklisted, we stop and the feature is
not allowed.

- If we find a common parent and no errors, you are good to go.

- We don't care about devices above the common parent whether they have
some feature X, Y, Z or not. 

Maybe, a little bit less code than what you have but it is flexible and
not that too hard to implement.

Well, the code is in RFC. I don't see why we can't remove some restrictions
and still have your code move forward. 

> However, I don't think a comprehensive white list should be a
> requirement for this work to go forward and I don't think anything
> you've suggested will remove any of the "ugliness".

I don't think the ask above is a very big deal. If you feel like
addressing this on another patchset like you suggested in your cover letter,
I'm fine with that too.

> 
> What we discussed at LSF was that only allowing cases with a switch was
> the simplest way to be sure any given setup would actually work.
> 
>> I'm looking at this from portability perspective to be honest.
> 
> I'm looking at this from the fact that there's a vast number of
> topologies and devices involved, and figuring out which will work is
> very complicated and could require a lot of hardware testing. The LSF
> folks were primarily concerned with not having users enable the feature
> and see breakage or terrible performance.
> 
>> I'd rather see the feature enabled by default without any assumptions.
>> Using it with a switch is just a use case that you happened to test.
>> It can allow new architectures to use your code tomorrow.
> 
> That's why I was advocating for letting userspace decide such that if
> you're setting up a system with this you say to use a specific p2pmem
> device and then you are responsible to test and benchmark it and decide
> to use it in going forward. However, this has received a lot of push back.

Yeah, we shouldn't trust the userspace for such things.

> 
> Logan
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 23:51             ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 23:51 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 3/31/2017 6:42 PM, Logan Gunthorpe wrote:
> 
> 
> On 31/03/17 03:38 PM, Sinan Kaya wrote:
>> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>>> What exactly would you white/black list? It can't be the NIC or the
>>> disk. If it's going to be a white/black list on the switch or root port
>>> then you'd need essentially the same code to ensure they are all behind
>>> the same switch or root port.
>>
>> What is so special about being connected to the same switch?
>>
>> Why don't we allow the feature by default and blacklist by the root ports
>> that don't work with a quirk.
> 
> Well root ports have the same issue here. There may be more than one
> root port or other buses (ie QPI) between the devices in question. So
> you can't just say "this system has root port X therefore we can always
> use p2pmem". 

We only care about devices on the data path between two devices.

> In the end, if you want to do any kind of restrictions
> you're going to have to walk the tree, as the code currently does, and
> figure out what's between the devices being used and black or white list
> accordingly. Then seeing there's just such a vast number of devices out
> there you'd almost certainly have to use some kind of white list and not
> a black list. Then the question becomes which devices will be white
> listed? 

How about a combination of blacklist + time bomb + peer-to-peer feature?

You can put a restriction with DMI/SMBIOS such that all devices from 2016
work else they belong to blacklist.

> The first to be listed would be switches seeing they will always
> work. This is pretty much what we have (though it doesn't currently
> cover multiple levels of switches). The next step, if someone wanted to
> test with specific hardware, might be to allow the case where all the
> devices are behind the same root port which Intel Ivy Bridge or newer.

Sorry, I'm not familiar with Intel architecture. Based on what you just
wrote, I think I see your point. 

I'm trying to generalize what you are doing to a little
bigger context so that I can use it on another architecture like arm64
where I may or may not have a switch.

This text below is sort of repeating what you are writing above. 

How about this:

The goal is to find a common parent between any two devices that need to
use your code. 

- all bridges/switches on the data need to support peer-to-peer, otherwise
stop.

- Make sure that all devices on the data path are not blacklisted via your
code.

- If there is at least somebody blacklisted, we stop and the feature is
not allowed.

- If we find a common parent and no errors, you are good to go.

- We don't care about devices above the common parent whether they have
some feature X, Y, Z or not. 

Maybe, a little bit less code than what you have but it is flexible and
not that too hard to implement.

Well, the code is in RFC. I don't see why we can't remove some restrictions
and still have your code move forward. 

> However, I don't think a comprehensive white list should be a
> requirement for this work to go forward and I don't think anything
> you've suggested will remove any of the "ugliness".

I don't think the ask above is a very big deal. If you feel like
addressing this on another patchset like you suggested in your cover letter,
I'm fine with that too.

> 
> What we discussed at LSF was that only allowing cases with a switch was
> the simplest way to be sure any given setup would actually work.
> 
>> I'm looking at this from portability perspective to be honest.
> 
> I'm looking at this from the fact that there's a vast number of
> topologies and devices involved, and figuring out which will work is
> very complicated and could require a lot of hardware testing. The LSF
> folks were primarily concerned with not having users enable the feature
> and see breakage or terrible performance.
> 
>> I'd rather see the feature enabled by default without any assumptions.
>> Using it with a switch is just a use case that you happened to test.
>> It can allow new architectures to use your code tomorrow.
> 
> That's why I was advocating for letting userspace decide such that if
> you're setting up a system with this you say to use a specific p2pmem
> device and then you are responsible to test and benchmark it and decide
> to use it in going forward. However, this has received a lot of push back.

Yeah, we shouldn't trust the userspace for such things.

> 
> Logan
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 23:51             ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 23:51 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

On 3/31/2017 6:42 PM, Logan Gunthorpe wrote:
> 
> 
> On 31/03/17 03:38 PM, Sinan Kaya wrote:
>> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>>> What exactly would you white/black list? It can't be the NIC or the
>>> disk. If it's going to be a white/black list on the switch or root port
>>> then you'd need essentially the same code to ensure they are all behind
>>> the same switch or root port.
>>
>> What is so special about being connected to the same switch?
>>
>> Why don't we allow the feature by default and blacklist by the root ports
>> that don't work with a quirk.
> 
> Well root ports have the same issue here. There may be more than one
> root port or other buses (ie QPI) between the devices in question. So
> you can't just say "this system has root port X therefore we can always
> use p2pmem". 

We only care about devices on the data path between two devices.

> In the end, if you want to do any kind of restrictions
> you're going to have to walk the tree, as the code currently does, and
> figure out what's between the devices being used and black or white list
> accordingly. Then seeing there's just such a vast number of devices out
> there you'd almost certainly have to use some kind of white list and not
> a black list. Then the question becomes which devices will be white
> listed? 

How about a combination of blacklist + time bomb + peer-to-peer feature?

You can put a restriction with DMI/SMBIOS such that all devices from 2016
work else they belong to blacklist.

> The first to be listed would be switches seeing they will always
> work. This is pretty much what we have (though it doesn't currently
> cover multiple levels of switches). The next step, if someone wanted to
> test with specific hardware, might be to allow the case where all the
> devices are behind the same root port which Intel Ivy Bridge or newer.

Sorry, I'm not familiar with Intel architecture. Based on what you just
wrote, I think I see your point. 

I'm trying to generalize what you are doing to a little
bigger context so that I can use it on another architecture like arm64
where I may or may not have a switch.

This text below is sort of repeating what you are writing above. 

How about this:

The goal is to find a common parent between any two devices that need to
use your code. 

- all bridges/switches on the data need to support peer-to-peer, otherwise
stop.

- Make sure that all devices on the data path are not blacklisted via your
code.

- If there is at least somebody blacklisted, we stop and the feature is
not allowed.

- If we find a common parent and no errors, you are good to go.

- We don't care about devices above the common parent whether they have
some feature X, Y, Z or not. 

Maybe, a little bit less code than what you have but it is flexible and
not that too hard to implement.

Well, the code is in RFC. I don't see why we can't remove some restrictions
and still have your code move forward. 

> However, I don't think a comprehensive white list should be a
> requirement for this work to go forward and I don't think anything
> you've suggested will remove any of the "ugliness".

I don't think the ask above is a very big deal. If you feel like
addressing this on another patchset like you suggested in your cover letter,
I'm fine with that too.

> 
> What we discussed at LSF was that only allowing cases with a switch was
> the simplest way to be sure any given setup would actually work.
> 
>> I'm looking at this from portability perspective to be honest.
> 
> I'm looking at this from the fact that there's a vast number of
> topologies and devices involved, and figuring out which will work is
> very complicated and could require a lot of hardware testing. The LSF
> folks were primarily concerned with not having users enable the feature
> and see breakage or terrible performance.
> 
>> I'd rather see the feature enabled by default without any assumptions.
>> Using it with a switch is just a use case that you happened to test.
>> It can allow new architectures to use your code tomorrow.
> 
> That's why I was advocating for letting userspace decide such that if
> you're setting up a system with this you say to use a specific p2pmem
> device and then you are responsible to test and benchmark it and decide
> to use it in going forward. However, this has received a lot of push back.

Yeah, we shouldn't trust the userspace for such things.

> 
> Logan
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 23:51             ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 23:51 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

On 3/31/2017 6:42 PM, Logan Gunthorpe wrote:
> 
> 
> On 31/03/17 03:38 PM, Sinan Kaya wrote:
>> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>>> What exactly would you white/black list? It can't be the NIC or the
>>> disk. If it's going to be a white/black list on the switch or root port
>>> then you'd need essentially the same code to ensure they are all behind
>>> the same switch or root port.
>>
>> What is so special about being connected to the same switch?
>>
>> Why don't we allow the feature by default and blacklist by the root ports
>> that don't work with a quirk.
> 
> Well root ports have the same issue here. There may be more than one
> root port or other buses (ie QPI) between the devices in question. So
> you can't just say "this system has root port X therefore we can always
> use p2pmem". 

We only care about devices on the data path between two devices.

> In the end, if you want to do any kind of restrictions
> you're going to have to walk the tree, as the code currently does, and
> figure out what's between the devices being used and black or white list
> accordingly. Then seeing there's just such a vast number of devices out
> there you'd almost certainly have to use some kind of white list and not
> a black list. Then the question becomes which devices will be white
> listed? 

How about a combination of blacklist + time bomb + peer-to-peer feature?

You can put a restriction with DMI/SMBIOS such that all devices from 2016
work else they belong to blacklist.

> The first to be listed would be switches seeing they will always
> work. This is pretty much what we have (though it doesn't currently
> cover multiple levels of switches). The next step, if someone wanted to
> test with specific hardware, might be to allow the case where all the
> devices are behind the same root port which Intel Ivy Bridge or newer.

Sorry, I'm not familiar with Intel architecture. Based on what you just
wrote, I think I see your point. 

I'm trying to generalize what you are doing to a little
bigger context so that I can use it on another architecture like arm64
where I may or may not have a switch.

This text below is sort of repeating what you are writing above. 

How about this:

The goal is to find a common parent between any two devices that need to
use your code. 

- all bridges/switches on the data need to support peer-to-peer, otherwise
stop.

- Make sure that all devices on the data path are not blacklisted via your
code.

- If there is at least somebody blacklisted, we stop and the feature is
not allowed.

- If we find a common parent and no errors, you are good to go.

- We don't care about devices above the common parent whether they have
some feature X, Y, Z or not. 

Maybe, a little bit less code than what you have but it is flexible and
not that too hard to implement.

Well, the code is in RFC. I don't see why we can't remove some restrictions
and still have your code move forward. 

> However, I don't think a comprehensive white list should be a
> requirement for this work to go forward and I don't think anything
> you've suggested will remove any of the "ugliness".

I don't think the ask above is a very big deal. If you feel like
addressing this on another patchset like you suggested in your cover letter,
I'm fine with that too.

> 
> What we discussed at LSF was that only allowing cases with a switch was
> the simplest way to be sure any given setup would actually work.
> 
>> I'm looking at this from portability perspective to be honest.
> 
> I'm looking at this from the fact that there's a vast number of
> topologies and devices involved, and figuring out which will work is
> very complicated and could require a lot of hardware testing. The LSF
> folks were primarily concerned with not having users enable the feature
> and see breakage or terrible performance.
> 
>> I'd rather see the feature enabled by default without any assumptions.
>> Using it with a switch is just a use case that you happened to test.
>> It can allow new architectures to use your code tomorrow.
> 
> That's why I was advocating for letting userspace decide such that if
> you're setting up a system with this you say to use a specific p2pmem
> device and then you are responsible to test and benchmark it and decide
> to use it in going forward. However, this has received a lot of push back.

Yeah, we shouldn't trust the userspace for such things.

> 
> Logan
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-03-31 23:51             ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-03-31 23:51 UTC (permalink / raw)


On 3/31/2017 6:42 PM, Logan Gunthorpe wrote:
> 
> 
> On 31/03/17 03:38 PM, Sinan Kaya wrote:
>> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>>> What exactly would you white/black list? It can't be the NIC or the
>>> disk. If it's going to be a white/black list on the switch or root port
>>> then you'd need essentially the same code to ensure they are all behind
>>> the same switch or root port.
>>
>> What is so special about being connected to the same switch?
>>
>> Why don't we allow the feature by default and blacklist by the root ports
>> that don't work with a quirk.
> 
> Well root ports have the same issue here. There may be more than one
> root port or other buses (ie QPI) between the devices in question. So
> you can't just say "this system has root port X therefore we can always
> use p2pmem". 

We only care about devices on the data path between two devices.

> In the end, if you want to do any kind of restrictions
> you're going to have to walk the tree, as the code currently does, and
> figure out what's between the devices being used and black or white list
> accordingly. Then seeing there's just such a vast number of devices out
> there you'd almost certainly have to use some kind of white list and not
> a black list. Then the question becomes which devices will be white
> listed? 

How about a combination of blacklist + time bomb + peer-to-peer feature?

You can put a restriction with DMI/SMBIOS such that all devices from 2016
work else they belong to blacklist.

> The first to be listed would be switches seeing they will always
> work. This is pretty much what we have (though it doesn't currently
> cover multiple levels of switches). The next step, if someone wanted to
> test with specific hardware, might be to allow the case where all the
> devices are behind the same root port which Intel Ivy Bridge or newer.

Sorry, I'm not familiar with Intel architecture. Based on what you just
wrote, I think I see your point. 

I'm trying to generalize what you are doing to a little
bigger context so that I can use it on another architecture like arm64
where I may or may not have a switch.

This text below is sort of repeating what you are writing above. 

How about this:

The goal is to find a common parent between any two devices that need to
use your code. 

- all bridges/switches on the data need to support peer-to-peer, otherwise
stop.

- Make sure that all devices on the data path are not blacklisted via your
code.

- If there is at least somebody blacklisted, we stop and the feature is
not allowed.

- If we find a common parent and no errors, you are good to go.

- We don't care about devices above the common parent whether they have
some feature X, Y, Z or not. 

Maybe, a little bit less code than what you have but it is flexible and
not that too hard to implement.

Well, the code is in RFC. I don't see why we can't remove some restrictions
and still have your code move forward. 

> However, I don't think a comprehensive white list should be a
> requirement for this work to go forward and I don't think anything
> you've suggested will remove any of the "ugliness".

I don't think the ask above is a very big deal. If you feel like
addressing this on another patchset like you suggested in your cover letter,
I'm fine with that too.

> 
> What we discussed at LSF was that only allowing cases with a switch was
> the simplest way to be sure any given setup would actually work.
> 
>> I'm looking at this from portability perspective to be honest.
> 
> I'm looking at this from the fact that there's a vast number of
> topologies and devices involved, and figuring out which will work is
> very complicated and could require a lot of hardware testing. The LSF
> folks were primarily concerned with not having users enable the feature
> and see breakage or terrible performance.
> 
>> I'd rather see the feature enabled by default without any assumptions.
>> Using it with a switch is just a use case that you happened to test.
>> It can allow new architectures to use your code tomorrow.
> 
> That's why I was advocating for letting userspace decide such that if
> you're setting up a system with this you say to use a specific p2pmem
> device and then you are responsible to test and benchmark it and decide
> to use it in going forward. However, this has received a lot of push back.

Yeah, we shouldn't trust the userspace for such things.

> 
> Logan
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  1:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-01  1:57 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme



On 31/03/17 05:51 PM, Sinan Kaya wrote:
> You can put a restriction with DMI/SMBIOS such that all devices from 2016
> work else they belong to blacklist.

How do you get a manufacturing date for a given device within the
kernel? Is this actually something generically available?

Logan

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  1:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-01  1:57 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 31/03/17 05:51 PM, Sinan Kaya wrote:
> You can put a restriction with DMI/SMBIOS such that all devices from 2016
> work else they belong to blacklist.

How do you get a manufacturing date for a given device within the
kernel? Is this actually something generically available?

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  1:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-01  1:57 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 31/03/17 05:51 PM, Sinan Kaya wrote:
> You can put a restriction with DMI/SMBIOS such that all devices from 2016
> work else they belong to blacklist.

How do you get a manufacturing date for a given device within the
kernel? Is this actually something generically available?

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  1:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-01  1:57 UTC (permalink / raw)
  To: Sinan Kaya, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 31/03/17 05:51 PM, Sinan Kaya wrote:
> You can put a restriction with DMI/SMBIOS such that all devices from 2016
> work else they belong to blacklist.

How do you get a manufacturing date for a given device within the
kernel? Is this actually something generically available?

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  1:57               ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-01  1:57 UTC (permalink / raw)




On 31/03/17 05:51 PM, Sinan Kaya wrote:
> You can put a restriction with DMI/SMBIOS such that all devices from 2016
> work else they belong to blacklist.

How do you get a manufacturing date for a given device within the
kernel? Is this actually something generically available?

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  2:17                 ` okaya-sgV2jX0FEOL9JmXXK+q4OQ
  0 siblings, 0 replies; 545+ messages in thread
From: okaya @ 2017-04-01  2:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Jason Gunthorpe,
	James E.J. Bottomley  <jejb@linux.vnet.ibm.com>,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch, linux-scsi,
	Max Gurtovoy, Christoph Hellwig

On 2017-03-31 21:57, Logan Gunthorpe wrote:
> On 31/03/17 05:51 PM, Sinan Kaya wrote:
>> You can put a restriction with DMI/SMBIOS such that all devices from 
>> 2016
>> work else they belong to blacklist.
> 
> How do you get a manufacturing date for a given device within the
> kernel? Is this actually something generically available?
> 
> Logan

Smbios calls are used all over the place in kernel for introducing new 
functionality while maintaining backwards compatibility.

See drivers/pci and drivers/acpi directory.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  2:17                 ` okaya-sgV2jX0FEOL9JmXXK+q4OQ
  0 siblings, 0 replies; 545+ messages in thread
From: okaya-sgV2jX0FEOL9JmXXK+q4OQ @ 2017-04-01  2:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Max Gurtovoy,
	Christoph Hellwig

On 2017-03-31 21:57, Logan Gunthorpe wrote:
> On 31/03/17 05:51 PM, Sinan Kaya wrote:
>> You can put a restriction with DMI/SMBIOS such that all devices from 
>> 2016
>> work else they belong to blacklist.
> 
> How do you get a manufacturing date for a given device within the
> kernel? Is this actually something generically available?
> 
> Logan

Smbios calls are used all over the place in kernel for introducing new 
functionality while maintaining backwards compatibility.

See drivers/pci and drivers/acpi directory.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  2:17                 ` okaya-sgV2jX0FEOL9JmXXK+q4OQ
  0 siblings, 0 replies; 545+ messages in thread
From: okaya @ 2017-04-01  2:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

On 2017-03-31 21:57, Logan Gunthorpe wrote:
> On 31/03/17 05:51 PM, Sinan Kaya wrote:
>> You can put a restriction with DMI/SMBIOS such that all devices from 
>> 2016
>> work else they belong to blacklist.
> 
> How do you get a manufacturing date for a given device within the
> kernel? Is this actually something generically available?
> 
> Logan

Smbios calls are used all over the place in kernel for introducing new 
functionality while maintaining backwards compatibility.

See drivers/pci and drivers/acpi directory.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  2:17                 ` okaya-sgV2jX0FEOL9JmXXK+q4OQ
  0 siblings, 0 replies; 545+ messages in thread
From: okaya @ 2017-04-01  2:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

On 2017-03-31 21:57, Logan Gunthorpe wrote:
> On 31/03/17 05:51 PM, Sinan Kaya wrote:
>> You can put a restriction with DMI/SMBIOS such that all devices from 
>> 2016
>> work else they belong to blacklist.
> 
> How do you get a manufacturing date for a given device within the
> kernel? Is this actually something generically available?
> 
> Logan

Smbios calls are used all over the place in kernel for introducing new 
functionality while maintaining backwards compatibility.

See drivers/pci and drivers/acpi directory.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01  2:17                 ` okaya-sgV2jX0FEOL9JmXXK+q4OQ
  0 siblings, 0 replies; 545+ messages in thread
From: okaya @ 2017-04-01  2:17 UTC (permalink / raw)


On 2017-03-31 21:57, Logan Gunthorpe wrote:
> On 31/03/17 05:51 PM, Sinan Kaya wrote:
>> You can put a restriction with DMI/SMBIOS such that all devices from 
>> 2016
>> work else they belong to blacklist.
> 
> How do you get a manufacturing date for a given device within the
> kernel? Is this actually something generically available?
> 
> Logan

Smbios calls are used all over the place in kernel for introducing new 
functionality while maintaining backwards compatibility.

See drivers/pci and drivers/acpi directory.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
  2017-04-01  2:17                 ` okaya-sgV2jX0FEOL9JmXXK+q4OQ
  (?)
  (?)
@ 2017-04-01 22:16                   ` Logan Gunthorpe
  -1 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-01 22:16 UTC (permalink / raw)
  To: okaya
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch, linux-scsi,
	Max Gurtovoy, Christoph Hellwig

Hey,

On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
> See drivers/pci and drivers/acpi directory.

The best I could find was the date of the firmware/bios. I really don't
think that makes sense to tie the two together. And really the more that
I think about it trying to do a date cutoff for this seems crazy without
very comprehensive hardware testing done. I have no idea which AMD chips
have decent root ports for this and then if we include all of ARM and
POWERPC, etc there's a huge amount of unknown hardware. Saying that the
system's firmware has to be written after 2016 seems like an arbitrary
restriction that isn't likely to correlate to any working systems.

I still say the only sane thing to do is allow all switches and then add
a whitelist of root ports that are known to work well. If we care about
preventing broken systems in a comprehensive way then that's the only
thing that is going to work.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01 22:16                   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-01 22:16 UTC (permalink / raw)
  To: okaya
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

Hey,

On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
> See drivers/pci and drivers/acpi directory.

The best I could find was the date of the firmware/bios. I really don't
think that makes sense to tie the two together. And really the more that
I think about it trying to do a date cutoff for this seems crazy without
very comprehensive hardware testing done. I have no idea which AMD chips
have decent root ports for this and then if we include all of ARM and
POWERPC, etc there's a huge amount of unknown hardware. Saying that the
system's firmware has to be written after 2016 seems like an arbitrary
restriction that isn't likely to correlate to any working systems.

I still say the only sane thing to do is allow all switches and then add
a whitelist of root ports that are known to work well. If we care about
preventing broken systems in a comprehensive way then that's the only
thing that is going to work.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01 22:16                   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-01 22:16 UTC (permalink / raw)
  To: okaya
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

Hey,

On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
> See drivers/pci and drivers/acpi directory.

The best I could find was the date of the firmware/bios. I really don't
think that makes sense to tie the two together. And really the more that
I think about it trying to do a date cutoff for this seems crazy without
very comprehensive hardware testing done. I have no idea which AMD chips
have decent root ports for this and then if we include all of ARM and
POWERPC, etc there's a huge amount of unknown hardware. Saying that the
system's firmware has to be written after 2016 seems like an arbitrary
restriction that isn't likely to correlate to any working systems.

I still say the only sane thing to do is allow all switches and then add
a whitelist of root ports that are known to work well. If we care about
preventing broken systems in a comprehensive way then that's the only
thing that is going to work.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-01 22:16                   ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-01 22:16 UTC (permalink / raw)


Hey,

On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
> See drivers/pci and drivers/acpi directory.

The best I could find was the date of the firmware/bios. I really don't
think that makes sense to tie the two together. And really the more that
I think about it trying to do a date cutoff for this seems crazy without
very comprehensive hardware testing done. I have no idea which AMD chips
have decent root ports for this and then if we include all of ARM and
POWERPC, etc there's a huge amount of unknown hardware. Saying that the
system's firmware has to be written after 2016 seems like an arbitrary
restriction that isn't likely to correlate to any working systems.

I still say the only sane thing to do is allow all switches and then add
a whitelist of root ports that are known to work well. If we care about
preventing broken systems in a comprehensive way then that's the only
thing that is going to work.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
  2017-04-01 22:16                   ` Logan Gunthorpe
  (?)
  (?)
@ 2017-04-02  2:26                     ` Sinan Kaya
  -1 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-04-02  2:26 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch,
	Alex Williamson, linux-scsi, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

Hi Logan,

I added Alex and Bjorn above.

On 4/1/2017 6:16 PM, Logan Gunthorpe wrote:
> Hey,
> 
> On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
>> See drivers/pci and drivers/acpi directory.
> 
> The best I could find was the date of the firmware/bios. I really don't
> think that makes sense to tie the two together. And really the more that
> I think about it trying to do a date cutoff for this seems crazy without
> very comprehensive hardware testing done. I have no idea which AMD chips
> have decent root ports for this and then if we include all of ARM and
> POWERPC, etc there's a huge amount of unknown hardware. Saying that the
> system's firmware has to be written after 2016 seems like an arbitrary
> restriction that isn't likely to correlate to any working systems.

I recommended a combination of blacklist + p2p capability + BIOS date.
Not just BIOS date. BIOS date by itself is useless.

As you may or may not be aware, PCI defines capability registers for
discovering features. Unfortunately, there is no direct p2p capability
register. 

However, Access Control Services (ACS) capability register has flags
indicating p2p functionality. p2p feature needs to be discovered from ACS. 

https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf

This is just one of the many P2P capability flags.

"ACS P2P Request Redirect: must be implemented by Root Ports that support peer-to-peer
traffic with other Root Ports5; must be implemented by Switch Downstream Ports."

If the root port or a switch does not have ACS capability, p2p is not allowed.
If these p2p flags are not set, don't allow p2p feature.

The normal expectation from any system (root port/switch) is not to set these
bits unless p2p feature is present/working.

However, there could be systems in the field with ACS capability but broken HW
or broken FW. 

This is when the BIOS date helps so that you don't break existing systems.

The right thing in my opinion is 

1. blacklist by pci vendor/device id like any other pci quirk in quirks.c.
2. Require this feature for recent HW/BIOS by checking the BIOS date.
3. Check the p2p capability from ACS. 

> 
> I still say the only sane thing to do is allow all switches and then add
> a whitelist of root ports that are known to work well. If we care about
> preventing broken systems in a comprehensive way then that's the only
> thing that is going to work.

We can't guarentee all switches will work either. See above for instructions
on when this feature should be enabled.

Let's step back for a moment.

If we think about logical blocks here, p2pmem is a pci user. It should
not walk the bus and search for possible good things by itself. We don't
usually put code into the kernel's driver directory for specific arch/
specific devices. There are hundreds of device drivers in the kernel. 
None of them are guarenteed to work in any architecture but they don't
prohibit use either.

System integrators like me test these drivers against their own systems,
find bugs to remove arch specific assumptions and post patches.

p2pmem is potentially just one of the many users of p2p capability in the
system.

This p2p detection needs to be done by some p2p driver inside the 
drivers/pci directory or inside drivers/pci/probe.c.

This p2p driver needs to verify ACS permissions similar to what
pci_device_group() does.

If the system is p2p capable, this p2p driver sets p2p_capable bit in 
struct pci_dev.

p2pmem driver then uses this bit to decide when it should enable its feature.

Bjorn and Alex needs to device about the final solution as they maintain both
PCI and virtualization (ACS) respectively.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02  2:26                     ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-04-02  2:26 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Alex Williamson, Bjorn Helgaas

Hi Logan,

I added Alex and Bjorn above.

On 4/1/2017 6:16 PM, Logan Gunthorpe wrote:
> Hey,
> 
> On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
>> See drivers/pci and drivers/acpi directory.
> 
> The best I could find was the date of the firmware/bios. I really don't
> think that makes sense to tie the two together. And really the more that
> I think about it trying to do a date cutoff for this seems crazy without
> very comprehensive hardware testing done. I have no idea which AMD chips
> have decent root ports for this and then if we include all of ARM and
> POWERPC, etc there's a huge amount of unknown hardware. Saying that the
> system's firmware has to be written after 2016 seems like an arbitrary
> restriction that isn't likely to correlate to any working systems.

I recommended a combination of blacklist + p2p capability + BIOS date.
Not just BIOS date. BIOS date by itself is useless.

As you may or may not be aware, PCI defines capability registers for
discovering features. Unfortunately, there is no direct p2p capability
register. 

However, Access Control Services (ACS) capability register has flags
indicating p2p functionality. p2p feature needs to be discovered from ACS. 

https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf

This is just one of the many P2P capability flags.

"ACS P2P Request Redirect: must be implemented by Root Ports that support peer-to-peer
traffic with other Root Ports5; must be implemented by Switch Downstream Ports."

If the root port or a switch does not have ACS capability, p2p is not allowed.
If these p2p flags are not set, don't allow p2p feature.

The normal expectation from any system (root port/switch) is not to set these
bits unless p2p feature is present/working.

However, there could be systems in the field with ACS capability but broken HW
or broken FW. 

This is when the BIOS date helps so that you don't break existing systems.

The right thing in my opinion is 

1. blacklist by pci vendor/device id like any other pci quirk in quirks.c.
2. Require this feature for recent HW/BIOS by checking the BIOS date.
3. Check the p2p capability from ACS. 

> 
> I still say the only sane thing to do is allow all switches and then add
> a whitelist of root ports that are known to work well. If we care about
> preventing broken systems in a comprehensive way then that's the only
> thing that is going to work.

We can't guarentee all switches will work either. See above for instructions
on when this feature should be enabled.

Let's step back for a moment.

If we think about logical blocks here, p2pmem is a pci user. It should
not walk the bus and search for possible good things by itself. We don't
usually put code into the kernel's driver directory for specific arch/
specific devices. There are hundreds of device drivers in the kernel. 
None of them are guarenteed to work in any architecture but they don't
prohibit use either.

System integrators like me test these drivers against their own systems,
find bugs to remove arch specific assumptions and post patches.

p2pmem is potentially just one of the many users of p2p capability in the
system.

This p2p detection needs to be done by some p2p driver inside the 
drivers/pci directory or inside drivers/pci/probe.c.

This p2p driver needs to verify ACS permissions similar to what
pci_device_group() does.

If the system is p2p capable, this p2p driver sets p2p_capable bit in 
struct pci_dev.

p2pmem driver then uses this bit to decide when it should enable its feature.

Bjorn and Alex needs to device about the final solution as they maintain both
PCI and virtualization (ACS) respectively.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02  2:26                     ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-04-02  2:26 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Alex Williamson, Bjorn Helgaas

Hi Logan,

I added Alex and Bjorn above.

On 4/1/2017 6:16 PM, Logan Gunthorpe wrote:
> Hey,
> 
> On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
>> See drivers/pci and drivers/acpi directory.
> 
> The best I could find was the date of the firmware/bios. I really don't
> think that makes sense to tie the two together. And really the more that
> I think about it trying to do a date cutoff for this seems crazy without
> very comprehensive hardware testing done. I have no idea which AMD chips
> have decent root ports for this and then if we include all of ARM and
> POWERPC, etc there's a huge amount of unknown hardware. Saying that the
> system's firmware has to be written after 2016 seems like an arbitrary
> restriction that isn't likely to correlate to any working systems.

I recommended a combination of blacklist + p2p capability + BIOS date.
Not just BIOS date. BIOS date by itself is useless.

As you may or may not be aware, PCI defines capability registers for
discovering features. Unfortunately, there is no direct p2p capability
register. 

However, Access Control Services (ACS) capability register has flags
indicating p2p functionality. p2p feature needs to be discovered from ACS. 

https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf

This is just one of the many P2P capability flags.

"ACS P2P Request Redirect: must be implemented by Root Ports that support peer-to-peer
traffic with other Root Ports5; must be implemented by Switch Downstream Ports."

If the root port or a switch does not have ACS capability, p2p is not allowed.
If these p2p flags are not set, don't allow p2p feature.

The normal expectation from any system (root port/switch) is not to set these
bits unless p2p feature is present/working.

However, there could be systems in the field with ACS capability but broken HW
or broken FW. 

This is when the BIOS date helps so that you don't break existing systems.

The right thing in my opinion is 

1. blacklist by pci vendor/device id like any other pci quirk in quirks.c.
2. Require this feature for recent HW/BIOS by checking the BIOS date.
3. Check the p2p capability from ACS. 

> 
> I still say the only sane thing to do is allow all switches and then add
> a whitelist of root ports that are known to work well. If we care about
> preventing broken systems in a comprehensive way then that's the only
> thing that is going to work.

We can't guarentee all switches will work either. See above for instructions
on when this feature should be enabled.

Let's step back for a moment.

If we think about logical blocks here, p2pmem is a pci user. It should
not walk the bus and search for possible good things by itself. We don't
usually put code into the kernel's driver directory for specific arch/
specific devices. There are hundreds of device drivers in the kernel. 
None of them are guarenteed to work in any architecture but they don't
prohibit use either.

System integrators like me test these drivers against their own systems,
find bugs to remove arch specific assumptions and post patches.

p2pmem is potentially just one of the many users of p2p capability in the
system.

This p2p detection needs to be done by some p2p driver inside the 
drivers/pci directory or inside drivers/pci/probe.c.

This p2p driver needs to verify ACS permissions similar to what
pci_device_group() does.

If the system is p2p capable, this p2p driver sets p2p_capable bit in 
struct pci_dev.

p2pmem driver then uses this bit to decide when it should enable its feature.

Bjorn and Alex needs to device about the final solution as they maintain both
PCI and virtualization (ACS) respectively.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02  2:26                     ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-04-02  2:26 UTC (permalink / raw)


Hi Logan,

I added Alex and Bjorn above.

On 4/1/2017 6:16 PM, Logan Gunthorpe wrote:
> Hey,
> 
> On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
>> See drivers/pci and drivers/acpi directory.
> 
> The best I could find was the date of the firmware/bios. I really don't
> think that makes sense to tie the two together. And really the more that
> I think about it trying to do a date cutoff for this seems crazy without
> very comprehensive hardware testing done. I have no idea which AMD chips
> have decent root ports for this and then if we include all of ARM and
> POWERPC, etc there's a huge amount of unknown hardware. Saying that the
> system's firmware has to be written after 2016 seems like an arbitrary
> restriction that isn't likely to correlate to any working systems.

I recommended a combination of blacklist + p2p capability + BIOS date.
Not just BIOS date. BIOS date by itself is useless.

As you may or may not be aware, PCI defines capability registers for
discovering features. Unfortunately, there is no direct p2p capability
register. 

However, Access Control Services (ACS) capability register has flags
indicating p2p functionality. p2p feature needs to be discovered from ACS. 

https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf

This is just one of the many P2P capability flags.

"ACS P2P Request Redirect: must be implemented by Root Ports that support peer-to-peer
traffic with other Root Ports5; must be implemented by Switch Downstream Ports."

If the root port or a switch does not have ACS capability, p2p is not allowed.
If these p2p flags are not set, don't allow p2p feature.

The normal expectation from any system (root port/switch) is not to set these
bits unless p2p feature is present/working.

However, there could be systems in the field with ACS capability but broken HW
or broken FW. 

This is when the BIOS date helps so that you don't break existing systems.

The right thing in my opinion is 

1. blacklist by pci vendor/device id like any other pci quirk in quirks.c.
2. Require this feature for recent HW/BIOS by checking the BIOS date.
3. Check the p2p capability from ACS. 

> 
> I still say the only sane thing to do is allow all switches and then add
> a whitelist of root ports that are known to work well. If we care about
> preventing broken systems in a comprehensive way then that's the only
> thing that is going to work.

We can't guarentee all switches will work either. See above for instructions
on when this feature should be enabled.

Let's step back for a moment.

If we think about logical blocks here, p2pmem is a pci user. It should
not walk the bus and search for possible good things by itself. We don't
usually put code into the kernel's driver directory for specific arch/
specific devices. There are hundreds of device drivers in the kernel. 
None of them are guarenteed to work in any architecture but they don't
prohibit use either.

System integrators like me test these drivers against their own systems,
find bugs to remove arch specific assumptions and post patches.

p2pmem is potentially just one of the many users of p2p capability in the
system.

This p2p detection needs to be done by some p2p driver inside the 
drivers/pci directory or inside drivers/pci/probe.c.

This p2p driver needs to verify ACS permissions similar to what
pci_device_group() does.

If the system is p2p capable, this p2p driver sets p2p_capable bit in 
struct pci_dev.

p2pmem driver then uses this bit to decide when it should enable its feature.

Bjorn and Alex needs to device about the final solution as they maintain both
PCI and virtualization (ACS) respectively.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02 17:21                       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-02 17:21 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch,
	Alex Williamson, linux-scsi, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 01/04/17 08:26 PM, Sinan Kaya wrote:
> I recommended a combination of blacklist + p2p capability + BIOS date.
> Not just BIOS date. BIOS date by itself is useless.

Well this proposal doesn't work for me at all. None of my hardware has
the p2p ACS capability and my BIOS date is in 2013 and yet my switch
works perfectly fine. You're going to have to make the case that ACS p2p
capabilities are somehow correlated with a device's ability to move TLPs
between ports with reasonable performance. (For example my sandy bridge
CPU does support p2p transactions fine, it just doesn't have great
performance.) The documentation doesn't suggest this nor can I even find
(via google) any lspci dump that suggest there is hardware that sets
this p2p capability. The ACS P2P flag is meant to indicate something
completely different from what you are proposing using it for: it's
meant to indicate the ability to manage permissions of p2p destined TLPs
not the ability to efficiently transfer them.

> This is when the BIOS date helps so that you don't break existing systems.

I'm not that worried about this code breaking existing systems. There
are significant trade-offs with using p2pmem (ie. you are quite likely
sacrificing performance for memory QOS or upstream PCI bandwidth), and
therefore the user _has_ to specifically say to use it. This is why
we've put a flag in the nvme target code that defaults to off. Thus we
are not going to have a situation where people upgrade their kernels and
see broken or slow systems. People _have_ to make the decision to turn
it on and decide based on their use case whether it's appropriate.

> We can't guarentee all switches will work either. See above for instructions
> on when this feature should be enabled.

It's a lot easier to say that all switches will work than it is for root
ports. This is essentially what switches are designed for, so I'd be
surprised to find one that doesn't work. Root ports are the trouble here
seeing it's a lot more likely for them to be designed without
considering that traffic needs to move between ports efficiently. If we
do find extremely broken switches that don't support this then we'd
probably want to create a black list for that. Also, there's
significantly fewer PCI switch products on the market than there are
root port instances, so a black list would be much easier to manage there.

> If we think about logical blocks here, p2pmem is a pci user. 

Well technically, the only thing that ties p2pmem to pci is the concept
of which devices to allow it's use with. There's absolutely no reason
why any other bus couldn't use the same code and just say any devices on
that bus allow p2pmem.

>It should
> not walk the bus and search for possible good things by itself. We don't
> usually put code into the kernel's driver directory for specific arch/
> specific devices. There are hundreds of device drivers in the kernel. 
> None of them are guarenteed to work in any architecture but they don't
> prohibit use either.

I'd agree that the final code for determining p2p capability should
belong in the pci code. Or more likely an even more generic interface
with struct device that is bus agnostic. Though, I'd hope that a lot of
this could happen later when there are more kernel users actually
wanting to use this code. It's hard to design a generic interface when
you only have one user at present.

> p2pmem is potentially just one of the many users of p2p capability in the
> system.

Yup, we've had similar feedback from Max. However, without knowing the
needs of a generic p2p device at this point, it's hard to consider this
at all. I am open to it though.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02 17:21                       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-02 17:21 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Alex Williamson, linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	Bjorn Helgaas, Max Gurtovoy, Christoph Hellwig



On 01/04/17 08:26 PM, Sinan Kaya wrote:
> I recommended a combination of blacklist + p2p capability + BIOS date.
> Not just BIOS date. BIOS date by itself is useless.

Well this proposal doesn't work for me at all. None of my hardware has
the p2p ACS capability and my BIOS date is in 2013 and yet my switch
works perfectly fine. You're going to have to make the case that ACS p2p
capabilities are somehow correlated with a device's ability to move TLPs
between ports with reasonable performance. (For example my sandy bridge
CPU does support p2p transactions fine, it just doesn't have great
performance.) The documentation doesn't suggest this nor can I even find
(via google) any lspci dump that suggest there is hardware that sets
this p2p capability. The ACS P2P flag is meant to indicate something
completely different from what you are proposing using it for: it's
meant to indicate the ability to manage permissions of p2p destined TLPs
not the ability to efficiently transfer them.

> This is when the BIOS date helps so that you don't break existing systems.

I'm not that worried about this code breaking existing systems. There
are significant trade-offs with using p2pmem (ie. you are quite likely
sacrificing performance for memory QOS or upstream PCI bandwidth), and
therefore the user _has_ to specifically say to use it. This is why
we've put a flag in the nvme target code that defaults to off. Thus we
are not going to have a situation where people upgrade their kernels and
see broken or slow systems. People _have_ to make the decision to turn
it on and decide based on their use case whether it's appropriate.

> We can't guarentee all switches will work either. See above for instructions
> on when this feature should be enabled.

It's a lot easier to say that all switches will work than it is for root
ports. This is essentially what switches are designed for, so I'd be
surprised to find one that doesn't work. Root ports are the trouble here
seeing it's a lot more likely for them to be designed without
considering that traffic needs to move between ports efficiently. If we
do find extremely broken switches that don't support this then we'd
probably want to create a black list for that. Also, there's
significantly fewer PCI switch products on the market than there are
root port instances, so a black list would be much easier to manage there.

> If we think about logical blocks here, p2pmem is a pci user. 

Well technically, the only thing that ties p2pmem to pci is the concept
of which devices to allow it's use with. There's absolutely no reason
why any other bus couldn't use the same code and just say any devices on
that bus allow p2pmem.

>It should
> not walk the bus and search for possible good things by itself. We don't
> usually put code into the kernel's driver directory for specific arch/
> specific devices. There are hundreds of device drivers in the kernel. 
> None of them are guarenteed to work in any architecture but they don't
> prohibit use either.

I'd agree that the final code for determining p2p capability should
belong in the pci code. Or more likely an even more generic interface
with struct device that is bus agnostic. Though, I'd hope that a lot of
this could happen later when there are more kernel users actually
wanting to use this code. It's hard to design a generic interface when
you only have one user at present.

> p2pmem is potentially just one of the many users of p2p capability in the
> system.

Yup, we've had similar feedback from Max. However, without knowing the
needs of a generic p2p device at this point, it's hard to consider this
at all. I am open to it though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02 17:21                       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-02 17:21 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Alex Williamson, Bjorn Helgaas



On 01/04/17 08:26 PM, Sinan Kaya wrote:
> I recommended a combination of blacklist + p2p capability + BIOS date.
> Not just BIOS date. BIOS date by itself is useless.

Well this proposal doesn't work for me at all. None of my hardware has
the p2p ACS capability and my BIOS date is in 2013 and yet my switch
works perfectly fine. You're going to have to make the case that ACS p2p
capabilities are somehow correlated with a device's ability to move TLPs
between ports with reasonable performance. (For example my sandy bridge
CPU does support p2p transactions fine, it just doesn't have great
performance.) The documentation doesn't suggest this nor can I even find
(via google) any lspci dump that suggest there is hardware that sets
this p2p capability. The ACS P2P flag is meant to indicate something
completely different from what you are proposing using it for: it's
meant to indicate the ability to manage permissions of p2p destined TLPs
not the ability to efficiently transfer them.

> This is when the BIOS date helps so that you don't break existing systems.

I'm not that worried about this code breaking existing systems. There
are significant trade-offs with using p2pmem (ie. you are quite likely
sacrificing performance for memory QOS or upstream PCI bandwidth), and
therefore the user _has_ to specifically say to use it. This is why
we've put a flag in the nvme target code that defaults to off. Thus we
are not going to have a situation where people upgrade their kernels and
see broken or slow systems. People _have_ to make the decision to turn
it on and decide based on their use case whether it's appropriate.

> We can't guarentee all switches will work either. See above for instructions
> on when this feature should be enabled.

It's a lot easier to say that all switches will work than it is for root
ports. This is essentially what switches are designed for, so I'd be
surprised to find one that doesn't work. Root ports are the trouble here
seeing it's a lot more likely for them to be designed without
considering that traffic needs to move between ports efficiently. If we
do find extremely broken switches that don't support this then we'd
probably want to create a black list for that. Also, there's
significantly fewer PCI switch products on the market than there are
root port instances, so a black list would be much easier to manage there.

> If we think about logical blocks here, p2pmem is a pci user. 

Well technically, the only thing that ties p2pmem to pci is the concept
of which devices to allow it's use with. There's absolutely no reason
why any other bus couldn't use the same code and just say any devices on
that bus allow p2pmem.

>It should
> not walk the bus and search for possible good things by itself. We don't
> usually put code into the kernel's driver directory for specific arch/
> specific devices. There are hundreds of device drivers in the kernel. 
> None of them are guarenteed to work in any architecture but they don't
> prohibit use either.

I'd agree that the final code for determining p2p capability should
belong in the pci code. Or more likely an even more generic interface
with struct device that is bus agnostic. Though, I'd hope that a lot of
this could happen later when there are more kernel users actually
wanting to use this code. It's hard to design a generic interface when
you only have one user at present.

> p2pmem is potentially just one of the many users of p2p capability in the
> system.

Yup, we've had similar feedback from Max. However, without knowing the
needs of a generic p2p device at this point, it's hard to consider this
at all. I am open to it though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02 17:21                       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-02 17:21 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Alex Williamson, Bjorn Helgaas



On 01/04/17 08:26 PM, Sinan Kaya wrote:
> I recommended a combination of blacklist + p2p capability + BIOS date.
> Not just BIOS date. BIOS date by itself is useless.

Well this proposal doesn't work for me at all. None of my hardware has
the p2p ACS capability and my BIOS date is in 2013 and yet my switch
works perfectly fine. You're going to have to make the case that ACS p2p
capabilities are somehow correlated with a device's ability to move TLPs
between ports with reasonable performance. (For example my sandy bridge
CPU does support p2p transactions fine, it just doesn't have great
performance.) The documentation doesn't suggest this nor can I even find
(via google) any lspci dump that suggest there is hardware that sets
this p2p capability. The ACS P2P flag is meant to indicate something
completely different from what you are proposing using it for: it's
meant to indicate the ability to manage permissions of p2p destined TLPs
not the ability to efficiently transfer them.

> This is when the BIOS date helps so that you don't break existing systems.

I'm not that worried about this code breaking existing systems. There
are significant trade-offs with using p2pmem (ie. you are quite likely
sacrificing performance for memory QOS or upstream PCI bandwidth), and
therefore the user _has_ to specifically say to use it. This is why
we've put a flag in the nvme target code that defaults to off. Thus we
are not going to have a situation where people upgrade their kernels and
see broken or slow systems. People _have_ to make the decision to turn
it on and decide based on their use case whether it's appropriate.

> We can't guarentee all switches will work either. See above for instructions
> on when this feature should be enabled.

It's a lot easier to say that all switches will work than it is for root
ports. This is essentially what switches are designed for, so I'd be
surprised to find one that doesn't work. Root ports are the trouble here
seeing it's a lot more likely for them to be designed without
considering that traffic needs to move between ports efficiently. If we
do find extremely broken switches that don't support this then we'd
probably want to create a black list for that. Also, there's
significantly fewer PCI switch products on the market than there are
root port instances, so a black list would be much easier to manage there.

> If we think about logical blocks here, p2pmem is a pci user. 

Well technically, the only thing that ties p2pmem to pci is the concept
of which devices to allow it's use with. There's absolutely no reason
why any other bus couldn't use the same code and just say any devices on
that bus allow p2pmem.

>It should
> not walk the bus and search for possible good things by itself. We don't
> usually put code into the kernel's driver directory for specific arch/
> specific devices. There are hundreds of device drivers in the kernel. 
> None of them are guarenteed to work in any architecture but they don't
> prohibit use either.

I'd agree that the final code for determining p2p capability should
belong in the pci code. Or more likely an even more generic interface
with struct device that is bus agnostic. Though, I'd hope that a lot of
this could happen later when there are more kernel users actually
wanting to use this code. It's hard to design a generic interface when
you only have one user at present.

> p2pmem is potentially just one of the many users of p2p capability in the
> system.

Yup, we've had similar feedback from Max. However, without knowing the
needs of a generic p2p device at this point, it's hard to consider this
at all. I am open to it though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02 17:21                       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-02 17:21 UTC (permalink / raw)




On 01/04/17 08:26 PM, Sinan Kaya wrote:
> I recommended a combination of blacklist + p2p capability + BIOS date.
> Not just BIOS date. BIOS date by itself is useless.

Well this proposal doesn't work for me at all. None of my hardware has
the p2p ACS capability and my BIOS date is in 2013 and yet my switch
works perfectly fine. You're going to have to make the case that ACS p2p
capabilities are somehow correlated with a device's ability to move TLPs
between ports with reasonable performance. (For example my sandy bridge
CPU does support p2p transactions fine, it just doesn't have great
performance.) The documentation doesn't suggest this nor can I even find
(via google) any lspci dump that suggest there is hardware that sets
this p2p capability. The ACS P2P flag is meant to indicate something
completely different from what you are proposing using it for: it's
meant to indicate the ability to manage permissions of p2p destined TLPs
not the ability to efficiently transfer them.

> This is when the BIOS date helps so that you don't break existing systems.

I'm not that worried about this code breaking existing systems. There
are significant trade-offs with using p2pmem (ie. you are quite likely
sacrificing performance for memory QOS or upstream PCI bandwidth), and
therefore the user _has_ to specifically say to use it. This is why
we've put a flag in the nvme target code that defaults to off. Thus we
are not going to have a situation where people upgrade their kernels and
see broken or slow systems. People _have_ to make the decision to turn
it on and decide based on their use case whether it's appropriate.

> We can't guarentee all switches will work either. See above for instructions
> on when this feature should be enabled.

It's a lot easier to say that all switches will work than it is for root
ports. This is essentially what switches are designed for, so I'd be
surprised to find one that doesn't work. Root ports are the trouble here
seeing it's a lot more likely for them to be designed without
considering that traffic needs to move between ports efficiently. If we
do find extremely broken switches that don't support this then we'd
probably want to create a black list for that. Also, there's
significantly fewer PCI switch products on the market than there are
root port instances, so a black list would be much easier to manage there.

> If we think about logical blocks here, p2pmem is a pci user. 

Well technically, the only thing that ties p2pmem to pci is the concept
of which devices to allow it's use with. There's absolutely no reason
why any other bus couldn't use the same code and just say any devices on
that bus allow p2pmem.

>It should
> not walk the bus and search for possible good things by itself. We don't
> usually put code into the kernel's driver directory for specific arch/
> specific devices. There are hundreds of device drivers in the kernel. 
> None of them are guarenteed to work in any architecture but they don't
> prohibit use either.

I'd agree that the final code for determining p2p capability should
belong in the pci code. Or more likely an even more generic interface
with struct device that is bus agnostic. Though, I'd hope that a lot of
this could happen later when there are more kernel users actually
wanting to use this code. It's hard to design a generic interface when
you only have one user at present.

> p2pmem is potentially just one of the many users of p2p capability in the
> system.

Yup, we've had similar feedback from Max. However, without knowing the
needs of a generic p2p device at this point, it's hard to consider this
at all. I am open to it though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
  2017-04-02 17:21                       ` Logan Gunthorpe
  (?)
  (?)
@ 2017-04-02 21:03                         ` Sinan Kaya
  -1 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-04-02 21:03 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch,
	Alex Williamson, linux-scsi, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig

On 4/2/2017 1:21 PM, Logan Gunthorpe wrote:
>> This is when the BIOS date helps so that you don't break existing systems.
> I'm not that worried about this code breaking existing systems. There
> are significant trade-offs with using p2pmem (ie. you are quite likely
> sacrificing performance for memory QOS or upstream PCI bandwidth), and
> therefore the user _has_ to specifically say to use it. This is why
> we've put a flag in the nvme target code that defaults to off. Thus we
> are not going to have a situation where people upgrade their kernels and
> see broken or slow systems. People _have_ to make the decision to turn
> it on and decide based on their use case whether it's appropriate.
> 

OK. I didn't know the feature was not enabled by default. This is even 
easier now. 

Push the decision all the way to the user. Let them decide whether they
want this feature to work on a root port connected port or under the
switch.

>> We can't guarentee all switches will work either. See above for instructions
>> on when this feature should be enabled.
> It's a lot easier to say that all switches will work than it is for root
> ports. This is essentially what switches are designed for, so I'd be
> surprised to find one that doesn't work. Root ports are the trouble here
> seeing it's a lot more likely for them to be designed without
> considering that traffic needs to move between ports efficiently. If we
> do find extremely broken switches that don't support this then we'd
> probably want to create a black list for that. Also, there's
> significantly fewer PCI switch products on the market than there are
> root port instances, so a black list would be much easier to manage there.
> 

I thought the issue was feature didn't work at all with some root ports
or there was some kind of memory corruption issue that you were trying to
avoid with the existing systems.

If you are just worried about performance, the switch recommendation belongs
to your particular product tuning guide or a howto document not into the
actual code itself. 

I think you should get rid of all pci searching business in your code.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02 21:03                         ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-04-02 21:03 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Alex Williamson, Bjorn Helgaas

On 4/2/2017 1:21 PM, Logan Gunthorpe wrote:
>> This is when the BIOS date helps so that you don't break existing systems.
> I'm not that worried about this code breaking existing systems. There
> are significant trade-offs with using p2pmem (ie. you are quite likely
> sacrificing performance for memory QOS or upstream PCI bandwidth), and
> therefore the user _has_ to specifically say to use it. This is why
> we've put a flag in the nvme target code that defaults to off. Thus we
> are not going to have a situation where people upgrade their kernels and
> see broken or slow systems. People _have_ to make the decision to turn
> it on and decide based on their use case whether it's appropriate.
> 

OK. I didn't know the feature was not enabled by default. This is even 
easier now. 

Push the decision all the way to the user. Let them decide whether they
want this feature to work on a root port connected port or under the
switch.

>> We can't guarentee all switches will work either. See above for instructions
>> on when this feature should be enabled.
> It's a lot easier to say that all switches will work than it is for root
> ports. This is essentially what switches are designed for, so I'd be
> surprised to find one that doesn't work. Root ports are the trouble here
> seeing it's a lot more likely for them to be designed without
> considering that traffic needs to move between ports efficiently. If we
> do find extremely broken switches that don't support this then we'd
> probably want to create a black list for that. Also, there's
> significantly fewer PCI switch products on the market than there are
> root port instances, so a black list would be much easier to manage there.
> 

I thought the issue was feature didn't work at all with some root ports
or there was some kind of memory corruption issue that you were trying to
avoid with the existing systems.

If you are just worried about performance, the switch recommendation belongs
to your particular product tuning guide or a howto document not into the
actual code itself. 

I think you should get rid of all pci searching business in your code.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02 21:03                         ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-04-02 21:03 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Alex Williamson, Bjorn Helgaas

On 4/2/2017 1:21 PM, Logan Gunthorpe wrote:
>> This is when the BIOS date helps so that you don't break existing systems.
> I'm not that worried about this code breaking existing systems. There
> are significant trade-offs with using p2pmem (ie. you are quite likely
> sacrificing performance for memory QOS or upstream PCI bandwidth), and
> therefore the user _has_ to specifically say to use it. This is why
> we've put a flag in the nvme target code that defaults to off. Thus we
> are not going to have a situation where people upgrade their kernels and
> see broken or slow systems. People _have_ to make the decision to turn
> it on and decide based on their use case whether it's appropriate.
> 

OK. I didn't know the feature was not enabled by default. This is even 
easier now. 

Push the decision all the way to the user. Let them decide whether they
want this feature to work on a root port connected port or under the
switch.

>> We can't guarentee all switches will work either. See above for instructions
>> on when this feature should be enabled.
> It's a lot easier to say that all switches will work than it is for root
> ports. This is essentially what switches are designed for, so I'd be
> surprised to find one that doesn't work. Root ports are the trouble here
> seeing it's a lot more likely for them to be designed without
> considering that traffic needs to move between ports efficiently. If we
> do find extremely broken switches that don't support this then we'd
> probably want to create a black list for that. Also, there's
> significantly fewer PCI switch products on the market than there are
> root port instances, so a black list would be much easier to manage there.
> 

I thought the issue was feature didn't work at all with some root ports
or there was some kind of memory corruption issue that you were trying to
avoid with the existing systems.

If you are just worried about performance, the switch recommendation belongs
to your particular product tuning guide or a howto document not into the
actual code itself. 

I think you should get rid of all pci searching business in your code.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-02 21:03                         ` Sinan Kaya
  0 siblings, 0 replies; 545+ messages in thread
From: Sinan Kaya @ 2017-04-02 21:03 UTC (permalink / raw)


On 4/2/2017 1:21 PM, Logan Gunthorpe wrote:
>> This is when the BIOS date helps so that you don't break existing systems.
> I'm not that worried about this code breaking existing systems. There
> are significant trade-offs with using p2pmem (ie. you are quite likely
> sacrificing performance for memory QOS or upstream PCI bandwidth), and
> therefore the user _has_ to specifically say to use it. This is why
> we've put a flag in the nvme target code that defaults to off. Thus we
> are not going to have a situation where people upgrade their kernels and
> see broken or slow systems. People _have_ to make the decision to turn
> it on and decide based on their use case whether it's appropriate.
> 

OK. I didn't know the feature was not enabled by default. This is even 
easier now. 

Push the decision all the way to the user. Let them decide whether they
want this feature to work on a root port connected port or under the
switch.

>> We can't guarentee all switches will work either. See above for instructions
>> on when this feature should be enabled.
> It's a lot easier to say that all switches will work than it is for root
> ports. This is essentially what switches are designed for, so I'd be
> surprised to find one that doesn't work. Root ports are the trouble here
> seeing it's a lot more likely for them to be designed without
> considering that traffic needs to move between ports efficiently. If we
> do find extremely broken switches that don't support this then we'd
> probably want to create a black list for that. Also, there's
> significantly fewer PCI switch products on the market than there are
> root port instances, so a black list would be much easier to manage there.
> 

I thought the issue was feature didn't work at all with some root ports
or there was some kind of memory corruption issue that you were trying to
avoid with the existing systems.

If you are just worried about performance, the switch recommendation belongs
to your particular product tuning guide or a howto document not into the
actual code itself. 

I think you should get rid of all pci searching business in your code.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
  2017-04-02 21:03                         ` Sinan Kaya
  (?)
  (?)
@ 2017-04-03  4:26                           ` Logan Gunthorpe
  -1 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03  4:26 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch,
	Alex Williamson, linux-scsi, Bjorn Helgaas, Max Gurtovoy,
	Christoph Hellwig



On 02/04/17 03:03 PM, Sinan Kaya wrote:
> Push the decision all the way to the user. Let them decide whether they
> want this feature to work on a root port connected port or under the
> switch.

Yes, I prefer this too. If other folks agree with that I'd be very happy
to go back to user chooses. I think Sagi was the most vocal proponent
for kernel chooses at LSF so hopefully he will read this thread and
offer some opinion.

> I thought the issue was feature didn't work at all with some root ports
> or there was some kind of memory corruption issue that you were trying to
> avoid with the existing systems.

I *think* there are some much older root ports where P2P TLPs don't even
get through. But it doesn't really change the situation: in the nvmet
case, the user would enable p2pmem and then be unable to connect and
thus choose to disable it going forward. Not a big difference from the
user seeing bad performance and not choosing to enable it.

> I think you should get rid of all pci searching business in your code.

Yes, my original proposal was when you configure the nvme target you
chose the specific p2pmem device to use. That code had no tie ins to PCI
code and could, in theory, work generically with any device and bus.

Logan


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-03  4:26                           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03  4:26 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Alex Williamson, Bjorn Helgaas



On 02/04/17 03:03 PM, Sinan Kaya wrote:
> Push the decision all the way to the user. Let them decide whether they
> want this feature to work on a root port connected port or under the
> switch.

Yes, I prefer this too. If other folks agree with that I'd be very happy
to go back to user chooses. I think Sagi was the most vocal proponent
for kernel chooses at LSF so hopefully he will read this thread and
offer some opinion.

> I thought the issue was feature didn't work at all with some root ports
> or there was some kind of memory corruption issue that you were trying to
> avoid with the existing systems.

I *think* there are some much older root ports where P2P TLPs don't even
get through. But it doesn't really change the situation: in the nvmet
case, the user would enable p2pmem and then be unable to connect and
thus choose to disable it going forward. Not a big difference from the
user seeing bad performance and not choosing to enable it.

> I think you should get rid of all pci searching business in your code.

Yes, my original proposal was when you configure the nvme target you
chose the specific p2pmem device to use. That code had no tie ins to PCI
code and could, in theory, work generically with any device and bus.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-03  4:26                           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03  4:26 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Alex Williamson, Bjorn Helgaas



On 02/04/17 03:03 PM, Sinan Kaya wrote:
> Push the decision all the way to the user. Let them decide whether they
> want this feature to work on a root port connected port or under the
> switch.

Yes, I prefer this too. If other folks agree with that I'd be very happy
to go back to user chooses. I think Sagi was the most vocal proponent
for kernel chooses at LSF so hopefully he will read this thread and
offer some opinion.

> I thought the issue was feature didn't work at all with some root ports
> or there was some kind of memory corruption issue that you were trying to
> avoid with the existing systems.

I *think* there are some much older root ports where P2P TLPs don't even
get through. But it doesn't really change the situation: in the nvmet
case, the user would enable p2pmem and then be unable to connect and
thus choose to disable it going forward. Not a big difference from the
user seeing bad performance and not choosing to enable it.

> I think you should get rid of all pci searching business in your code.

Yes, my original proposal was when you configure the nvme target you
chose the specific p2pmem device to use. That code had no tie ins to PCI
code and could, in theory, work generically with any device and bus.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device
@ 2017-04-03  4:26                           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03  4:26 UTC (permalink / raw)




On 02/04/17 03:03 PM, Sinan Kaya wrote:
> Push the decision all the way to the user. Let them decide whether they
> want this feature to work on a root port connected port or under the
> switch.

Yes, I prefer this too. If other folks agree with that I'd be very happy
to go back to user chooses. I think Sagi was the most vocal proponent
for kernel chooses at LSF so hopefully he will read this thread and
offer some opinion.

> I thought the issue was feature didn't work at all with some root ports
> or there was some kind of memory corruption issue that you were trying to
> avoid with the existing systems.

I *think* there are some much older root ports where P2P TLPs don't even
get through. But it doesn't really change the situation: in the nvmet
case, the user would enable p2pmem and then be unable to connect and
thus choose to disable it going forward. Not a big difference from the
user seeing bad performance and not choosing to enable it.

> I think you should get rid of all pci searching business in your code.

Yes, my original proposal was when you configure the nvme target you
chose the specific p2pmem device to use. That code had no tie ins to PCI
code and could, in theory, work generically with any device and bus.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:20         ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 21:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Max Gurtovoy, Christoph Hellwig

Hi Christoph,

What are your thoughts on an approach like the following untested
draft patch.

The patch (if fleshed out) makes it so iomem can be used in an sgl
and WARN_ONs will occur in places where drivers attempt to access
iomem directly through the sgl.

I'd also probably create a p2pmem_alloc_sgl helper function so driver
writers wouldn't have to mess with sg_set_iomem_page.

With all that in place, it should be relatively safe for drivers to
implement p2pmem even though we'd still technically be violating the
__iomem boundary in some places.

Logan


commit b435a154a4ec4f82766f6ab838092c3c5a9388ac
Author: Logan Gunthorpe <logang@deltatee.com>
Date:   Wed Feb 8 12:44:52 2017 -0700

    scatterlist: Add support for iomem pages

    This patch steals another bit from the page_link field to indicate the
    sg points to iomem. In sg_copy_buffer we use this flag to select
    between memcpy and iomemcpy. Other sg_miter users will get an WARN_ON
    unless they indicate they support iomemory by setting the
    SG_MITER_IOMEM flag.

    Also added are sg_kmap functions which would replace a common pattern
    of kmap(sg_page(sg)). These new functions then also warn if the caller
    tries to map io memory. Another option may be to automatically copy
    the iomem to a new page and return that transparently to the driver.

    Another coccinelle patch would then be done to convert kmap(sg_page(sg))
    instances to the appropriate sg_kmap calls.

    Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0007b79..bd690a2c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,9 @@

 #include <uapi/linux/dma-buf.h>

+/* Avoid the highmem.h macro from aliasing our ops->kunmap_atomic */
+#undef kunmap_atomic
+
 static inline int is_dma_buf_file(struct file *);

 struct dma_buf_list {
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..7608da0 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <asm/io.h>

 struct scatterlist {
@@ -53,6 +54,9 @@ struct sg_table {
  *
  * If bit 1 is set, then this sg entry is the last element in a list.
  *
+ * We also use bit 2 to indicate whether the page_link points to an
+ * iomem page or not.
+ *
  * See sg_next().
  *
  */
@@ -64,10 +68,17 @@ struct sg_table {
  * a valid sg entry, or whether it points to the start of a new
scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define sg_is_chain(sg)		((sg)->page_link & 0x01)
-#define sg_is_last(sg)		((sg)->page_link & 0x02)
+#define PAGE_LINK_MASK	0x7
+#define PAGE_LINK_CHAIN	0x1
+#define PAGE_LINK_LAST	0x2
+#define PAGE_LINK_IOMEM	0x4
+
+#define sg_is_chain(sg)		((sg)->page_link & PAGE_LINK_CHAIN)
+#define sg_is_last(sg)		((sg)->page_link & PAGE_LINK_LAST)
 #define sg_chain_ptr(sg)	\
-	((struct scatterlist *) ((sg)->page_link & ~0x03))
+	((struct scatterlist *) ((sg)->page_link & ~(PAGE_LINK_CHAIN | \
+						     PAGE_LINK_LAST)))
+#define sg_is_iomem(sg)		((sg)->page_link & PAGE_LINK_IOMEM)

 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -81,13 +92,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page
*page)
 {
-	unsigned long page_link = sg->page_link & 0x3;
+	unsigned long page_link = sg->page_link & PAGE_LINK_MASK;

 	/*
 	 * In order for the low bit stealing approach to work, pages
-	 * must be aligned at a 32-bit boundary as a minimum.
+	 * must be aligned at a 64-bit boundary as a minimum.
 	 */
-	BUG_ON((unsigned long) page & 0x03);
+	BUG_ON((unsigned long) page & PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
@@ -117,13 +128,56 @@ static inline void sg_set_page(struct scatterlist
*sg, struct page *page,
 	sg->length = len;
 }

+/**
+ * sg_set_page - Set sg entry to point at given iomem page
+ * @sg:		 SG entry
+ * @page:	 The page
+ * @len:	 Length of data
+ * @offset:	 Offset into page
+ *
+ * Description:
+ *   Same as sg_set_page but used when the page is a ZONE_DEVICE page that
+ *   points to IO memory.
+ *
+ **/
+static inline void sg_set_iomem_page(struct scatterlist *sg, struct
page *page,
+				     unsigned int len, unsigned int offset)
+{
+	sg_set_page(sg, page, len, offset);
+	sg->page_link |= PAGE_LINK_IOMEM;
+}
+
 static inline struct page *sg_page(struct scatterlist *sg)
 {
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
 #endif
-	return (struct page *)((sg)->page_link & ~0x3);
+	return (struct page *)((sg)->page_link & ~PAGE_LINK_MASK);
+}
+
+static inline void *sg_kmap(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap(struct scatterlist *sg, void *addr)
+{
+	kunmap(addr);
+}
+
+static inline void *sg_kmap_atomic(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap_atomic(struct scatterlist *sg, void *addr)
+{
+	kunmap_atomic(addr);
 }

 /**
@@ -171,7 +225,8 @@ static inline void sg_chain(struct scatterlist *prv,
unsigned int prv_nents,
 	 * Set lowest bit to indicate a link pointer, and make sure to clear
 	 * the termination bit if it happens to be set.
 	 */
-	prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+	prv[prv_nents - 1].page_link =
+		((unsigned long) sgl & ~PAGE_LINK_MASK) | PAGE_LINK_CHAIN;
 }

 /**
@@ -191,8 +246,8 @@ static inline void sg_mark_end(struct scatterlist *sg)
 	/*
 	 * Set termination bit, clear potential chain bit
 	 */
-	sg->page_link |= 0x02;
-	sg->page_link &= ~0x01;
+	sg->page_link &= ~PAGE_LINK_MASK;
+	sg->page_link |= PAGE_LINK_LAST;
 }

 /**
@@ -208,7 +263,7 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 #endif
-	sg->page_link &= ~0x02;
+	sg->page_link &= ~PAGE_LINK_LAST;
 }

 /**
@@ -383,6 +438,7 @@ static inline dma_addr_t
sg_page_iter_dma_address(struct sg_page_iter *piter)
 #define SG_MITER_ATOMIC		(1 << 0)	 /* use kmap_atomic */
 #define SG_MITER_TO_SG		(1 << 1)	/* flush back to phys on unmap */
 #define SG_MITER_FROM_SG	(1 << 2)	/* nop */
+#define SG_MITER_IOMEM		(1 << 3)	/* support iomem in miter ops */

 struct sg_mapping_iter {
 	/* the following three fields can be accessed directly */
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..6d8f39b 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -580,6 +580,9 @@ bool sg_miter_next(struct sg_mapping_iter *miter)
 	if (!sg_miter_get_next_page(miter))
 		return false;

+	if (!(miter->__flags & SG_MITER_IOMEM))
+		WARN_ON(sg_is_iomem(miter->piter.sg));
+
 	miter->page = sg_page_iter_page(&miter->piter);
 	miter->consumed = miter->length = miter->__remaining;

@@ -651,7 +654,7 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
-	unsigned int sg_flags = SG_MITER_ATOMIC;
+	unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_IOMEM;

 	if (to_buffer)
 		sg_flags |= SG_MITER_FROM_SG;
@@ -668,10 +671,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,

 		len = min(miter.length, buflen - offset);

-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (sg_is_iomem(miter.piter.sg)) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}

 		offset += len;
 	}
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:20         ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 21:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Max Gurtovoy,
	Christoph Hellwig

Hi Christoph,

What are your thoughts on an approach like the following untested
draft patch.

The patch (if fleshed out) makes it so iomem can be used in an sgl
and WARN_ONs will occur in places where drivers attempt to access
iomem directly through the sgl.

I'd also probably create a p2pmem_alloc_sgl helper function so driver
writers wouldn't have to mess with sg_set_iomem_page.

With all that in place, it should be relatively safe for drivers to
implement p2pmem even though we'd still technically be violating the
__iomem boundary in some places.

Logan


commit b435a154a4ec4f82766f6ab838092c3c5a9388ac
Author: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
Date:   Wed Feb 8 12:44:52 2017 -0700

    scatterlist: Add support for iomem pages

    This patch steals another bit from the page_link field to indicate the
    sg points to iomem. In sg_copy_buffer we use this flag to select
    between memcpy and iomemcpy. Other sg_miter users will get an WARN_ON
    unless they indicate they support iomemory by setting the
    SG_MITER_IOMEM flag.

    Also added are sg_kmap functions which would replace a common pattern
    of kmap(sg_page(sg)). These new functions then also warn if the caller
    tries to map io memory. Another option may be to automatically copy
    the iomem to a new page and return that transparently to the driver.

    Another coccinelle patch would then be done to convert kmap(sg_page(sg))
    instances to the appropriate sg_kmap calls.

    Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0007b79..bd690a2c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,9 @@

 #include <uapi/linux/dma-buf.h>

+/* Avoid the highmem.h macro from aliasing our ops->kunmap_atomic */
+#undef kunmap_atomic
+
 static inline int is_dma_buf_file(struct file *);

 struct dma_buf_list {
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..7608da0 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <asm/io.h>

 struct scatterlist {
@@ -53,6 +54,9 @@ struct sg_table {
  *
  * If bit 1 is set, then this sg entry is the last element in a list.
  *
+ * We also use bit 2 to indicate whether the page_link points to an
+ * iomem page or not.
+ *
  * See sg_next().
  *
  */
@@ -64,10 +68,17 @@ struct sg_table {
  * a valid sg entry, or whether it points to the start of a new
scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define sg_is_chain(sg)		((sg)->page_link & 0x01)
-#define sg_is_last(sg)		((sg)->page_link & 0x02)
+#define PAGE_LINK_MASK	0x7
+#define PAGE_LINK_CHAIN	0x1
+#define PAGE_LINK_LAST	0x2
+#define PAGE_LINK_IOMEM	0x4
+
+#define sg_is_chain(sg)		((sg)->page_link & PAGE_LINK_CHAIN)
+#define sg_is_last(sg)		((sg)->page_link & PAGE_LINK_LAST)
 #define sg_chain_ptr(sg)	\
-	((struct scatterlist *) ((sg)->page_link & ~0x03))
+	((struct scatterlist *) ((sg)->page_link & ~(PAGE_LINK_CHAIN | \
+						     PAGE_LINK_LAST)))
+#define sg_is_iomem(sg)		((sg)->page_link & PAGE_LINK_IOMEM)

 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -81,13 +92,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page
*page)
 {
-	unsigned long page_link = sg->page_link & 0x3;
+	unsigned long page_link = sg->page_link & PAGE_LINK_MASK;

 	/*
 	 * In order for the low bit stealing approach to work, pages
-	 * must be aligned at a 32-bit boundary as a minimum.
+	 * must be aligned at a 64-bit boundary as a minimum.
 	 */
-	BUG_ON((unsigned long) page & 0x03);
+	BUG_ON((unsigned long) page & PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
@@ -117,13 +128,56 @@ static inline void sg_set_page(struct scatterlist
*sg, struct page *page,
 	sg->length = len;
 }

+/**
+ * sg_set_page - Set sg entry to point at given iomem page
+ * @sg:		 SG entry
+ * @page:	 The page
+ * @len:	 Length of data
+ * @offset:	 Offset into page
+ *
+ * Description:
+ *   Same as sg_set_page but used when the page is a ZONE_DEVICE page that
+ *   points to IO memory.
+ *
+ **/
+static inline void sg_set_iomem_page(struct scatterlist *sg, struct
page *page,
+				     unsigned int len, unsigned int offset)
+{
+	sg_set_page(sg, page, len, offset);
+	sg->page_link |= PAGE_LINK_IOMEM;
+}
+
 static inline struct page *sg_page(struct scatterlist *sg)
 {
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
 #endif
-	return (struct page *)((sg)->page_link & ~0x3);
+	return (struct page *)((sg)->page_link & ~PAGE_LINK_MASK);
+}
+
+static inline void *sg_kmap(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap(struct scatterlist *sg, void *addr)
+{
+	kunmap(addr);
+}
+
+static inline void *sg_kmap_atomic(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap_atomic(struct scatterlist *sg, void *addr)
+{
+	kunmap_atomic(addr);
 }

 /**
@@ -171,7 +225,8 @@ static inline void sg_chain(struct scatterlist *prv,
unsigned int prv_nents,
 	 * Set lowest bit to indicate a link pointer, and make sure to clear
 	 * the termination bit if it happens to be set.
 	 */
-	prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+	prv[prv_nents - 1].page_link =
+		((unsigned long) sgl & ~PAGE_LINK_MASK) | PAGE_LINK_CHAIN;
 }

 /**
@@ -191,8 +246,8 @@ static inline void sg_mark_end(struct scatterlist *sg)
 	/*
 	 * Set termination bit, clear potential chain bit
 	 */
-	sg->page_link |= 0x02;
-	sg->page_link &= ~0x01;
+	sg->page_link &= ~PAGE_LINK_MASK;
+	sg->page_link |= PAGE_LINK_LAST;
 }

 /**
@@ -208,7 +263,7 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 #endif
-	sg->page_link &= ~0x02;
+	sg->page_link &= ~PAGE_LINK_LAST;
 }

 /**
@@ -383,6 +438,7 @@ static inline dma_addr_t
sg_page_iter_dma_address(struct sg_page_iter *piter)
 #define SG_MITER_ATOMIC		(1 << 0)	 /* use kmap_atomic */
 #define SG_MITER_TO_SG		(1 << 1)	/* flush back to phys on unmap */
 #define SG_MITER_FROM_SG	(1 << 2)	/* nop */
+#define SG_MITER_IOMEM		(1 << 3)	/* support iomem in miter ops */

 struct sg_mapping_iter {
 	/* the following three fields can be accessed directly */
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..6d8f39b 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -580,6 +580,9 @@ bool sg_miter_next(struct sg_mapping_iter *miter)
 	if (!sg_miter_get_next_page(miter))
 		return false;

+	if (!(miter->__flags & SG_MITER_IOMEM))
+		WARN_ON(sg_is_iomem(miter->piter.sg));
+
 	miter->page = sg_page_iter_page(&miter->piter);
 	miter->consumed = miter->length = miter->__remaining;

@@ -651,7 +654,7 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
-	unsigned int sg_flags = SG_MITER_ATOMIC;
+	unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_IOMEM;

 	if (to_buffer)
 		sg_flags |= SG_MITER_FROM_SG;
@@ -668,10 +671,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,

 		len = min(miter.length, buflen - offset);

-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (sg_is_iomem(miter.piter.sg)) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}

 		offset += len;
 	}
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:20         ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 21:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch, linux-scsi,
	Max Gurtovoy, Christoph Hellwig

Hi Christoph,

What are your thoughts on an approach like the following untested
draft patch.

The patch (if fleshed out) makes it so iomem can be used in an sgl
and WARN_ONs will occur in places where drivers attempt to access
iomem directly through the sgl.

I'd also probably create a p2pmem_alloc_sgl helper function so driver
writers wouldn't have to mess with sg_set_iomem_page.

With all that in place, it should be relatively safe for drivers to
implement p2pmem even though we'd still technically be violating the
__iomem boundary in some places.

Logan


commit b435a154a4ec4f82766f6ab838092c3c5a9388ac
Author: Logan Gunthorpe <logang@deltatee.com>
Date:   Wed Feb 8 12:44:52 2017 -0700

    scatterlist: Add support for iomem pages

    This patch steals another bit from the page_link field to indicate the
    sg points to iomem. In sg_copy_buffer we use this flag to select
    between memcpy and iomemcpy. Other sg_miter users will get an WARN_ON
    unless they indicate they support iomemory by setting the
    SG_MITER_IOMEM flag.

    Also added are sg_kmap functions which would replace a common pattern
    of kmap(sg_page(sg)). These new functions then also warn if the caller
    tries to map io memory. Another option may be to automatically copy
    the iomem to a new page and return that transparently to the driver.

    Another coccinelle patch would then be done to convert kmap(sg_page(sg))
    instances to the appropriate sg_kmap calls.

    Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0007b79..bd690a2c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,9 @@

 #include <uapi/linux/dma-buf.h>

+/* Avoid the highmem.h macro from aliasing our ops->kunmap_atomic */
+#undef kunmap_atomic
+
 static inline int is_dma_buf_file(struct file *);

 struct dma_buf_list {
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..7608da0 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <asm/io.h>

 struct scatterlist {
@@ -53,6 +54,9 @@ struct sg_table {
  *
  * If bit 1 is set, then this sg entry is the last element in a list.
  *
+ * We also use bit 2 to indicate whether the page_link points to an
+ * iomem page or not.
+ *
  * See sg_next().
  *
  */
@@ -64,10 +68,17 @@ struct sg_table {
  * a valid sg entry, or whether it points to the start of a new
scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define sg_is_chain(sg)		((sg)->page_link & 0x01)
-#define sg_is_last(sg)		((sg)->page_link & 0x02)
+#define PAGE_LINK_MASK	0x7
+#define PAGE_LINK_CHAIN	0x1
+#define PAGE_LINK_LAST	0x2
+#define PAGE_LINK_IOMEM	0x4
+
+#define sg_is_chain(sg)		((sg)->page_link & PAGE_LINK_CHAIN)
+#define sg_is_last(sg)		((sg)->page_link & PAGE_LINK_LAST)
 #define sg_chain_ptr(sg)	\
-	((struct scatterlist *) ((sg)->page_link & ~0x03))
+	((struct scatterlist *) ((sg)->page_link & ~(PAGE_LINK_CHAIN | \
+						     PAGE_LINK_LAST)))
+#define sg_is_iomem(sg)		((sg)->page_link & PAGE_LINK_IOMEM)

 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -81,13 +92,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page
*page)
 {
-	unsigned long page_link = sg->page_link & 0x3;
+	unsigned long page_link = sg->page_link & PAGE_LINK_MASK;

 	/*
 	 * In order for the low bit stealing approach to work, pages
-	 * must be aligned at a 32-bit boundary as a minimum.
+	 * must be aligned at a 64-bit boundary as a minimum.
 	 */
-	BUG_ON((unsigned long) page & 0x03);
+	BUG_ON((unsigned long) page & PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
@@ -117,13 +128,56 @@ static inline void sg_set_page(struct scatterlist
*sg, struct page *page,
 	sg->length = len;
 }

+/**
+ * sg_set_page - Set sg entry to point at given iomem page
+ * @sg:		 SG entry
+ * @page:	 The page
+ * @len:	 Length of data
+ * @offset:	 Offset into page
+ *
+ * Description:
+ *   Same as sg_set_page but used when the page is a ZONE_DEVICE page that
+ *   points to IO memory.
+ *
+ **/
+static inline void sg_set_iomem_page(struct scatterlist *sg, struct
page *page,
+				     unsigned int len, unsigned int offset)
+{
+	sg_set_page(sg, page, len, offset);
+	sg->page_link |= PAGE_LINK_IOMEM;
+}
+
 static inline struct page *sg_page(struct scatterlist *sg)
 {
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
 #endif
-	return (struct page *)((sg)->page_link & ~0x3);
+	return (struct page *)((sg)->page_link & ~PAGE_LINK_MASK);
+}
+
+static inline void *sg_kmap(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap(struct scatterlist *sg, void *addr)
+{
+	kunmap(addr);
+}
+
+static inline void *sg_kmap_atomic(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap_atomic(struct scatterlist *sg, void *addr)
+{
+	kunmap_atomic(addr);
 }

 /**
@@ -171,7 +225,8 @@ static inline void sg_chain(struct scatterlist *prv,
unsigned int prv_nents,
 	 * Set lowest bit to indicate a link pointer, and make sure to clear
 	 * the termination bit if it happens to be set.
 	 */
-	prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+	prv[prv_nents - 1].page_link =
+		((unsigned long) sgl & ~PAGE_LINK_MASK) | PAGE_LINK_CHAIN;
 }

 /**
@@ -191,8 +246,8 @@ static inline void sg_mark_end(struct scatterlist *sg)
 	/*
 	 * Set termination bit, clear potential chain bit
 	 */
-	sg->page_link |= 0x02;
-	sg->page_link &= ~0x01;
+	sg->page_link &= ~PAGE_LINK_MASK;
+	sg->page_link |= PAGE_LINK_LAST;
 }

 /**
@@ -208,7 +263,7 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 #endif
-	sg->page_link &= ~0x02;
+	sg->page_link &= ~PAGE_LINK_LAST;
 }

 /**
@@ -383,6 +438,7 @@ static inline dma_addr_t
sg_page_iter_dma_address(struct sg_page_iter *piter)
 #define SG_MITER_ATOMIC		(1 << 0)	 /* use kmap_atomic */
 #define SG_MITER_TO_SG		(1 << 1)	/* flush back to phys on unmap */
 #define SG_MITER_FROM_SG	(1 << 2)	/* nop */
+#define SG_MITER_IOMEM		(1 << 3)	/* support iomem in miter ops */

 struct sg_mapping_iter {
 	/* the following three fields can be accessed directly */
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..6d8f39b 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -580,6 +580,9 @@ bool sg_miter_next(struct sg_mapping_iter *miter)
 	if (!sg_miter_get_next_page(miter))
 		return false;

+	if (!(miter->__flags & SG_MITER_IOMEM))
+		WARN_ON(sg_is_iomem(miter->piter.sg));
+
 	miter->page = sg_page_iter_page(&miter->piter);
 	miter->consumed = miter->length = miter->__remaining;

@@ -651,7 +654,7 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
-	unsigned int sg_flags = SG_MITER_ATOMIC;
+	unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_IOMEM;

 	if (to_buffer)
 		sg_flags |= SG_MITER_FROM_SG;
@@ -668,10 +671,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,

 		len = min(miter.length, buflen - offset);

-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (sg_is_iomem(miter.piter.sg)) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}

 		offset += len;
 	}

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:20         ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 21:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Gunthorpe, James E.J. Bottomley,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch, linux-scsi,
	Max Gurtovoy, Christoph Hellwig

Hi Christoph,

What are your thoughts on an approach like the following untested
draft patch.

The patch (if fleshed out) makes it so iomem can be used in an sgl
and WARN_ONs will occur in places where drivers attempt to access
iomem directly through the sgl.

I'd also probably create a p2pmem_alloc_sgl helper function so driver
writers wouldn't have to mess with sg_set_iomem_page.

With all that in place, it should be relatively safe for drivers to
implement p2pmem even though we'd still technically be violating the
__iomem boundary in some places.

Logan


commit b435a154a4ec4f82766f6ab838092c3c5a9388ac
Author: Logan Gunthorpe <logang@deltatee.com>
Date:   Wed Feb 8 12:44:52 2017 -0700

    scatterlist: Add support for iomem pages

    This patch steals another bit from the page_link field to indicate the
    sg points to iomem. In sg_copy_buffer we use this flag to select
    between memcpy and iomemcpy. Other sg_miter users will get an WARN_ON
    unless they indicate they support iomemory by setting the
    SG_MITER_IOMEM flag.

    Also added are sg_kmap functions which would replace a common pattern
    of kmap(sg_page(sg)). These new functions then also warn if the caller
    tries to map io memory. Another option may be to automatically copy
    the iomem to a new page and return that transparently to the driver.

    Another coccinelle patch would then be done to convert kmap(sg_page(sg))
    instances to the appropriate sg_kmap calls.

    Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0007b79..bd690a2c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,9 @@

 #include <uapi/linux/dma-buf.h>

+/* Avoid the highmem.h macro from aliasing our ops->kunmap_atomic */
+#undef kunmap_atomic
+
 static inline int is_dma_buf_file(struct file *);

 struct dma_buf_list {
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..7608da0 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <asm/io.h>

 struct scatterlist {
@@ -53,6 +54,9 @@ struct sg_table {
  *
  * If bit 1 is set, then this sg entry is the last element in a list.
  *
+ * We also use bit 2 to indicate whether the page_link points to an
+ * iomem page or not.
+ *
  * See sg_next().
  *
  */
@@ -64,10 +68,17 @@ struct sg_table {
  * a valid sg entry, or whether it points to the start of a new
scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define sg_is_chain(sg)		((sg)->page_link & 0x01)
-#define sg_is_last(sg)		((sg)->page_link & 0x02)
+#define PAGE_LINK_MASK	0x7
+#define PAGE_LINK_CHAIN	0x1
+#define PAGE_LINK_LAST	0x2
+#define PAGE_LINK_IOMEM	0x4
+
+#define sg_is_chain(sg)		((sg)->page_link & PAGE_LINK_CHAIN)
+#define sg_is_last(sg)		((sg)->page_link & PAGE_LINK_LAST)
 #define sg_chain_ptr(sg)	\
-	((struct scatterlist *) ((sg)->page_link & ~0x03))
+	((struct scatterlist *) ((sg)->page_link & ~(PAGE_LINK_CHAIN | \
+						     PAGE_LINK_LAST)))
+#define sg_is_iomem(sg)		((sg)->page_link & PAGE_LINK_IOMEM)

 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -81,13 +92,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page
*page)
 {
-	unsigned long page_link = sg->page_link & 0x3;
+	unsigned long page_link = sg->page_link & PAGE_LINK_MASK;

 	/*
 	 * In order for the low bit stealing approach to work, pages
-	 * must be aligned at a 32-bit boundary as a minimum.
+	 * must be aligned at a 64-bit boundary as a minimum.
 	 */
-	BUG_ON((unsigned long) page & 0x03);
+	BUG_ON((unsigned long) page & PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
@@ -117,13 +128,56 @@ static inline void sg_set_page(struct scatterlist
*sg, struct page *page,
 	sg->length = len;
 }

+/**
+ * sg_set_page - Set sg entry to point at given iomem page
+ * @sg:		 SG entry
+ * @page:	 The page
+ * @len:	 Length of data
+ * @offset:	 Offset into page
+ *
+ * Description:
+ *   Same as sg_set_page but used when the page is a ZONE_DEVICE page that
+ *   points to IO memory.
+ *
+ **/
+static inline void sg_set_iomem_page(struct scatterlist *sg, struct
page *page,
+				     unsigned int len, unsigned int offset)
+{
+	sg_set_page(sg, page, len, offset);
+	sg->page_link |= PAGE_LINK_IOMEM;
+}
+
 static inline struct page *sg_page(struct scatterlist *sg)
 {
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
 #endif
-	return (struct page *)((sg)->page_link & ~0x3);
+	return (struct page *)((sg)->page_link & ~PAGE_LINK_MASK);
+}
+
+static inline void *sg_kmap(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap(struct scatterlist *sg, void *addr)
+{
+	kunmap(addr);
+}
+
+static inline void *sg_kmap_atomic(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap_atomic(struct scatterlist *sg, void *addr)
+{
+	kunmap_atomic(addr);
 }

 /**
@@ -171,7 +225,8 @@ static inline void sg_chain(struct scatterlist *prv,
unsigned int prv_nents,
 	 * Set lowest bit to indicate a link pointer, and make sure to clear
 	 * the termination bit if it happens to be set.
 	 */
-	prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+	prv[prv_nents - 1].page_link =
+		((unsigned long) sgl & ~PAGE_LINK_MASK) | PAGE_LINK_CHAIN;
 }

 /**
@@ -191,8 +246,8 @@ static inline void sg_mark_end(struct scatterlist *sg)
 	/*
 	 * Set termination bit, clear potential chain bit
 	 */
-	sg->page_link |= 0x02;
-	sg->page_link &= ~0x01;
+	sg->page_link &= ~PAGE_LINK_MASK;
+	sg->page_link |= PAGE_LINK_LAST;
 }

 /**
@@ -208,7 +263,7 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 #endif
-	sg->page_link &= ~0x02;
+	sg->page_link &= ~PAGE_LINK_LAST;
 }

 /**
@@ -383,6 +438,7 @@ static inline dma_addr_t
sg_page_iter_dma_address(struct sg_page_iter *piter)
 #define SG_MITER_ATOMIC		(1 << 0)	 /* use kmap_atomic */
 #define SG_MITER_TO_SG		(1 << 1)	/* flush back to phys on unmap */
 #define SG_MITER_FROM_SG	(1 << 2)	/* nop */
+#define SG_MITER_IOMEM		(1 << 3)	/* support iomem in miter ops */

 struct sg_mapping_iter {
 	/* the following three fields can be accessed directly */
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..6d8f39b 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -580,6 +580,9 @@ bool sg_miter_next(struct sg_mapping_iter *miter)
 	if (!sg_miter_get_next_page(miter))
 		return false;

+	if (!(miter->__flags & SG_MITER_IOMEM))
+		WARN_ON(sg_is_iomem(miter->piter.sg));
+
 	miter->page = sg_page_iter_page(&miter->piter);
 	miter->consumed = miter->length = miter->__remaining;

@@ -651,7 +654,7 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
-	unsigned int sg_flags = SG_MITER_ATOMIC;
+	unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_IOMEM;

 	if (to_buffer)
 		sg_flags |= SG_MITER_FROM_SG;
@@ -668,10 +671,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,

 		len = min(miter.length, buflen - offset);

-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (sg_is_iomem(miter.piter.sg)) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}

 		offset += len;
 	}

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:20         ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 21:20 UTC (permalink / raw)


Hi Christoph,

What are your thoughts on an approach like the following untested
draft patch.

The patch (if fleshed out) makes it so iomem can be used in an sgl
and WARN_ONs will occur in places where drivers attempt to access
iomem directly through the sgl.

I'd also probably create a p2pmem_alloc_sgl helper function so driver
writers wouldn't have to mess with sg_set_iomem_page.

With all that in place, it should be relatively safe for drivers to
implement p2pmem even though we'd still technically be violating the
__iomem boundary in some places.

Logan


commit b435a154a4ec4f82766f6ab838092c3c5a9388ac
Author: Logan Gunthorpe <logang at deltatee.com>
Date:   Wed Feb 8 12:44:52 2017 -0700

    scatterlist: Add support for iomem pages

    This patch steals another bit from the page_link field to indicate the
    sg points to iomem. In sg_copy_buffer we use this flag to select
    between memcpy and iomemcpy. Other sg_miter users will get an WARN_ON
    unless they indicate they support iomemory by setting the
    SG_MITER_IOMEM flag.

    Also added are sg_kmap functions which would replace a common pattern
    of kmap(sg_page(sg)). These new functions then also warn if the caller
    tries to map io memory. Another option may be to automatically copy
    the iomem to a new page and return that transparently to the driver.

    Another coccinelle patch would then be done to convert kmap(sg_page(sg))
    instances to the appropriate sg_kmap calls.

    Signed-off-by: Logan Gunthorpe <logang at deltatee.com>

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0007b79..bd690a2c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,9 @@

 #include <uapi/linux/dma-buf.h>

+/* Avoid the highmem.h macro from aliasing our ops->kunmap_atomic */
+#undef kunmap_atomic
+
 static inline int is_dma_buf_file(struct file *);

 struct dma_buf_list {
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..7608da0 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <asm/io.h>

 struct scatterlist {
@@ -53,6 +54,9 @@ struct sg_table {
  *
  * If bit 1 is set, then this sg entry is the last element in a list.
  *
+ * We also use bit 2 to indicate whether the page_link points to an
+ * iomem page or not.
+ *
  * See sg_next().
  *
  */
@@ -64,10 +68,17 @@ struct sg_table {
  * a valid sg entry, or whether it points to the start of a new
scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define sg_is_chain(sg)		((sg)->page_link & 0x01)
-#define sg_is_last(sg)		((sg)->page_link & 0x02)
+#define PAGE_LINK_MASK	0x7
+#define PAGE_LINK_CHAIN	0x1
+#define PAGE_LINK_LAST	0x2
+#define PAGE_LINK_IOMEM	0x4
+
+#define sg_is_chain(sg)		((sg)->page_link & PAGE_LINK_CHAIN)
+#define sg_is_last(sg)		((sg)->page_link & PAGE_LINK_LAST)
 #define sg_chain_ptr(sg)	\
-	((struct scatterlist *) ((sg)->page_link & ~0x03))
+	((struct scatterlist *) ((sg)->page_link & ~(PAGE_LINK_CHAIN | \
+						     PAGE_LINK_LAST)))
+#define sg_is_iomem(sg)		((sg)->page_link & PAGE_LINK_IOMEM)

 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -81,13 +92,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page
*page)
 {
-	unsigned long page_link = sg->page_link & 0x3;
+	unsigned long page_link = sg->page_link & PAGE_LINK_MASK;

 	/*
 	 * In order for the low bit stealing approach to work, pages
-	 * must be aligned at a 32-bit boundary as a minimum.
+	 * must be aligned at a 64-bit boundary as a minimum.
 	 */
-	BUG_ON((unsigned long) page & 0x03);
+	BUG_ON((unsigned long) page & PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
@@ -117,13 +128,56 @@ static inline void sg_set_page(struct scatterlist
*sg, struct page *page,
 	sg->length = len;
 }

+/**
+ * sg_set_page - Set sg entry to point at given iomem page
+ * @sg:		 SG entry
+ * @page:	 The page
+ * @len:	 Length of data
+ * @offset:	 Offset into page
+ *
+ * Description:
+ *   Same as sg_set_page but used when the page is a ZONE_DEVICE page that
+ *   points to IO memory.
+ *
+ **/
+static inline void sg_set_iomem_page(struct scatterlist *sg, struct
page *page,
+				     unsigned int len, unsigned int offset)
+{
+	sg_set_page(sg, page, len, offset);
+	sg->page_link |= PAGE_LINK_IOMEM;
+}
+
 static inline struct page *sg_page(struct scatterlist *sg)
 {
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 	BUG_ON(sg_is_chain(sg));
 #endif
-	return (struct page *)((sg)->page_link & ~0x3);
+	return (struct page *)((sg)->page_link & ~PAGE_LINK_MASK);
+}
+
+static inline void *sg_kmap(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap(struct scatterlist *sg, void *addr)
+{
+	kunmap(addr);
+}
+
+static inline void *sg_kmap_atomic(struct scatterlist *sg)
+{
+	WARN_ON(sg_is_iomem(sg));
+
+	return kmap(sg_page(sg));
+}
+
+static inline void sg_kunmap_atomic(struct scatterlist *sg, void *addr)
+{
+	kunmap_atomic(addr);
 }

 /**
@@ -171,7 +225,8 @@ static inline void sg_chain(struct scatterlist *prv,
unsigned int prv_nents,
 	 * Set lowest bit to indicate a link pointer, and make sure to clear
 	 * the termination bit if it happens to be set.
 	 */
-	prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+	prv[prv_nents - 1].page_link =
+		((unsigned long) sgl & ~PAGE_LINK_MASK) | PAGE_LINK_CHAIN;
 }

 /**
@@ -191,8 +246,8 @@ static inline void sg_mark_end(struct scatterlist *sg)
 	/*
 	 * Set termination bit, clear potential chain bit
 	 */
-	sg->page_link |= 0x02;
-	sg->page_link &= ~0x01;
+	sg->page_link &= ~PAGE_LINK_MASK;
+	sg->page_link |= PAGE_LINK_LAST;
 }

 /**
@@ -208,7 +263,7 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg->sg_magic != SG_MAGIC);
 #endif
-	sg->page_link &= ~0x02;
+	sg->page_link &= ~PAGE_LINK_LAST;
 }

 /**
@@ -383,6 +438,7 @@ static inline dma_addr_t
sg_page_iter_dma_address(struct sg_page_iter *piter)
 #define SG_MITER_ATOMIC		(1 << 0)	 /* use kmap_atomic */
 #define SG_MITER_TO_SG		(1 << 1)	/* flush back to phys on unmap */
 #define SG_MITER_FROM_SG	(1 << 2)	/* nop */
+#define SG_MITER_IOMEM		(1 << 3)	/* support iomem in miter ops */

 struct sg_mapping_iter {
 	/* the following three fields can be accessed directly */
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c6cf822..6d8f39b 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -580,6 +580,9 @@ bool sg_miter_next(struct sg_mapping_iter *miter)
 	if (!sg_miter_get_next_page(miter))
 		return false;

+	if (!(miter->__flags & SG_MITER_IOMEM))
+		WARN_ON(sg_is_iomem(miter->piter.sg));
+
 	miter->page = sg_page_iter_page(&miter->piter);
 	miter->consumed = miter->length = miter->__remaining;

@@ -651,7 +654,7 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,
 {
 	unsigned int offset = 0;
 	struct sg_mapping_iter miter;
-	unsigned int sg_flags = SG_MITER_ATOMIC;
+	unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_IOMEM;

 	if (to_buffer)
 		sg_flags |= SG_MITER_FROM_SG;
@@ -668,10 +671,17 @@ size_t sg_copy_buffer(struct scatterlist *sgl,
unsigned int nents, void *buf,

 		len = min(miter.length, buflen - offset);

-		if (to_buffer)
-			memcpy(buf + offset, miter.addr, len);
-		else
-			memcpy(miter.addr, buf + offset, len);
+		if (sg_is_iomem(miter.piter.sg)) {
+			if (to_buffer)
+				memcpy_fromio(buf + offset,  miter.addr, len);
+			else
+				memcpy_toio(miter.addr, buf + offset, len);
+		} else {
+			if (to_buffer)
+				memcpy(buf + offset, miter.addr, len);
+			else
+				memcpy(miter.addr, buf + offset, len);
+		}

 		offset += len;
 	}

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:44           ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 21:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Christoph Hellwig,
	Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe

On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi Christoph,
>
> What are your thoughts on an approach like the following untested
> draft patch.
>
> The patch (if fleshed out) makes it so iomem can be used in an sgl
> and WARN_ONs will occur in places where drivers attempt to access
> iomem directly through the sgl.
>
> I'd also probably create a p2pmem_alloc_sgl helper function so driver
> writers wouldn't have to mess with sg_set_iomem_page.
>
> With all that in place, it should be relatively safe for drivers to
> implement p2pmem even though we'd still technically be violating the
> __iomem boundary in some places.

Just reacting to this mail, I still haven't had a chance to take a
look at the rest of the series.

The pfn_t type was invented to carry extra type and page lookup
information about the memory behind a given pfn. At first glance that
seems a more natural place to carry an indication that this is an
"I/O" pfn.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:44           ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 21:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Max Gurtovoy, Christoph Hellwig

On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
> Hi Christoph,
>
> What are your thoughts on an approach like the following untested
> draft patch.
>
> The patch (if fleshed out) makes it so iomem can be used in an sgl
> and WARN_ONs will occur in places where drivers attempt to access
> iomem directly through the sgl.
>
> I'd also probably create a p2pmem_alloc_sgl helper function so driver
> writers wouldn't have to mess with sg_set_iomem_page.
>
> With all that in place, it should be relatively safe for drivers to
> implement p2pmem even though we'd still technically be violating the
> __iomem boundary in some places.

Just reacting to this mail, I still haven't had a chance to take a
look at the rest of the series.

The pfn_t type was invented to carry extra type and page lookup
information about the memory behind a given pfn. At first glance that
seems a more natural place to carry an indication that this is an
"I/O" pfn.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:44           ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 21:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	Jason Gunthorpe, Max Gurtovoy, Christoph Hellwig

On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi Christoph,
>
> What are your thoughts on an approach like the following untested
> draft patch.
>
> The patch (if fleshed out) makes it so iomem can be used in an sgl
> and WARN_ONs will occur in places where drivers attempt to access
> iomem directly through the sgl.
>
> I'd also probably create a p2pmem_alloc_sgl helper function so driver
> writers wouldn't have to mess with sg_set_iomem_page.
>
> With all that in place, it should be relatively safe for drivers to
> implement p2pmem even though we'd still technically be violating the
> __iomem boundary in some places.

Just reacting to this mail, I still haven't had a chance to take a
look at the rest of the series.

The pfn_t type was invented to carry extra type and page lookup
information about the memory behind a given pfn. At first glance that
seems a more natural place to carry an indication that this is an
"I/O" pfn.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:44           ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 21:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm, linux-rdma,
	linux-pci, Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Max Gurtovoy, Christoph Hellwig

On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi Christoph,
>
> What are your thoughts on an approach like the following untested
> draft patch.
>
> The patch (if fleshed out) makes it so iomem can be used in an sgl
> and WARN_ONs will occur in places where drivers attempt to access
> iomem directly through the sgl.
>
> I'd also probably create a p2pmem_alloc_sgl helper function so driver
> writers wouldn't have to mess with sg_set_iomem_page.
>
> With all that in place, it should be relatively safe for drivers to
> implement p2pmem even though we'd still technically be violating the
> __iomem boundary in some places.

Just reacting to this mail, I still haven't had a chance to take a
look at the rest of the series.

The pfn_t type was invented to carry extra type and page lookup
information about the memory behind a given pfn. At first glance that
seems a more natural place to carry an indication that this is an
"I/O" pfn.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 21:44           ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 21:44 UTC (permalink / raw)


On Mon, Apr 3, 2017@2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hi Christoph,
>
> What are your thoughts on an approach like the following untested
> draft patch.
>
> The patch (if fleshed out) makes it so iomem can be used in an sgl
> and WARN_ONs will occur in places where drivers attempt to access
> iomem directly through the sgl.
>
> I'd also probably create a p2pmem_alloc_sgl helper function so driver
> writers wouldn't have to mess with sg_set_iomem_page.
>
> With all that in place, it should be relatively safe for drivers to
> implement p2pmem even though we'd still technically be violating the
> __iomem boundary in some places.

Just reacting to this mail, I still haven't had a chance to take a
look at the rest of the series.

The pfn_t type was invented to carry extra type and page lookup
information about the memory behind a given pfn. At first glance that
seems a more natural place to carry an indication that this is an
"I/O" pfn.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:10             ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 22:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Christoph Hellwig,
	Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe



On 03/04/17 03:44 PM, Dan Williams wrote:
> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>> Hi Christoph,
>>
>> What are your thoughts on an approach like the following untested
>> draft patch.
>>
>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>> and WARN_ONs will occur in places where drivers attempt to access
>> iomem directly through the sgl.
>>
>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>> writers wouldn't have to mess with sg_set_iomem_page.
>>
>> With all that in place, it should be relatively safe for drivers to
>> implement p2pmem even though we'd still technically be violating the
>> __iomem boundary in some places.
> 
> Just reacting to this mail, I still haven't had a chance to take a
> look at the rest of the series.
> 
> The pfn_t type was invented to carry extra type and page lookup
> information about the memory behind a given pfn. At first glance that
> seems a more natural place to carry an indication that this is an
> "I/O" pfn.

I agree... But what are the plans for pfn_t? Is anyone working on using
it in the scatterlist code? Currently it's not there yet and given the
assertion that we will continue to be using struct page for DMA is that
a direction we'd want to go?

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:10             ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 22:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe



On 03/04/17 03:44 PM, Dan Williams wrote:
> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>> Hi Christoph,
>>
>> What are your thoughts on an approach like the following untested
>> draft patch.
>>
>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>> and WARN_ONs will occur in places where drivers attempt to access
>> iomem directly through the sgl.
>>
>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>> writers wouldn't have to mess with sg_set_iomem_page.
>>
>> With all that in place, it should be relatively safe for drivers to
>> implement p2pmem even though we'd still technically be violating the
>> __iomem boundary in some places.
> 
> Just reacting to this mail, I still haven't had a chance to take a
> look at the rest of the series.
> 
> The pfn_t type was invented to carry extra type and page lookup
> information about the memory behind a given pfn. At first glance that
> seems a more natural place to carry an indication that this is an
> "I/O" pfn.

I agree... But what are the plans for pfn_t? Is anyone working on using
it in the scatterlist code? Currently it's not there yet and given the
assertion that we will continue to be using struct page for DMA is that
a direction we'd want to go?

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:10             ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 22:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	Jason Gunthorpe, Max Gurtovoy, Christoph Hellwig



On 03/04/17 03:44 PM, Dan Williams wrote:
> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>> Hi Christoph,
>>
>> What are your thoughts on an approach like the following untested
>> draft patch.
>>
>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>> and WARN_ONs will occur in places where drivers attempt to access
>> iomem directly through the sgl.
>>
>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>> writers wouldn't have to mess with sg_set_iomem_page.
>>
>> With all that in place, it should be relatively safe for drivers to
>> implement p2pmem even though we'd still technically be violating the
>> __iomem boundary in some places.
> 
> Just reacting to this mail, I still haven't had a chance to take a
> look at the rest of the series.
> 
> The pfn_t type was invented to carry extra type and page lookup
> information about the memory behind a given pfn. At first glance that
> seems a more natural place to carry an indication that this is an
> "I/O" pfn.

I agree... But what are the plans for pfn_t? Is anyone working on using
it in the scatterlist code? Currently it's not there yet and given the
assertion that we will continue to be using struct page for DMA is that
a direction we'd want to go?

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:10             ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 22:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm, linux-rdma,
	linux-pci, Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Max Gurtovoy, Christoph Hellwig



On 03/04/17 03:44 PM, Dan Williams wrote:
> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>> Hi Christoph,
>>
>> What are your thoughts on an approach like the following untested
>> draft patch.
>>
>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>> and WARN_ONs will occur in places where drivers attempt to access
>> iomem directly through the sgl.
>>
>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>> writers wouldn't have to mess with sg_set_iomem_page.
>>
>> With all that in place, it should be relatively safe for drivers to
>> implement p2pmem even though we'd still technically be violating the
>> __iomem boundary in some places.
> 
> Just reacting to this mail, I still haven't had a chance to take a
> look at the rest of the series.
> 
> The pfn_t type was invented to carry extra type and page lookup
> information about the memory behind a given pfn. At first glance that
> seems a more natural place to carry an indication that this is an
> "I/O" pfn.

I agree... But what are the plans for pfn_t? Is anyone working on using
it in the scatterlist code? Currently it's not there yet and given the
assertion that we will continue to be using struct page for DMA is that
a direction we'd want to go?

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:10             ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 22:10 UTC (permalink / raw)




On 03/04/17 03:44 PM, Dan Williams wrote:
> On Mon, Apr 3, 2017@2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>> Hi Christoph,
>>
>> What are your thoughts on an approach like the following untested
>> draft patch.
>>
>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>> and WARN_ONs will occur in places where drivers attempt to access
>> iomem directly through the sgl.
>>
>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>> writers wouldn't have to mess with sg_set_iomem_page.
>>
>> With all that in place, it should be relatively safe for drivers to
>> implement p2pmem even though we'd still technically be violating the
>> __iomem boundary in some places.
> 
> Just reacting to this mail, I still haven't had a chance to take a
> look at the rest of the series.
> 
> The pfn_t type was invented to carry extra type and page lookup
> information about the memory behind a given pfn. At first glance that
> seems a more natural place to carry an indication that this is an
> "I/O" pfn.

I agree... But what are the plans for pfn_t? Is anyone working on using
it in the scatterlist code? Currently it's not there yet and given the
assertion that we will continue to be using struct page for DMA is that
a direction we'd want to go?

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:47               ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 22:47 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Christoph Hellwig,
	Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe

On Mon, Apr 3, 2017 at 3:10 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 03/04/17 03:44 PM, Dan Williams wrote:
>> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> Hi Christoph,
>>>
>>> What are your thoughts on an approach like the following untested
>>> draft patch.
>>>
>>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>>> and WARN_ONs will occur in places where drivers attempt to access
>>> iomem directly through the sgl.
>>>
>>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>>> writers wouldn't have to mess with sg_set_iomem_page.
>>>
>>> With all that in place, it should be relatively safe for drivers to
>>> implement p2pmem even though we'd still technically be violating the
>>> __iomem boundary in some places.
>>
>> Just reacting to this mail, I still haven't had a chance to take a
>> look at the rest of the series.
>>
>> The pfn_t type was invented to carry extra type and page lookup
>> information about the memory behind a given pfn. At first glance that
>> seems a more natural place to carry an indication that this is an
>> "I/O" pfn.
>
> I agree... But what are the plans for pfn_t? Is anyone working on using
> it in the scatterlist code? Currently it's not there yet and given the
> assertion that we will continue to be using struct page for DMA is that
> a direction we'd want to go?
>

I wouldn't necessarily conflate supporting pfn_t in the scatterlist
with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
conversion will still work and be required. However you're right, the
minute we use pfn_t for this we're into the realm of special case
drivers that understand scatterlists with special "I/O-pfn_t" entries.
However, maybe that's what we want? I think peer-to-peer DMA is not a
general purpose feature unless/until we get it standardized in PCI. So
maybe drivers with special case scatterlist support is exactly what we
want for now.

Thoughts?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:47               ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 22:47 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe

On Mon, Apr 3, 2017 at 3:10 PM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>
>
> On 03/04/17 03:44 PM, Dan Williams wrote:
>> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>>> Hi Christoph,
>>>
>>> What are your thoughts on an approach like the following untested
>>> draft patch.
>>>
>>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>>> and WARN_ONs will occur in places where drivers attempt to access
>>> iomem directly through the sgl.
>>>
>>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>>> writers wouldn't have to mess with sg_set_iomem_page.
>>>
>>> With all that in place, it should be relatively safe for drivers to
>>> implement p2pmem even though we'd still technically be violating the
>>> __iomem boundary in some places.
>>
>> Just reacting to this mail, I still haven't had a chance to take a
>> look at the rest of the series.
>>
>> The pfn_t type was invented to carry extra type and page lookup
>> information about the memory behind a given pfn. At first glance that
>> seems a more natural place to carry an indication that this is an
>> "I/O" pfn.
>
> I agree... But what are the plans for pfn_t? Is anyone working on using
> it in the scatterlist code? Currently it's not there yet and given the
> assertion that we will continue to be using struct page for DMA is that
> a direction we'd want to go?
>

I wouldn't necessarily conflate supporting pfn_t in the scatterlist
with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
conversion will still work and be required. However you're right, the
minute we use pfn_t for this we're into the realm of special case
drivers that understand scatterlists with special "I/O-pfn_t" entries.
However, maybe that's what we want? I think peer-to-peer DMA is not a
general purpose feature unless/until we get it standardized in PCI. So
maybe drivers with special case scatterlist support is exactly what we
want for now.

Thoughts?

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:47               ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 22:47 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	Jason Gunthorpe, Max Gurtovoy, Christoph Hellwig

On Mon, Apr 3, 2017 at 3:10 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 03/04/17 03:44 PM, Dan Williams wrote:
>> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> Hi Christoph,
>>>
>>> What are your thoughts on an approach like the following untested
>>> draft patch.
>>>
>>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>>> and WARN_ONs will occur in places where drivers attempt to access
>>> iomem directly through the sgl.
>>>
>>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>>> writers wouldn't have to mess with sg_set_iomem_page.
>>>
>>> With all that in place, it should be relatively safe for drivers to
>>> implement p2pmem even though we'd still technically be violating the
>>> __iomem boundary in some places.
>>
>> Just reacting to this mail, I still haven't had a chance to take a
>> look at the rest of the series.
>>
>> The pfn_t type was invented to carry extra type and page lookup
>> information about the memory behind a given pfn. At first glance that
>> seems a more natural place to carry an indication that this is an
>> "I/O" pfn.
>
> I agree... But what are the plans for pfn_t? Is anyone working on using
> it in the scatterlist code? Currently it's not there yet and given the
> assertion that we will continue to be using struct page for DMA is that
> a direction we'd want to go?
>

I wouldn't necessarily conflate supporting pfn_t in the scatterlist
with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
conversion will still work and be required. However you're right, the
minute we use pfn_t for this we're into the realm of special case
drivers that understand scatterlists with special "I/O-pfn_t" entries.
However, maybe that's what we want? I think peer-to-peer DMA is not a
general purpose feature unless/until we get it standardized in PCI. So
maybe drivers with special case scatterlist support is exactly what we
want for now.

Thoughts?

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:47               ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 22:47 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm, linux-rdma,
	linux-pci, Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Max Gurtovoy, Christoph Hellwig

On Mon, Apr 3, 2017 at 3:10 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 03/04/17 03:44 PM, Dan Williams wrote:
>> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> Hi Christoph,
>>>
>>> What are your thoughts on an approach like the following untested
>>> draft patch.
>>>
>>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>>> and WARN_ONs will occur in places where drivers attempt to access
>>> iomem directly through the sgl.
>>>
>>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>>> writers wouldn't have to mess with sg_set_iomem_page.
>>>
>>> With all that in place, it should be relatively safe for drivers to
>>> implement p2pmem even though we'd still technically be violating the
>>> __iomem boundary in some places.
>>
>> Just reacting to this mail, I still haven't had a chance to take a
>> look at the rest of the series.
>>
>> The pfn_t type was invented to carry extra type and page lookup
>> information about the memory behind a given pfn. At first glance that
>> seems a more natural place to carry an indication that this is an
>> "I/O" pfn.
>
> I agree... But what are the plans for pfn_t? Is anyone working on using
> it in the scatterlist code? Currently it's not there yet and given the
> assertion that we will continue to be using struct page for DMA is that
> a direction we'd want to go?
>

I wouldn't necessarily conflate supporting pfn_t in the scatterlist
with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
conversion will still work and be required. However you're right, the
minute we use pfn_t for this we're into the realm of special case
drivers that understand scatterlists with special "I/O-pfn_t" entries.
However, maybe that's what we want? I think peer-to-peer DMA is not a
general purpose feature unless/until we get it standardized in PCI. So
maybe drivers with special case scatterlist support is exactly what we
want for now.

Thoughts?

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 22:47               ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-03 22:47 UTC (permalink / raw)


On Mon, Apr 3, 2017@3:10 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 03/04/17 03:44 PM, Dan Williams wrote:
>> On Mon, Apr 3, 2017@2:20 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> Hi Christoph,
>>>
>>> What are your thoughts on an approach like the following untested
>>> draft patch.
>>>
>>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>>> and WARN_ONs will occur in places where drivers attempt to access
>>> iomem directly through the sgl.
>>>
>>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>>> writers wouldn't have to mess with sg_set_iomem_page.
>>>
>>> With all that in place, it should be relatively safe for drivers to
>>> implement p2pmem even though we'd still technically be violating the
>>> __iomem boundary in some places.
>>
>> Just reacting to this mail, I still haven't had a chance to take a
>> look at the rest of the series.
>>
>> The pfn_t type was invented to carry extra type and page lookup
>> information about the memory behind a given pfn. At first glance that
>> seems a more natural place to carry an indication that this is an
>> "I/O" pfn.
>
> I agree... But what are the plans for pfn_t? Is anyone working on using
> it in the scatterlist code? Currently it's not there yet and given the
> assertion that we will continue to be using struct page for DMA is that
> a direction we'd want to go?
>

I wouldn't necessarily conflate supporting pfn_t in the scatterlist
with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
conversion will still work and be required. However you're right, the
minute we use pfn_t for this we're into the realm of special case
drivers that understand scatterlists with special "I/O-pfn_t" entries.
However, maybe that's what we want? I think peer-to-peer DMA is not a
general purpose feature unless/until we get it standardized in PCI. So
maybe drivers with special case scatterlist support is exactly what we
want for now.

Thoughts?

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 23:12                 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 23:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Christoph Hellwig,
	Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe



On 03/04/17 04:47 PM, Dan Williams wrote:
> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
> conversion will still work and be required. However you're right, the
> minute we use pfn_t for this we're into the realm of special case
> drivers that understand scatterlists with special "I/O-pfn_t" entries.

Well yes, it would certainly be possible to convert the scatterlist code
from page_link to pfn_t. (The only slightly tricky thing is that
scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
they'd have to be harmonized somehow). But if we aren't moving toward
struct-page-less DMA, I fail to see the point of the conversion.

I'll definitely need IO scatterlists of some form or another and I like
pfn_t but right now it just seems like extra work with unclear benefit.
(Though, if someone told me that I can't use a third bit in the
page_link field then maybe that would be a good reason to move to pfn_t.)

> However, maybe that's what we want? I think peer-to-peer DMA is not a
> general purpose feature unless/until we get it standardized in PCI. So
> maybe drivers with special case scatterlist support is exactly what we
> want for now.

Well, I think this should be completely independent from PCI code. I see
no reason why we can't have infrastructure for DMA on iomem from any
bus. Largely all the work I've done in this area is completely agnostic
to the bus in use. (Except for any kind of white/black list when it is
used.)

The "special case scatterlist" is essentially what I'm proposing in the
patch I sent upthread, it just stores the flag in the page_link instead
of in a pfn_t.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 23:12                 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 23:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe



On 03/04/17 04:47 PM, Dan Williams wrote:
> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
> conversion will still work and be required. However you're right, the
> minute we use pfn_t for this we're into the realm of special case
> drivers that understand scatterlists with special "I/O-pfn_t" entries.

Well yes, it would certainly be possible to convert the scatterlist code
from page_link to pfn_t. (The only slightly tricky thing is that
scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
they'd have to be harmonized somehow). But if we aren't moving toward
struct-page-less DMA, I fail to see the point of the conversion.

I'll definitely need IO scatterlists of some form or another and I like
pfn_t but right now it just seems like extra work with unclear benefit.
(Though, if someone told me that I can't use a third bit in the
page_link field then maybe that would be a good reason to move to pfn_t.)

> However, maybe that's what we want? I think peer-to-peer DMA is not a
> general purpose feature unless/until we get it standardized in PCI. So
> maybe drivers with special case scatterlist support is exactly what we
> want for now.

Well, I think this should be completely independent from PCI code. I see
no reason why we can't have infrastructure for DMA on iomem from any
bus. Largely all the work I've done in this area is completely agnostic
to the bus in use. (Except for any kind of white/black list when it is
used.)

The "special case scatterlist" is essentially what I'm proposing in the
patch I sent upthread, it just stores the flag in the page_link instead
of in a pfn_t.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 23:12                 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 23:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	Jason Gunthorpe, Max Gurtovoy, Christoph Hellwig



On 03/04/17 04:47 PM, Dan Williams wrote:
> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
> conversion will still work and be required. However you're right, the
> minute we use pfn_t for this we're into the realm of special case
> drivers that understand scatterlists with special "I/O-pfn_t" entries.

Well yes, it would certainly be possible to convert the scatterlist code
from page_link to pfn_t. (The only slightly tricky thing is that
scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
they'd have to be harmonized somehow). But if we aren't moving toward
struct-page-less DMA, I fail to see the point of the conversion.

I'll definitely need IO scatterlists of some form or another and I like
pfn_t but right now it just seems like extra work with unclear benefit.
(Though, if someone told me that I can't use a third bit in the
page_link field then maybe that would be a good reason to move to pfn_t.)

> However, maybe that's what we want? I think peer-to-peer DMA is not a
> general purpose feature unless/until we get it standardized in PCI. So
> maybe drivers with special case scatterlist support is exactly what we
> want for now.

Well, I think this should be completely independent from PCI code. I see
no reason why we can't have infrastructure for DMA on iomem from any
bus. Largely all the work I've done in this area is completely agnostic
to the bus in use. (Except for any kind of white/black list when it is
used.)

The "special case scatterlist" is essentially what I'm proposing in the
patch I sent upthread, it just stores the flag in the page_link instead
of in a pfn_t.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 23:12                 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 23:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm, linux-rdma,
	linux-pci, Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Max Gurtovoy, Christoph Hellwig



On 03/04/17 04:47 PM, Dan Williams wrote:
> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
> conversion will still work and be required. However you're right, the
> minute we use pfn_t for this we're into the realm of special case
> drivers that understand scatterlists with special "I/O-pfn_t" entries.

Well yes, it would certainly be possible to convert the scatterlist code
from page_link to pfn_t. (The only slightly tricky thing is that
scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
they'd have to be harmonized somehow). But if we aren't moving toward
struct-page-less DMA, I fail to see the point of the conversion.

I'll definitely need IO scatterlists of some form or another and I like
pfn_t but right now it just seems like extra work with unclear benefit.
(Though, if someone told me that I can't use a third bit in the
page_link field then maybe that would be a good reason to move to pfn_t.)

> However, maybe that's what we want? I think peer-to-peer DMA is not a
> general purpose feature unless/until we get it standardized in PCI. So
> maybe drivers with special case scatterlist support is exactly what we
> want for now.

Well, I think this should be completely independent from PCI code. I see
no reason why we can't have infrastructure for DMA on iomem from any
bus. Largely all the work I've done in this area is completely agnostic
to the bus in use. (Except for any kind of white/black list when it is
used.)

The "special case scatterlist" is essentially what I'm proposing in the
patch I sent upthread, it just stores the flag in the page_link instead
of in a pfn_t.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-03 23:12                 ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-03 23:12 UTC (permalink / raw)




On 03/04/17 04:47 PM, Dan Williams wrote:
> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
> conversion will still work and be required. However you're right, the
> minute we use pfn_t for this we're into the realm of special case
> drivers that understand scatterlists with special "I/O-pfn_t" entries.

Well yes, it would certainly be possible to convert the scatterlist code
from page_link to pfn_t. (The only slightly tricky thing is that
scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
they'd have to be harmonized somehow). But if we aren't moving toward
struct-page-less DMA, I fail to see the point of the conversion.

I'll definitely need IO scatterlists of some form or another and I like
pfn_t but right now it just seems like extra work with unclear benefit.
(Though, if someone told me that I can't use a third bit in the
page_link field then maybe that would be a good reason to move to pfn_t.)

> However, maybe that's what we want? I think peer-to-peer DMA is not a
> general purpose feature unless/until we get it standardized in PCI. So
> maybe drivers with special case scatterlist support is exactly what we
> want for now.

Well, I think this should be completely independent from PCI code. I see
no reason why we can't have infrastructure for DMA on iomem from any
bus. Largely all the work I've done in this area is completely agnostic
to the bus in use. (Except for any kind of white/black list when it is
used.)

The "special case scatterlist" is essentially what I'm proposing in the
patch I sent upthread, it just stores the flag in the page_link instead
of in a pfn_t.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-04  0:07                   ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-04  0:07 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm, linux-rdma, linux-pci,
	Steve Wise, linux-kernel, linux-nvme, Christoph Hellwig,
	Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe

On Mon, Apr 3, 2017 at 4:12 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 03/04/17 04:47 PM, Dan Williams wrote:
>> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
>> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
>> conversion will still work and be required. However you're right, the
>> minute we use pfn_t for this we're into the realm of special case
>> drivers that understand scatterlists with special "I/O-pfn_t" entries.
>
> Well yes, it would certainly be possible to convert the scatterlist code
> from page_link to pfn_t. (The only slightly tricky thing is that
> scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
> they'd have to be harmonized somehow). But if we aren't moving toward
> struct-page-less DMA, I fail to see the point of the conversion.
>
> I'll definitely need IO scatterlists of some form or another and I like
> pfn_t but right now it just seems like extra work with unclear benefit.
> (Though, if someone told me that I can't use a third bit in the
> page_link field then maybe that would be a good reason to move to pfn_t.)
>
>> However, maybe that's what we want? I think peer-to-peer DMA is not a
>> general purpose feature unless/until we get it standardized in PCI. So
>> maybe drivers with special case scatterlist support is exactly what we
>> want for now.
>
> Well, I think this should be completely independent from PCI code. I see
> no reason why we can't have infrastructure for DMA on iomem from any
> bus. Largely all the work I've done in this area is completely agnostic
> to the bus in use. (Except for any kind of white/black list when it is
> used.)

The completely agnostic part is where I get worried, but I shouldn't
say anymore until I actually read the patch.The worry is cases where
this agnostic enabling allows unsuspecting code paths to do the wrong
thing. Like bypass iomem safety.

> The "special case scatterlist" is essentially what I'm proposing in the
> patch I sent upthread, it just stores the flag in the page_link instead
> of in a pfn_t.

Makes sense. The suggestion of pfn_t was to try to get more type
safety throughout the stack. So that, again, unsuspecting code paths
that get an I/O pfn aren't able to do things like page_address() or
kmap() without failing.

I'll stop commenting now and set aside some time to go read the patches.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-04  0:07                   ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-04  0:07 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, linux-scsi,
	Martin K. Petersen, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtovoy, Christoph Hellwig, Jason Gunthorpe

On Mon, Apr 3, 2017 at 4:12 PM, Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org> wrote:
>
>
> On 03/04/17 04:47 PM, Dan Williams wrote:
>> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
>> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
>> conversion will still work and be required. However you're right, the
>> minute we use pfn_t for this we're into the realm of special case
>> drivers that understand scatterlists with special "I/O-pfn_t" entries.
>
> Well yes, it would certainly be possible to convert the scatterlist code
> from page_link to pfn_t. (The only slightly tricky thing is that
> scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
> they'd have to be harmonized somehow). But if we aren't moving toward
> struct-page-less DMA, I fail to see the point of the conversion.
>
> I'll definitely need IO scatterlists of some form or another and I like
> pfn_t but right now it just seems like extra work with unclear benefit.
> (Though, if someone told me that I can't use a third bit in the
> page_link field then maybe that would be a good reason to move to pfn_t.)
>
>> However, maybe that's what we want? I think peer-to-peer DMA is not a
>> general purpose feature unless/until we get it standardized in PCI. So
>> maybe drivers with special case scatterlist support is exactly what we
>> want for now.
>
> Well, I think this should be completely independent from PCI code. I see
> no reason why we can't have infrastructure for DMA on iomem from any
> bus. Largely all the work I've done in this area is completely agnostic
> to the bus in use. (Except for any kind of white/black list when it is
> used.)

The completely agnostic part is where I get worried, but I shouldn't
say anymore until I actually read the patch.The worry is cases where
this agnostic enabling allows unsuspecting code paths to do the wrong
thing. Like bypass iomem safety.

> The "special case scatterlist" is essentially what I'm proposing in the
> patch I sent upthread, it just stores the flag in the page_link instead
> of in a pfn_t.

Makes sense. The suggestion of pfn_t was to try to get more type
safety throughout the stack. So that, again, unsuspecting code paths
that get an I/O pfn aren't able to do things like page_address() or
kmap() without failing.

I'll stop commenting now and set aside some time to go read the patches.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-04  0:07                   ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-04  0:07 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Steve Wise, linux-kernel, linux-nvme,
	Jason Gunthorpe, Max Gurtovoy, Christoph Hellwig

On Mon, Apr 3, 2017 at 4:12 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 03/04/17 04:47 PM, Dan Williams wrote:
>> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
>> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
>> conversion will still work and be required. However you're right, the
>> minute we use pfn_t for this we're into the realm of special case
>> drivers that understand scatterlists with special "I/O-pfn_t" entries.
>
> Well yes, it would certainly be possible to convert the scatterlist code
> from page_link to pfn_t. (The only slightly tricky thing is that
> scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
> they'd have to be harmonized somehow). But if we aren't moving toward
> struct-page-less DMA, I fail to see the point of the conversion.
>
> I'll definitely need IO scatterlists of some form or another and I like
> pfn_t but right now it just seems like extra work with unclear benefit.
> (Though, if someone told me that I can't use a third bit in the
> page_link field then maybe that would be a good reason to move to pfn_t.)
>
>> However, maybe that's what we want? I think peer-to-peer DMA is not a
>> general purpose feature unless/until we get it standardized in PCI. So
>> maybe drivers with special case scatterlist support is exactly what we
>> want for now.
>
> Well, I think this should be completely independent from PCI code. I see
> no reason why we can't have infrastructure for DMA on iomem from any
> bus. Largely all the work I've done in this area is completely agnostic
> to the bus in use. (Except for any kind of white/black list when it is
> used.)

The completely agnostic part is where I get worried, but I shouldn't
say anymore until I actually read the patch.The worry is cases where
this agnostic enabling allows unsuspecting code paths to do the wrong
thing. Like bypass iomem safety.

> The "special case scatterlist" is essentially what I'm proposing in the
> patch I sent upthread, it just stores the flag in the page_link instead
> of in a pfn_t.

Makes sense. The suggestion of pfn_t was to try to get more type
safety throughout the stack. So that, again, unsuspecting code paths
that get an I/O pfn aren't able to do things like page_address() or
kmap() without failing.

I'll stop commenting now and set aside some time to go read the patches.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-04  0:07                   ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-04  0:07 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, James E.J. Bottomley,
	linux-scsi, Martin K. Petersen, linux-nvdimm, linux-rdma,
	linux-pci, Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Max Gurtovoy, Christoph Hellwig

On Mon, Apr 3, 2017 at 4:12 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 03/04/17 04:47 PM, Dan Williams wrote:
>> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
>> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
>> conversion will still work and be required. However you're right, the
>> minute we use pfn_t for this we're into the realm of special case
>> drivers that understand scatterlists with special "I/O-pfn_t" entries.
>
> Well yes, it would certainly be possible to convert the scatterlist code
> from page_link to pfn_t. (The only slightly tricky thing is that
> scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
> they'd have to be harmonized somehow). But if we aren't moving toward
> struct-page-less DMA, I fail to see the point of the conversion.
>
> I'll definitely need IO scatterlists of some form or another and I like
> pfn_t but right now it just seems like extra work with unclear benefit.
> (Though, if someone told me that I can't use a third bit in the
> page_link field then maybe that would be a good reason to move to pfn_t.)
>
>> However, maybe that's what we want? I think peer-to-peer DMA is not a
>> general purpose feature unless/until we get it standardized in PCI. So
>> maybe drivers with special case scatterlist support is exactly what we
>> want for now.
>
> Well, I think this should be completely independent from PCI code. I see
> no reason why we can't have infrastructure for DMA on iomem from any
> bus. Largely all the work I've done in this area is completely agnostic
> to the bus in use. (Except for any kind of white/black list when it is
> used.)

The completely agnostic part is where I get worried, but I shouldn't
say anymore until I actually read the patch.The worry is cases where
this agnostic enabling allows unsuspecting code paths to do the wrong
thing. Like bypass iomem safety.

> The "special case scatterlist" is essentially what I'm proposing in the
> patch I sent upthread, it just stores the flag in the page_link instead
> of in a pfn_t.

Makes sense. The suggestion of pfn_t was to try to get more type
safety throughout the stack. So that, again, unsuspecting code paths
that get an I/O pfn aren't able to do things like page_address() or
kmap() without failing.

I'll stop commenting now and set aside some time to go read the patches.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
@ 2017-04-04  0:07                   ` Dan Williams
  0 siblings, 0 replies; 545+ messages in thread
From: Dan Williams @ 2017-04-04  0:07 UTC (permalink / raw)


On Mon, Apr 3, 2017@4:12 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 03/04/17 04:47 PM, Dan Williams wrote:
>> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
>> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
>> conversion will still work and be required. However you're right, the
>> minute we use pfn_t for this we're into the realm of special case
>> drivers that understand scatterlists with special "I/O-pfn_t" entries.
>
> Well yes, it would certainly be possible to convert the scatterlist code
> from page_link to pfn_t. (The only slightly tricky thing is that
> scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
> they'd have to be harmonized somehow). But if we aren't moving toward
> struct-page-less DMA, I fail to see the point of the conversion.
>
> I'll definitely need IO scatterlists of some form or another and I like
> pfn_t but right now it just seems like extra work with unclear benefit.
> (Though, if someone told me that I can't use a third bit in the
> page_link field then maybe that would be a good reason to move to pfn_t.)
>
>> However, maybe that's what we want? I think peer-to-peer DMA is not a
>> general purpose feature unless/until we get it standardized in PCI. So
>> maybe drivers with special case scatterlist support is exactly what we
>> want for now.
>
> Well, I think this should be completely independent from PCI code. I see
> no reason why we can't have infrastructure for DMA on iomem from any
> bus. Largely all the work I've done in this area is completely agnostic
> to the bus in use. (Except for any kind of white/black list when it is
> used.)

The completely agnostic part is where I get worried, but I shouldn't
say anymore until I actually read the patch.The worry is cases where
this agnostic enabling allows unsuspecting code paths to do the wrong
thing. Like bypass iomem safety.

> The "special case scatterlist" is essentially what I'm proposing in the
> patch I sent upthread, it just stores the flag in the page_link instead
> of in a pfn_t.

Makes sense. The suggestion of pfn_t was to try to get more type
safety throughout the stack. So that, again, unsuspecting code paths
that get an I/O pfn aren't able to do things like page_address() or
kmap() without failing.

I'll stop commenting now and set aside some time to go read the patches.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
  2017-03-30 22:12   ` Logan Gunthorpe
  (?)
  (?)
@ 2017-04-04 10:40     ` Sagi Grimberg
  -1 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:40 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

Hey Logan,

> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch as the RDMA port and all the block devices in use. If
> the user enabled it an no devices are found, then the system will
> silently fall back on using regular memory.

What should we do if we have more than a single device that satisfies
this? I'd say that it would be better to have the user ask for a
specific device and fail it if it doesn't meet the above conditions...

> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.

That's good :)

> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available.

Even if it was available, it would be hard to make real use of this
given that we wouldn't know how to pre-post recv buffers (for in-capsule
data). But let's leave this out of the scope entirely...

> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index ecc4fe8..7fd4840 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -23,6 +23,7 @@
>  #include <linux/string.h>
>  #include <linux/wait.h>
>  #include <linux/inet.h>
> +#include <linux/p2pmem.h>
>  #include <asm/unaligned.h>
>
>  #include <rdma/ib_verbs.h>
> @@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
>  	struct rdma_rw_ctx	rw;
>
>  	struct nvmet_req	req;
> +	struct p2pmem_dev       *p2pmem;

Why do you need this? you have a reference to the
queue itself.

> @@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
>  	int			send_queue_size;
>
>  	struct list_head	queue_list;
> +
> +	struct p2pmem_dev	*p2pmem;
>  };
>
>  struct nvmet_rdma_device {
> @@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
>  	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
>  }
>
> -static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
> +static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
> +				struct p2pmem_dev *p2pmem)
>  {
>  	struct scatterlist *sg;
>  	int count;
> @@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
>  	if (!sgl || !nents)
>  		return;
>
> -	for_each_sg(sgl, sg, nents, count)
> -		__free_page(sg_page(sg));
> +	for_each_sg(sgl, sg, nents, count) {
> +		if (p2pmem)
> +			p2pmem_free_page(p2pmem, sg_page(sg));
> +		else
> +			__free_page(sg_page(sg));
> +	}
>  	kfree(sgl);
>  }
>
>  static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
> -		u32 length)
> +		u32 length, struct p2pmem_dev *p2pmem)
>  {
>  	struct scatterlist *sg;
>  	struct page *page;
> @@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
>  	while (length) {
>  		u32 page_len = min_t(u32, length, PAGE_SIZE);
>
> -		page = alloc_page(GFP_KERNEL);
> +		if (p2pmem)
> +			page = p2pmem_alloc_page(p2pmem);
> +		else
> +			page = alloc_page(GFP_KERNEL);
> +
>  		if (!page)
>  			goto out_free_pages;
>
> @@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
>  out_free_pages:
>  	while (i > 0) {
>  		i--;
> -		__free_page(sg_page(&sg[i]));
> +		if (p2pmem)
> +			p2pmem_free_page(p2pmem, sg_page(&sg[i]));
> +		else
> +			__free_page(sg_page(&sg[i]));
>  	}
>  	kfree(sg);
>  out:
> @@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
>  	}
>
>  	if (rsp->req.sg != &rsp->cmd->inline_sg)
> -		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
> +		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
> +				    rsp->p2pmem);
>
>  	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
>  		nvmet_rdma_process_wr_wait_list(queue);
> @@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
>  	if (!len)
>  		return 0;
>
> +	rsp->p2pmem = rsp->queue->p2pmem;
>  	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
> -			len);
> +			len, rsp->p2pmem);
> +
> +	if (status && rsp->p2pmem) {
> +		rsp->p2pmem = NULL;
> +		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
> +					      len, rsp->p2pmem);
> +	}
> +

Not sure its a good practice to rely on rsp->p2pmem not being NULL...
Would be nice if the allocation routines can hide it from us...

>  	if (status)
>  		return status;
>
> @@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
>  				!queue->host_qid);
>  	}
>  	nvmet_rdma_free_rsps(queue);
> +	p2pmem_put(queue->p2pmem);

What does this pair with? p2pmem_find_compat()?

>  	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
>  	kfree(queue);
>  }
> @@ -1179,6 +1205,52 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
>  	return ret;
>  }
>
> +/*
> + * If allow_p2pmem is set, we will try to use P2P memory for our
> + * sgl lists. This requires the p2pmem device to be compatible with
> + * the backing device for every namespace this device will support.
> + * If not, we fall back on using system memory.
> + */
> +static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
> +{
> +	struct device **dma_devs;
> +	struct nvmet_ns *ns;
> +	int ndevs = 1;
> +	int i = 0;
> +	struct nvmet_subsys_link *s;
> +
> +	if (!queue->port->allow_p2pmem)
> +		return;
> +
> +	list_for_each_entry(s, &queue->port->subsystems, entry) {
> +		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
> +			ndevs++;
> +		}
> +	}

This code has no business in nvmet-rdma. Why not keep nr_ns in
nvmet_subsys in the first place?

> +
> +	dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
> +	if (!dma_devs)
> +		return;
> +
> +	dma_devs[i++] = &queue->dev->device->dev;
> +
> +	list_for_each_entry(s, &queue->port->subsystems, entry) {
> +		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
> +			dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
> +		}
> +	}
> +
> +	dma_devs[i++] = NULL;
> +
> +	queue->p2pmem = p2pmem_find_compat(dma_devs);

This is a problem. namespaces can be added at any point in time. No one
guarantee that dma_devs are all the namepaces we'll ever see.

> +
> +	if (queue->p2pmem)
> +		pr_debug("using %s for rdma nvme target queue",
> +			 dev_name(&queue->p2pmem->dev));
> +
> +	kfree(dma_devs);
> +}
> +
>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>  		struct rdma_cm_event *event)
>  {
> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>  	}
>  	queue->port = cm_id->context;
>
> +	nvmet_rdma_queue_setup_p2pmem(queue);
> +

Why is all this done for each queue? looks completely redundant to me.

>  	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>  	if (ret)
>  		goto release_queue;

You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
curious why?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-04 10:40     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:40 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

Hey Logan,

> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch as the RDMA port and all the block devices in use. If
> the user enabled it an no devices are found, then the system will
> silently fall back on using regular memory.

What should we do if we have more than a single device that satisfies
this? I'd say that it would be better to have the user ask for a
specific device and fail it if it doesn't meet the above conditions...

> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.

That's good :)

> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available.

Even if it was available, it would be hard to make real use of this
given that we wouldn't know how to pre-post recv buffers (for in-capsule
data). But let's leave this out of the scope entirely...

> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index ecc4fe8..7fd4840 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -23,6 +23,7 @@
>  #include <linux/string.h>
>  #include <linux/wait.h>
>  #include <linux/inet.h>
> +#include <linux/p2pmem.h>
>  #include <asm/unaligned.h>
>
>  #include <rdma/ib_verbs.h>
> @@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
>  	struct rdma_rw_ctx	rw;
>
>  	struct nvmet_req	req;
> +	struct p2pmem_dev       *p2pmem;

Why do you need this? you have a reference to the
queue itself.

> @@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
>  	int			send_queue_size;
>
>  	struct list_head	queue_list;
> +
> +	struct p2pmem_dev	*p2pmem;
>  };
>
>  struct nvmet_rdma_device {
> @@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
>  	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
>  }
>
> -static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
> +static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
> +				struct p2pmem_dev *p2pmem)
>  {
>  	struct scatterlist *sg;
>  	int count;
> @@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
>  	if (!sgl || !nents)
>  		return;
>
> -	for_each_sg(sgl, sg, nents, count)
> -		__free_page(sg_page(sg));
> +	for_each_sg(sgl, sg, nents, count) {
> +		if (p2pmem)
> +			p2pmem_free_page(p2pmem, sg_page(sg));
> +		else
> +			__free_page(sg_page(sg));
> +	}
>  	kfree(sgl);
>  }
>
>  static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
> -		u32 length)
> +		u32 length, struct p2pmem_dev *p2pmem)
>  {
>  	struct scatterlist *sg;
>  	struct page *page;
> @@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
>  	while (length) {
>  		u32 page_len = min_t(u32, length, PAGE_SIZE);
>
> -		page = alloc_page(GFP_KERNEL);
> +		if (p2pmem)
> +			page = p2pmem_alloc_page(p2pmem);
> +		else
> +			page = alloc_page(GFP_KERNEL);
> +
>  		if (!page)
>  			goto out_free_pages;
>
> @@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
>  out_free_pages:
>  	while (i > 0) {
>  		i--;
> -		__free_page(sg_page(&sg[i]));
> +		if (p2pmem)
> +			p2pmem_free_page(p2pmem, sg_page(&sg[i]));
> +		else
> +			__free_page(sg_page(&sg[i]));
>  	}
>  	kfree(sg);
>  out:
> @@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
>  	}
>
>  	if (rsp->req.sg != &rsp->cmd->inline_sg)
> -		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
> +		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
> +				    rsp->p2pmem);
>
>  	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
>  		nvmet_rdma_process_wr_wait_list(queue);
> @@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
>  	if (!len)
>  		return 0;
>
> +	rsp->p2pmem = rsp->queue->p2pmem;
>  	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
> -			len);
> +			len, rsp->p2pmem);
> +
> +	if (status && rsp->p2pmem) {
> +		rsp->p2pmem = NULL;
> +		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
> +					      len, rsp->p2pmem);
> +	}
> +

Not sure its a good practice to rely on rsp->p2pmem not being NULL...
Would be nice if the allocation routines can hide it from us...

>  	if (status)
>  		return status;
>
> @@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
>  				!queue->host_qid);
>  	}
>  	nvmet_rdma_free_rsps(queue);
> +	p2pmem_put(queue->p2pmem);

What does this pair with? p2pmem_find_compat()?

>  	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
>  	kfree(queue);
>  }
> @@ -1179,6 +1205,52 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
>  	return ret;
>  }
>
> +/*
> + * If allow_p2pmem is set, we will try to use P2P memory for our
> + * sgl lists. This requires the p2pmem device to be compatible with
> + * the backing device for every namespace this device will support.
> + * If not, we fall back on using system memory.
> + */
> +static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
> +{
> +	struct device **dma_devs;
> +	struct nvmet_ns *ns;
> +	int ndevs = 1;
> +	int i = 0;
> +	struct nvmet_subsys_link *s;
> +
> +	if (!queue->port->allow_p2pmem)
> +		return;
> +
> +	list_for_each_entry(s, &queue->port->subsystems, entry) {
> +		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
> +			ndevs++;
> +		}
> +	}

This code has no business in nvmet-rdma. Why not keep nr_ns in
nvmet_subsys in the first place?

> +
> +	dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
> +	if (!dma_devs)
> +		return;
> +
> +	dma_devs[i++] = &queue->dev->device->dev;
> +
> +	list_for_each_entry(s, &queue->port->subsystems, entry) {
> +		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
> +			dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
> +		}
> +	}
> +
> +	dma_devs[i++] = NULL;
> +
> +	queue->p2pmem = p2pmem_find_compat(dma_devs);

This is a problem. namespaces can be added at any point in time. No one
guarantee that dma_devs are all the namepaces we'll ever see.

> +
> +	if (queue->p2pmem)
> +		pr_debug("using %s for rdma nvme target queue",
> +			 dev_name(&queue->p2pmem->dev));
> +
> +	kfree(dma_devs);
> +}
> +
>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>  		struct rdma_cm_event *event)
>  {
> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>  	}
>  	queue->port = cm_id->context;
>
> +	nvmet_rdma_queue_setup_p2pmem(queue);
> +

Why is all this done for each queue? looks completely redundant to me.

>  	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>  	if (ret)
>  		goto release_queue;

You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
curious why?

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-04 10:40     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:40 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

Hey Logan,

> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch as the RDMA port and all the block devices in use. If
> the user enabled it an no devices are found, then the system will
> silently fall back on using regular memory.

What should we do if we have more than a single device that satisfies
this? I'd say that it would be better to have the user ask for a
specific device and fail it if it doesn't meet the above conditions...

> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.

That's good :)

> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available.

Even if it was available, it would be hard to make real use of this
given that we wouldn't know how to pre-post recv buffers (for in-capsule
data). But let's leave this out of the scope entirely...

> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index ecc4fe8..7fd4840 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -23,6 +23,7 @@
>  #include <linux/string.h>
>  #include <linux/wait.h>
>  #include <linux/inet.h>
> +#include <linux/p2pmem.h>
>  #include <asm/unaligned.h>
>
>  #include <rdma/ib_verbs.h>
> @@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
>  	struct rdma_rw_ctx	rw;
>
>  	struct nvmet_req	req;
> +	struct p2pmem_dev       *p2pmem;

Why do you need this? you have a reference to the
queue itself.

> @@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
>  	int			send_queue_size;
>
>  	struct list_head	queue_list;
> +
> +	struct p2pmem_dev	*p2pmem;
>  };
>
>  struct nvmet_rdma_device {
> @@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
>  	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
>  }
>
> -static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
> +static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
> +				struct p2pmem_dev *p2pmem)
>  {
>  	struct scatterlist *sg;
>  	int count;
> @@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
>  	if (!sgl || !nents)
>  		return;
>
> -	for_each_sg(sgl, sg, nents, count)
> -		__free_page(sg_page(sg));
> +	for_each_sg(sgl, sg, nents, count) {
> +		if (p2pmem)
> +			p2pmem_free_page(p2pmem, sg_page(sg));
> +		else
> +			__free_page(sg_page(sg));
> +	}
>  	kfree(sgl);
>  }
>
>  static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
> -		u32 length)
> +		u32 length, struct p2pmem_dev *p2pmem)
>  {
>  	struct scatterlist *sg;
>  	struct page *page;
> @@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
>  	while (length) {
>  		u32 page_len = min_t(u32, length, PAGE_SIZE);
>
> -		page = alloc_page(GFP_KERNEL);
> +		if (p2pmem)
> +			page = p2pmem_alloc_page(p2pmem);
> +		else
> +			page = alloc_page(GFP_KERNEL);
> +
>  		if (!page)
>  			goto out_free_pages;
>
> @@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
>  out_free_pages:
>  	while (i > 0) {
>  		i--;
> -		__free_page(sg_page(&sg[i]));
> +		if (p2pmem)
> +			p2pmem_free_page(p2pmem, sg_page(&sg[i]));
> +		else
> +			__free_page(sg_page(&sg[i]));
>  	}
>  	kfree(sg);
>  out:
> @@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
>  	}
>
>  	if (rsp->req.sg != &rsp->cmd->inline_sg)
> -		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
> +		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
> +				    rsp->p2pmem);
>
>  	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
>  		nvmet_rdma_process_wr_wait_list(queue);
> @@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
>  	if (!len)
>  		return 0;
>
> +	rsp->p2pmem = rsp->queue->p2pmem;
>  	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
> -			len);
> +			len, rsp->p2pmem);
> +
> +	if (status && rsp->p2pmem) {
> +		rsp->p2pmem = NULL;
> +		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
> +					      len, rsp->p2pmem);
> +	}
> +

Not sure its a good practice to rely on rsp->p2pmem not being NULL...
Would be nice if the allocation routines can hide it from us...

>  	if (status)
>  		return status;
>
> @@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
>  				!queue->host_qid);
>  	}
>  	nvmet_rdma_free_rsps(queue);
> +	p2pmem_put(queue->p2pmem);

What does this pair with? p2pmem_find_compat()?

>  	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
>  	kfree(queue);
>  }
> @@ -1179,6 +1205,52 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
>  	return ret;
>  }
>
> +/*
> + * If allow_p2pmem is set, we will try to use P2P memory for our
> + * sgl lists. This requires the p2pmem device to be compatible with
> + * the backing device for every namespace this device will support.
> + * If not, we fall back on using system memory.
> + */
> +static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
> +{
> +	struct device **dma_devs;
> +	struct nvmet_ns *ns;
> +	int ndevs = 1;
> +	int i = 0;
> +	struct nvmet_subsys_link *s;
> +
> +	if (!queue->port->allow_p2pmem)
> +		return;
> +
> +	list_for_each_entry(s, &queue->port->subsystems, entry) {
> +		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
> +			ndevs++;
> +		}
> +	}

This code has no business in nvmet-rdma. Why not keep nr_ns in
nvmet_subsys in the first place?

> +
> +	dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
> +	if (!dma_devs)
> +		return;
> +
> +	dma_devs[i++] = &queue->dev->device->dev;
> +
> +	list_for_each_entry(s, &queue->port->subsystems, entry) {
> +		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
> +			dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
> +		}
> +	}
> +
> +	dma_devs[i++] = NULL;
> +
> +	queue->p2pmem = p2pmem_find_compat(dma_devs);

This is a problem. namespaces can be added at any point in time. No one
guarantee that dma_devs are all the namepaces we'll ever see.

> +
> +	if (queue->p2pmem)
> +		pr_debug("using %s for rdma nvme target queue",
> +			 dev_name(&queue->p2pmem->dev));
> +
> +	kfree(dma_devs);
> +}
> +
>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>  		struct rdma_cm_event *event)
>  {
> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>  	}
>  	queue->port = cm_id->context;
>
> +	nvmet_rdma_queue_setup_p2pmem(queue);
> +

Why is all this done for each queue? looks completely redundant to me.

>  	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>  	if (ret)
>  		goto release_queue;

You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
curious why?

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-04 10:40     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:40 UTC (permalink / raw)


Hey Logan,

> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch as the RDMA port and all the block devices in use. If
> the user enabled it an no devices are found, then the system will
> silently fall back on using regular memory.

What should we do if we have more than a single device that satisfies
this? I'd say that it would be better to have the user ask for a
specific device and fail it if it doesn't meet the above conditions...

> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.

That's good :)

> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, cards with CMB buffers
> don't seem to be available.

Even if it was available, it would be hard to make real use of this
given that we wouldn't know how to pre-post recv buffers (for in-capsule
data). But let's leave this out of the scope entirely...

> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index ecc4fe8..7fd4840 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -23,6 +23,7 @@
>  #include <linux/string.h>
>  #include <linux/wait.h>
>  #include <linux/inet.h>
> +#include <linux/p2pmem.h>
>  #include <asm/unaligned.h>
>
>  #include <rdma/ib_verbs.h>
> @@ -64,6 +65,7 @@ struct nvmet_rdma_rsp {
>  	struct rdma_rw_ctx	rw;
>
>  	struct nvmet_req	req;
> +	struct p2pmem_dev       *p2pmem;

Why do you need this? you have a reference to the
queue itself.

> @@ -107,6 +109,8 @@ struct nvmet_rdma_queue {
>  	int			send_queue_size;
>
>  	struct list_head	queue_list;
> +
> +	struct p2pmem_dev	*p2pmem;
>  };
>
>  struct nvmet_rdma_device {
> @@ -185,7 +189,8 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
>  	spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags);
>  }
>
> -static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
> +static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents,
> +				struct p2pmem_dev *p2pmem)
>  {
>  	struct scatterlist *sg;
>  	int count;
> @@ -193,13 +198,17 @@ static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
>  	if (!sgl || !nents)
>  		return;
>
> -	for_each_sg(sgl, sg, nents, count)
> -		__free_page(sg_page(sg));
> +	for_each_sg(sgl, sg, nents, count) {
> +		if (p2pmem)
> +			p2pmem_free_page(p2pmem, sg_page(sg));
> +		else
> +			__free_page(sg_page(sg));
> +	}
>  	kfree(sgl);
>  }
>
>  static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
> -		u32 length)
> +		u32 length, struct p2pmem_dev *p2pmem)
>  {
>  	struct scatterlist *sg;
>  	struct page *page;
> @@ -216,7 +225,11 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
>  	while (length) {
>  		u32 page_len = min_t(u32, length, PAGE_SIZE);
>
> -		page = alloc_page(GFP_KERNEL);
> +		if (p2pmem)
> +			page = p2pmem_alloc_page(p2pmem);
> +		else
> +			page = alloc_page(GFP_KERNEL);
> +
>  		if (!page)
>  			goto out_free_pages;
>
> @@ -231,7 +244,10 @@ static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
>  out_free_pages:
>  	while (i > 0) {
>  		i--;
> -		__free_page(sg_page(&sg[i]));
> +		if (p2pmem)
> +			p2pmem_free_page(p2pmem, sg_page(&sg[i]));
> +		else
> +			__free_page(sg_page(&sg[i]));
>  	}
>  	kfree(sg);
>  out:
> @@ -484,7 +500,8 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
>  	}
>
>  	if (rsp->req.sg != &rsp->cmd->inline_sg)
> -		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
> +		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
> +				    rsp->p2pmem);
>
>  	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
>  		nvmet_rdma_process_wr_wait_list(queue);
> @@ -625,8 +642,16 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
>  	if (!len)
>  		return 0;
>
> +	rsp->p2pmem = rsp->queue->p2pmem;
>  	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
> -			len);
> +			len, rsp->p2pmem);
> +
> +	if (status && rsp->p2pmem) {
> +		rsp->p2pmem = NULL;
> +		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
> +					      len, rsp->p2pmem);
> +	}
> +

Not sure its a good practice to rely on rsp->p2pmem not being NULL...
Would be nice if the allocation routines can hide it from us...

>  	if (status)
>  		return status;
>
> @@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
>  				!queue->host_qid);
>  	}
>  	nvmet_rdma_free_rsps(queue);
> +	p2pmem_put(queue->p2pmem);

What does this pair with? p2pmem_find_compat()?

>  	ida_simple_remove(&nvmet_rdma_queue_ida, queue->idx);
>  	kfree(queue);
>  }
> @@ -1179,6 +1205,52 @@ static int nvmet_rdma_cm_accept(struct rdma_cm_id *cm_id,
>  	return ret;
>  }
>
> +/*
> + * If allow_p2pmem is set, we will try to use P2P memory for our
> + * sgl lists. This requires the p2pmem device to be compatible with
> + * the backing device for every namespace this device will support.
> + * If not, we fall back on using system memory.
> + */
> +static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue *queue)
> +{
> +	struct device **dma_devs;
> +	struct nvmet_ns *ns;
> +	int ndevs = 1;
> +	int i = 0;
> +	struct nvmet_subsys_link *s;
> +
> +	if (!queue->port->allow_p2pmem)
> +		return;
> +
> +	list_for_each_entry(s, &queue->port->subsystems, entry) {
> +		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
> +			ndevs++;
> +		}
> +	}

This code has no business in nvmet-rdma. Why not keep nr_ns in
nvmet_subsys in the first place?

> +
> +	dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
> +	if (!dma_devs)
> +		return;
> +
> +	dma_devs[i++] = &queue->dev->device->dev;
> +
> +	list_for_each_entry(s, &queue->port->subsystems, entry) {
> +		list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
> +			dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
> +		}
> +	}
> +
> +	dma_devs[i++] = NULL;
> +
> +	queue->p2pmem = p2pmem_find_compat(dma_devs);

This is a problem. namespaces can be added at any point in time. No one
guarantee that dma_devs are all the namepaces we'll ever see.

> +
> +	if (queue->p2pmem)
> +		pr_debug("using %s for rdma nvme target queue",
> +			 dev_name(&queue->p2pmem->dev));
> +
> +	kfree(dma_devs);
> +}
> +
>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>  		struct rdma_cm_event *event)
>  {
> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>  	}
>  	queue->port = cm_id->context;
>
> +	nvmet_rdma_queue_setup_p2pmem(queue);
> +

Why is all this done for each queue? looks completely redundant to me.

>  	ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>  	if (ret)
>  		goto release_queue;

You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
curious why?

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 10:42     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:42 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme


> +static void setup_memwin_p2pmem(struct adapter *adap)
> +{
> +	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
> +	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
> +
> +	if (!use_p2pmem)
> +		return;

This is weird, why even call this if !use_p2pmem?

> +static int init_p2pmem(struct adapter *adapter)
> +{
> +	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
> +	struct p2pmem_dev *p;
> +	int rc;
> +	struct resource res;
> +
> +	if (!mem_size || !use_p2pmem)
> +		return 0;

Again, weird...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 10:42     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:42 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


> +static void setup_memwin_p2pmem(struct adapter *adap)
> +{
> +	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
> +	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
> +
> +	if (!use_p2pmem)
> +		return;

This is weird, why even call this if !use_p2pmem?

> +static int init_p2pmem(struct adapter *adapter)
> +{
> +	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
> +	struct p2pmem_dev *p;
> +	int rc;
> +	struct resource res;
> +
> +	if (!mem_size || !use_p2pmem)
> +		return 0;

Again, weird...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 10:42     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:42 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel


> +static void setup_memwin_p2pmem(struct adapter *adap)
> +{
> +	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
> +	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
> +
> +	if (!use_p2pmem)
> +		return;

This is weird, why even call this if !use_p2pmem?

> +static int init_p2pmem(struct adapter *adapter)
> +{
> +	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
> +	struct p2pmem_dev *p;
> +	int rc;
> +	struct resource res;
> +
> +	if (!mem_size || !use_p2pmem)
> +		return 0;

Again, weird...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 10:42     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:42 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel


> +static void setup_memwin_p2pmem(struct adapter *adap)
> +{
> +	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
> +	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
> +
> +	if (!use_p2pmem)
> +		return;

This is weird, why even call this if !use_p2pmem?

> +static int init_p2pmem(struct adapter *adapter)
> +{
> +	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
> +	struct p2pmem_dev *p;
> +	int rc;
> +	struct resource res;
> +
> +	if (!mem_size || !use_p2pmem)
> +		return 0;

Again, weird...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 10:42     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:42 UTC (permalink / raw)



> +static void setup_memwin_p2pmem(struct adapter *adap)
> +{
> +	unsigned int mem_base = t4_read_reg(adap, CIM_EXTMEM2_BASE_ADDR_A);
> +	unsigned int mem_size = t4_read_reg(adap, CIM_EXTMEM2_ADDR_SIZE_A);
> +
> +	if (!use_p2pmem)
> +		return;

This is weird, why even call this if !use_p2pmem?

> +static int init_p2pmem(struct adapter *adapter)
> +{
> +	unsigned int mem_size = t4_read_reg(adapter, CIM_EXTMEM2_ADDR_SIZE_A);
> +	struct p2pmem_dev *p;
> +	int rc;
> +	struct resource res;
> +
> +	if (!mem_size || !use_p2pmem)
> +		return 0;

Again, weird...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 10:46     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:46 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme


> +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> +	if (!p2pmem_debugfs_root)
> +		pr_info("could not create debugfs entry, continuing\n");
> +

Why continue? I think it'd be better to just fail it.

Besides, this can be safely squashed into patch 1.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 10:46     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:46 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


> +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> +	if (!p2pmem_debugfs_root)
> +		pr_info("could not create debugfs entry, continuing\n");
> +

Why continue? I think it'd be better to just fail it.

Besides, this can be safely squashed into patch 1.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 10:46     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:46 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel


> +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> +	if (!p2pmem_debugfs_root)
> +		pr_info("could not create debugfs entry, continuing\n");
> +

Why continue? I think it'd be better to just fail it.

Besides, this can be safely squashed into patch 1.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 10:46     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:46 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel


> +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> +	if (!p2pmem_debugfs_root)
> +		pr_info("could not create debugfs entry, continuing\n");
> +

Why continue? I think it'd be better to just fail it.

Besides, this can be safely squashed into patch 1.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 10:46     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:46 UTC (permalink / raw)



> +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> +	if (!p2pmem_debugfs_root)
> +		pr_info("could not create debugfs entry, continuing\n");
> +

Why continue? I think it'd be better to just fail it.

Besides, this can be safely squashed into patch 1.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-04 10:59     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:59 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme


>  u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
>  		size_t len)
>  {
> -	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
> +	bool iomem = req->p2pmem;
> +	size_t ret;
> +
> +	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
> +			     false, iomem);
> +
> +	if (ret != len)
>  		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
> +
>  	return 0;
>  }

We can never ever get here from an IO command, and that is a good thing
because it would have been broken if we did, regardless of what copy
method we use...

Note that the nvme completion queues are still on the host memory, so
this means we have lost the ordering between data and completions as
they go to different pcie targets.

If at all, this is the place to *emphasize* we must never get here
with p2pmem, and immediately fail if we do.

I'm not sure what will happen with to copy_from_sgl, I guess we
have the same race because the nvme submission queues are also
on the host memory (which is on a different pci target). Maybe
more likely to happen with write-combine enabled?

Anyway I don't think we have a real issue here *currently*, because
we use copy_to_sgl only for admin/fabrics commands emulation and
copy_from_sgl to setup dsm ranges...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-04 10:59     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:59 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


>  u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
>  		size_t len)
>  {
> -	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
> +	bool iomem = req->p2pmem;
> +	size_t ret;
> +
> +	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
> +			     false, iomem);
> +
> +	if (ret != len)
>  		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
> +
>  	return 0;
>  }

We can never ever get here from an IO command, and that is a good thing
because it would have been broken if we did, regardless of what copy
method we use...

Note that the nvme completion queues are still on the host memory, so
this means we have lost the ordering between data and completions as
they go to different pcie targets.

If at all, this is the place to *emphasize* we must never get here
with p2pmem, and immediately fail if we do.

I'm not sure what will happen with to copy_from_sgl, I guess we
have the same race because the nvme submission queues are also
on the host memory (which is on a different pci target). Maybe
more likely to happen with write-combine enabled?

Anyway I don't think we have a real issue here *currently*, because
we use copy_to_sgl only for admin/fabrics commands emulation and
copy_from_sgl to setup dsm ranges...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-04 10:59     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:59 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel


>  u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
>  		size_t len)
>  {
> -	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
> +	bool iomem = req->p2pmem;
> +	size_t ret;
> +
> +	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
> +			     false, iomem);
> +
> +	if (ret != len)
>  		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
> +
>  	return 0;
>  }

We can never ever get here from an IO command, and that is a good thing
because it would have been broken if we did, regardless of what copy
method we use...

Note that the nvme completion queues are still on the host memory, so
this means we have lost the ordering between data and completions as
they go to different pcie targets.

If at all, this is the place to *emphasize* we must never get here
with p2pmem, and immediately fail if we do.

I'm not sure what will happen with to copy_from_sgl, I guess we
have the same race because the nvme submission queues are also
on the host memory (which is on a different pci target). Maybe
more likely to happen with write-combine enabled?

Anyway I don't think we have a real issue here *currently*, because
we use copy_to_sgl only for admin/fabrics commands emulation and
copy_from_sgl to setup dsm ranges...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-04 10:59     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:59 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel


>  u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
>  		size_t len)
>  {
> -	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
> +	bool iomem = req->p2pmem;
> +	size_t ret;
> +
> +	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
> +			     false, iomem);
> +
> +	if (ret != len)
>  		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
> +
>  	return 0;
>  }

We can never ever get here from an IO command, and that is a good thing
because it would have been broken if we did, regardless of what copy
method we use...

Note that the nvme completion queues are still on the host memory, so
this means we have lost the ordering between data and completions as
they go to different pcie targets.

If at all, this is the place to *emphasize* we must never get here
with p2pmem, and immediately fail if we do.

I'm not sure what will happen with to copy_from_sgl, I guess we
have the same race because the nvme submission queues are also
on the host memory (which is on a different pci target). Maybe
more likely to happen with write-combine enabled?

Anyway I don't think we have a real issue here *currently*, because
we use copy_to_sgl only for admin/fabrics commands emulation and
copy_from_sgl to setup dsm ranges...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-04 10:59     ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-04 10:59 UTC (permalink / raw)



>  u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
>  		size_t len)
>  {
> -	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
> +	bool iomem = req->p2pmem;
> +	size_t ret;
> +
> +	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
> +			     false, iomem);
> +
> +	if (ret != len)
>  		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
> +
>  	return 0;
>  }

We can never ever get here from an IO command, and that is a good thing
because it would have been broken if we did, regardless of what copy
method we use...

Note that the nvme completion queues are still on the host memory, so
this means we have lost the ordering between data and completions as
they go to different pcie targets.

If at all, this is the place to *emphasize* we must never get here
with p2pmem, and immediately fail if we do.

I'm not sure what will happen with to copy_from_sgl, I guess we
have the same race because the nvme submission queues are also
on the host memory (which is on a different pci target). Maybe
more likely to happen with write-combine enabled?

Anyway I don't think we have a real issue here *currently*, because
we use copy_to_sgl only for admin/fabrics commands emulation and
copy_from_sgl to setup dsm ranges...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
  2017-04-04 10:59     ` Sagi Grimberg
  (?)
@ 2017-04-04 15:46         ` Jason Gunthorpe
  -1 siblings, 0 replies; 545+ messages in thread
From: Jason Gunthorpe @ 2017-04-04 15:46 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, James E.J. Bottomley,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Martin K. Petersen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w, Max Gurtovoy,
	Christoph Hellwig

On Tue, Apr 04, 2017 at 01:59:26PM +0300, Sagi Grimberg wrote:
> Note that the nvme completion queues are still on the host memory, so
> this means we have lost the ordering between data and completions as
> they go to different pcie targets.

Hmm, in this simple up/down case with a switch, I think it might
actually be OK.

Transactions might not complete at the NVMe device before the CPU
processes the RDMA completion, however due to the PCI-E ordering rules
new TLPs directed to the NVMe will complete after the RMDA TLPs and
thus observe the new data. (eg order preserving)

It would be very hard to use P2P if fabric ordering is not preserved..

Jason

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-04 15:46         ` Jason Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Jason Gunthorpe @ 2017-04-04 15:46 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel

On Tue, Apr 04, 2017 at 01:59:26PM +0300, Sagi Grimberg wrote:
> Note that the nvme completion queues are still on the host memory, so
> this means we have lost the ordering between data and completions as
> they go to different pcie targets.

Hmm, in this simple up/down case with a switch, I think it might
actually be OK.

Transactions might not complete at the NVMe device before the CPU
processes the RDMA completion, however due to the PCI-E ordering rules
new TLPs directed to the NVMe will complete after the RMDA TLPs and
thus observe the new data. (eg order preserving)

It would be very hard to use P2P if fabric ordering is not preserved..

Jason

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-04 15:46         ` Jason Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Jason Gunthorpe @ 2017-04-04 15:46 UTC (permalink / raw)


On Tue, Apr 04, 2017@01:59:26PM +0300, Sagi Grimberg wrote:
> Note that the nvme completion queues are still on the host memory, so
> this means we have lost the ordering between data and completions as
> they go to different pcie targets.

Hmm, in this simple up/down case with a switch, I think it might
actually be OK.

Transactions might not complete at the NVMe device before the CPU
processes the RDMA completion, however due to the PCI-E ordering rules
new TLPs directed to the NVMe will complete after the RMDA TLPs and
thus observe the new data. (eg order preserving)

It would be very hard to use P2P if fabric ordering is not preserved..

Jason

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 15:56       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 15:56 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme



On 04/04/17 04:42 AM, Sagi Grimberg wrote:
> This is weird, why even call this if !use_p2pmem?

I personally find it cleaner than:

if (use_p2pmem)
	setup_memwin_p2pmem(...)

I'm not sure why that's so weird.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 15:56       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 15:56 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 04/04/17 04:42 AM, Sagi Grimberg wrote:
> This is weird, why even call this if !use_p2pmem?

I personally find it cleaner than:

if (use_p2pmem)
	setup_memwin_p2pmem(...)

I'm not sure why that's so weird.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 15:56       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 15:56 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 04/04/17 04:42 AM, Sagi Grimberg wrote:
> This is weird, why even call this if !use_p2pmem?

I personally find it cleaner than:

if (use_p2pmem)
	setup_memwin_p2pmem(...)

I'm not sure why that's so weird.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 15:56       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 15:56 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 04/04/17 04:42 AM, Sagi Grimberg wrote:
> This is weird, why even call this if !use_p2pmem?

I personally find it cleaner than:

if (use_p2pmem)
	setup_memwin_p2pmem(...)

I'm not sure why that's so weird.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-04 15:56       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 15:56 UTC (permalink / raw)




On 04/04/17 04:42 AM, Sagi Grimberg wrote:
> This is weird, why even call this if !use_p2pmem?

I personally find it cleaner than:

if (use_p2pmem)
	setup_memwin_p2pmem(...)

I'm not sure why that's so weird.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-04 16:16       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 16:16 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, Sinan Kaya



On 04/04/17 04:40 AM, Sagi Grimberg wrote:
> Hey Logan,
> 
>> We create a configfs attribute in each nvme-fabrics target port to
>> enable p2p memory use. When enabled, the port will only then use the
>> p2p memory if a p2p memory device can be found which is behind the
>> same switch as the RDMA port and all the block devices in use. If
>> the user enabled it an no devices are found, then the system will
>> silently fall back on using regular memory.
> 
> What should we do if we have more than a single device that satisfies
> this? I'd say that it would be better to have the user ask for a
> specific device and fail it if it doesn't meet the above conditions...

I hadn't done this yet but I think a simple closest device in the tree
would solve the issue sufficiently. However, I originally had it so the
user has to pick the device and I prefer that approach. But if the user
picks the device, then why bother restricting what he picks? Per the
thread with Sinan, I'd prefer to use what the user picks. You were one
of the biggest opponents to that so I'd like to hear your opinion on
removing the restrictions.

>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>> save an extra PCI transfer as the NVME card could just take the data
>> out of it's own memory. However, at this time, cards with CMB buffers
>> don't seem to be available.
> 
> Even if it was available, it would be hard to make real use of this
> given that we wouldn't know how to pre-post recv buffers (for in-capsule
> data). But let's leave this out of the scope entirely...

I don't understand what you're referring to. We'd simply use the CMB
buffer as a p2pmem device, why does that change anything?

> Why do you need this? you have a reference to the
> queue itself.

This keeps track of whether the response was actually allocated with
p2pmem or not. It's needed for when we free the SGL because the queue
may have a p2pmem device assigned to it but, if the alloc failed and it
fell back on system memory then we need to know how to free it. I'm
currently looking at having SGLs having an iomem flag. In which case,
this would no longer be needed as the flag in the SGL could be used.


>> +    rsp->p2pmem = rsp->queue->p2pmem;
>>      status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> -            len);
>> +            len, rsp->p2pmem);
>> +
>> +    if (status && rsp->p2pmem) {
>> +        rsp->p2pmem = NULL;
>> +        status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> +                          len, rsp->p2pmem);
>> +    }
>> +
> 
> Not sure its a good practice to rely on rsp->p2pmem not being NULL...
> Would be nice if the allocation routines can hide it from us...

I'm not sure what the reasoning is behind your NULL comment.

Yes, I'm currently considering pushing an alloc/free sgl into the p2pmem
code.

>>      if (status)
>>          return status;
>>
>> @@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct
>> nvmet_rdma_queue *queue)
>>                  !queue->host_qid);
>>      }
>>      nvmet_rdma_free_rsps(queue);
>> +    p2pmem_put(queue->p2pmem);
> 
> What does this pair with? p2pmem_find_compat()?

Yes, that's correct.


>> +static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue
>> *queue)
>> +{
>> +    struct device **dma_devs;
>> +    struct nvmet_ns *ns;
>> +    int ndevs = 1;
>> +    int i = 0;
>> +    struct nvmet_subsys_link *s;
>> +
>> +    if (!queue->port->allow_p2pmem)
>> +        return;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            ndevs++;
>> +        }
>> +    }
> 
> This code has no business in nvmet-rdma. Why not keep nr_ns in
> nvmet_subsys in the first place?

That makes sense.

>> +
>> +    dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
>> +    if (!dma_devs)
>> +        return;
>> +
>> +    dma_devs[i++] = &queue->dev->device->dev;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
>> +        }
>> +    }
>> +
>> +    dma_devs[i++] = NULL;
>> +
>> +    queue->p2pmem = p2pmem_find_compat(dma_devs);
> 
> This is a problem. namespaces can be added at any point in time. No one
> guarantee that dma_devs are all the namepaces we'll ever see.

Yeah, well restricting p2pmem based on all the devices in use is hard.
So we'd need a call into the transport every time an ns is added and
we'd have to drop the p2pmem if they add one that isn't supported. This
complexity is just one of the reasons I prefer just letting the user chose.

>> +
>> +    if (queue->p2pmem)
>> +        pr_debug("using %s for rdma nvme target queue",
>> +             dev_name(&queue->p2pmem->dev));
>> +
>> +    kfree(dma_devs);
>> +}
>> +
>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>          struct rdma_cm_event *event)
>>  {
>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>> rdma_cm_id *cm_id,
>>      }
>>      queue->port = cm_id->context;
>>
>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>> +
> 
> Why is all this done for each queue? looks completely redundant to me.

A little bit. Where would you put it?

>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>      if (ret)
>>          goto release_queue;
> 
> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
> curious why?

Yes, the thinking was that these transfers were small anyway so there
would not be significant benefit to pushing them through p2pmem. There's
really no reason why we couldn't do that if it made sense to though.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-04 16:16       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 16:16 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sinan Kaya



On 04/04/17 04:40 AM, Sagi Grimberg wrote:
> Hey Logan,
> 
>> We create a configfs attribute in each nvme-fabrics target port to
>> enable p2p memory use. When enabled, the port will only then use the
>> p2p memory if a p2p memory device can be found which is behind the
>> same switch as the RDMA port and all the block devices in use. If
>> the user enabled it an no devices are found, then the system will
>> silently fall back on using regular memory.
> 
> What should we do if we have more than a single device that satisfies
> this? I'd say that it would be better to have the user ask for a
> specific device and fail it if it doesn't meet the above conditions...

I hadn't done this yet but I think a simple closest device in the tree
would solve the issue sufficiently. However, I originally had it so the
user has to pick the device and I prefer that approach. But if the user
picks the device, then why bother restricting what he picks? Per the
thread with Sinan, I'd prefer to use what the user picks. You were one
of the biggest opponents to that so I'd like to hear your opinion on
removing the restrictions.

>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>> save an extra PCI transfer as the NVME card could just take the data
>> out of it's own memory. However, at this time, cards with CMB buffers
>> don't seem to be available.
> 
> Even if it was available, it would be hard to make real use of this
> given that we wouldn't know how to pre-post recv buffers (for in-capsule
> data). But let's leave this out of the scope entirely...

I don't understand what you're referring to. We'd simply use the CMB
buffer as a p2pmem device, why does that change anything?

> Why do you need this? you have a reference to the
> queue itself.

This keeps track of whether the response was actually allocated with
p2pmem or not. It's needed for when we free the SGL because the queue
may have a p2pmem device assigned to it but, if the alloc failed and it
fell back on system memory then we need to know how to free it. I'm
currently looking at having SGLs having an iomem flag. In which case,
this would no longer be needed as the flag in the SGL could be used.


>> +    rsp->p2pmem = rsp->queue->p2pmem;
>>      status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> -            len);
>> +            len, rsp->p2pmem);
>> +
>> +    if (status && rsp->p2pmem) {
>> +        rsp->p2pmem = NULL;
>> +        status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> +                          len, rsp->p2pmem);
>> +    }
>> +
> 
> Not sure its a good practice to rely on rsp->p2pmem not being NULL...
> Would be nice if the allocation routines can hide it from us...

I'm not sure what the reasoning is behind your NULL comment.

Yes, I'm currently considering pushing an alloc/free sgl into the p2pmem
code.

>>      if (status)
>>          return status;
>>
>> @@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct
>> nvmet_rdma_queue *queue)
>>                  !queue->host_qid);
>>      }
>>      nvmet_rdma_free_rsps(queue);
>> +    p2pmem_put(queue->p2pmem);
> 
> What does this pair with? p2pmem_find_compat()?

Yes, that's correct.


>> +static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue
>> *queue)
>> +{
>> +    struct device **dma_devs;
>> +    struct nvmet_ns *ns;
>> +    int ndevs = 1;
>> +    int i = 0;
>> +    struct nvmet_subsys_link *s;
>> +
>> +    if (!queue->port->allow_p2pmem)
>> +        return;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            ndevs++;
>> +        }
>> +    }
> 
> This code has no business in nvmet-rdma. Why not keep nr_ns in
> nvmet_subsys in the first place?

That makes sense.

>> +
>> +    dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
>> +    if (!dma_devs)
>> +        return;
>> +
>> +    dma_devs[i++] = &queue->dev->device->dev;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
>> +        }
>> +    }
>> +
>> +    dma_devs[i++] = NULL;
>> +
>> +    queue->p2pmem = p2pmem_find_compat(dma_devs);
> 
> This is a problem. namespaces can be added at any point in time. No one
> guarantee that dma_devs are all the namepaces we'll ever see.

Yeah, well restricting p2pmem based on all the devices in use is hard.
So we'd need a call into the transport every time an ns is added and
we'd have to drop the p2pmem if they add one that isn't supported. This
complexity is just one of the reasons I prefer just letting the user chose.

>> +
>> +    if (queue->p2pmem)
>> +        pr_debug("using %s for rdma nvme target queue",
>> +             dev_name(&queue->p2pmem->dev));
>> +
>> +    kfree(dma_devs);
>> +}
>> +
>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>          struct rdma_cm_event *event)
>>  {
>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>> rdma_cm_id *cm_id,
>>      }
>>      queue->port = cm_id->context;
>>
>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>> +
> 
> Why is all this done for each queue? looks completely redundant to me.

A little bit. Where would you put it?

>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>      if (ret)
>>          goto release_queue;
> 
> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
> curious why?

Yes, the thinking was that these transfers were small anyway so there
would not be significant benefit to pushing them through p2pmem. There's
really no reason why we couldn't do that if it made sense to though.

Logan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-04 16:16       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 16:16 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Sinan Kaya



On 04/04/17 04:40 AM, Sagi Grimberg wrote:
> Hey Logan,
> 
>> We create a configfs attribute in each nvme-fabrics target port to
>> enable p2p memory use. When enabled, the port will only then use the
>> p2p memory if a p2p memory device can be found which is behind the
>> same switch as the RDMA port and all the block devices in use. If
>> the user enabled it an no devices are found, then the system will
>> silently fall back on using regular memory.
> 
> What should we do if we have more than a single device that satisfies
> this? I'd say that it would be better to have the user ask for a
> specific device and fail it if it doesn't meet the above conditions...

I hadn't done this yet but I think a simple closest device in the tree
would solve the issue sufficiently. However, I originally had it so the
user has to pick the device and I prefer that approach. But if the user
picks the device, then why bother restricting what he picks? Per the
thread with Sinan, I'd prefer to use what the user picks. You were one
of the biggest opponents to that so I'd like to hear your opinion on
removing the restrictions.

>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>> save an extra PCI transfer as the NVME card could just take the data
>> out of it's own memory. However, at this time, cards with CMB buffers
>> don't seem to be available.
> 
> Even if it was available, it would be hard to make real use of this
> given that we wouldn't know how to pre-post recv buffers (for in-capsule
> data). But let's leave this out of the scope entirely...

I don't understand what you're referring to. We'd simply use the CMB
buffer as a p2pmem device, why does that change anything?

> Why do you need this? you have a reference to the
> queue itself.

This keeps track of whether the response was actually allocated with
p2pmem or not. It's needed for when we free the SGL because the queue
may have a p2pmem device assigned to it but, if the alloc failed and it
fell back on system memory then we need to know how to free it. I'm
currently looking at having SGLs having an iomem flag. In which case,
this would no longer be needed as the flag in the SGL could be used.


>> +    rsp->p2pmem = rsp->queue->p2pmem;
>>      status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> -            len);
>> +            len, rsp->p2pmem);
>> +
>> +    if (status && rsp->p2pmem) {
>> +        rsp->p2pmem = NULL;
>> +        status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> +                          len, rsp->p2pmem);
>> +    }
>> +
> 
> Not sure its a good practice to rely on rsp->p2pmem not being NULL...
> Would be nice if the allocation routines can hide it from us...

I'm not sure what the reasoning is behind your NULL comment.

Yes, I'm currently considering pushing an alloc/free sgl into the p2pmem
code.

>>      if (status)
>>          return status;
>>
>> @@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct
>> nvmet_rdma_queue *queue)
>>                  !queue->host_qid);
>>      }
>>      nvmet_rdma_free_rsps(queue);
>> +    p2pmem_put(queue->p2pmem);
> 
> What does this pair with? p2pmem_find_compat()?

Yes, that's correct.


>> +static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue
>> *queue)
>> +{
>> +    struct device **dma_devs;
>> +    struct nvmet_ns *ns;
>> +    int ndevs = 1;
>> +    int i = 0;
>> +    struct nvmet_subsys_link *s;
>> +
>> +    if (!queue->port->allow_p2pmem)
>> +        return;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            ndevs++;
>> +        }
>> +    }
> 
> This code has no business in nvmet-rdma. Why not keep nr_ns in
> nvmet_subsys in the first place?

That makes sense.

>> +
>> +    dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
>> +    if (!dma_devs)
>> +        return;
>> +
>> +    dma_devs[i++] = &queue->dev->device->dev;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
>> +        }
>> +    }
>> +
>> +    dma_devs[i++] = NULL;
>> +
>> +    queue->p2pmem = p2pmem_find_compat(dma_devs);
> 
> This is a problem. namespaces can be added at any point in time. No one
> guarantee that dma_devs are all the namepaces we'll ever see.

Yeah, well restricting p2pmem based on all the devices in use is hard.
So we'd need a call into the transport every time an ns is added and
we'd have to drop the p2pmem if they add one that isn't supported. This
complexity is just one of the reasons I prefer just letting the user chose.

>> +
>> +    if (queue->p2pmem)
>> +        pr_debug("using %s for rdma nvme target queue",
>> +             dev_name(&queue->p2pmem->dev));
>> +
>> +    kfree(dma_devs);
>> +}
>> +
>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>          struct rdma_cm_event *event)
>>  {
>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>> rdma_cm_id *cm_id,
>>      }
>>      queue->port = cm_id->context;
>>
>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>> +
> 
> Why is all this done for each queue? looks completely redundant to me.

A little bit. Where would you put it?

>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>      if (ret)
>>          goto release_queue;
> 
> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
> curious why?

Yes, the thinking was that these transfers were small anyway so there
would not be significant benefit to pushing them through p2pmem. There's
really no reason why we couldn't do that if it made sense to though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-04 16:16       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 16:16 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Sinan Kaya



On 04/04/17 04:40 AM, Sagi Grimberg wrote:
> Hey Logan,
> 
>> We create a configfs attribute in each nvme-fabrics target port to
>> enable p2p memory use. When enabled, the port will only then use the
>> p2p memory if a p2p memory device can be found which is behind the
>> same switch as the RDMA port and all the block devices in use. If
>> the user enabled it an no devices are found, then the system will
>> silently fall back on using regular memory.
> 
> What should we do if we have more than a single device that satisfies
> this? I'd say that it would be better to have the user ask for a
> specific device and fail it if it doesn't meet the above conditions...

I hadn't done this yet but I think a simple closest device in the tree
would solve the issue sufficiently. However, I originally had it so the
user has to pick the device and I prefer that approach. But if the user
picks the device, then why bother restricting what he picks? Per the
thread with Sinan, I'd prefer to use what the user picks. You were one
of the biggest opponents to that so I'd like to hear your opinion on
removing the restrictions.

>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>> save an extra PCI transfer as the NVME card could just take the data
>> out of it's own memory. However, at this time, cards with CMB buffers
>> don't seem to be available.
> 
> Even if it was available, it would be hard to make real use of this
> given that we wouldn't know how to pre-post recv buffers (for in-capsule
> data). But let's leave this out of the scope entirely...

I don't understand what you're referring to. We'd simply use the CMB
buffer as a p2pmem device, why does that change anything?

> Why do you need this? you have a reference to the
> queue itself.

This keeps track of whether the response was actually allocated with
p2pmem or not. It's needed for when we free the SGL because the queue
may have a p2pmem device assigned to it but, if the alloc failed and it
fell back on system memory then we need to know how to free it. I'm
currently looking at having SGLs having an iomem flag. In which case,
this would no longer be needed as the flag in the SGL could be used.


>> +    rsp->p2pmem = rsp->queue->p2pmem;
>>      status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> -            len);
>> +            len, rsp->p2pmem);
>> +
>> +    if (status && rsp->p2pmem) {
>> +        rsp->p2pmem = NULL;
>> +        status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> +                          len, rsp->p2pmem);
>> +    }
>> +
> 
> Not sure its a good practice to rely on rsp->p2pmem not being NULL...
> Would be nice if the allocation routines can hide it from us...

I'm not sure what the reasoning is behind your NULL comment.

Yes, I'm currently considering pushing an alloc/free sgl into the p2pmem
code.

>>      if (status)
>>          return status;
>>
>> @@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct
>> nvmet_rdma_queue *queue)
>>                  !queue->host_qid);
>>      }
>>      nvmet_rdma_free_rsps(queue);
>> +    p2pmem_put(queue->p2pmem);
> 
> What does this pair with? p2pmem_find_compat()?

Yes, that's correct.


>> +static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue
>> *queue)
>> +{
>> +    struct device **dma_devs;
>> +    struct nvmet_ns *ns;
>> +    int ndevs = 1;
>> +    int i = 0;
>> +    struct nvmet_subsys_link *s;
>> +
>> +    if (!queue->port->allow_p2pmem)
>> +        return;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            ndevs++;
>> +        }
>> +    }
> 
> This code has no business in nvmet-rdma. Why not keep nr_ns in
> nvmet_subsys in the first place?

That makes sense.

>> +
>> +    dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
>> +    if (!dma_devs)
>> +        return;
>> +
>> +    dma_devs[i++] = &queue->dev->device->dev;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
>> +        }
>> +    }
>> +
>> +    dma_devs[i++] = NULL;
>> +
>> +    queue->p2pmem = p2pmem_find_compat(dma_devs);
> 
> This is a problem. namespaces can be added at any point in time. No one
> guarantee that dma_devs are all the namepaces we'll ever see.

Yeah, well restricting p2pmem based on all the devices in use is hard.
So we'd need a call into the transport every time an ns is added and
we'd have to drop the p2pmem if they add one that isn't supported. This
complexity is just one of the reasons I prefer just letting the user chose.

>> +
>> +    if (queue->p2pmem)
>> +        pr_debug("using %s for rdma nvme target queue",
>> +             dev_name(&queue->p2pmem->dev));
>> +
>> +    kfree(dma_devs);
>> +}
>> +
>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>          struct rdma_cm_event *event)
>>  {
>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>> rdma_cm_id *cm_id,
>>      }
>>      queue->port = cm_id->context;
>>
>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>> +
> 
> Why is all this done for each queue? looks completely redundant to me.

A little bit. Where would you put it?

>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>      if (ret)
>>          goto release_queue;
> 
> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
> curious why?

Yes, the thinking was that these transfers were small anyway so there
would not be significant benefit to pushing them through p2pmem. There's
really no reason why we couldn't do that if it made sense to though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-04 16:16       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 16:16 UTC (permalink / raw)




On 04/04/17 04:40 AM, Sagi Grimberg wrote:
> Hey Logan,
> 
>> We create a configfs attribute in each nvme-fabrics target port to
>> enable p2p memory use. When enabled, the port will only then use the
>> p2p memory if a p2p memory device can be found which is behind the
>> same switch as the RDMA port and all the block devices in use. If
>> the user enabled it an no devices are found, then the system will
>> silently fall back on using regular memory.
> 
> What should we do if we have more than a single device that satisfies
> this? I'd say that it would be better to have the user ask for a
> specific device and fail it if it doesn't meet the above conditions...

I hadn't done this yet but I think a simple closest device in the tree
would solve the issue sufficiently. However, I originally had it so the
user has to pick the device and I prefer that approach. But if the user
picks the device, then why bother restricting what he picks? Per the
thread with Sinan, I'd prefer to use what the user picks. You were one
of the biggest opponents to that so I'd like to hear your opinion on
removing the restrictions.

>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>> save an extra PCI transfer as the NVME card could just take the data
>> out of it's own memory. However, at this time, cards with CMB buffers
>> don't seem to be available.
> 
> Even if it was available, it would be hard to make real use of this
> given that we wouldn't know how to pre-post recv buffers (for in-capsule
> data). But let's leave this out of the scope entirely...

I don't understand what you're referring to. We'd simply use the CMB
buffer as a p2pmem device, why does that change anything?

> Why do you need this? you have a reference to the
> queue itself.

This keeps track of whether the response was actually allocated with
p2pmem or not. It's needed for when we free the SGL because the queue
may have a p2pmem device assigned to it but, if the alloc failed and it
fell back on system memory then we need to know how to free it. I'm
currently looking at having SGLs having an iomem flag. In which case,
this would no longer be needed as the flag in the SGL could be used.


>> +    rsp->p2pmem = rsp->queue->p2pmem;
>>      status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> -            len);
>> +            len, rsp->p2pmem);
>> +
>> +    if (status && rsp->p2pmem) {
>> +        rsp->p2pmem = NULL;
>> +        status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
>> +                          len, rsp->p2pmem);
>> +    }
>> +
> 
> Not sure its a good practice to rely on rsp->p2pmem not being NULL...
> Would be nice if the allocation routines can hide it from us...

I'm not sure what the reasoning is behind your NULL comment.

Yes, I'm currently considering pushing an alloc/free sgl into the p2pmem
code.

>>      if (status)
>>          return status;
>>
>> @@ -984,6 +1009,7 @@ static void nvmet_rdma_free_queue(struct
>> nvmet_rdma_queue *queue)
>>                  !queue->host_qid);
>>      }
>>      nvmet_rdma_free_rsps(queue);
>> +    p2pmem_put(queue->p2pmem);
> 
> What does this pair with? p2pmem_find_compat()?

Yes, that's correct.


>> +static void nvmet_rdma_queue_setup_p2pmem(struct nvmet_rdma_queue
>> *queue)
>> +{
>> +    struct device **dma_devs;
>> +    struct nvmet_ns *ns;
>> +    int ndevs = 1;
>> +    int i = 0;
>> +    struct nvmet_subsys_link *s;
>> +
>> +    if (!queue->port->allow_p2pmem)
>> +        return;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            ndevs++;
>> +        }
>> +    }
> 
> This code has no business in nvmet-rdma. Why not keep nr_ns in
> nvmet_subsys in the first place?

That makes sense.

>> +
>> +    dma_devs = kmalloc((ndevs + 1) * sizeof(*dma_devs), GFP_KERNEL);
>> +    if (!dma_devs)
>> +        return;
>> +
>> +    dma_devs[i++] = &queue->dev->device->dev;
>> +
>> +    list_for_each_entry(s, &queue->port->subsystems, entry) {
>> +        list_for_each_entry_rcu(ns, &s->subsys->namespaces, dev_link) {
>> +            dma_devs[i++] = disk_to_dev(ns->bdev->bd_disk);
>> +        }
>> +    }
>> +
>> +    dma_devs[i++] = NULL;
>> +
>> +    queue->p2pmem = p2pmem_find_compat(dma_devs);
> 
> This is a problem. namespaces can be added at any point in time. No one
> guarantee that dma_devs are all the namepaces we'll ever see.

Yeah, well restricting p2pmem based on all the devices in use is hard.
So we'd need a call into the transport every time an ns is added and
we'd have to drop the p2pmem if they add one that isn't supported. This
complexity is just one of the reasons I prefer just letting the user chose.

>> +
>> +    if (queue->p2pmem)
>> +        pr_debug("using %s for rdma nvme target queue",
>> +             dev_name(&queue->p2pmem->dev));
>> +
>> +    kfree(dma_devs);
>> +}
>> +
>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>          struct rdma_cm_event *event)
>>  {
>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>> rdma_cm_id *cm_id,
>>      }
>>      queue->port = cm_id->context;
>>
>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>> +
> 
> Why is all this done for each queue? looks completely redundant to me.

A little bit. Where would you put it?

>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>      if (ret)
>>          goto release_queue;
> 
> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
> curious why?

Yes, the thinking was that these transfers were small anyway so there
would not be significant benefit to pushing them through p2pmem. There's
really no reason why we couldn't do that if it made sense to though.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
  2017-04-04 15:46         ` Jason Gunthorpe
  (?)
@ 2017-04-04 17:21             ` Logan Gunthorpe
  -1 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 17:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Jens Axboe, James E.J. Bottomley,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Martin K. Petersen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Steve Wise,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w, Max Gurtovoy,
	Christoph Hellwig



On 04/04/17 04:59 AM, Sagi Grimberg wrote:
> We can never ever get here from an IO command, and that is a good thing
> because it would have been broken if we did, regardless of what copy
> method we use...

Yes, I changed this mostly for admin commands. I did notice connect
commands do end up reading from the p2mem and this patchset correctly
switches it to iomemcpy. However, based on Cristoph's comment, I hope to
make it more general such that iomem is hidden within sgls and any
access will either be correct or create a warning.


On 04/04/17 09:46 AM, Jason Gunthorpe wrote:
> Transactions might not complete at the NVMe device before the CPU
> processes the RDMA completion, however due to the PCI-E ordering rules
> new TLPs directed to the NVMe will complete after the RMDA TLPs and
> thus observe the new data. (eg order preserving)
> 
> It would be very hard to use P2P if fabric ordering is not preserved..

Yes, my understanding is the same, the PCI-E ordering rules save us here.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-04 17:21             ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 17:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Sagi Grimberg
  Cc: Christoph Hellwig, James E.J. Bottomley, Martin K. Petersen,
	Jens Axboe, Steve Wise, Stephen Bates, Max Gurtovoy,
	Dan Williams, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel



On 04/04/17 04:59 AM, Sagi Grimberg wrote:
> We can never ever get here from an IO command, and that is a good thing
> because it would have been broken if we did, regardless of what copy
> method we use...

Yes, I changed this mostly for admin commands. I did notice connect
commands do end up reading from the p2mem and this patchset correctly
switches it to iomemcpy. However, based on Cristoph's comment, I hope to
make it more general such that iomem is hidden within sgls and any
access will either be correct or create a warning.


On 04/04/17 09:46 AM, Jason Gunthorpe wrote:
> Transactions might not complete at the NVMe device before the CPU
> processes the RDMA completion, however due to the PCI-E ordering rules
> new TLPs directed to the NVMe will complete after the RMDA TLPs and
> thus observe the new data. (eg order preserving)
> 
> It would be very hard to use P2P if fabric ordering is not preserved..

Yes, my understanding is the same, the PCI-E ordering rules save us here.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-04 17:21             ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 17:21 UTC (permalink / raw)




On 04/04/17 04:59 AM, Sagi Grimberg wrote:
> We can never ever get here from an IO command, and that is a good thing
> because it would have been broken if we did, regardless of what copy
> method we use...

Yes, I changed this mostly for admin commands. I did notice connect
commands do end up reading from the p2mem and this patchset correctly
switches it to iomemcpy. However, based on Cristoph's comment, I hope to
make it more general such that iomem is hidden within sgls and any
access will either be correct or create a warning.


On 04/04/17 09:46 AM, Jason Gunthorpe wrote:
> Transactions might not complete at the NVMe device before the CPU
> processes the RDMA completion, however due to the PCI-E ordering rules
> new TLPs directed to the NVMe will complete after the RMDA TLPs and
> thus observe the new data. (eg order preserving)
> 
> It would be very hard to use P2P if fabric ordering is not preserved..

Yes, my understanding is the same, the PCI-E ordering rules save us here.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 17:25       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 17:25 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme



On 04/04/17 04:46 AM, Sagi Grimberg wrote:
> 
>> +    p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
>> +    if (!p2pmem_debugfs_root)
>> +        pr_info("could not create debugfs entry, continuing\n");
>> +
> 
> Why continue? I think it'd be better to just fail it.

Yup, agreed. This should probably also be PTR_ERR as well.

> Besides, this can be safely squashed into patch 1.

Sure, the only real reason I kept it separate was it was authored by
Steve Wise.

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 17:25       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 17:25 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r



On 04/04/17 04:46 AM, Sagi Grimberg wrote:
> 
>> +    p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
>> +    if (!p2pmem_debugfs_root)
>> +        pr_info("could not create debugfs entry, continuing\n");
>> +
> 
> Why continue? I think it'd be better to just fail it.

Yup, agreed. This should probably also be PTR_ERR as well.

> Besides, this can be safely squashed into patch 1.

Sure, the only real reason I kept it separate was it was authored by
Steve Wise.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 17:25       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 17:25 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 04/04/17 04:46 AM, Sagi Grimberg wrote:
> 
>> +    p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
>> +    if (!p2pmem_debugfs_root)
>> +        pr_info("could not create debugfs entry, continuing\n");
>> +
> 
> Why continue? I think it'd be better to just fail it.

Yup, agreed. This should probably also be PTR_ERR as well.

> Besides, this can be safely squashed into patch 1.

Sure, the only real reason I kept it separate was it was authored by
Steve Wise.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 17:25       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 17:25 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 04/04/17 04:46 AM, Sagi Grimberg wrote:
> 
>> +    p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
>> +    if (!p2pmem_debugfs_root)
>> +        pr_info("could not create debugfs entry, continuing\n");
>> +
> 
> Why continue? I think it'd be better to just fail it.

Yup, agreed. This should probably also be PTR_ERR as well.

> Besides, this can be safely squashed into patch 1.

Sure, the only real reason I kept it separate was it was authored by
Steve Wise.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-04 17:25       ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-04 17:25 UTC (permalink / raw)




On 04/04/17 04:46 AM, Sagi Grimberg wrote:
> 
>> +    p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
>> +    if (!p2pmem_debugfs_root)
>> +        pr_info("could not create debugfs entry, continuing\n");
>> +
> 
> Why continue? I think it'd be better to just fail it.

Yup, agreed. This should probably also be PTR_ERR as well.

> Besides, this can be safely squashed into patch 1.

Sure, the only real reason I kept it separate was it was authored by
Steve Wise.

Logan

^ permalink raw reply	[flat|nested] 545+ messages in thread

* RE: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-05 15:41       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:41 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Logan Gunthorpe',
	'Christoph Hellwig', 'James E.J. Bottomley',
	'Martin K. Petersen', 'Jens Axboe',
	'Stephen Bates', 'Max Gurtovoy',
	'Dan Williams', 'Keith Busch',
	'Jason Gunthorpe'
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

> 
> 
> > +static void setup_memwin_p2pmem(struct adapter *adap)
> > +{
> > +	unsigned int mem_base = t4_read_reg(adap,
> CIM_EXTMEM2_BASE_ADDR_A);
> > +	unsigned int mem_size = t4_read_reg(adap,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +
> > +	if (!use_p2pmem)
> > +		return;
> 
> This is weird, why even call this if !use_p2pmem?
> 

The use_p2pmem was added after the original change.  I'll update as you
suggest.

> > +static int init_p2pmem(struct adapter *adapter)
> > +{
> > +	unsigned int mem_size = t4_read_reg(adapter,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +	struct p2pmem_dev *p;
> > +	int rc;
> > +	struct resource res;
> > +
> > +	if (!mem_size || !use_p2pmem)
> > +		return 0;
> 
> Again, weird...

Yup.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* RE: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-05 15:41       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:41 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Logan Gunthorpe',
	'Christoph Hellwig', 'James E.J. Bottomley',
	'Martin K. Petersen', 'Jens Axboe',
	'Stephen Bates', 'Max Gurtovoy',
	'Dan Williams', 'Keith Busch',
	'Jason Gunthorpe'
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

> 
> 
> > +static void setup_memwin_p2pmem(struct adapter *adap)
> > +{
> > +	unsigned int mem_base = t4_read_reg(adap,
> CIM_EXTMEM2_BASE_ADDR_A);
> > +	unsigned int mem_size = t4_read_reg(adap,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +
> > +	if (!use_p2pmem)
> > +		return;
> 
> This is weird, why even call this if !use_p2pmem?
> 

The use_p2pmem was added after the original change.  I'll update as you
suggest.

> > +static int init_p2pmem(struct adapter *adapter)
> > +{
> > +	unsigned int mem_size = t4_read_reg(adapter,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +	struct p2pmem_dev *p;
> > +	int rc;
> > +	struct resource res;
> > +
> > +	if (!mem_size || !use_p2pmem)
> > +		return 0;
> 
> Again, weird...

Yup.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* RE: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-05 15:41       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:41 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Logan Gunthorpe',
	'Christoph Hellwig', 'James E.J. Bottomley',
	'Martin K. Petersen', 'Jens Axboe',
	'Stephen Bates', 'Max Gurtovoy',
	'Dan Williams', 'Keith Busch',
	'Jason Gunthorpe'
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

> 
> 
> > +static void setup_memwin_p2pmem(struct adapter *adap)
> > +{
> > +	unsigned int mem_base = t4_read_reg(adap,
> CIM_EXTMEM2_BASE_ADDR_A);
> > +	unsigned int mem_size = t4_read_reg(adap,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +
> > +	if (!use_p2pmem)
> > +		return;
> 
> This is weird, why even call this if !use_p2pmem?
> 

The use_p2pmem was added after the original change.  I'll update as you
suggest.

> > +static int init_p2pmem(struct adapter *adapter)
> > +{
> > +	unsigned int mem_size = t4_read_reg(adapter,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +	struct p2pmem_dev *p;
> > +	int rc;
> > +	struct resource res;
> > +
> > +	if (!mem_size || !use_p2pmem)
> > +		return 0;
> 
> Again, weird...

Yup.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* RE: [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-05 15:41       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:41 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Logan Gunthorpe',
	'Christoph Hellwig', 'James E.J. Bottomley',
	'Martin K. Petersen', 'Jens Axboe',
	'Stephen Bates', 'Max Gurtovoy',
	'Dan Williams', 'Keith Busch',
	'Jason Gunthorpe'
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

> 
> 
> > +static void setup_memwin_p2pmem(struct adapter *adap)
> > +{
> > +	unsigned int mem_base = t4_read_reg(adap,
> CIM_EXTMEM2_BASE_ADDR_A);
> > +	unsigned int mem_size = t4_read_reg(adap,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +
> > +	if (!use_p2pmem)
> > +		return;
> 
> This is weird, why even call this if !use_p2pmem?
> 

The use_p2pmem was added after the original change.  I'll update as you
suggest.

> > +static int init_p2pmem(struct adapter *adapter)
> > +{
> > +	unsigned int mem_size = t4_read_reg(adapter,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +	struct p2pmem_dev *p;
> > +	int rc;
> > +	struct resource res;
> > +
> > +	if (!mem_size || !use_p2pmem)
> > +		return 0;
> 
> Again, weird...

Yup.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 2/8] cxgb4: setup pcie memory window 4 and create p2pmem region
@ 2017-04-05 15:41       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:41 UTC (permalink / raw)


> 
> 
> > +static void setup_memwin_p2pmem(struct adapter *adap)
> > +{
> > +	unsigned int mem_base = t4_read_reg(adap,
> CIM_EXTMEM2_BASE_ADDR_A);
> > +	unsigned int mem_size = t4_read_reg(adap,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +
> > +	if (!use_p2pmem)
> > +		return;
> 
> This is weird, why even call this if !use_p2pmem?
> 

The use_p2pmem was added after the original change.  I'll update as you
suggest.

> > +static int init_p2pmem(struct adapter *adapter)
> > +{
> > +	unsigned int mem_size = t4_read_reg(adapter,
> CIM_EXTMEM2_ADDR_SIZE_A);
> > +	struct p2pmem_dev *p;
> > +	int rc;
> > +	struct resource res;
> > +
> > +	if (!mem_size || !use_p2pmem)
> > +		return 0;
> 
> Again, weird...

Yup.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* RE: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-05 15:43       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:43 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Logan Gunthorpe',
	'Christoph Hellwig', 'James E.J. Bottomley',
	'Martin K. Petersen', 'Jens Axboe',
	'Stephen Bates', 'Max Gurtovoy',
	'Dan Williams', 'Keith Busch',
	'Jason Gunthorpe'
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme

> 
> > +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> > +	if (!p2pmem_debugfs_root)
> > +		pr_info("could not create debugfs entry, continuing\n");
> > +
> 
> Why continue? I think it'd be better to just fail it.
> 

Because not having debugfs support isn't fatal to using p2pmem.  So I
believe it is better to continue.  But this is trivial, IMO, so either was
is ok with me.

> Besides, this can be safely squashed into patch 1.

Yes.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* RE: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-05 15:43       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:43 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Logan Gunthorpe',
	'Christoph Hellwig', 'James E.J. Bottomley',
	'Martin K. Petersen', 'Jens Axboe',
	'Stephen Bates', 'Max Gurtovoy',
	'Dan Williams', 'Keith Busch',
	'Jason Gunthorpe'
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

> 
> > +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> > +	if (!p2pmem_debugfs_root)
> > +		pr_info("could not create debugfs entry, continuing\n");
> > +
> 
> Why continue? I think it'd be better to just fail it.
> 

Because not having debugfs support isn't fatal to using p2pmem.  So I
believe it is better to continue.  But this is trivial, IMO, so either was
is ok with me.

> Besides, this can be safely squashed into patch 1.

Yes.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* RE: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-05 15:43       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:43 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Logan Gunthorpe',
	'Christoph Hellwig', 'James E.J. Bottomley',
	'Martin K. Petersen', 'Jens Axboe',
	'Stephen Bates', 'Max Gurtovoy',
	'Dan Williams', 'Keith Busch',
	'Jason Gunthorpe'
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

> 
> > +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> > +	if (!p2pmem_debugfs_root)
> > +		pr_info("could not create debugfs entry, continuing\n");
> > +
> 
> Why continue? I think it'd be better to just fail it.
> 

Because not having debugfs support isn't fatal to using p2pmem.  So I
believe it is better to continue.  But this is trivial, IMO, so either was
is ok with me.

> Besides, this can be safely squashed into patch 1.

Yes.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* RE: [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-05 15:43       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:43 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Logan Gunthorpe',
	'Christoph Hellwig', 'James E.J. Bottomley',
	'Martin K. Petersen', 'Jens Axboe',
	'Stephen Bates', 'Max Gurtovoy',
	'Dan Williams', 'Keith Busch',
	'Jason Gunthorpe'
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

> 
> > +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> > +	if (!p2pmem_debugfs_root)
> > +		pr_info("could not create debugfs entry, continuing\n");
> > +
> 
> Why continue? I think it'd be better to just fail it.
> 

Because not having debugfs support isn't fatal to using p2pmem.  So I
believe it is better to continue.  But this is trivial, IMO, so either was
is ok with me.

> Besides, this can be safely squashed into patch 1.

Yes.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 4/8] p2pmem: Add debugfs "stats" file
@ 2017-04-05 15:43       ` Steve Wise
  0 siblings, 0 replies; 545+ messages in thread
From: Steve Wise @ 2017-04-05 15:43 UTC (permalink / raw)


> 
> > +	p2pmem_debugfs_root = debugfs_create_dir("p2pmem", NULL);
> > +	if (!p2pmem_debugfs_root)
> > +		pr_info("could not create debugfs entry, continuing\n");
> > +
> 
> Why continue? I think it'd be better to just fail it.
> 

Because not having debugfs support isn't fatal to using p2pmem.  So I
believe it is better to continue.  But this is trivial, IMO, so either was
is ok with me.

> Besides, this can be safely squashed into patch 1.

Yes.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
  2017-04-04 15:46         ` Jason Gunthorpe
  (?)
@ 2017-04-06  5:33             ` Sagi Grimberg
  -1 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-06  5:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


>> Note that the nvme completion queues are still on the host memory, so
>> this means we have lost the ordering between data and completions as
>> they go to different pcie targets.
>
> Hmm, in this simple up/down case with a switch, I think it might
> actually be OK.
>
> Transactions might not complete at the NVMe device before the CPU
> processes the RDMA completion, however due to the PCI-E ordering rules
> new TLPs directed to the NVMe will complete after the RMDA TLPs and
> thus observe the new data. (eg order preserving)
>
> It would be very hard to use P2P if fabric ordering is not preserved..

I think it still can race if the p2p device is connected with more than
a single port to the switch.

Say it's connected via 2 legs, the bar is accessed from leg A and the
data from the disk comes via leg B. In this case, the data is heading
towards the p2p device via leg B (might be congested), the completion
goes directly to the RC, and then the host issues a read from the
bar via leg A. I don't understand what can guarantee ordering here.

Stephen told me that this still guarantees ordering, but I honestly
can't understand how, perhaps someone can explain to me in a simple
way that I can understand.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-06  5:33             ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-06  5:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel


>> Note that the nvme completion queues are still on the host memory, so
>> this means we have lost the ordering between data and completions as
>> they go to different pcie targets.
>
> Hmm, in this simple up/down case with a switch, I think it might
> actually be OK.
>
> Transactions might not complete at the NVMe device before the CPU
> processes the RDMA completion, however due to the PCI-E ordering rules
> new TLPs directed to the NVMe will complete after the RMDA TLPs and
> thus observe the new data. (eg order preserving)
>
> It would be very hard to use P2P if fabric ordering is not preserved..

I think it still can race if the p2p device is connected with more than
a single port to the switch.

Say it's connected via 2 legs, the bar is accessed from leg A and the
data from the disk comes via leg B. In this case, the data is heading
towards the p2p device via leg B (might be congested), the completion
goes directly to the RC, and then the host issues a read from the
bar via leg A. I don't understand what can guarantee ordering here.

Stephen told me that this still guarantees ordering, but I honestly
can't understand how, perhaps someone can explain to me in a simple
way that I can understand.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
@ 2017-04-06  5:33             ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-06  5:33 UTC (permalink / raw)



>> Note that the nvme completion queues are still on the host memory, so
>> this means we have lost the ordering between data and completions as
>> they go to different pcie targets.
>
> Hmm, in this simple up/down case with a switch, I think it might
> actually be OK.
>
> Transactions might not complete at the NVMe device before the CPU
> processes the RDMA completion, however due to the PCI-E ordering rules
> new TLPs directed to the NVMe will complete after the RMDA TLPs and
> thus observe the new data. (eg order preserving)
>
> It would be very hard to use P2P if fabric ordering is not preserved..

I think it still can race if the p2p device is connected with more than
a single port to the switch.

Say it's connected via 2 legs, the bar is accessed from leg A and the
data from the disk comes via leg B. In this case, the data is heading
towards the p2p device via leg B (might be congested), the completion
goes directly to the RC, and then the host issues a read from the
bar via leg A. I don't understand what can guarantee ordering here.

Stephen told me that this still guarantees ordering, but I honestly
can't understand how, perhaps someone can explain to me in a simple
way that I can understand.

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-06  5:47         ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-06  5:47 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, Sinan Kaya


> I hadn't done this yet but I think a simple closest device in the tree
> would solve the issue sufficiently. However, I originally had it so the
> user has to pick the device and I prefer that approach. But if the user
> picks the device, then why bother restricting what he picks?

Because the user can get it wrong, and its our job to do what we can in
order to prevent the user from screwing itself.

> Per the
> thread with Sinan, I'd prefer to use what the user picks. You were one
> of the biggest opponents to that so I'd like to hear your opinion on
> removing the restrictions.

I wasn't against it that much, I'm all for making things "just work"
with minimal configuration steps, but I'm not sure we can get it
right without it.

>>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>>> save an extra PCI transfer as the NVME card could just take the data
>>> out of it's own memory. However, at this time, cards with CMB buffers
>>> don't seem to be available.
>>
>> Even if it was available, it would be hard to make real use of this
>> given that we wouldn't know how to pre-post recv buffers (for in-capsule
>> data). But let's leave this out of the scope entirely...
>
> I don't understand what you're referring to. We'd simply use the CMB
> buffer as a p2pmem device, why does that change anything?

I'm referring to the in-capsule data buffers pre-posts that we do.
Because we prepare a buffer that would contain in-capsule data, we have
no knowledge to which device the incoming I/O is directed to, which
means we can (and will) have I/O where the data lies in CMB of device
A but it's really targeted to device B - which sorta defeats the purpose
of what we're trying to optimize here...

>> Why do you need this? you have a reference to the
>> queue itself.
>
> This keeps track of whether the response was actually allocated with
> p2pmem or not. It's needed for when we free the SGL because the queue
> may have a p2pmem device assigned to it but, if the alloc failed and it
> fell back on system memory then we need to know how to free it. I'm
> currently looking at having SGLs having an iomem flag. In which case,
> this would no longer be needed as the flag in the SGL could be used.

That would be better, maybe...

[...]

>> This is a problem. namespaces can be added at any point in time. No one
>> guarantee that dma_devs are all the namepaces we'll ever see.
>
> Yeah, well restricting p2pmem based on all the devices in use is hard.
> So we'd need a call into the transport every time an ns is added and
> we'd have to drop the p2pmem if they add one that isn't supported. This
> complexity is just one of the reasons I prefer just letting the user chose.

Still the user can get it wrong. Not sure we can get a way without
keeping track of this as new devices join the subsystem.

>>> +
>>> +    if (queue->p2pmem)
>>> +        pr_debug("using %s for rdma nvme target queue",
>>> +             dev_name(&queue->p2pmem->dev));
>>> +
>>> +    kfree(dma_devs);
>>> +}
>>> +
>>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>>          struct rdma_cm_event *event)
>>>  {
>>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>>> rdma_cm_id *cm_id,
>>>      }
>>>      queue->port = cm_id->context;
>>>
>>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>>> +
>>
>> Why is all this done for each queue? looks completely redundant to me.
>
> A little bit. Where would you put it?

I think we'll need a representation of a controller in nvmet-rdma for
that. we sort of got a way without it so far, but I don't think we can
anymore with this.

>>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>>      if (ret)
>>>          goto release_queue;
>>
>> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
>> curious why?
>
> Yes, the thinking was that these transfers were small anyway so there
> would not be significant benefit to pushing them through p2pmem. There's
> really no reason why we couldn't do that if it made sense to though.

I don't see an urgent reason for it too. I was just curious...
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-06  5:47         ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-06  5:47 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Sinan Kaya


> I hadn't done this yet but I think a simple closest device in the tree
> would solve the issue sufficiently. However, I originally had it so the
> user has to pick the device and I prefer that approach. But if the user
> picks the device, then why bother restricting what he picks?

Because the user can get it wrong, and its our job to do what we can in
order to prevent the user from screwing itself.

> Per the
> thread with Sinan, I'd prefer to use what the user picks. You were one
> of the biggest opponents to that so I'd like to hear your opinion on
> removing the restrictions.

I wasn't against it that much, I'm all for making things "just work"
with minimal configuration steps, but I'm not sure we can get it
right without it.

>>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>>> save an extra PCI transfer as the NVME card could just take the data
>>> out of it's own memory. However, at this time, cards with CMB buffers
>>> don't seem to be available.
>>
>> Even if it was available, it would be hard to make real use of this
>> given that we wouldn't know how to pre-post recv buffers (for in-capsule
>> data). But let's leave this out of the scope entirely...
>
> I don't understand what you're referring to. We'd simply use the CMB
> buffer as a p2pmem device, why does that change anything?

I'm referring to the in-capsule data buffers pre-posts that we do.
Because we prepare a buffer that would contain in-capsule data, we have
no knowledge to which device the incoming I/O is directed to, which
means we can (and will) have I/O where the data lies in CMB of device
A but it's really targeted to device B - which sorta defeats the purpose
of what we're trying to optimize here...

>> Why do you need this? you have a reference to the
>> queue itself.
>
> This keeps track of whether the response was actually allocated with
> p2pmem or not. It's needed for when we free the SGL because the queue
> may have a p2pmem device assigned to it but, if the alloc failed and it
> fell back on system memory then we need to know how to free it. I'm
> currently looking at having SGLs having an iomem flag. In which case,
> this would no longer be needed as the flag in the SGL could be used.

That would be better, maybe...

[...]

>> This is a problem. namespaces can be added at any point in time. No one
>> guarantee that dma_devs are all the namepaces we'll ever see.
>
> Yeah, well restricting p2pmem based on all the devices in use is hard.
> So we'd need a call into the transport every time an ns is added and
> we'd have to drop the p2pmem if they add one that isn't supported. This
> complexity is just one of the reasons I prefer just letting the user chose.

Still the user can get it wrong. Not sure we can get a way without
keeping track of this as new devices join the subsystem.

>>> +
>>> +    if (queue->p2pmem)
>>> +        pr_debug("using %s for rdma nvme target queue",
>>> +             dev_name(&queue->p2pmem->dev));
>>> +
>>> +    kfree(dma_devs);
>>> +}
>>> +
>>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>>          struct rdma_cm_event *event)
>>>  {
>>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>>> rdma_cm_id *cm_id,
>>>      }
>>>      queue->port = cm_id->context;
>>>
>>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>>> +
>>
>> Why is all this done for each queue? looks completely redundant to me.
>
> A little bit. Where would you put it?

I think we'll need a representation of a controller in nvmet-rdma for
that. we sort of got a way without it so far, but I don't think we can
anymore with this.

>>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>>      if (ret)
>>>          goto release_queue;
>>
>> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
>> curious why?
>
> Yes, the thinking was that these transfers were small anyway so there
> would not be significant benefit to pushing them through p2pmem. There's
> really no reason why we couldn't do that if it made sense to though.

I don't see an urgent reason for it too. I was just curious...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-06  5:47         ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-06  5:47 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Sinan Kaya


> I hadn't done this yet but I think a simple closest device in the tree
> would solve the issue sufficiently. However, I originally had it so the
> user has to pick the device and I prefer that approach. But if the user
> picks the device, then why bother restricting what he picks?

Because the user can get it wrong, and its our job to do what we can in
order to prevent the user from screwing itself.

> Per the
> thread with Sinan, I'd prefer to use what the user picks. You were one
> of the biggest opponents to that so I'd like to hear your opinion on
> removing the restrictions.

I wasn't against it that much, I'm all for making things "just work"
with minimal configuration steps, but I'm not sure we can get it
right without it.

>>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>>> save an extra PCI transfer as the NVME card could just take the data
>>> out of it's own memory. However, at this time, cards with CMB buffers
>>> don't seem to be available.
>>
>> Even if it was available, it would be hard to make real use of this
>> given that we wouldn't know how to pre-post recv buffers (for in-capsule
>> data). But let's leave this out of the scope entirely...
>
> I don't understand what you're referring to. We'd simply use the CMB
> buffer as a p2pmem device, why does that change anything?

I'm referring to the in-capsule data buffers pre-posts that we do.
Because we prepare a buffer that would contain in-capsule data, we have
no knowledge to which device the incoming I/O is directed to, which
means we can (and will) have I/O where the data lies in CMB of device
A but it's really targeted to device B - which sorta defeats the purpose
of what we're trying to optimize here...

>> Why do you need this? you have a reference to the
>> queue itself.
>
> This keeps track of whether the response was actually allocated with
> p2pmem or not. It's needed for when we free the SGL because the queue
> may have a p2pmem device assigned to it but, if the alloc failed and it
> fell back on system memory then we need to know how to free it. I'm
> currently looking at having SGLs having an iomem flag. In which case,
> this would no longer be needed as the flag in the SGL could be used.

That would be better, maybe...

[...]

>> This is a problem. namespaces can be added at any point in time. No one
>> guarantee that dma_devs are all the namepaces we'll ever see.
>
> Yeah, well restricting p2pmem based on all the devices in use is hard.
> So we'd need a call into the transport every time an ns is added and
> we'd have to drop the p2pmem if they add one that isn't supported. This
> complexity is just one of the reasons I prefer just letting the user chose.

Still the user can get it wrong. Not sure we can get a way without
keeping track of this as new devices join the subsystem.

>>> +
>>> +    if (queue->p2pmem)
>>> +        pr_debug("using %s for rdma nvme target queue",
>>> +             dev_name(&queue->p2pmem->dev));
>>> +
>>> +    kfree(dma_devs);
>>> +}
>>> +
>>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>>          struct rdma_cm_event *event)
>>>  {
>>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>>> rdma_cm_id *cm_id,
>>>      }
>>>      queue->port = cm_id->context;
>>>
>>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>>> +
>>
>> Why is all this done for each queue? looks completely redundant to me.
>
> A little bit. Where would you put it?

I think we'll need a representation of a controller in nvmet-rdma for
that. we sort of got a way without it so far, but I don't think we can
anymore with this.

>>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>>      if (ret)
>>>          goto release_queue;
>>
>> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
>> curious why?
>
> Yes, the thinking was that these transfers were small anyway so there
> would not be significant benefit to pushing them through p2pmem. There's
> really no reason why we couldn't do that if it made sense to though.

I don't see an urgent reason for it too. I was just curious...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-06  5:47         ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-06  5:47 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Sinan Kaya


> I hadn't done this yet but I think a simple closest device in the tree
> would solve the issue sufficiently. However, I originally had it so the
> user has to pick the device and I prefer that approach. But if the user
> picks the device, then why bother restricting what he picks?

Because the user can get it wrong, and its our job to do what we can in
order to prevent the user from screwing itself.

> Per the
> thread with Sinan, I'd prefer to use what the user picks. You were one
> of the biggest opponents to that so I'd like to hear your opinion on
> removing the restrictions.

I wasn't against it that much, I'm all for making things "just work"
with minimal configuration steps, but I'm not sure we can get it
right without it.

>>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>>> save an extra PCI transfer as the NVME card could just take the data
>>> out of it's own memory. However, at this time, cards with CMB buffers
>>> don't seem to be available.
>>
>> Even if it was available, it would be hard to make real use of this
>> given that we wouldn't know how to pre-post recv buffers (for in-capsule
>> data). But let's leave this out of the scope entirely...
>
> I don't understand what you're referring to. We'd simply use the CMB
> buffer as a p2pmem device, why does that change anything?

I'm referring to the in-capsule data buffers pre-posts that we do.
Because we prepare a buffer that would contain in-capsule data, we have
no knowledge to which device the incoming I/O is directed to, which
means we can (and will) have I/O where the data lies in CMB of device
A but it's really targeted to device B - which sorta defeats the purpose
of what we're trying to optimize here...

>> Why do you need this? you have a reference to the
>> queue itself.
>
> This keeps track of whether the response was actually allocated with
> p2pmem or not. It's needed for when we free the SGL because the queue
> may have a p2pmem device assigned to it but, if the alloc failed and it
> fell back on system memory then we need to know how to free it. I'm
> currently looking at having SGLs having an iomem flag. In which case,
> this would no longer be needed as the flag in the SGL could be used.

That would be better, maybe...

[...]

>> This is a problem. namespaces can be added at any point in time. No one
>> guarantee that dma_devs are all the namepaces we'll ever see.
>
> Yeah, well restricting p2pmem based on all the devices in use is hard.
> So we'd need a call into the transport every time an ns is added and
> we'd have to drop the p2pmem if they add one that isn't supported. This
> complexity is just one of the reasons I prefer just letting the user chose.

Still the user can get it wrong. Not sure we can get a way without
keeping track of this as new devices join the subsystem.

>>> +
>>> +    if (queue->p2pmem)
>>> +        pr_debug("using %s for rdma nvme target queue",
>>> +             dev_name(&queue->p2pmem->dev));
>>> +
>>> +    kfree(dma_devs);
>>> +}
>>> +
>>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>>          struct rdma_cm_event *event)
>>>  {
>>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>>> rdma_cm_id *cm_id,
>>>      }
>>>      queue->port = cm_id->context;
>>>
>>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>>> +
>>
>> Why is all this done for each queue? looks completely redundant to me.
>
> A little bit. Where would you put it?

I think we'll need a representation of a controller in nvmet-rdma for
that. we sort of got a way without it so far, but I don't think we can
anymore with this.

>>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>>      if (ret)
>>>          goto release_queue;
>>
>> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
>> curious why?
>
> Yes, the thinking was that these transfers were small anyway so there
> would not be significant benefit to pushing them through p2pmem. There's
> really no reason why we couldn't do that if it made sense to though.

I don't see an urgent reason for it too. I was just curious...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-06  5:47         ` Sagi Grimberg
  0 siblings, 0 replies; 545+ messages in thread
From: Sagi Grimberg @ 2017-04-06  5:47 UTC (permalink / raw)



> I hadn't done this yet but I think a simple closest device in the tree
> would solve the issue sufficiently. However, I originally had it so the
> user has to pick the device and I prefer that approach. But if the user
> picks the device, then why bother restricting what he picks?

Because the user can get it wrong, and its our job to do what we can in
order to prevent the user from screwing itself.

> Per the
> thread with Sinan, I'd prefer to use what the user picks. You were one
> of the biggest opponents to that so I'd like to hear your opinion on
> removing the restrictions.

I wasn't against it that much, I'm all for making things "just work"
with minimal configuration steps, but I'm not sure we can get it
right without it.

>>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>>> save an extra PCI transfer as the NVME card could just take the data
>>> out of it's own memory. However, at this time, cards with CMB buffers
>>> don't seem to be available.
>>
>> Even if it was available, it would be hard to make real use of this
>> given that we wouldn't know how to pre-post recv buffers (for in-capsule
>> data). But let's leave this out of the scope entirely...
>
> I don't understand what you're referring to. We'd simply use the CMB
> buffer as a p2pmem device, why does that change anything?

I'm referring to the in-capsule data buffers pre-posts that we do.
Because we prepare a buffer that would contain in-capsule data, we have
no knowledge to which device the incoming I/O is directed to, which
means we can (and will) have I/O where the data lies in CMB of device
A but it's really targeted to device B - which sorta defeats the purpose
of what we're trying to optimize here...

>> Why do you need this? you have a reference to the
>> queue itself.
>
> This keeps track of whether the response was actually allocated with
> p2pmem or not. It's needed for when we free the SGL because the queue
> may have a p2pmem device assigned to it but, if the alloc failed and it
> fell back on system memory then we need to know how to free it. I'm
> currently looking at having SGLs having an iomem flag. In which case,
> this would no longer be needed as the flag in the SGL could be used.

That would be better, maybe...

[...]

>> This is a problem. namespaces can be added at any point in time. No one
>> guarantee that dma_devs are all the namepaces we'll ever see.
>
> Yeah, well restricting p2pmem based on all the devices in use is hard.
> So we'd need a call into the transport every time an ns is added and
> we'd have to drop the p2pmem if they add one that isn't supported. This
> complexity is just one of the reasons I prefer just letting the user chose.

Still the user can get it wrong. Not sure we can get a way without
keeping track of this as new devices join the subsystem.

>>> +
>>> +    if (queue->p2pmem)
>>> +        pr_debug("using %s for rdma nvme target queue",
>>> +             dev_name(&queue->p2pmem->dev));
>>> +
>>> +    kfree(dma_devs);
>>> +}
>>> +
>>>  static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
>>>          struct rdma_cm_event *event)
>>>  {
>>> @@ -1199,6 +1271,8 @@ static int nvmet_rdma_queue_connect(struct
>>> rdma_cm_id *cm_id,
>>>      }
>>>      queue->port = cm_id->context;
>>>
>>> +    nvmet_rdma_queue_setup_p2pmem(queue);
>>> +
>>
>> Why is all this done for each queue? looks completely redundant to me.
>
> A little bit. Where would you put it?

I think we'll need a representation of a controller in nvmet-rdma for
that. we sort of got a way without it so far, but I don't think we can
anymore with this.

>>>      ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>>      if (ret)
>>>          goto release_queue;
>>
>> You seemed to skip the in-capsule buffers for p2pmem (inline_page), I'm
>> curious why?
>
> Yes, the thinking was that these transfers were small anyway so there
> would not be significant benefit to pushing them through p2pmem. There's
> really no reason why we couldn't do that if it made sense to though.

I don't see an urgent reason for it too. I was just curious...

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-06 15:52           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-06 15:52 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi, linux-nvdimm, linux-rdma, linux-pci, linux-kernel,
	linux-nvme, Sinan Kaya

Hey Sagi,

On 05/04/17 11:47 PM, Sagi Grimberg wrote:
> Because the user can get it wrong, and its our job to do what we can in
> order to prevent the user from screwing itself.

Well, "screwing" themselves seems a bit strong. It wouldn't be much
different from a lot of other tunables in the system. For example, it
would be similar to the user choosing the wrong io scheduler for their
disk or workload. If you change this setting without measuring
performance you probably don't care too much about the result anyway.

> I wasn't against it that much, I'm all for making things "just work"
> with minimal configuration steps, but I'm not sure we can get it
> right without it.

Ok, well in that case I may reconsider this in the next series.

>>>> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
>>>> save an extra PCI transfer as the NVME card could just take the data
>>>> out of it's own memory. However, at this time, cards with CMB buffers
>>>> don't seem to be available.
>>>
>>> Even if it was available, it would be hard to make real use of this
>>> given that we wouldn't know how to pre-post recv buffers (for in-capsule
>>> data). But let's leave this out of the scope entirely...
>>
>> I don't understand what you're referring to. We'd simply use the CMB
>> buffer as a p2pmem device, why does that change anything?
> 
> I'm referring to the in-capsule data buffers pre-posts that we do.
> Because we prepare a buffer that would contain in-capsule data, we have
> no knowledge to which device the incoming I/O is directed to, which
> means we can (and will) have I/O where the data lies in CMB of device
> A but it's really targeted to device B - which sorta defeats the purpose
> of what we're trying to optimize here...

Well, the way I've had it is that each port gets one p2pmem device. So
you'd only want to put NVMe devices that will work with that p2pmem
device behind that port. Though, I can see that being a difficult
restriction seeing it probably means you'll need to have one port per
nvme device if you want to use the CMB buffer of each device. I'll have
to think about that some. Also, it's worth noting that we aren't even
optimizing in-capsule data at this time.


> Still the user can get it wrong. Not sure we can get a way without
> keeping track of this as new devices join the subsystem.

Yeah, I understand. I'll have to think some more about all of this. I'm
starting to see some ways to improve thing.s

Thanks,

Logan
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 545+ messages in thread

* Re: [RFC 3/8] nvmet: Use p2pmem in nvme target
@ 2017-04-06 15:52           ` Logan Gunthorpe
  0 siblings, 0 replies; 545+ messages in thread
From: Logan Gunthorpe @ 2017-04-06 15:52 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Sinan Kaya

Hey Sagi,

On 05/04/17 11:47 PM, Sagi Grimberg wrote:
> Because the user can get it wrong, and it