linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
@ 2017-04-16 15:44 Dan Williams
  2017-04-16 16:47 ` Logan Gunthorpe
  2017-04-16 22:23 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-16 15:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Bjorn Helgaas, Jason Gunthorpe,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Sat, Apr 15, 2017 at 10:36 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 15/04/17 04:17 PM, Benjamin Herrenschmidt wrote:
>> You can't. If the iommu is on, everything is remapped. Or do you mean
>> to have dma_map_* not do a remapping ?
>
> Well, yes, you'd have to change the code so that iomem pages do not get
> remapped and the raw BAR address is passed to the DMA engine. I said
> specifically we haven't done this at this time but it really doesn't
> seem like an unsolvable problem. It is something we will need to address
> before a proper patch set is posted though.
>
>> That's the problem again, same as before, for that to work, the
>> dma_map_* ops would have to do something special that depends on *both*
>> the source and target device.
>
> No, I don't think you have to do things different based on the source.
> Have the p2pmem device layer restrict allocating p2pmem based on the
> devices in use (similar to how the RFC code works now) and when the dma
> mapping code sees iomem pages it just needs to leave the address alone
> so it's used directly by the dma in question.
>
> It's much better to make the decision on which memory to use when you
> allocate it. If you wait until you map it, it would be a pain to fall
> back to system memory if it doesn't look like it will work. So, if when
> you allocate it, you know everything will work you just need the dma
> mapping layer to stay out of the way.

I think we very much want the dma mapping layer to be in the way.
It's the only sane semantic we have to communicate this translation.

>
>> The dma_ops today are architecture specific and have no way to
>> differenciate between normal and those special P2P DMA pages.
>
> Correct, unless Dan's idea works (which will need some investigation),
> we'd need a flag in struct page or some other similar method to
> determine that these are special iomem pages.
>
>>> Though if it does, I'd expect
>>> everything would still work you just wouldn't get the performance or
>>> traffic flow you are looking for. We've been testing with the software
>>> iommu which doesn't have this problem.
>>
>> So first, no, it's more than "you wouldn't get the performance". On
>> some systems it may also just not work. Also what do you mean by "the
>> SW iommu doesn't have this problem" ? It catches the fact that
>> addresses don't point to RAM and maps differently ?
>
> I haven't tested it but I can't imagine why an iommu would not correctly
> map the memory in the bar. But that's _way_ beside the point. We
> _really_ want to avoid that situation anyway. If the iommu maps the
> memory it defeats what we are trying to accomplish.
>
> I believe the sotfware iommu only uses bounce buffers if the DMA engine
> in use cannot address the memory. So in most cases, with modern
> hardware, it just passes the BAR's address to the DMA engine and
> everything works. The code posted in the RFC does in fact work without
> needing to do any of this fussing.
>
>>>> The problem is that the latter while seemingly easier, is also slower
>>>> and not supported by all platforms and architectures (for example,
>>>> POWER currently won't allow it, or rather only allows a store-only
>>>> subset of it under special circumstances).
>>>
>>> Yes, I think situations where we have to cross host bridges will remain
>>> unsupported by this work for a long time. There are two many cases where
>>> it just doesn't work or it performs too poorly to be useful.
>>
>> And the situation where you don't cross bridges is the one where you
>> need to also take into account the offsets.
>
> I think for the first incarnation we will just not support systems that
> have offsets. This makes things much easier and still supports all the
> use cases we are interested in.
>
>> So you are designing something that is built from scratch to only work
>> on a specific limited category of systems and is also incompatible with
>> virtualization.
>
> Yes, we are starting with support for specific use cases. Almost all
> technology starts that way. Dax has been in the kernel for years and
> only recently has someone submitted patches for it to support pmem on
> powerpc. This is not unusual. If you had forced the pmem developers to
> support all architectures in existence before allowing them upstream
> they couldn't possibly be as far as they are today.

The difference is that there was nothing fundamental in the core
design of pmem + DAX that prevented other archs from growing pmem
support. THP and memory hotplug existed on other architectures and
they just need to plug in their arch-specific enabling. p2p support
needs the same starting point of something more than one architecture
can plug into, and handling the bus address offset case needs to be
incorporated into the design.

pmem + dax did not change the meaning of what a dma_addr_t is, p2p does.

> Virtualization specifically would be a _lot_ more difficult than simply
> supporting offsets. The actual topology of the bus will probably be lost
> on the guest OS and it would therefor have a difficult time figuring out
> when it's acceptable to use p2pmem. I also have a difficult time seeing
> a use case for it and thus I have a hard time with the argument that we
> can't support use cases that do want it because use cases that don't
> want it (perhaps yet) won't work.
>
>> This is an interesting experiement to look at I suppose, but if you
>> ever want this upstream I would like at least for you to develop a
>> strategy to support the wider case, if not an actual implementation.
>
> I think there are plenty of avenues forward to support offsets, etc.
> It's just work. Nothing we'd be proposing would be incompatible with it.
> We just don't want to have to do it all upfront especially when no one
> really knows how well various architecture's hardware supports this or
> if anyone even wants to run it on systems such as those. (Keep in mind
> this is a pretty specific optimization that mostly helps systems
> designed in specific ways -- not a general "everybody gets faster" type
> situation.) Get the cases working we know will work, can easily support
> and people actually want.  Then expand it to support others as people
> come around with hardware to test and use cases for it.

I think you need to give other archs a chance to support this with a
design that considers the offset case as a first class citizen rather
than an afterthought.

^ permalink raw reply	[flat|nested] 105+ messages in thread
* [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
@ 2017-03-30 22:12 Logan Gunthorpe
  2017-04-12  5:22 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

Hello,

As discussed at LSF/MM we'd like to present our work to enable
copy offload support in NVMe fabrics RDMA targets. We'd appreciate
some review and feedback from the community on our direction.
This series is not intended to go upstream at this point.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the trade-off
is currently a reduction in overall throughput. (Largely due to hardware
issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We've chosen to go this route based on feedback we
received at LSF).

In order to enable this functionality we introduce a new p2pmem device
which can be instantiated by PCI drivers. The device will register some
PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
users of these devices to get buffers. We give an example of enabling
p2p memory with the cxgb4 driver, however currently these devices have
some hardware issues that prevent their use so we will likely be
dropping this patch in the future. Ideally, we'd want to enable this
functionality with NVME CMB buffers, however we don't have any hardware
with this feature at this time.

In nvmet-rdma, we attempt to get an appropriate p2pmem device at
queue creation time and if a suitable one is found we will use it for
all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
attribute is also created which is required to be set before any p2pmem
is attempted.

This patchset also includes a more controversial patch which provides an
interface for userspace to obtain p2pmem buffers through an mmap call on
a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
O_DIRECT interfaces. However, the user would be entirely responsible for
knowing what their doing and inspecting sysfs to understand the pci
topology and only using it in sane situations.

Thanks,

Logan


Logan Gunthorpe (6):
  Introduce Peer-to-Peer memory (p2pmem) device
  nvmet: Use p2pmem in nvme target
  scatterlist: Modify SG copy functions to support io memory.
  nvmet: Be careful about using iomem accesses when dealing with p2pmem
  p2pmem: Support device removal
  p2pmem: Added char device user interface

Steve Wise (2):
  cxgb4: setup pcie memory window 4 and create p2pmem region
  p2pmem: Add debugfs "stats" file

 drivers/memory/Kconfig                          |   5 +
 drivers/memory/Makefile                         |   2 +
 drivers/memory/p2pmem.c                         | 697 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  97 +++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   5 +
 drivers/nvme/target/configfs.c                  |  31 ++
 drivers/nvme/target/core.c                      |  18 +-
 drivers/nvme/target/fabrics-cmd.c               |  28 +-
 drivers/nvme/target/nvmet.h                     |   2 +
 drivers/nvme/target/rdma.c                      | 183 +++++--
 drivers/scsi/scsi_debug.c                       |   7 +-
 include/linux/p2pmem.h                          | 120 ++++
 include/linux/scatterlist.h                     |   7 +-
 lib/scatterlist.c                               |  64 ++-
 15 files changed, 1189 insertions(+), 80 deletions(-)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

--
2.1.4

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2017-04-25 21:23 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-16 15:44 [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Dan Williams
2017-04-16 16:47 ` Logan Gunthorpe
2017-04-16 22:32   ` Benjamin Herrenschmidt
2017-04-17  5:13     ` Logan Gunthorpe
2017-04-17  7:20       ` Benjamin Herrenschmidt
2017-04-17 16:52         ` Logan Gunthorpe
2017-04-17 17:04           ` Dan Williams
2017-04-18  5:22             ` Logan Gunthorpe
2017-04-17 18:04           ` Jerome Glisse
2017-04-18  6:14             ` Logan Gunthorpe
2017-04-17 21:11           ` Benjamin Herrenschmidt
2017-04-18  5:43             ` Logan Gunthorpe
2017-04-18  6:29               ` Benjamin Herrenschmidt
2017-04-16 22:23 ` Benjamin Herrenschmidt
2017-04-18 16:45   ` Jason Gunthorpe
2017-04-18 17:27     ` Dan Williams
2017-04-18 18:00       ` Jason Gunthorpe
2017-04-18 18:34         ` Dan Williams
2017-04-19  1:13         ` Benjamin Herrenschmidt
2017-04-18 22:46       ` Benjamin Herrenschmidt
2017-04-18 22:52         ` Dan Williams
2017-04-18 18:30     ` Logan Gunthorpe
2017-04-18 19:01       ` Jason Gunthorpe
2017-04-18 19:35         ` Logan Gunthorpe
2017-04-18 19:48           ` Dan Williams
2017-04-18 20:29             ` Jerome Glisse
2017-04-18 20:31               ` Dan Williams
2017-04-18 20:48                 ` Logan Gunthorpe
2017-04-19  1:17                   ` Benjamin Herrenschmidt
2017-04-18 21:03             ` Jason Gunthorpe
2017-04-18 21:11               ` Dan Williams
2017-04-18 21:22                 ` Jason Gunthorpe
2017-04-18 21:36                   ` Dan Williams
2017-04-18 22:15                     ` Logan Gunthorpe
2017-04-18 22:28                       ` Dan Williams
2017-04-18 22:42                         ` Jason Gunthorpe
2017-04-18 22:51                           ` Dan Williams
2017-04-18 23:21                             ` Jason Gunthorpe
2017-04-19  1:25                               ` Benjamin Herrenschmidt
2017-04-18 22:48                         ` Logan Gunthorpe
2017-04-18 22:50                           ` Dan Williams
2017-04-18 22:56                             ` Logan Gunthorpe
2017-04-18 23:02                               ` Dan Williams
2017-04-19  1:21                   ` Benjamin Herrenschmidt
2017-04-18 21:31               ` Logan Gunthorpe
2017-04-18 22:24                 ` Jason Gunthorpe
2017-04-18 23:03                   ` Logan Gunthorpe
2017-04-19  1:23                   ` Benjamin Herrenschmidt
2017-04-19  1:20               ` Benjamin Herrenschmidt
2017-04-19 15:55                 ` Jason Gunthorpe
2017-04-19 16:48                   ` Logan Gunthorpe
2017-04-19 17:01                     ` Dan Williams
2017-04-19 17:32                       ` Jerome Glisse
2017-04-19 17:41                         ` Dan Williams
2017-04-19 18:11                           ` Logan Gunthorpe
2017-04-19 18:19                             ` Logan Gunthorpe
2017-04-19 18:30                               ` Dan Williams
2017-04-19 18:41                                 ` Logan Gunthorpe
2017-04-19 18:44                                   ` Dan Williams
2017-04-20 20:43                           ` Stephen  Bates
2017-04-20 20:47                             ` Dan Williams
2017-04-20 23:07                               ` Stephen  Bates
2017-04-21  4:59                                 ` Dan Williams
2017-04-19 17:14                     ` Jason Gunthorpe
2017-04-19 18:01                       ` Logan Gunthorpe
2017-04-19 18:32                         ` Jason Gunthorpe
2017-04-19 19:02                           ` Logan Gunthorpe
2017-04-19 19:31                             ` Jason Gunthorpe
2017-04-19 19:41                               ` Logan Gunthorpe
2017-04-19 20:48                                 ` Jason Gunthorpe
2017-04-19 22:55                                   ` Logan Gunthorpe
2017-04-20  0:07                                     ` Dan Williams
2017-04-18 19:48           ` Jason Gunthorpe
2017-04-18 20:06             ` Logan Gunthorpe
  -- strict thread matches above, loose matches on Subject: below --
2017-03-30 22:12 Logan Gunthorpe
2017-04-12  5:22 ` Benjamin Herrenschmidt
2017-04-12 17:09   ` Logan Gunthorpe
2017-04-12 21:55     ` Benjamin Herrenschmidt
2017-04-13 21:22       ` Logan Gunthorpe
2017-04-13 22:37         ` Benjamin Herrenschmidt
2017-04-13 23:26         ` Bjorn Helgaas
2017-04-14  4:16           ` Jason Gunthorpe
2017-04-14  4:40             ` Logan Gunthorpe
2017-04-14 11:37               ` Benjamin Herrenschmidt
2017-04-14 11:39                 ` Benjamin Herrenschmidt
2017-04-14 11:37             ` Benjamin Herrenschmidt
2017-04-14 17:30               ` Logan Gunthorpe
2017-04-14 19:04                 ` Bjorn Helgaas
2017-04-14 22:07                   ` Benjamin Herrenschmidt
2017-04-15 17:41                     ` Logan Gunthorpe
2017-04-15 22:09                       ` Dan Williams
2017-04-16  3:01                         ` Benjamin Herrenschmidt
2017-04-16  4:46                           ` Logan Gunthorpe
2017-04-16 15:53                           ` Dan Williams
2017-04-16 16:34                             ` Logan Gunthorpe
2017-04-16 22:31                               ` Benjamin Herrenschmidt
2017-04-24  7:36                                 ` Knut Omang
2017-04-24 16:14                                   ` Logan Gunthorpe
2017-04-25  6:30                                     ` Knut Omang
2017-04-25 17:03                                       ` Logan Gunthorpe
2017-04-25 21:23                                         ` Stephen  Bates
2017-04-25 21:23                                   ` Stephen  Bates
2017-04-16 22:26                             ` Benjamin Herrenschmidt
2017-04-15 22:17                       ` Benjamin Herrenschmidt
2017-04-16  5:36                         ` Logan Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).