linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
@ 2017-04-16 15:44 Dan Williams
  2017-04-16 16:47 ` Logan Gunthorpe
  2017-04-16 22:23 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-16 15:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Bjorn Helgaas, Jason Gunthorpe,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Sat, Apr 15, 2017 at 10:36 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 15/04/17 04:17 PM, Benjamin Herrenschmidt wrote:
>> You can't. If the iommu is on, everything is remapped. Or do you mean
>> to have dma_map_* not do a remapping ?
>
> Well, yes, you'd have to change the code so that iomem pages do not get
> remapped and the raw BAR address is passed to the DMA engine. I said
> specifically we haven't done this at this time but it really doesn't
> seem like an unsolvable problem. It is something we will need to address
> before a proper patch set is posted though.
>
>> That's the problem again, same as before, for that to work, the
>> dma_map_* ops would have to do something special that depends on *both*
>> the source and target device.
>
> No, I don't think you have to do things different based on the source.
> Have the p2pmem device layer restrict allocating p2pmem based on the
> devices in use (similar to how the RFC code works now) and when the dma
> mapping code sees iomem pages it just needs to leave the address alone
> so it's used directly by the dma in question.
>
> It's much better to make the decision on which memory to use when you
> allocate it. If you wait until you map it, it would be a pain to fall
> back to system memory if it doesn't look like it will work. So, if when
> you allocate it, you know everything will work you just need the dma
> mapping layer to stay out of the way.

I think we very much want the dma mapping layer to be in the way.
It's the only sane semantic we have to communicate this translation.

>
>> The dma_ops today are architecture specific and have no way to
>> differenciate between normal and those special P2P DMA pages.
>
> Correct, unless Dan's idea works (which will need some investigation),
> we'd need a flag in struct page or some other similar method to
> determine that these are special iomem pages.
>
>>> Though if it does, I'd expect
>>> everything would still work you just wouldn't get the performance or
>>> traffic flow you are looking for. We've been testing with the software
>>> iommu which doesn't have this problem.
>>
>> So first, no, it's more than "you wouldn't get the performance". On
>> some systems it may also just not work. Also what do you mean by "the
>> SW iommu doesn't have this problem" ? It catches the fact that
>> addresses don't point to RAM and maps differently ?
>
> I haven't tested it but I can't imagine why an iommu would not correctly
> map the memory in the bar. But that's _way_ beside the point. We
> _really_ want to avoid that situation anyway. If the iommu maps the
> memory it defeats what we are trying to accomplish.
>
> I believe the sotfware iommu only uses bounce buffers if the DMA engine
> in use cannot address the memory. So in most cases, with modern
> hardware, it just passes the BAR's address to the DMA engine and
> everything works. The code posted in the RFC does in fact work without
> needing to do any of this fussing.
>
>>>> The problem is that the latter while seemingly easier, is also slower
>>>> and not supported by all platforms and architectures (for example,
>>>> POWER currently won't allow it, or rather only allows a store-only
>>>> subset of it under special circumstances).
>>>
>>> Yes, I think situations where we have to cross host bridges will remain
>>> unsupported by this work for a long time. There are two many cases where
>>> it just doesn't work or it performs too poorly to be useful.
>>
>> And the situation where you don't cross bridges is the one where you
>> need to also take into account the offsets.
>
> I think for the first incarnation we will just not support systems that
> have offsets. This makes things much easier and still supports all the
> use cases we are interested in.
>
>> So you are designing something that is built from scratch to only work
>> on a specific limited category of systems and is also incompatible with
>> virtualization.
>
> Yes, we are starting with support for specific use cases. Almost all
> technology starts that way. Dax has been in the kernel for years and
> only recently has someone submitted patches for it to support pmem on
> powerpc. This is not unusual. If you had forced the pmem developers to
> support all architectures in existence before allowing them upstream
> they couldn't possibly be as far as they are today.

The difference is that there was nothing fundamental in the core
design of pmem + DAX that prevented other archs from growing pmem
support. THP and memory hotplug existed on other architectures and
they just need to plug in their arch-specific enabling. p2p support
needs the same starting point of something more than one architecture
can plug into, and handling the bus address offset case needs to be
incorporated into the design.

pmem + dax did not change the meaning of what a dma_addr_t is, p2p does.

> Virtualization specifically would be a _lot_ more difficult than simply
> supporting offsets. The actual topology of the bus will probably be lost
> on the guest OS and it would therefor have a difficult time figuring out
> when it's acceptable to use p2pmem. I also have a difficult time seeing
> a use case for it and thus I have a hard time with the argument that we
> can't support use cases that do want it because use cases that don't
> want it (perhaps yet) won't work.
>
>> This is an interesting experiement to look at I suppose, but if you
>> ever want this upstream I would like at least for you to develop a
>> strategy to support the wider case, if not an actual implementation.
>
> I think there are plenty of avenues forward to support offsets, etc.
> It's just work. Nothing we'd be proposing would be incompatible with it.
> We just don't want to have to do it all upfront especially when no one
> really knows how well various architecture's hardware supports this or
> if anyone even wants to run it on systems such as those. (Keep in mind
> this is a pretty specific optimization that mostly helps systems
> designed in specific ways -- not a general "everybody gets faster" type
> situation.) Get the cases working we know will work, can easily support
> and people actually want.  Then expand it to support others as people
> come around with hardware to test and use cases for it.

I think you need to give other archs a chance to support this with a
design that considers the offset case as a first class citizen rather
than an afterthought.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16 15:44 [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Dan Williams
@ 2017-04-16 16:47 ` Logan Gunthorpe
  2017-04-16 22:32   ` Benjamin Herrenschmidt
  2017-04-16 22:23 ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-16 16:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Benjamin Herrenschmidt, Bjorn Helgaas, Jason Gunthorpe,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 16/04/17 09:44 AM, Dan Williams wrote:
> I think we very much want the dma mapping layer to be in the way.
> It's the only sane semantic we have to communicate this translation.

Yes, I wasn't proposing bypassing that layer, per say. I just meant that
the layer would, in the end, have to return the address without any
translations.

> The difference is that there was nothing fundamental in the core
> design of pmem + DAX that prevented other archs from growing pmem
> support. THP and memory hotplug existed on other architectures and
> they just need to plug in their arch-specific enabling. p2p support
> needs the same starting point of something more than one architecture
> can plug into, and handling the bus address offset case needs to be
> incorporated into the design.

I don't think there's a difference there either. There'd have been
nothing fundamental in our core design that says offsets couldn't have
been added later.

> pmem + dax did not change the meaning of what a dma_addr_t is, p2p does.

I don't think p2p actually really changes the meaning of dma_addr_t
either. We are just putting addresses in there that weren't used
previously. Our RFC makes no changes to anything even remotely related
to dma_addr_t.

> I think you need to give other archs a chance to support this with a
> design that considers the offset case as a first class citizen rather
> than an afterthought.

I'll consider this. Given the fact I can use your existing
get_dev_pagemap infrastructure to look up the p2pmem device this
probably isn't as hard as I thought it would be anyway (we probably
don't even need a page flag). We'd just have lookup the dev_pagemap,
test if it's a p2pmem device, and if so, call a p2pmem_dma_map function
which could apply the offset or do any other arch specific logic (if
necessary).

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16 15:44 [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Dan Williams
  2017-04-16 16:47 ` Logan Gunthorpe
@ 2017-04-16 22:23 ` Benjamin Herrenschmidt
  2017-04-18 16:45   ` Jason Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-16 22:23 UTC (permalink / raw)
  To: Dan Williams, Logan Gunthorpe
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Sun, 2017-04-16 at 08:44 -0700, Dan Williams wrote:
> The difference is that there was nothing fundamental in the core
> design of pmem + DAX that prevented other archs from growing pmem
> support.

Indeed. In fact we have work in progress support for pmem on power
using experimental HW.

> THP and memory hotplug existed on other architectures and
> they just need to plug in their arch-specific enabling. p2p support
> needs the same starting point of something more than one architecture
> can plug into, and handling the bus address offset case needs to be
> incorporated into the design.
> 
> pmem + dax did not change the meaning of what a dma_addr_t is, p2p does.

The more I think about it, the more I tend toward something along the
lines of having the arch DMA ops being able to quickly differentiate
between "normal" memory (which includes non-PCI pmem in some cases,
it's an architecture choice I suppose) and "special device" (page flag
? pfn bit ? ... there are options).

>From there, we keep our existing fast path for the normal case.

For the special case, we need to provide a fast lookup mechanism
(assuming we can't stash enough stuff in struct page or the pfn)
to get back to a struct of some sort that provides the necessary
information to resolve the translation.

This *could* be something like a struct p2mem device that carries
a special set of DMA ops, though we probably shouldn't make the generic
structure PCI specific.

This is a slightly slower path, but that "stub" structure allows the
special DMA ops to provide the necessary bus-specific knowledge, which
for PCI for example, can check whether the devices are on the same
segment, whether the switches are configured to allow p2p, etc...

What form should that fast lookup take ? It's not completely clear to
me at that point. We could start with a simple linear lookup I suppose
and improve in a second stage.

Of course this pipes into the old discussion about disconnecting
the DMA ops from struct page. If we keep struct page, any device that
wants to be a potential DMA target will need to do something "special"
to create those struct pages etc.. though we could make that a simple
pci helper that pops the necessary bits and pieces for a given BAR &
range.

If we don't need struct page, then it might be possible to hide it all
in the PCI infrastructure.

> > Virtualization specifically would be a _lot_ more difficult than simply
> > supporting offsets. The actual topology of the bus will probably be lost
> > on the guest OS and it would therefor have a difficult time figuring out
> > when it's acceptable to use p2pmem. I also have a difficult time seeing
> > a use case for it and thus I have a hard time with the argument that we
> > can't support use cases that do want it because use cases that don't
> > want it (perhaps yet) won't work.
> > 
> > > This is an interesting experiement to look at I suppose, but if you
> > > ever want this upstream I would like at least for you to develop a
> > > strategy to support the wider case, if not an actual implementation.
> > 
> > I think there are plenty of avenues forward to support offsets, etc.
> > It's just work. Nothing we'd be proposing would be incompatible with it.
> > We just don't want to have to do it all upfront especially when no one
> > really knows how well various architecture's hardware supports this or
> > if anyone even wants to run it on systems such as those. (Keep in mind
> > this is a pretty specific optimization that mostly helps systems
> > designed in specific ways -- not a general "everybody gets faster" type
> > situation.) Get the cases working we know will work, can easily support
> > and people actually want.  Then expand it to support others as people
> > come around with hardware to test and use cases for it.
> 
> I think you need to give other archs a chance to support this with a
> design that considers the offset case as a first class citizen rather
> than an afterthought.

Thanks :-) There's a reason why I'm insisting on this. We have constant
requests for this today. We have hacks in the GPU drivers to do it for
GPUs behind a switch, but those are just that, ad-hoc hacks in the
drivers. We have similar grossness around the corner with some CAPI
NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
to whack nVME devices.

I'm very interested in a more generic solution to deal with the problem
of P2P between devices. I'm happy to contribute with code to handle the
powerpc bits but we need to agree on the design first :)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16 16:47 ` Logan Gunthorpe
@ 2017-04-16 22:32   ` Benjamin Herrenschmidt
  2017-04-17  5:13     ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-16 22:32 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Sun, 2017-04-16 at 10:47 -0600, Logan Gunthorpe wrote:
> > I think you need to give other archs a chance to support this with a
> > design that considers the offset case as a first class citizen rather
> > than an afterthought.
> 
> I'll consider this. Given the fact I can use your existing
> get_dev_pagemap infrastructure to look up the p2pmem device this
> probably isn't as hard as I thought it would be anyway (we probably
> don't even need a page flag). We'd just have lookup the dev_pagemap,
> test if it's a p2pmem device, and if so, call a p2pmem_dma_map function
> which could apply the offset or do any other arch specific logic (if
> necessary).

I'm still not 100% why do you need a "p2mem device" mind you ...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16 22:32   ` Benjamin Herrenschmidt
@ 2017-04-17  5:13     ` Logan Gunthorpe
  2017-04-17  7:20       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-17  5:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 16/04/17 04:32 PM, Benjamin Herrenschmidt wrote:
>> I'll consider this. Given the fact I can use your existing
>> get_dev_pagemap infrastructure to look up the p2pmem device this
>> probably isn't as hard as I thought it would be anyway (we probably
>> don't even need a page flag). We'd just have lookup the dev_pagemap,
>> test if it's a p2pmem device, and if so, call a p2pmem_dma_map function
>> which could apply the offset or do any other arch specific logic (if
>> necessary).
> 
> I'm still not 100% why do you need a "p2mem device" mind you ...

Well, you don't "need" it but it is a design choice that I think makes a
lot of sense for the following reasons:

1) p2pmem is in fact a device on the pci bus. A pci driver will need to
set it up and create the device and thus it will have a natural parent
pci device. Instantiating a struct device for it means it will appear in
the device hierarchy and one can use that to reason about its position
in the topology.

2) In order to create the struct pages we use the ZONE_DEVICE
infrastructure which requires a struct device. (See
devm_memremap_pages.) This amazingly gets us the get_dev_pagemap
architecture which also uses a struct device. So by using a p2pmem
device we can go from struct page to struct device to p2pmem device
quickly and effortlessly.

3) You wouldn't want to use the pci's struct device because it doesn't
really describe what's going on. For example, there may be multiple
devices on the pci device in question: eg. an NVME card and some p2pmem.
Or it could be a NIC with some p2pmem. Or it could just be p2pmem by
itself. And the logic to figure out what memory is available and where
the address is will be non-standard so it's really straightforward to
have any pci driver just instantiate a p2pmem device.

It is probably worth you reading the RFC patches at this point to get a
better feel for this.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-17  5:13     ` Logan Gunthorpe
@ 2017-04-17  7:20       ` Benjamin Herrenschmidt
  2017-04-17 16:52         ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-17  7:20 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Sun, 2017-04-16 at 23:13 -0600, Logan Gunthorpe wrote:
> 
> > 
> > I'm still not 100% why do you need a "p2mem device" mind you ...
> 
> Well, you don't "need" it but it is a design choice that I think makes a
> lot of sense for the following reasons:
> 
> 1) p2pmem is in fact a device on the pci bus. A pci driver will need to
> set it up and create the device and thus it will have a natural parent
> pci device. Instantiating a struct device for it means it will appear in
> the device hierarchy and one can use that to reason about its position
> in the topology.

But is it ? For example take a GPU, does it, in your scheme, need an
additional "p2pmem" child ? Why can't the GPU driver just use some
helper to instantiate the necessary struct pages ? What does having an
actual "struct device" child buys you ?

> 2) In order to create the struct pages we use the ZONE_DEVICE
> infrastructure which requires a struct device. (See
> devm_memremap_pages.)

Yup, but you already have one in the actual pci_dev ... What is the
benefit of adding a second one ?

>  This amazingly gets us the get_dev_pagemap
> architecture which also uses a struct device. So by using a p2pmem
> device we can go from struct page to struct device to p2pmem device
> quickly and effortlessly.

Which isn't terribly useful in itself right ? What you care about is
the "enclosing" pci_dev no ? Or am I missing something ?

> 3) You wouldn't want to use the pci's struct device because it doesn't
> really describe what's going on. For example, there may be multiple
> devices on the pci device in question: eg. an NVME card and some p2pmem.

What is "some p2pmem" ?

> Or it could be a NIC with some p2pmem.

Again what is "some p2pmem" ?

That a device might have some memory-like buffer space is all well and
good but does it need to be specifically distinguished at the device
level ? It could be inherent to what the device is... for example again
take the GPU example, why would you call the FB memory "p2pmem" ? 

>  Or it could just be p2pmem by itself. And the logic to figure out what
>  memory is available and where
> the address is will be non-standard so it's really straightforward to
> have any pci driver just instantiate a p2pmem device.

Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
standard way to lookup the various offsets etc... I mentioned earlier,
then yes, it would make sense to have it as a staging point. As-is, I
don't know. 

> It is probably worth you reading the RFC patches at this point to get a
> better feel for this.

Yup, I'll have another look a bit more in depth.

Cheers,
Ben.


> Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-17  7:20       ` Benjamin Herrenschmidt
@ 2017-04-17 16:52         ` Logan Gunthorpe
  2017-04-17 17:04           ` Dan Williams
                             ` (2 more replies)
  0 siblings, 3 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-17 16:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> But is it ? For example take a GPU, does it, in your scheme, need an
> additional "p2pmem" child ? Why can't the GPU driver just use some
> helper to instantiate the necessary struct pages ? What does having an
> actual "struct device" child buys you ?

Yes, in this scheme, it needs an additional p2pmem child. Why is that an
issue? It certainly makes it a lot easier for the user to understand the
p2pmem memory in the system (through the sysfs tree) and reason about
the topology and when to use it. This is important.

> 
>> 2) In order to create the struct pages we use the ZONE_DEVICE
>> infrastructure which requires a struct device. (See
>> devm_memremap_pages.)
> 
> Yup, but you already have one in the actual pci_dev ... What is the
> benefit of adding a second one ?

But that would tie all of this very tightly to be pci only and may get
hard to differentiate if more users of ZONE_DEVICE crop up who happen to
be using a pci device. Having a specific class for this makes it very
clear how this memory would be handled. For example, although I haven't
looked into it, this could very well be a point of conflict with HMM. If
they were to use the pci device to populate the dev_pagemap then we
couldn't also use the pci device. I feel it's much better for users of
dev_pagemap to have their struct devices they own to avoid such conflicts.

> 
>>  This amazingly gets us the get_dev_pagemap
>> architecture which also uses a struct device. So by using a p2pmem
>> device we can go from struct page to struct device to p2pmem device
>> quickly and effortlessly.
> 
> Which isn't terribly useful in itself right ? What you care about is
> the "enclosing" pci_dev no ? Or am I missing something ?

Sure it is. What if we want to someday support p2pmem that's on another bus?

>> 3) You wouldn't want to use the pci's struct device because it doesn't
>> really describe what's going on. For example, there may be multiple
>> devices on the pci device in question: eg. an NVME card and some p2pmem.
> 
> What is "some p2pmem" ?
>> Or it could be a NIC with some p2pmem.
> 
> Again what is "some p2pmem" ?

Some device local memory intended for use as a DMA target from a
neighbour device or itself. On a PCI device, this would be a BAR, or a
portion of a BAR with memory behind it.

Keep in mind device classes tend to carve out common use cases and don't
have a one to one mapping with a physical pci card.

> That a device might have some memory-like buffer space is all well and
> good but does it need to be specifically distinguished at the device
> level ? It could be inherent to what the device is... for example again
> take the GPU example, why would you call the FB memory "p2pmem" ? 

Well if you are using it for p2p transactions why wouldn't you call it
p2pmem? There's no technical downside here except some vague argument
over naming. Once registered as p2pmem, that device will handle all the
dma map stuff for you and have a central obvious place to put code which
helps decide whether to use it or not based on topology.

I can certainly see an issue you'd have with the current RFC in that the
p2pmem device currently also handles memory allocation which a GPU would
 want to do itself. There are plenty of solutions to this though: we
could provide hooks for the parent device to override allocation or
something like that. However, the use cases I'm concerned with don't do
their own allocation so that is an important feature for them.

> Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
> it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
> standard way to lookup the various offsets etc... I mentioned earlier,
> then yes, it would make sense to have it as a staging point. As-is, I
> don't know. 

Well of course, at some point it would have a standard way to lookup
offsets and figure out what's necessary for a mapping. We wouldn't make
that separate from this, that would make no sense.

I also forgot:

4) We need someway in the kernel to configure drivers that use p2pmem.
That means it needs a unique name that the user can understand, lookup
and pass to other drivers. Then a way for those drivers to find it in
the system. A specific device class gets that for us in a very simple
fashion. We also don't want to have drivers like nvmet having to walk
every pci device to figure out where the p2p memory is and whether it
can use it.

IMO there are many clear benefits here and you haven't really offered an
alternative that provides the same features and potential for future use
cases.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-17 16:52         ` Logan Gunthorpe
@ 2017-04-17 17:04           ` Dan Williams
  2017-04-18  5:22             ` Logan Gunthorpe
  2017-04-17 18:04           ` Jerome Glisse
  2017-04-17 21:11           ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-17 17:04 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Bjorn Helgaas, Jason Gunthorpe,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Mon, Apr 17, 2017 at 9:52 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
>> But is it ? For example take a GPU, does it, in your scheme, need an
>> additional "p2pmem" child ? Why can't the GPU driver just use some
>> helper to instantiate the necessary struct pages ? What does having an
>> actual "struct device" child buys you ?
>
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

I think you want to go the other way in the hierarchy and find a
shared *parent* to land the p2pmem capability. Because that same agent
is going to be responsible handling address translation for the peers.

>>> 2) In order to create the struct pages we use the ZONE_DEVICE
>>> infrastructure which requires a struct device. (See
>>> devm_memremap_pages.)
>>
>> Yup, but you already have one in the actual pci_dev ... What is the
>> benefit of adding a second one ?
>
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device. Having a specific class for this makes it very
> clear how this memory would be handled. For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

Peer-dma is always going to be a property of the bus and not the end
devices. Requiring each bus implementation to explicitly enable
peer-to-peer support is a feature not a bug.

>>>  This amazingly gets us the get_dev_pagemap
>>> architecture which also uses a struct device. So by using a p2pmem
>>> device we can go from struct page to struct device to p2pmem device
>>> quickly and effortlessly.
>>
>> Which isn't terribly useful in itself right ? What you care about is
>> the "enclosing" pci_dev no ? Or am I missing something ?
>
> Sure it is. What if we want to someday support p2pmem that's on another bus?

We shouldn't design for some future possible use case. Solve it for
pci and when / if another bus comes along then look at a more generic
abstraction.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-17 16:52         ` Logan Gunthorpe
  2017-04-17 17:04           ` Dan Williams
@ 2017-04-17 18:04           ` Jerome Glisse
  2017-04-18  6:14             ` Logan Gunthorpe
  2017-04-17 21:11           ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 105+ messages in thread
From: Jerome Glisse @ 2017-04-17 18:04 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel

On Mon, Apr 17, 2017 at 10:52:29AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> > But is it ? For example take a GPU, does it, in your scheme, need an
> > additional "p2pmem" child ? Why can't the GPU driver just use some
> > helper to instantiate the necessary struct pages ? What does having an
> > actual "struct device" child buys you ?
> 
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

I disagree here. I would rather see Peer-to-Peer mapping as a form
of helper so that device driver can opt-in for multiple mecanisms
concurrently. Like HMM and p2p.

Also it seems you are having a static vision for p2p. For GPU and
network the use case is you move some buffer into the device memory
and then you create mapping for some network adapter while the buffer
is in device memory. But this is only temporary and buffer might
move to different device memory. So usecase is highly dynamic (well
mapping lifetime is still probably few second/minutes).

I see no reason for exposing sysfs tree to userspace for all this.
This isn't too dynamic, either 2 devices can access each others
memory, either they can't. This can be hidden through the device
kernel API. Again for GPU the idea is that it is always do-able
in the sense that when it is not you fallback to using system
memory.


> >> 2) In order to create the struct pages we use the ZONE_DEVICE
> >> infrastructure which requires a struct device. (See
> >> devm_memremap_pages.)
> > 
> > Yup, but you already have one in the actual pci_dev ... What is the
> > benefit of adding a second one ?
> 
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device. Having a specific class for this makes it very
> clear how this memory would be handled. For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

Yes this could conflict and that's why i would rather see this as a set
of helper like HMM is doing. So device driver can opt-in HMM and p2pmem
at the same time.


> >>  This amazingly gets us the get_dev_pagemap
> >> architecture which also uses a struct device. So by using a p2pmem
> >> device we can go from struct page to struct device to p2pmem device
> >> quickly and effortlessly.
> > 
> > Which isn't terribly useful in itself right ? What you care about is
> > the "enclosing" pci_dev no ? Or am I missing something ?
> 
> Sure it is. What if we want to someday support p2pmem that's on another bus?
> 
> >> 3) You wouldn't want to use the pci's struct device because it doesn't
> >> really describe what's going on. For example, there may be multiple
> >> devices on the pci device in question: eg. an NVME card and some p2pmem.
> > 
> > What is "some p2pmem" ?
> >> Or it could be a NIC with some p2pmem.
> > 
> > Again what is "some p2pmem" ?
> 
> Some device local memory intended for use as a DMA target from a
> neighbour device or itself. On a PCI device, this would be a BAR, or a
> portion of a BAR with memory behind it.
> 
> Keep in mind device classes tend to carve out common use cases and don't
> have a one to one mapping with a physical pci card.
> 
> > That a device might have some memory-like buffer space is all well and
> > good but does it need to be specifically distinguished at the device
> > level ? It could be inherent to what the device is... for example again
> > take the GPU example, why would you call the FB memory "p2pmem" ? 
> 
> Well if you are using it for p2p transactions why wouldn't you call it
> p2pmem? There's no technical downside here except some vague argument
> over naming. Once registered as p2pmem, that device will handle all the
> dma map stuff for you and have a central obvious place to put code which
> helps decide whether to use it or not based on topology.
> 
> I can certainly see an issue you'd have with the current RFC in that the
> p2pmem device currently also handles memory allocation which a GPU would
>  want to do itself. There are plenty of solutions to this though: we
> could provide hooks for the parent device to override allocation or
> something like that. However, the use cases I'm concerned with don't do
> their own allocation so that is an important feature for them.

This seems to duplicate things that already exist in each individual
driver. If a device has memory than device driver already have some
form of memory management and most likely expose some API to userspace
to allow program to use that memory.

Peer to peer DMA mapping is orthogonal to memory management, it is
an optimization ie you have some buffer allocated through some device
driver specific IOCTL and now you want some other device to directly
access it. Having to first do another allocation in a different device
driver for that to happen seems overkill.

What you really want from device driver point of view is an helper to
first tell you if 2 device can access each other and second an helper
that allow the second device to import the other device memory to allow
direct access.


> > Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
> > it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
> > standard way to lookup the various offsets etc... I mentioned earlier,
> > then yes, it would make sense to have it as a staging point. As-is, I
> > don't know. 
> 
> Well of course, at some point it would have a standard way to lookup
> offsets and figure out what's necessary for a mapping. We wouldn't make
> that separate from this, that would make no sense.
> 
> I also forgot:
> 
> 4) We need someway in the kernel to configure drivers that use p2pmem.
> That means it needs a unique name that the user can understand, lookup
> and pass to other drivers. Then a way for those drivers to find it in
> the system. A specific device class gets that for us in a very simple
> fashion. We also don't want to have drivers like nvmet having to walk
> every pci device to figure out where the p2p memory is and whether it
> can use it.
> 
> IMO there are many clear benefits here and you haven't really offered an
> alternative that provides the same features and potential for future use
> cases.

Discovering possible peer is a onetime only thing and designing around
that is wrong in my view. There is already existing hierarchy in kernel
for that in the form of the bus hierarchy (i am thinking pci bus here).
So there is already existing way to discover this and you are just
duplicating informations here.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-17 16:52         ` Logan Gunthorpe
  2017-04-17 17:04           ` Dan Williams
  2017-04-17 18:04           ` Jerome Glisse
@ 2017-04-17 21:11           ` Benjamin Herrenschmidt
  2017-04-18  5:43             ` Logan Gunthorpe
  2 siblings, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-17 21:11 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Mon, 2017-04-17 at 10:52 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> > But is it ? For example take a GPU, does it, in your scheme, need an
> > additional "p2pmem" child ? Why can't the GPU driver just use some
> > helper to instantiate the necessary struct pages ? What does having an
> > actual "struct device" child buys you ?
> 
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

Is it ? Again, you create a "concept" the user may have no idea about,
"p2pmem memory". So now any kind of memory buffer on a device can could
be use for p2p but also potentially a bunch of other things becomes
special and called "p2pmem" ...

> > > 2) In order to create the struct pages we use the ZONE_DEVICE
> > > infrastructure which requires a struct device. (See
> > > devm_memremap_pages.)
> > 
> > Yup, but you already have one in the actual pci_dev ... What is the
> > benefit of adding a second one ?
> 
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device.

But what do you have in p2pmem that somebody benefits from. Again I
don't understand what that "p2pmem" device buys you in term of
functionality vs. having the device just instanciate the pages.

Now having some kind of way to override the dma_ops, yes I do get that,
and it could be that this "p2pmem" is typically the way to do it, but
at the moment you don't even have that. So I'm a bit at a loss here.
 
>  Having a specific class for this makes it very
> clear how this memory would be handled.

But it doesn't *have* to be. Again, take my GPU example. The fact that
a NIC might be able to DMA into it doesn't make it specifically "p2p
memory".

Essentially you are saying that any device that happens to have a piece
of mappable "memory" (or something that behaves like it) and can be
DMA'ed into should now have that "p2pmem" thing attached to it.

Now take an example where that becomes really awkward (it's also a real
example of something people want to do). I have a NIC and a GPU, the
NIC DMA's data to/from the GPU, but they also want to poke at each
other doorbell, the GPU to kick the NIC into action when data is ready
to send, the NIC to poke the GPU when data has been received.

Those doorbells are MMIO registers.

So now your "p2pmem" device needs to also be laid out on top of those
MMIO registers ? It's becoming weird.

See, basically, doing peer 2 peer between devices has 3 main challenges
today: The DMA API needing struct pages, the MMIO translation issues
and the IOMMU translation issues.

You seem to create that added device as some kind of "owner" for the
struct pages, solving #1, but leave #2 and #3 alone.

Now, as I said, it could very well be that having the devmap pointer
point to some specific device-type with a well known structure to
provide solutions for #2 and #3 such as dma_ops overrides, is indeed
the right way to solve these problems.

If we go down that path, though, rather than calling it p2pmem I would
call it something like dma_target which I find much clearer especially
since it doesn't have to be just memory.

For the sole case of creating struct page's however, I fail to see the
point.

>  For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

If we are going to create some sort of struct dma_target, HMM could
potentially just look for the parent if it needs the PCI device.

> > >  This amazingly gets us the get_dev_pagemap
> > > architecture which also uses a struct device. So by using a p2pmem
> > > device we can go from struct page to struct device to p2pmem device
> > > quickly and effortlessly.
> > 
> > Which isn't terribly useful in itself right ? What you care about is
> > the "enclosing" pci_dev no ? Or am I missing something ?
> 
> Sure it is. What if we want to someday support p2pmem that's on another bus?

But why not directly use that other bus' device in that case ?

> > > 3) You wouldn't want to use the pci's struct device because it doesn't
> > > really describe what's going on. For example, there may be multiple
> > > devices on the pci device in question: eg. an NVME card and some p2pmem.
> > 
> > What is "some p2pmem" ?
> > > Or it could be a NIC with some p2pmem.
> > 
> > Again what is "some p2pmem" ?
> 
> Some device local memory intended for use as a DMA target from a
> neighbour device or itself. On a PCI device, this would be a BAR, or a
> portion of a BAR with memory behind it.

So back to my base objections:

 - There is no reason why this has to just be memory. There are good
reasons to want to do peer DMA to MMIO registers (see above)

 - There is no reason why that memory on a device is specifically
dedicated to "peer to peer" and thus calling it "p2pmem" is something
I find actually confusing.

> Keep in mind device classes tend to carve out common use cases and don't
> have a one to one mapping with a physical pci card.
> 
> > That a device might have some memory-like buffer space is all well and
> > good but does it need to be specifically distinguished at the device
> > level ? It could be inherent to what the device is... for example again
> > take the GPU example, why would you call the FB memory "p2pmem" ? 
> 
> Well if you are using it for p2p transactions why wouldn't you call it
> p2pmem?

Im not only using it for that :)

>  There's no technical downside here except some vague argument
> over naming. Once registered as p2pmem, that device will handle all the
> dma map stuff for you and have a central obvious place to put code which
> helps decide whether to use it or not based on topology.

Except it doesn't handle any of the dma_map stuff today as far as I can
see.

> I can certainly see an issue you'd have with the current RFC in that the
> p2pmem device currently also handles memory allocation which a GPU would
>  want to do itself.

The memory allocation should be a completely orthogonal and separate
thing yes. You are conflating two completely different things now into
a single concept.

>  There are plenty of solutions to this though: we
> could provide hooks for the parent device to override allocation or
> something like that. However, the use cases I'm concerned with don't do
> their own allocation so that is an important feature for them.

No, the allocation should not even have links to the DMA peering
mechanism. This is completely orthogonal.

I feel more and more like your entire infrastructure is designed for a
special use case and conflates several problems of that specific use
case into one single "solution" rather than separating the various
problems and solving them independently.

> > Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
> > it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
> > standard way to lookup the various offsets etc... I mentioned earlier,
> > then yes, it would make sense to have it as a staging point. As-is, I
> > don't know. 
> 
> Well of course, at some point it would have a standard way to lookup
> offsets and figure out what's necessary for a mapping. We wouldn't make
> that separate from this, that would make no sense.
> 
> I also forgot:
> 
> 4) We need someway in the kernel to configure drivers that use p2pmem.
> That means it needs a unique name that the user can understand, lookup
> and pass to other drivers. Then a way for those drivers to find it in
> the system. A specific device class gets that for us in a very simple
> fashion. We also don't want to have drivers like nvmet having to walk
> every pci device to figure out where the p2p memory is and whether it
> can use it.
> 
> IMO there are many clear benefits here and you haven't really offered an
> alternative that provides the same features and potential for future use
> cases.
> 
> Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-17 17:04           ` Dan Williams
@ 2017-04-18  5:22             ` Logan Gunthorpe
  0 siblings, 0 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18  5:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Benjamin Herrenschmidt, Bjorn Helgaas, Jason Gunthorpe,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 17/04/17 11:04 AM, Dan Williams wrote:
>> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
>> issue? It certainly makes it a lot easier for the user to understand the
>> p2pmem memory in the system (through the sysfs tree) and reason about
>> the topology and when to use it. This is important.
> 
> I think you want to go the other way in the hierarchy and find a
> shared *parent* to land the p2pmem capability. Because that same agent
> is going to be responsible handling address translation for the peers.
>
> Peer-dma is always going to be a property of the bus and not the end
> devices. Requiring each bus implementation to explicitly enable
> peer-to-peer support is a feature not a bug.
> 
> We shouldn't design for some future possible use case. Solve it for
> pci and when / if another bus comes along then look at a more generic
> abstraction.

Thanks Dan, these are some good points. Wedding it closer to the PCI
code makes more sense to me now. I'd still think you'd want some struct
device though to appear in the device hierarchy and allow reasoning
about topology.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-17 21:11           ` Benjamin Herrenschmidt
@ 2017-04-18  5:43             ` Logan Gunthorpe
  2017-04-18  6:29               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18  5:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 17/04/17 03:11 PM, Benjamin Herrenschmidt wrote:
> Is it ? Again, you create a "concept" the user may have no idea about,
> "p2pmem memory". So now any kind of memory buffer on a device can could
> be use for p2p but also potentially a bunch of other things becomes
> special and called "p2pmem" ...

The user is going to have to have an idea about it if they are designing
systems to make use of it. I've said it before many times: this is an
optimization with significant trade-offs so the user does have to make
decisions regarding when to enable it.


> But what do you have in p2pmem that somebody benefits from. Again I
> don't understand what that "p2pmem" device buys you in term of
> functionality vs. having the device just instanciate the pages.

Well thanks for just taking a big shit on all of our work without even
reading the patches. Bravo.

> Now having some kind of way to override the dma_ops, yes I do get that,
> and it could be that this "p2pmem" is typically the way to do it, but
> at the moment you don't even have that. So I'm a bit at a loss here.

Yes, we've already said many times that this is something we will need
to add.

> But it doesn't *have* to be. Again, take my GPU example. The fact that
> a NIC might be able to DMA into it doesn't make it specifically "p2p
> memory".

Just because you use it for other things doesn't mean it can't also
provide the service of a "p2pmem" device.

> So now your "p2pmem" device needs to also be laid out on top of those
> MMIO registers ? It's becoming weird.

Yes, Max Gurtovoy has also expressed an interest in expanding this work
to cover things other than memory. He's suggested simply calling it a
p2p device, but until we figure out what exactly that all means we can't
really finalize a name.

> See, basically, doing peer 2 peer between devices has 3 main challenges
> today: The DMA API needing struct pages, the MMIO translation issues
> and the IOMMU translation issues.
> 
> You seem to create that added device as some kind of "owner" for the
> struct pages, solving #1, but leave #2 and #3 alone.

Well there are other challenges too. Like figuring out when it's
appropriate to use, tying together the device that provides the memory
with the driver tring to use it in DMA transactions, etc, etc. Our patch
set tackles these latter issues.

> If we go down that path, though, rather than calling it p2pmem I would
> call it something like dma_target which I find much clearer especially
> since it doesn't have to be just memory.

I'm not set on the name. My arguments have been specifically for the
existence of an independent struct device. But I'm not really interested
in getting into bike shedding arguments over what to call it at this
time when we don't even really know what it's going to end up doing in
the end.

> The memory allocation should be a completely orthogonal and separate
> thing yes. You are conflating two completely different things now into
> a single concept.

Well we need a uniform way for a driver trying to coordinate a p2p dma
to find and obtain memory from devices that supply it. We are not
dealing with GPUs that already have complicated allocators. We are
dealing with people adding memory to their devices for the _sole_
purpose of enabling p2p transfers. So having a common allocation setup
is seen as a benefit to us.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-17 18:04           ` Jerome Glisse
@ 2017-04-18  6:14             ` Logan Gunthorpe
  0 siblings, 0 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18  6:14 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel



On 17/04/17 12:04 PM, Jerome Glisse wrote:
> I disagree here. I would rather see Peer-to-Peer mapping as a form
> of helper so that device driver can opt-in for multiple mecanisms
> concurrently. Like HMM and p2p.

I'm not against moving some of the common stuff into a library. It
sounds like the problems p2pmem solves don't overlap much with the
problems of the GPU and moving the stuff we have in common somewhere
else seems sensible.

> Also it seems you are having a static vision for p2p. For GPU and
> network the use case is you move some buffer into the device memory
> and then you create mapping for some network adapter while the buffer
> is in device memory. But this is only temporary and buffer might
> move to different device memory. So usecase is highly dynamic (well
> mapping lifetime is still probably few second/minutes).

I feel like you will need to pin the memory while it's the target of a
DMA transaction. If some network peer is sending you data and you just
invalidated the memory it is headed to then you are just going to break
applications. But really this isn't our concern: the memory we are using
with this work will be static and not prone to disappearing.

> I see no reason for exposing sysfs tree to userspace for all this.
> This isn't too dynamic, either 2 devices can access each others
> memory, either they can't. This can be hidden through the device
> kernel API. Again for GPU the idea is that it is always do-able
> in the sense that when it is not you fallback to using system
> memory.

The user has to make a decision to use it or not. This is an
optimization with significant trade-offs that may differ significantly
based on system design.

> Yes this could conflict and that's why i would rather see this as a set
> of helper like HMM is doing. So device driver can opt-in HMM and p2pmem
> at the same time.

I don't understand how that addresses the conflict. We need to each be
using unique and identifiable struct devices in the ZONE_DEVICE
dev_pagemap so we don't apply p2p dma mappings to hmm memory and vice-versa.

> This seems to duplicate things that already exist in each individual
> driver. If a device has memory than device driver already have some
> form of memory management and most likely expose some API to userspace
> to allow program to use that memory.

> Peer to peer DMA mapping is orthogonal to memory management, it is
> an optimization ie you have some buffer allocated through some device
> driver specific IOCTL and now you want some other device to directly
> access it. Having to first do another allocation in a different device
> driver for that to happen seems overkill.

The devices we are working with are adding memory specifically for
enabling p2p applications. The memory is new and there are no allocators
for any of it yet.

Also note: we've gotten _significant_ push back against exposing any of
this memory to userspace. Letting the user unknowingly have to deal with
the issues of iomem is not anything anyone wants to see. Thus we are
dealing with in-kernel users only and they need a common interface to
get the memory from.

> What you really want from device driver point of view is an helper to
> first tell you if 2 device can access each other and second an helper
> that allow the second device to import the other device memory to allow
> direct access.

Well, actually it's a bit more complicated than that but essentially
correct: There can be N devices in the mix and quite likely another
driver completely separate from all N devices. (eg. for our main use
case we have N nvme cards being talked to through an RDMA NIC with it
all being coordinated by the nvme-target driver).


> Discovering possible peer is a onetime only thing and designing around
> that is wrong in my view. There is already existing hierarchy in kernel
> for that in the form of the bus hierarchy (i am thinking pci bus here).
> So there is already existing way to discover this and you are just
> duplicating informations here.

I really don't see the solution you are proposing here. Have the user
specify a pci device name and just have them guess which ones have
suitable memory? Or do they have to walk the entire pci tree to find
ones that have such memory? There was no "duplicate" information created
by our patch set.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18  5:43             ` Logan Gunthorpe
@ 2017-04-18  6:29               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-18  6:29 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Mon, 2017-04-17 at 23:43 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 03:11 PM, Benjamin Herrenschmidt wrote:
> > Is it ? Again, you create a "concept" the user may have no idea about,
> > "p2pmem memory". So now any kind of memory buffer on a device can could
> > be use for p2p but also potentially a bunch of other things becomes
> > special and called "p2pmem" ...
> 
> The user is going to have to have an idea about it if they are designing
> systems to make use of it. I've said it before many times: this is an
> optimization with significant trade-offs so the user does have to make
> decisions regarding when to enable it.

Not necessarily. There are many cases where the "end user" won't have
any idea. In any case, I think we bring the story down to those two
points of conflating the allocator with the peer to peer DMA, and the
lack of generality in the approach to solve the peer to peer DMA
problem.

> > But what do you have in p2pmem that somebody benefits from. Again I
> > don't understand what that "p2pmem" device buys you in term of
> > functionality vs. having the device just instanciate the pages.
> 
> Well thanks for just taking a big shit on all of our work without even
> reading the patches. Bravo.

Now now now .... calm down. We are being civil here. I'm not shitting
on anything, I'm asking what seems to be a reasonable question in term
of benefits of the approach you have chosen. Using that sort of
language will not get you anywhere. 

> > Now having some kind of way to override the dma_ops, yes I do get that,
> > and it could be that this "p2pmem" is typically the way to do it, but
> > at the moment you don't even have that. So I'm a bit at a loss here.
> 
> Yes, we've already said many times that this is something we will need
> to add.
> 
> > But it doesn't *have* to be. Again, take my GPU example. The fact that
> > a NIC might be able to DMA into it doesn't make it specifically "p2p
> > memory".
> 
> Just because you use it for other things doesn't mean it can't also
> provide the service of a "p2pmem" device.

But there is no such thing as a "p2pmem" device.. that's what I'm
trying to tell you...

As both Jerome and I tried to explain, there are many reason why one
may want to do peer DMA into some device memory, that doesn't make that
memory some kind of "p2pmem". It's trying to stick a generic label onto
something that isn't.

That's why I'm suggesting we disconnect the two aspects. On one hand
the problem of handling p2p DMA, whether the target is some memory,
some MMIO registers, etc...

On the other hand, some generic "utility" that can optionally be used
by drivers to manage a pool of DMA memory in the device, essentially a
simple allocator.

The two things are completely orthogonal.

> > So now your "p2pmem" device needs to also be laid out on top of those
> > MMIO registers ? It's becoming weird.
> 
> Yes, Max Gurtovoy has also expressed an interest in expanding this work
> to cover things other than memory. He's suggested simply calling it a
> p2p device, but until we figure out what exactly that all means we can't
> really finalize a name.

Possibly. In any case, I think it should be separate from the
allocation.

> > See, basically, doing peer 2 peer between devices has 3 main challenges
> > today: The DMA API needing struct pages, the MMIO translation issues
> > and the IOMMU translation issues.
> > 
> > You seem to create that added device as some kind of "owner" for the
> > struct pages, solving #1, but leave #2 and #3 alone.
> 
> Well there are other challenges too. Like figuring out when it's
> appropriate to use, tying together the device that provides the memory
> with the driver tring to use it in DMA transactions, etc, etc. Our patch
> set tackles these latter issues.

But it tries to conflate the allocation, which is basically the fact
that this is some kind of "memory pool" with the problem of doing peer
DMA.

I'm advocating for separating the concepts.

> > If we go down that path, though, rather than calling it p2pmem I would
> > call it something like dma_target which I find much clearer especially
> > since it doesn't have to be just memory.
> 
> I'm not set on the name. My arguments have been specifically for the
> existence of an independent struct device. But I'm not really interested
> in getting into bike shedding arguments over what to call it at this
> time when we don't even really know what it's going to end up doing in
> the end.

It's not bike shedding. It's about taking out the allocator part and
making it clear that this isn't something to lay out on top of a pre-
decided chunk of "memory".

> > The memory allocation should be a completely orthogonal and separate
> > thing yes. You are conflating two completely different things now into
> > a single concept.
> 
> Well we need a uniform way for a driver trying to coordinate a p2p dma
> to find and obtain memory from devices that supply it.

Again, you are bringing everything down to your special case of "p2p
memory". That's where you lose me. This looks like a special case to me
and you are making the centre point of your design.

What we need is:

 - On one hand a way to expose device space (whether it's MMIO
registers, memory, something else ...) to the DMA ops so another device
can do standard dma_map_* to/from it. (Not dma_alloc_* those shouldn't
relate to p2p at all, they are intended for a driver own allocation for
the device it manages). This includes the creation of struct pages and
the mechanism to override/adjust the dma_ops etc.... along with all the
PCI specific gunk to figure out if we are on the same bus or not etc.

 - Some kind of generic utility you can use to manage a pool of
"memory" which seems to be what your special devices use or wantf or
use by peer DMA.

>  We are not
> > dealing with GPUs that already have complicated allocators.> 
We are
> dealing with people adding memory to their devices for the _sole_
> purpose of enabling p2p transfers. So having a common allocation setup
> > i
s seen as a benefit to us.

I'm not disagreeing. I'm saying that it is completely orthogonal to the
solving the the DMA peer issue. I'm simply objecting to conflating the
two.

Cheers,
Ben.

> Logan
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16 22:23 ` Benjamin Herrenschmidt
@ 2017-04-18 16:45   ` Jason Gunthorpe
  2017-04-18 17:27     ` Dan Williams
  2017-04-18 18:30     ` Logan Gunthorpe
  0 siblings, 2 replies; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 16:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Dan Williams, Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig,
	Sagi Grimberg, James E.J. Bottomley, Martin K. Petersen,
	Jens Axboe, Steve Wise, Stephen Bates, Max Gurtovoy, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse

On Mon, Apr 17, 2017 at 08:23:16AM +1000, Benjamin Herrenschmidt wrote:
 
> Thanks :-) There's a reason why I'm insisting on this. We have constant
> requests for this today. We have hacks in the GPU drivers to do it for
> GPUs behind a switch, but those are just that, ad-hoc hacks in the
> drivers. We have similar grossness around the corner with some CAPI
> NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
> to whack nVME devices.

A lot of people feel this way in the RDMA community too. We have had
vendors shipping out of tree code to enable P2P for RDMA with GPU
years and years now. :(

Attempts to get things in mainline have always run into the same sort
of road blocks you've identified in this thread..

FWIW, I read this discussion and it sounds closer to an agreement than
I've ever seen in the past.

>From Ben's comments, I would think that the 'first class' support that
is needed here is simply a function to return the 'struct device'
backing a CPU address range.

This is the minimal required information for the arch or IOMMU code
under the dma ops to figure out the fabric source/dest, compute the
traffic path, determine if P2P is even possible, what translation
hardware is crossed, and what DMA address should be used.

If there is going to be more core support for this stuff I think it
will be under the topic of more robustly describing the fabric to the
core and core helpers to extract data from the description: eg compute
the path, check if the path crosses translation, etc

But that isn't really related to P2P, and is probably better left to
the arch authors to figure out where they need to enhance the existing
topology data..

I think the key agreement to get out of Logan's series is that P2P DMA
means:
 - The BAR will be backed by struct pages
 - Passing the CPU __iomem address of the BAR to the DMA API is
   valid and, long term, dma ops providers are expected to fail
   or return the right DMA address
 - Mapping BAR memory into userspace and back to the kernel via
   get_user_pages works transparently, and with the DMA API above
 - The dma ops provider must be able to tell if source memory is bar
   mapped and recover the pci device backing the mapping.

At least this is what we'd like in RDMA :)

FWIW, RDMA probably wouldn't want to use a p2mem device either, we
already have APIs that map BAR memory to user space, and would like to
keep using them. A 'enable P2P for bar' helper function sounds better
to me.

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 16:45   ` Jason Gunthorpe
@ 2017-04-18 17:27     ` Dan Williams
  2017-04-18 18:00       ` Jason Gunthorpe
  2017-04-18 22:46       ` Benjamin Herrenschmidt
  2017-04-18 18:30     ` Logan Gunthorpe
  1 sibling, 2 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-18 17:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Logan Gunthorpe, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 9:45 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Apr 17, 2017 at 08:23:16AM +1000, Benjamin Herrenschmidt wrote:
>
>> Thanks :-) There's a reason why I'm insisting on this. We have constant
>> requests for this today. We have hacks in the GPU drivers to do it for
>> GPUs behind a switch, but those are just that, ad-hoc hacks in the
>> drivers. We have similar grossness around the corner with some CAPI
>> NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
>> to whack nVME devices.
>
> A lot of people feel this way in the RDMA community too. We have had
> vendors shipping out of tree code to enable P2P for RDMA with GPU
> years and years now. :(
>
> Attempts to get things in mainline have always run into the same sort
> of road blocks you've identified in this thread..
>
> FWIW, I read this discussion and it sounds closer to an agreement than
> I've ever seen in the past.
>
> From Ben's comments, I would think that the 'first class' support that
> is needed here is simply a function to return the 'struct device'
> backing a CPU address range.
>
> This is the minimal required information for the arch or IOMMU code
> under the dma ops to figure out the fabric source/dest, compute the
> traffic path, determine if P2P is even possible, what translation
> hardware is crossed, and what DMA address should be used.
>
> If there is going to be more core support for this stuff I think it
> will be under the topic of more robustly describing the fabric to the
> core and core helpers to extract data from the description: eg compute
> the path, check if the path crosses translation, etc
>
> But that isn't really related to P2P, and is probably better left to
> the arch authors to figure out where they need to enhance the existing
> topology data..
>
> I think the key agreement to get out of Logan's series is that P2P DMA
> means:
>  - The BAR will be backed by struct pages
>  - Passing the CPU __iomem address of the BAR to the DMA API is
>    valid and, long term, dma ops providers are expected to fail
>    or return the right DMA address
>  - Mapping BAR memory into userspace and back to the kernel via
>    get_user_pages works transparently, and with the DMA API above
>  - The dma ops provider must be able to tell if source memory is bar
>    mapped and recover the pci device backing the mapping.
>
> At least this is what we'd like in RDMA :)
>
> FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> already have APIs that map BAR memory to user space, and would like to
> keep using them. A 'enable P2P for bar' helper function sounds better
> to me.

...and I think it's not a helper function as much as asking the bus
provider "can these two device dma to each other". The "helper" is the
dma api redirecting through a software-iommu that handles bus address
translation differently than it would handle host memory dma mapping.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 17:27     ` Dan Williams
@ 2017-04-18 18:00       ` Jason Gunthorpe
  2017-04-18 18:34         ` Dan Williams
  2017-04-19  1:13         ` Benjamin Herrenschmidt
  2017-04-18 22:46       ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 18:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Benjamin Herrenschmidt, Logan Gunthorpe, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 10:27:47AM -0700, Dan Williams wrote:
> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> > already have APIs that map BAR memory to user space, and would like to
> > keep using them. A 'enable P2P for bar' helper function sounds better
> > to me.
> 
> ...and I think it's not a helper function as much as asking the bus
> provider "can these two device dma to each other".

What I mean I could write in a RDMA driver:

  /* Allow the memory in BAR 1 to be the target of P2P transactions */
  pci_enable_p2p_bar(dev, 1);

And not require anything else..

> The "helper" is the dma api redirecting through a software-iommu
> that handles bus address translation differently than it would
> handle host memory dma mapping.

Not sure, until we see what arches actually need to do here it is hard
to design common helpers.

Here are a few obvious things that arches will need to implement to
support this broadly:

- Virtualization might need to do a hypervisor call to get the right
  translation, or consult some hypervisor specific description table.

- Anything using IOMMUs for virtualization will need to setup IOMMU
  permissions to allow the P2P flow, this might require translation to
  an address cookie.

- Fail if the PCI devices are in different domains, or setup hardware to
  do completion bus/device/function translation.

- All platforms can succeed if the PCI devices are under the same
  'segment', but where segments begin is somewhat platform specific
  knowledge. (this is 'same switch' idea Logan has talked about)

So, we can eventually design helpers for various common scenarios, but
until we see what arch code actually needs to do it seems
premature. Much of this seems to involve interaction with some kind of
hardware, or consulation of some kind of currently platform specific
data, so I'm not sure what a software-iommu would be doing??

The main thing to agree on is that this code belongs under dma ops and
that arches have to support struct page mapped BAR addresses in their
dma ops inputs. Is that resonable?

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 16:45   ` Jason Gunthorpe
  2017-04-18 17:27     ` Dan Williams
@ 2017-04-18 18:30     ` Logan Gunthorpe
  2017-04-18 19:01       ` Jason Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18 18:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Benjamin Herrenschmidt
  Cc: Dan Williams, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 18/04/17 10:45 AM, Jason Gunthorpe wrote:
> From Ben's comments, I would think that the 'first class' support that
> is needed here is simply a function to return the 'struct device'
> backing a CPU address range.

Yes, and Dan's get_dev_pagemap suggestion gets us 90% of the way there.
It's just a disagreement as to what struct device is inside the pagemap.
Care needs to be taken to ensure that struct device doesn't conflict
with hmm and doesn't limit other potential future users of ZONE_DEVICE.

> If there is going to be more core support for this stuff I think it
> will be under the topic of more robustly describing the fabric to the
> core and core helpers to extract data from the description: eg compute
> the path, check if the path crosses translation, etc

Agreed, those helpers would be useful to everyone.

> I think the key agreement to get out of Logan's series is that P2P DMA
> means:
>  - The BAR will be backed by struct pages
>  - Passing the CPU __iomem address of the BAR to the DMA API is
>    valid and, long term, dma ops providers are expected to fail
>    or return the right DMA address

Well, yes but we have a _lot_ of work to do to make it safe to pass
around struct pages backed with __iomem. That's where our next focus
will be. I've already taken very initial steps toward this with my
scatterlist map patchset.

>  - Mapping BAR memory into userspace and back to the kernel via
>    get_user_pages works transparently, and with the DMA API above

Again, we've had a lot of push back for the memory to go to userspace at
all. It does work, but people expect userspace to screw it up in a lot
of ways. Among the people pushing back on that: Christoph Hellwig has
specifically said he wants to see this stay with in-kernel users only
until the apis can be worked out. This is one of the reasons we decided
to go with enabling nvme-fabrics as everything remains in the kernel.
And with that decision we needed a common in-kernel allocation
infrastructure: this is what p2pmem really is at this point.

>  - The dma ops provider must be able to tell if source memory is bar
>    mapped and recover the pci device backing the mapping.

Do you mean to say that every dma-ops provider needs to be taught about
p2p backed pages? I was hoping we could have dma_map_* just use special
p2p dma-ops if it was passed p2p pages (though there are some
complications to this too).

> At least this is what we'd like in RDMA :)
> 
> FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> already have APIs that map BAR memory to user space, and would like to
> keep using them. A 'enable P2P for bar' helper function sounds better
> to me.

Well, in the end that will likely come down to just devm_memremap_pages
with some (presently undecided) struct device that can be used to get
special p2p dma-ops for the bus.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 18:00       ` Jason Gunthorpe
@ 2017-04-18 18:34         ` Dan Williams
  2017-04-19  1:13         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-18 18:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Logan Gunthorpe, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 11:00 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Apr 18, 2017 at 10:27:47AM -0700, Dan Williams wrote:
>> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
>> > already have APIs that map BAR memory to user space, and would like to
>> > keep using them. A 'enable P2P for bar' helper function sounds better
>> > to me.
>>
>> ...and I think it's not a helper function as much as asking the bus
>> provider "can these two device dma to each other".
>
> What I mean I could write in a RDMA driver:
>
>   /* Allow the memory in BAR 1 to be the target of P2P transactions */
>   pci_enable_p2p_bar(dev, 1);
>
> And not require anything else..
>
>> The "helper" is the dma api redirecting through a software-iommu
>> that handles bus address translation differently than it would
>> handle host memory dma mapping.
>
> Not sure, until we see what arches actually need to do here it is hard
> to design common helpers.
>
> Here are a few obvious things that arches will need to implement to
> support this broadly:
>
> - Virtualization might need to do a hypervisor call to get the right
>   translation, or consult some hypervisor specific description table.
>
> - Anything using IOMMUs for virtualization will need to setup IOMMU
>   permissions to allow the P2P flow, this might require translation to
>   an address cookie.
>
> - Fail if the PCI devices are in different domains, or setup hardware to
>   do completion bus/device/function translation.
>
> - All platforms can succeed if the PCI devices are under the same
>   'segment', but where segments begin is somewhat platform specific
>   knowledge. (this is 'same switch' idea Logan has talked about)
>
> So, we can eventually design helpers for various common scenarios, but
> until we see what arch code actually needs to do it seems
> premature. Much of this seems to involve interaction with some kind of
> hardware, or consulation of some kind of currently platform specific
> data, so I'm not sure what a software-iommu would be doing??
>
> The main thing to agree on is that this code belongs under dma ops and
> that arches have to support struct page mapped BAR addresses in their
> dma ops inputs. Is that resonable?

I think we're saying the same thing by "software-iommu" and "custom
dma_ops", so yes.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 18:30     ` Logan Gunthorpe
@ 2017-04-18 19:01       ` Jason Gunthorpe
  2017-04-18 19:35         ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 19:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 12:30:59PM -0600, Logan Gunthorpe wrote:

> >  - The dma ops provider must be able to tell if source memory is bar
> >    mapped and recover the pci device backing the mapping.
> 
> Do you mean to say that every dma-ops provider needs to be taught about
> p2p backed pages? I was hoping we could have dma_map_* just use special
> p2p dma-ops if it was passed p2p pages (though there are some
> complications to this too).

I think that is how it will end up working out if this is the path..

Ultimately every dma_ops will need special code to support P2P with
the special hardware that ops is controlling, so it makes some sense
to start by pushing the check down there in the first place. This
advice is partially motivated by how dma_map_sg is just a small
wrapper around the function pointer call...

Something like:

foo_dma_map_sg(...)
{
  for (every page in sg)
      if (page is p2p)
           dma_addr[I] = p2p_same_segment_map_page(...);
}

Where p2p_same_segment_map_page checks if the two devices are on the
'same switch' and if so returns the address translated to match the
bus address programmed into the BAR or fails. We knows this case is
required to work by the PCI spec, so it makes sense to use it as the
first canned helper.

This also proves out the basic idea that the dma ops can recover the
pci device and perform an inspection of the traversed fabric path.

>From there every arch would have to expand the implementation to
support a wider range of things. Eg x86 with no iommu and no offset
could allow every address to be used based on a host bridge white
list.

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 19:01       ` Jason Gunthorpe
@ 2017-04-18 19:35         ` Logan Gunthorpe
  2017-04-18 19:48           ` Dan Williams
  2017-04-18 19:48           ` Jason Gunthorpe
  0 siblings, 2 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18 19:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
> Ultimately every dma_ops will need special code to support P2P with
> the special hardware that ops is controlling, so it makes some sense
> to start by pushing the check down there in the first place. This
> advice is partially motivated by how dma_map_sg is just a small
> wrapper around the function pointer call...

Yes, I noticed this problem too and that makes sense. It just means
every dma_ops will probably need to be modified to either support p2p
pages or fail on them. Though, the only real difficulty there is that it
will be a lot of work.

> Where p2p_same_segment_map_page checks if the two devices are on the
> 'same switch' and if so returns the address translated to match the
> bus address programmed into the BAR or fails. We knows this case is
> required to work by the PCI spec, so it makes sense to use it as the
> first canned helper.

I've also suggested that this check should probably be done (or perhaps
duplicated) before we even get to the map stage. In the case of
nvme-fabrics we'd probably want to let the user know when they try to
configure it or at least fall back to allocating regular memory instead.
It would be a difficult situation to have already copied a block of data
from a NIC to p2p memory only to have it be deemed unmappable on the
NVMe device it's destined for. (Or vice-versa.) This was another issue
p2pmem was attempting to solve.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 19:35         ` Logan Gunthorpe
@ 2017-04-18 19:48           ` Dan Williams
  2017-04-18 20:29             ` Jerome Glisse
  2017-04-18 21:03             ` Jason Gunthorpe
  2017-04-18 19:48           ` Jason Gunthorpe
  1 sibling, 2 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-18 19:48 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
>> Ultimately every dma_ops will need special code to support P2P with
>> the special hardware that ops is controlling, so it makes some sense
>> to start by pushing the check down there in the first place. This
>> advice is partially motivated by how dma_map_sg is just a small
>> wrapper around the function pointer call...
>
> Yes, I noticed this problem too and that makes sense. It just means
> every dma_ops will probably need to be modified to either support p2p
> pages or fail on them. Though, the only real difficulty there is that it
> will be a lot of work.

I don't think you need to go touch all dma_ops, I think you can just
arrange for devices that are going to do dma to get redirected to a
p2p aware provider of operations that overrides the system default
dma_ops. I.e. just touch get_dma_ops().

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 19:35         ` Logan Gunthorpe
  2017-04-18 19:48           ` Dan Williams
@ 2017-04-18 19:48           ` Jason Gunthorpe
  2017-04-18 20:06             ` Logan Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 19:48 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 01:35:32PM -0600, Logan Gunthorpe wrote:

> > Ultimately every dma_ops will need special code to support P2P with
> > the special hardware that ops is controlling, so it makes some sense
> > to start by pushing the check down there in the first place. This
> > advice is partially motivated by how dma_map_sg is just a small
> > wrapper around the function pointer call...
> 
> Yes, I noticed this problem too and that makes sense. It just means
> every dma_ops will probably need to be modified to either support p2p
> pages or fail on them. Though, the only real difficulty there is that it
> will be a lot of work.

I think this is why progress on this keeps getting stuck - every
solution is a lot of work.

> > Where p2p_same_segment_map_page checks if the two devices are on the
> > 'same switch' and if so returns the address translated to match the
> > bus address programmed into the BAR or fails. We knows this case is
> > required to work by the PCI spec, so it makes sense to use it as the
> > first canned helper.
> 
> I've also suggested that this check should probably be done (or perhaps
> duplicated) before we even get to the map stage.

Since the mechanics of the check is essentially unique to every
dma-ops I would not hoist it out of the map function without a really
good reason.

> In the case of nvme-fabrics we'd probably want to let the user know
> when they try to configure it or at least fall back to allocating
> regular memory instead.

You could try to do a dummy mapping / create a MR early on to detect
this.

FWIW, I wonder if from a RDMA perspective we have another
problem.. Should we allow P2P memory to be used with the local DMA
lkey? There are potential designs around virtualization that would not
allow that. Should we mandate that P2P memory be in its own MR?

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 19:48           ` Jason Gunthorpe
@ 2017-04-18 20:06             ` Logan Gunthorpe
  0 siblings, 0 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18 20:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 18/04/17 01:48 PM, Jason Gunthorpe wrote:
> I think this is why progress on this keeps getting stuck - every
> solution is a lot of work.

Yup! There's also a ton of work just to get the iomem safety issues
addressed. Let alone the dma mapping issues.

> You could try to do a dummy mapping / create a MR early on to detect
> this.

Ok, that could be a workable solution.

> FWIW, I wonder if from a RDMA perspective we have another
> problem.. Should we allow P2P memory to be used with the local DMA
> lkey? There are potential designs around virtualization that would not
> allow that. Should we mandate that P2P memory be in its own MR?

I can't say I understand these issues...

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 19:48           ` Dan Williams
@ 2017-04-18 20:29             ` Jerome Glisse
  2017-04-18 20:31               ` Dan Williams
  2017-04-18 21:03             ` Jason Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Jerome Glisse @ 2017-04-18 20:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel

> On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe <logang@deltatee.com>
> wrote:
> >
> >
> > On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
> >> Ultimately every dma_ops will need special code to support P2P with
> >> the special hardware that ops is controlling, so it makes some sense
> >> to start by pushing the check down there in the first place. This
> >> advice is partially motivated by how dma_map_sg is just a small
> >> wrapper around the function pointer call...
> >
> > Yes, I noticed this problem too and that makes sense. It just means
> > every dma_ops will probably need to be modified to either support p2p
> > pages or fail on them. Though, the only real difficulty there is that it
> > will be a lot of work.
> 
> I don't think you need to go touch all dma_ops, I think you can just
> arrange for devices that are going to do dma to get redirected to a
> p2p aware provider of operations that overrides the system default
> dma_ops. I.e. just touch get_dma_ops().

This would not work well for everyone, for instance on GPU we usualy
have buffer object with a mix of device memory and regular system
memory but call dma sg map once for the list.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 20:29             ` Jerome Glisse
@ 2017-04-18 20:31               ` Dan Williams
  2017-04-18 20:48                 ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-18 20:31 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Logan Gunthorpe, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel

On Tue, Apr 18, 2017 at 1:29 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>> On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe <logang@deltatee.com>
>> wrote:
>> >
>> >
>> > On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
>> >> Ultimately every dma_ops will need special code to support P2P with
>> >> the special hardware that ops is controlling, so it makes some sense
>> >> to start by pushing the check down there in the first place. This
>> >> advice is partially motivated by how dma_map_sg is just a small
>> >> wrapper around the function pointer call...
>> >
>> > Yes, I noticed this problem too and that makes sense. It just means
>> > every dma_ops will probably need to be modified to either support p2p
>> > pages or fail on them. Though, the only real difficulty there is that it
>> > will be a lot of work.
>>
>> I don't think you need to go touch all dma_ops, I think you can just
>> arrange for devices that are going to do dma to get redirected to a
>> p2p aware provider of operations that overrides the system default
>> dma_ops. I.e. just touch get_dma_ops().
>
> This would not work well for everyone, for instance on GPU we usualy
> have buffer object with a mix of device memory and regular system
> memory but call dma sg map once for the list.
>

...and that dma_map goes through get_dma_ops(), so I don't see the conflict?

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 20:31               ` Dan Williams
@ 2017-04-18 20:48                 ` Logan Gunthorpe
  2017-04-19  1:17                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18 20:48 UTC (permalink / raw)
  To: Dan Williams, Jerome Glisse
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel



On 18/04/17 02:31 PM, Dan Williams wrote:
> On Tue, Apr 18, 2017 at 1:29 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>>> On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe <logang@deltatee.com>
>>> wrote:
>>>>
>>>>
>>>> On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
>>>>> Ultimately every dma_ops will need special code to support P2P with
>>>>> the special hardware that ops is controlling, so it makes some sense
>>>>> to start by pushing the check down there in the first place. This
>>>>> advice is partially motivated by how dma_map_sg is just a small
>>>>> wrapper around the function pointer call...
>>>>
>>>> Yes, I noticed this problem too and that makes sense. It just means
>>>> every dma_ops will probably need to be modified to either support p2p
>>>> pages or fail on them. Though, the only real difficulty there is that it
>>>> will be a lot of work.
>>>
>>> I don't think you need to go touch all dma_ops, I think you can just
>>> arrange for devices that are going to do dma to get redirected to a
>>> p2p aware provider of operations that overrides the system default
>>> dma_ops. I.e. just touch get_dma_ops().
>>
>> This would not work well for everyone, for instance on GPU we usualy
>> have buffer object with a mix of device memory and regular system
>> memory but call dma sg map once for the list.
>>
> 
> ...and that dma_map goes through get_dma_ops(), so I don't see the conflict?

The main conflict is in dma_map_sg which only does get_dma_ops once but
the sg may contain memory of different types.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 19:48           ` Dan Williams
  2017-04-18 20:29             ` Jerome Glisse
@ 2017-04-18 21:03             ` Jason Gunthorpe
  2017-04-18 21:11               ` Dan Williams
                                 ` (2 more replies)
  1 sibling, 3 replies; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 21:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 12:48:35PM -0700, Dan Williams wrote:

> > Yes, I noticed this problem too and that makes sense. It just means
> > every dma_ops will probably need to be modified to either support p2p
> > pages or fail on them. Though, the only real difficulty there is that it
> > will be a lot of work.
> 
> I don't think you need to go touch all dma_ops, I think you can just
> arrange for devices that are going to do dma to get redirected to a
> p2p aware provider of operations that overrides the system default
> dma_ops. I.e. just touch get_dma_ops().

I don't follow, when does get_dma_ops() return a p2p aware provider?
It has no way to know if the DMA is going to involve p2p, get_dma_ops
is called with the device initiating the DMA.

So you'd always return the P2P shim on a system that has registered
P2P memory?

Even so, how does this shim work? dma_ops are not really intended to
be stacked. How would we make unmap work, for instance? What happens
when the underlying iommu dma ops actually natively understands p2p
and doesn't want the shim?

I think this opens an even bigger can of worms..

Lets find a strategy to safely push this into dma_ops.

What about something more incremental like this instead:
- dma_ops will set map_sg_p2p == map_sg when they are updated to
  support p2p, otherwise DMA on P2P pages will fail for those ops.
- When all ops support p2p we remove the if and ops->map_sg then
  just call map_sg_p2p
- For now the scatterlist maintains a bit when pages are added indicating if
  p2p memory might be present in the list.
- Unmap for p2p and non-p2p is the same, the underlying ops driver has
  to make it work.

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 0977317c6835c2..505ed7d502053d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -103,6 +103,9 @@ struct dma_map_ops {
 	int (*map_sg)(struct device *dev, struct scatterlist *sg,
 		      int nents, enum dma_data_direction dir,
 		      unsigned long attrs);
+	int (*map_sg_p2p)(struct device *dev, struct scatterlist *sg,
+			  int nents, enum dma_data_direction dir,
+			  unsigned long attrs);
 	void (*unmap_sg)(struct device *dev,
 			 struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir,
@@ -244,7 +247,15 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 	for_each_sg(sg, s, nents, i)
 		kmemcheck_mark_initialized(sg_virt(s), s->length);
 	BUG_ON(!valid_dma_direction(dir));
-	ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
+	if (sg_has_p2p(sg)) {
+		if (ops->map_sg_p2p)
+			ents = ops->map_sg_p2p(dev, sg, nents, dir, attrs);
+		else
+			return 0;
+	} else
+		ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
 	BUG_ON(ents < 0);
 	debug_dma_map_sg(dev, sg, nents, ents, dir);
 

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 21:03             ` Jason Gunthorpe
@ 2017-04-18 21:11               ` Dan Williams
  2017-04-18 21:22                 ` Jason Gunthorpe
  2017-04-18 21:31               ` Logan Gunthorpe
  2017-04-19  1:20               ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-18 21:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 2:03 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Apr 18, 2017 at 12:48:35PM -0700, Dan Williams wrote:
>
>> > Yes, I noticed this problem too and that makes sense. It just means
>> > every dma_ops will probably need to be modified to either support p2p
>> > pages or fail on them. Though, the only real difficulty there is that it
>> > will be a lot of work.
>>
>> I don't think you need to go touch all dma_ops, I think you can just
>> arrange for devices that are going to do dma to get redirected to a
>> p2p aware provider of operations that overrides the system default
>> dma_ops. I.e. just touch get_dma_ops().
>
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
>
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
>
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?
>
> I think this opens an even bigger can of worms..

No, I don't think it does. You'd only shim when the target page is
backed by a device, not host memory, and you can figure this out by a
is_zone_device_page()-style lookup.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 21:11               ` Dan Williams
@ 2017-04-18 21:22                 ` Jason Gunthorpe
  2017-04-18 21:36                   ` Dan Williams
  2017-04-19  1:21                   ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 21:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > I think this opens an even bigger can of worms..
> 
> No, I don't think it does. You'd only shim when the target page is
> backed by a device, not host memory, and you can figure this out by a
> is_zone_device_page()-style lookup.

The bigger can of worms is how do you meaningfully stack dma_ops.

What does the p2p provider do when it detects a p2p page?

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 21:03             ` Jason Gunthorpe
  2017-04-18 21:11               ` Dan Williams
@ 2017-04-18 21:31               ` Logan Gunthorpe
  2017-04-18 22:24                 ` Jason Gunthorpe
  2017-04-19  1:20               ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18 21:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Benjamin Herrenschmidt, Bjorn Helgaas, Christoph Hellwig,
	Sagi Grimberg, James E.J. Bottomley, Martin K. Petersen,
	Jens Axboe, Steve Wise, Stephen Bates, Max Gurtovoy, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse



On 18/04/17 03:03 PM, Jason Gunthorpe wrote:
> What about something more incremental like this instead:
> - dma_ops will set map_sg_p2p == map_sg when they are updated to
>   support p2p, otherwise DMA on P2P pages will fail for those ops.
> - When all ops support p2p we remove the if and ops->map_sg then
>   just call map_sg_p2p
> - For now the scatterlist maintains a bit when pages are added indicating if
>   p2p memory might be present in the list.
> - Unmap for p2p and non-p2p is the same, the underlying ops driver has
>   to make it work.
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 0977317c6835c2..505ed7d502053d 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -103,6 +103,9 @@ struct dma_map_ops {
>  	int (*map_sg)(struct device *dev, struct scatterlist *sg,
>  		      int nents, enum dma_data_direction dir,
>  		      unsigned long attrs);
> +	int (*map_sg_p2p)(struct device *dev, struct scatterlist *sg,
> +			  int nents, enum dma_data_direction dir,
> +			  unsigned long attrs);
>  	void (*unmap_sg)(struct device *dev,
>  			 struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir,
> @@ -244,7 +247,15 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
>  	for_each_sg(sg, s, nents, i)
>  		kmemcheck_mark_initialized(sg_virt(s), s->length);
>  	BUG_ON(!valid_dma_direction(dir));
> -	ents = ops->map_sg(dev, sg, nents, dir, attrs);
> +
> +	if (sg_has_p2p(sg)) {
> +		if (ops->map_sg_p2p)
> +			ents = ops->map_sg_p2p(dev, sg, nents, dir, attrs);
> +		else
> +			return 0;
> +	} else
> +		ents = ops->map_sg(dev, sg, nents, dir, attrs);
> +
>  	BUG_ON(ents < 0);
>  	debug_dma_map_sg(dev, sg, nents, ents, dir);

I could get behind this. Though a couple of points:

1) It means that sg_has_p2p has to walk the entire sg and check every
page. Then map_sg_p2p/map_sg has to walk it again and repeat the check
then do some operation per page. If anyone is concerned about the
dma_map performance this could be an issue.

2) Without knowing exactly what the arch specific code may need to do
it's hard to say that this is exactly the right approach. If every
dma_ops provider has to do exactly this on every page it may lead to a
lot of duplicate code:

foreach_sg page:
   if (pci_page_is_p2p(page)) {
        dma_addr = pci_p2p_map_page(page)
	if (!dma_addr)
		return 0;
        continue
   }
   ...

The only thing I'm presently aware of is the segment check and applying
the offset to the physical address -- neither of which has much to do
with the specific dma_ops providers. It _may_ be that this needs to be
bus specific and not arch specific which I think is what Dan may be
getting at. So it may make sense to just have a pci_map_sg_p2p() which
takes a dma_ops struct it would use for any page that isn't a p2p page.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 21:22                 ` Jason Gunthorpe
@ 2017-04-18 21:36                   ` Dan Williams
  2017-04-18 22:15                     ` Logan Gunthorpe
  2017-04-19  1:21                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-18 21:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 2:22 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
>> > I think this opens an even bigger can of worms..
>>
>> No, I don't think it does. You'd only shim when the target page is
>> backed by a device, not host memory, and you can figure this out by a
>> is_zone_device_page()-style lookup.
>
> The bigger can of worms is how do you meaningfully stack dma_ops.

This goes back to my original comment to make this capability a
function of the pci bridge itself. The kernel has an implementation of
a dynamically created bridge device that injects its own dma_ops for
the devices behind the bridge. See vmd_setup_dma_ops() in
drivers/pci/host/vmd.c.

> What does the p2p provider do when it detects a p2p page?

Check to see if the arch requires this offset translation that Ben
brought up and if not provide the physical address as the patches are
doing now.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 21:36                   ` Dan Williams
@ 2017-04-18 22:15                     ` Logan Gunthorpe
  2017-04-18 22:28                       ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18 22:15 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Bjorn Helgaas, Christoph Hellwig,
	Sagi Grimberg, James E.J. Bottomley, Martin K. Petersen,
	Jens Axboe, Steve Wise, Stephen Bates, Max Gurtovoy, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse



On 18/04/17 03:36 PM, Dan Williams wrote:
> On Tue, Apr 18, 2017 at 2:22 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
>> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
>>>> I think this opens an even bigger can of worms..
>>>
>>> No, I don't think it does. You'd only shim when the target page is
>>> backed by a device, not host memory, and you can figure this out by a
>>> is_zone_device_page()-style lookup.
>>
>> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> This goes back to my original comment to make this capability a
> function of the pci bridge itself. The kernel has an implementation of
> a dynamically created bridge device that injects its own dma_ops for
> the devices behind the bridge. See vmd_setup_dma_ops() in
> drivers/pci/host/vmd.c.

Well the issue I think Jason is pointing out is that the ops don't
stack. The map_* function in the injected dma_ops needs to be able to
call the original map_* for any page that is not p2p memory. This is
especially annoying in the map_sg function which may need to call a
different op based on the contents of the sgl. (And please correct me if
I'm not seeing how this can be done in the vmd example.)

Also, what happens if p2p pages end up getting passed to a device that
doesn't have the injected dma_ops?

However, the concept of replacing the dma_ops for all devices behind a
supporting bridge is interesting and may be a good piece of the final
solution.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 21:31               ` Logan Gunthorpe
@ 2017-04-18 22:24                 ` Jason Gunthorpe
  2017-04-18 23:03                   ` Logan Gunthorpe
  2017-04-19  1:23                   ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 22:24 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 03:31:58PM -0600, Logan Gunthorpe wrote:

> 1) It means that sg_has_p2p has to walk the entire sg and check every
> page. Then map_sg_p2p/map_sg has to walk it again and repeat the check
> then do some operation per page. If anyone is concerned about the
> dma_map performance this could be an issue.

dma_map performance is a concern, this is why I suggest this as an
interm solution until all dma_ops are migrated. Ideally sg_has_p2p
would be a fast path that checked some kind of flags bit set during
sg_assign_page... 

This would probably all have to be protected with CONFIG_P2P until it
becomes performance neutral. People without an iommu are not going to
want to walk the sg list at all..

> 2) Without knowing exactly what the arch specific code may need to do
> it's hard to say that this is exactly the right approach. If every
> dma_ops provider has to do exactly this on every page it may lead to a
> lot of duplicate code:

I think someone would have to start to look at it to make a
determination..

I suspect the main server oriented iommu dma op will want to have
proper p2p support anyhow and will probably have their unique control
flow..

> The only thing I'm presently aware of is the segment check and applying
> the offset to the physical address

Well, I called the function p2p_same_segment_map_page() in my last
suggestion for a reason - that is all the helper does.

The intention would be for real iommu drivers to call that helper for
the one simple case and if it fails then use their own routines to
figure out if cross-segment P2P is possible and configure the iommu as
needed.

> bus specific and not arch specific which I think is what Dan may be
> getting at. So it may make sense to just have a pci_map_sg_p2p() which
> takes a dma_ops struct it would use for any page that isn't a p2p page.

Like I keep saying, dma_ops are not really designed to be stacked.

Try and write a stacked map_sg function like you describe and you will
see how horrible it quickly becomes.

Setting up an iommu is very expensive, so we need to batch it for the
entire sg list. Thus a trivial implementation to iterate over all sg
list entries is not desired.

So first a sg list without p2p memory would have to be created, pass
to the lower level ops, then brought back. Remember, the returned sg
list will have a different number of entries than the original. Now
another complex loop is needed to split/merge back in the p2p sg
elements to get a return result.

Finally, we have to undo all of this when doing unmap.

Basically, all this list processing is a huge overhead compared to
just putting a helper call in the existing sg iteration loop of the
actual op.  Particularly if the actual op is a no-op like no-mmu x86
would use.

Since dma mapping is a performance path we must be careful not to
create intrinsic inefficiencies with otherwise nice layering :)

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:15                     ` Logan Gunthorpe
@ 2017-04-18 22:28                       ` Dan Williams
  2017-04-18 22:42                         ` Jason Gunthorpe
  2017-04-18 22:48                         ` Logan Gunthorpe
  0 siblings, 2 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-18 22:28 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 3:15 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 18/04/17 03:36 PM, Dan Williams wrote:
>> On Tue, Apr 18, 2017 at 2:22 PM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>>> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
>>>>> I think this opens an even bigger can of worms..
>>>>
>>>> No, I don't think it does. You'd only shim when the target page is
>>>> backed by a device, not host memory, and you can figure this out by a
>>>> is_zone_device_page()-style lookup.
>>>
>>> The bigger can of worms is how do you meaningfully stack dma_ops.
>>
>> This goes back to my original comment to make this capability a
>> function of the pci bridge itself. The kernel has an implementation of
>> a dynamically created bridge device that injects its own dma_ops for
>> the devices behind the bridge. See vmd_setup_dma_ops() in
>> drivers/pci/host/vmd.c.
>
> Well the issue I think Jason is pointing out is that the ops don't
> stack. The map_* function in the injected dma_ops needs to be able to
> call the original map_* for any page that is not p2p memory. This is
> especially annoying in the map_sg function which may need to call a
> different op based on the contents of the sgl. (And please correct me if
> I'm not seeing how this can be done in the vmd example.)

Unlike the pci bus address offset case which I think is fundamental to
support since shipping archs do this today, I think it is ok to say
p2p is restricted to a single sgl that gets to talk to host memory or
a single device. That said, what's wrong with a p2p aware map_sg
implementation calling up to the host memory map_sg implementation on
a per sgl basis?

> Also, what happens if p2p pages end up getting passed to a device that
> doesn't have the injected dma_ops?

This goes back to limiting p2p to a single pci host bridge. If the p2p
capability is coordinated with the bridge rather than between the
individual devices then we have a central point to catch this case.

...of course this is all hand wavy until someone writes the code and
proves otherwise.

> However, the concept of replacing the dma_ops for all devices behind a
> supporting bridge is interesting and may be a good piece of the final
> solution.

It's at least a proof point for injecting special behavior for devices
behind a (virtual) pci bridge without needing to go touch a bunch of
drivers.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:28                       ` Dan Williams
@ 2017-04-18 22:42                         ` Jason Gunthorpe
  2017-04-18 22:51                           ` Dan Williams
  2017-04-18 22:48                         ` Logan Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 22:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 03:28:17PM -0700, Dan Williams wrote:

> Unlike the pci bus address offset case which I think is fundamental to
> support since shipping archs do this toda

But we can support this by modifying those arch's unique dma_ops
directly.

Eg as I explained, my p2p_same_segment_map_page() helper concept would
do the offset adjustment for same-segement DMA.

If PPC calls that in their IOMMU drivers then they will have proper
support for this basic p2p, and the right framework to move on to more
advanced cases of p2p.

This really seems like much less trouble than trying to wrapper all
the arch's dma ops, and doesn't have the wonky restrictions.

> I think it is ok to say p2p is restricted to a single sgl that gets
> to talk to host memory or a single device.

RDMA and GPU would be sad with this restriction...

> That said, what's wrong with a p2p aware map_sg implementation
> calling up to the host memory map_sg implementation on a per sgl
> basis?

Setting up the iommu is fairly expensive, so getting rid of the
batching would kill performance..

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 17:27     ` Dan Williams
  2017-04-18 18:00       ` Jason Gunthorpe
@ 2017-04-18 22:46       ` Benjamin Herrenschmidt
  2017-04-18 22:52         ` Dan Williams
  1 sibling, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-18 22:46 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe
  Cc: Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Tue, 2017-04-18 at 10:27 -0700, Dan Williams wrote:
> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> > already have APIs that map BAR memory to user space, and would like to
> > keep using them. A 'enable P2P for bar' helper function sounds better
> > to me.
> 
> ...and I think it's not a helper function as much as asking the bus
> provider "can these two device dma to each other". The "helper" is the
> dma api redirecting through a software-iommu that handles bus address
> translation differently than it would handle host memory dma mapping.

Do we even need tat function ? The dma_ops have a dma_supported()
call...

If we have those override ops built into the "dma_target" object,
then these things can make that decision knowing both the source
and target device.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:28                       ` Dan Williams
  2017-04-18 22:42                         ` Jason Gunthorpe
@ 2017-04-18 22:48                         ` Logan Gunthorpe
  2017-04-18 22:50                           ` Dan Williams
  1 sibling, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18 22:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 18/04/17 04:28 PM, Dan Williams wrote:
> Unlike the pci bus address offset case which I think is fundamental to
> support since shipping archs do this today, I think it is ok to say
> p2p is restricted to a single sgl that gets to talk to host memory or
> a single device. That said, what's wrong with a p2p aware map_sg
> implementation calling up to the host memory map_sg implementation on
> a per sgl basis?

I think Ben said they need mixed sgls and that is where this gets messy.
I think I'd prefer this too given trying to enforce all sgs in a list to
be one type or another could be quite difficult given the state of the
scatterlist code.

>> Also, what happens if p2p pages end up getting passed to a device that
>> doesn't have the injected dma_ops?
> 
> This goes back to limiting p2p to a single pci host bridge. If the p2p
> capability is coordinated with the bridge rather than between the
> individual devices then we have a central point to catch this case.

Not really relevant. If these pages get to userspace (as people seem
keen on doing) or a less than careful kernel driver they could easily
get into the dma_map calls of devices that aren't even pci related (via
an O_DIRECT operation on an incorrect file or something). The common
code must reject these and can't rely on an injected dma op.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:48                         ` Logan Gunthorpe
@ 2017-04-18 22:50                           ` Dan Williams
  2017-04-18 22:56                             ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-18 22:50 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 3:48 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 18/04/17 04:28 PM, Dan Williams wrote:
>> Unlike the pci bus address offset case which I think is fundamental to
>> support since shipping archs do this today, I think it is ok to say
>> p2p is restricted to a single sgl that gets to talk to host memory or
>> a single device. That said, what's wrong with a p2p aware map_sg
>> implementation calling up to the host memory map_sg implementation on
>> a per sgl basis?
>
> I think Ben said they need mixed sgls and that is where this gets messy.
> I think I'd prefer this too given trying to enforce all sgs in a list to
> be one type or another could be quite difficult given the state of the
> scatterlist code.
>
>>> Also, what happens if p2p pages end up getting passed to a device that
>>> doesn't have the injected dma_ops?
>>
>> This goes back to limiting p2p to a single pci host bridge. If the p2p
>> capability is coordinated with the bridge rather than between the
>> individual devices then we have a central point to catch this case.
>
> Not really relevant. If these pages get to userspace (as people seem
> keen on doing) or a less than careful kernel driver they could easily
> get into the dma_map calls of devices that aren't even pci related (via
> an O_DIRECT operation on an incorrect file or something). The common
> code must reject these and can't rely on an injected dma op.

No, we can't do that at get_user_pages() time, it will always need to
be up to the device driver to fail dma that it can't perform.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:42                         ` Jason Gunthorpe
@ 2017-04-18 22:51                           ` Dan Williams
  2017-04-18 23:21                             ` Jason Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-18 22:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 3:42 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Apr 18, 2017 at 03:28:17PM -0700, Dan Williams wrote:
>
>> Unlike the pci bus address offset case which I think is fundamental to
>> support since shipping archs do this toda
>
> But we can support this by modifying those arch's unique dma_ops
> directly.
>
> Eg as I explained, my p2p_same_segment_map_page() helper concept would
> do the offset adjustment for same-segement DMA.
>
> If PPC calls that in their IOMMU drivers then they will have proper
> support for this basic p2p, and the right framework to move on to more
> advanced cases of p2p.
>
> This really seems like much less trouble than trying to wrapper all
> the arch's dma ops, and doesn't have the wonky restrictions.

I don't think the root bus iommu drivers have any business knowing or
caring about dma happening between devices lower in the hierarchy.

>> I think it is ok to say p2p is restricted to a single sgl that gets
>> to talk to host memory or a single device.
>
> RDMA and GPU would be sad with this restriction...
>
>> That said, what's wrong with a p2p aware map_sg implementation
>> calling up to the host memory map_sg implementation on a per sgl
>> basis?
>
> Setting up the iommu is fairly expensive, so getting rid of the
> batching would kill performance..

When we're crossing device and host memory boundaries how much
batching is possible? As far as I can see you'll always be splitting
the sgl on these dma mapping boundaries.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:46       ` Benjamin Herrenschmidt
@ 2017-04-18 22:52         ` Dan Williams
  0 siblings, 0 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-18 22:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jason Gunthorpe, Logan Gunthorpe, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 3:46 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2017-04-18 at 10:27 -0700, Dan Williams wrote:
>> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
>> > already have APIs that map BAR memory to user space, and would like to
>> > keep using them. A 'enable P2P for bar' helper function sounds better
>> > to me.
>>
>> ...and I think it's not a helper function as much as asking the bus
>> provider "can these two device dma to each other". The "helper" is the
>> dma api redirecting through a software-iommu that handles bus address
>> translation differently than it would handle host memory dma mapping.
>
> Do we even need tat function ? The dma_ops have a dma_supported()
> call...
>
> If we have those override ops built into the "dma_target" object,
> then these things can make that decision knowing both the source
> and target device.
>

Yes.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:50                           ` Dan Williams
@ 2017-04-18 22:56                             ` Logan Gunthorpe
  2017-04-18 23:02                               ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18 22:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 18/04/17 04:50 PM, Dan Williams wrote:
> On Tue, Apr 18, 2017 at 3:48 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>> On 18/04/17 04:28 PM, Dan Williams wrote:
>>> Unlike the pci bus address offset case which I think is fundamental to
>>> support since shipping archs do this today, I think it is ok to say
>>> p2p is restricted to a single sgl that gets to talk to host memory or
>>> a single device. That said, what's wrong with a p2p aware map_sg
>>> implementation calling up to the host memory map_sg implementation on
>>> a per sgl basis?
>>
>> I think Ben said they need mixed sgls and that is where this gets messy.
>> I think I'd prefer this too given trying to enforce all sgs in a list to
>> be one type or another could be quite difficult given the state of the
>> scatterlist code.
>>
>>>> Also, what happens if p2p pages end up getting passed to a device that
>>>> doesn't have the injected dma_ops?
>>>
>>> This goes back to limiting p2p to a single pci host bridge. If the p2p
>>> capability is coordinated with the bridge rather than between the
>>> individual devices then we have a central point to catch this case.
>>
>> Not really relevant. If these pages get to userspace (as people seem
>> keen on doing) or a less than careful kernel driver they could easily
>> get into the dma_map calls of devices that aren't even pci related (via
>> an O_DIRECT operation on an incorrect file or something). The common
>> code must reject these and can't rely on an injected dma op.
> 
> No, we can't do that at get_user_pages() time, it will always need to
> be up to the device driver to fail dma that it can't perform.

I'm not sure I follow -- are you agreeing with me? The dma_map_* needs
to fail for any dma it cannot perform. Which means either all dma_ops
providers need to be p2p aware or this logic has to be in dma_map_*
itself. My point being: you can't rely on an injected dma_op for some
devices to handle the fail case globally.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:56                             ` Logan Gunthorpe
@ 2017-04-18 23:02                               ` Dan Williams
  0 siblings, 0 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-18 23:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 3:56 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 18/04/17 04:50 PM, Dan Williams wrote:
>> On Tue, Apr 18, 2017 at 3:48 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>>
>>>
>>> On 18/04/17 04:28 PM, Dan Williams wrote:
>>>> Unlike the pci bus address offset case which I think is fundamental to
>>>> support since shipping archs do this today, I think it is ok to say
>>>> p2p is restricted to a single sgl that gets to talk to host memory or
>>>> a single device. That said, what's wrong with a p2p aware map_sg
>>>> implementation calling up to the host memory map_sg implementation on
>>>> a per sgl basis?
>>>
>>> I think Ben said they need mixed sgls and that is where this gets messy.
>>> I think I'd prefer this too given trying to enforce all sgs in a list to
>>> be one type or another could be quite difficult given the state of the
>>> scatterlist code.
>>>
>>>>> Also, what happens if p2p pages end up getting passed to a device that
>>>>> doesn't have the injected dma_ops?
>>>>
>>>> This goes back to limiting p2p to a single pci host bridge. If the p2p
>>>> capability is coordinated with the bridge rather than between the
>>>> individual devices then we have a central point to catch this case.
>>>
>>> Not really relevant. If these pages get to userspace (as people seem
>>> keen on doing) or a less than careful kernel driver they could easily
>>> get into the dma_map calls of devices that aren't even pci related (via
>>> an O_DIRECT operation on an incorrect file or something). The common
>>> code must reject these and can't rely on an injected dma op.
>>
>> No, we can't do that at get_user_pages() time, it will always need to
>> be up to the device driver to fail dma that it can't perform.
>
> I'm not sure I follow -- are you agreeing with me? The dma_map_* needs
> to fail for any dma it cannot perform. Which means either all dma_ops
> providers need to be p2p aware or this logic has to be in dma_map_*
> itself. My point being: you can't rely on an injected dma_op for some
> devices to handle the fail case globally.

Ah, I see what you're saying now. Yes, we do need something that
guarantees any dma mapping implementation that gets a struct page that
it does now know how to translate properly fails the request.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:24                 ` Jason Gunthorpe
@ 2017-04-18 23:03                   ` Logan Gunthorpe
  2017-04-19  1:23                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-18 23:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 18/04/17 04:24 PM, Jason Gunthorpe wrote:
> Try and write a stacked map_sg function like you describe and you will
> see how horrible it quickly becomes.

Yes, unfortunately, I have to agree with this statement completely.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)

Yeah, I'm also personally thinking your proposal is the way to go as
well. Dan's injected ops suggestion is interesting but I can't see how
it solves the issue completely. Your proposal is the only one that seems
to be complete to me. It just has a few minor pain points which I've
already described but are likely manageable and less than the pain
stacked dma_ops creates.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:51                           ` Dan Williams
@ 2017-04-18 23:21                             ` Jason Gunthorpe
  2017-04-19  1:25                               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-18 23:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Tue, Apr 18, 2017 at 03:51:27PM -0700, Dan Williams wrote:
> > This really seems like much less trouble than trying to wrapper all
> > the arch's dma ops, and doesn't have the wonky restrictions.
> 
> I don't think the root bus iommu drivers have any business knowing or
> caring about dma happening between devices lower in the hierarchy.

Maybe not, but performance requires some odd choices in this code.. :(

> > Setting up the iommu is fairly expensive, so getting rid of the
> > batching would kill performance..
> 
> When we're crossing device and host memory boundaries how much
> batching is possible? As far as I can see you'll always be splitting
> the sgl on these dma mapping boundaries.

Splitting the sgl is different from iommu batching.

As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
the middle.

The optimum behavior is to allocate a 1MB-4K iommu range and fill it
with the CPU memory. Then return a SGL with three entires, two
pointing into the range and one to the p2p.

It is creating each range which tends to be expensive, so creating two
ranges (or worse, if every SGL created a range it would be 255) is
very undesired.

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 18:00       ` Jason Gunthorpe
  2017-04-18 18:34         ` Dan Williams
@ 2017-04-19  1:13         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-19  1:13 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Tue, 2017-04-18 at 12:00 -0600, Jason Gunthorpe wrote:
> - All platforms can succeed if the PCI devices are under the same
>   'segment', but where segments begin is somewhat platform specific
>   knowledge. (this is 'same switch' idea Logan has talked about)

We also need to be careful whether P2P is enabled in the switch
or not.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 20:48                 ` Logan Gunthorpe
@ 2017-04-19  1:17                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-19  1:17 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams, Jerome Glisse
  Cc: Jason Gunthorpe, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel

On Tue, 2017-04-18 at 14:48 -0600, Logan Gunthorpe wrote:
> > ...and that dma_map goes through get_dma_ops(), so I don't see the conflict?
> 
> The main conflict is in dma_map_sg which only does get_dma_ops once but
> the sg may contain memory of different types.

We can handle that in our "overriden" dma ops.

It's a bit tricky but it *could* break it down into segments and
forward portions back to the original dma ops.

Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 21:03             ` Jason Gunthorpe
  2017-04-18 21:11               ` Dan Williams
  2017-04-18 21:31               ` Logan Gunthorpe
@ 2017-04-19  1:20               ` Benjamin Herrenschmidt
  2017-04-19 15:55                 ` Jason Gunthorpe
  2 siblings, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-19  1:20 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Tue, 2017-04-18 at 15:03 -0600, Jason Gunthorpe wrote:
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
> 
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
> 
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?

Good point. We only know on a per-page basis ... ugh.

So we really need to change the arch main dma_ops. I'm not opposed to
that. What we then need to do is have that main arch dma_map_sg,
when it encounters a "device" page, call into a helper attached to
the devmap to handle *that page*, providing sufficient context.

That helper wouldn't perform the actual iommu mapping. It would simply
return something along the lines of:

 - "use that alternate bus address and don't map in the iommu"
 - "use that alternate bus address and do map in the iommu"
 - "proceed as normal"
 - "fail"

What do you think ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 21:22                 ` Jason Gunthorpe
  2017-04-18 21:36                   ` Dan Williams
@ 2017-04-19  1:21                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-19  1:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Tue, 2017-04-18 at 15:22 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > > I think this opens an even bigger can of worms..
> > 
> > No, I don't think it does. You'd only shim when the target page is
> > backed by a device, not host memory, and you can figure this out by
> > a
> > is_zone_device_page()-style lookup.
> 
> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> What does the p2p provider do when it detects a p2p page?

Yeah I think we don't really want to stack dma_ops... thinking more
about it.

As I just wrote, it looks like we might need a more specialised hook
in the devmap to be used by the main dma_op, on a per-page basis.

Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 22:24                 ` Jason Gunthorpe
  2017-04-18 23:03                   ` Logan Gunthorpe
@ 2017-04-19  1:23                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-19  1:23 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Dan Williams, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Tue, 2017-04-18 at 16:24 -0600, Jason Gunthorpe wrote:
> Basically, all this list processing is a huge overhead compared to
> just putting a helper call in the existing sg iteration loop of the
> actual op.  Particularly if the actual op is a no-op like no-mmu x86
> would use.

Yes, I'm leaning toward that approach too.

The helper itself could hang off the devmap though.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)
> 
> Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-18 23:21                             ` Jason Gunthorpe
@ 2017-04-19  1:25                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-19  1:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Tue, 2017-04-18 at 17:21 -0600, Jason Gunthorpe wrote:
> Splitting the sgl is different from iommu batching.
> 
> As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
> the middle.
> 
> The optimum behavior is to allocate a 1MB-4K iommu range and fill it
> with the CPU memory. Then return a SGL with three entires, two
> pointing into the range and one to the p2p.
> 
> It is creating each range which tends to be expensive, so creating
> two
> ranges (or worse, if every SGL created a range it would be 255) is
> very undesired.

I think it's easier to get us started to just use a helper and
stick it in the existing sglist processing loop of the architecture.

As we noticed, stacking dma_ops is actually non-trivial and opens quite
the can of worms.

As Jerome mentioned, you can end up with IOs ops containing an sglist
that is a collection of memory and GPU pages for example.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19  1:20               ` Benjamin Herrenschmidt
@ 2017-04-19 15:55                 ` Jason Gunthorpe
  2017-04-19 16:48                   ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-19 15:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Dan Williams, Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig,
	Sagi Grimberg, James E.J. Bottomley, Martin K. Petersen,
	Jens Axboe, Steve Wise, Stephen Bates, Max Gurtovoy, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse

On Wed, Apr 19, 2017 at 11:20:06AM +1000, Benjamin Herrenschmidt wrote:

> That helper wouldn't perform the actual iommu mapping. It would simply
> return something along the lines of:
> 
>  - "use that alternate bus address and don't map in the iommu"

I was thinking only this one would be supported with a core code
helper..

>  - "use that alternate bus address and do map in the iommu"
>  - "proceed as normal"

.. because I don't have an idea how a core code helper could figure
out what the platform needs for the above two ..

I think many of the iommu platforms will need to construct their own
bus address in the iommu, and we already have easy access to the CPU
address.

Presumably once a transcation passes through the iommu it needs to be
using the CPU address for the target bar, otherwise things are going
to be ambiguous..

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 15:55                 ` Jason Gunthorpe
@ 2017-04-19 16:48                   ` Logan Gunthorpe
  2017-04-19 17:01                     ` Dan Williams
  2017-04-19 17:14                     ` Jason Gunthorpe
  0 siblings, 2 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-19 16:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Benjamin Herrenschmidt
  Cc: Dan Williams, Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
> I was thinking only this one would be supported with a core code
> helper..

Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
type flag to the dev_pagemap structure which would be very useful to us.
We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
Then, potentially, we could add a dma_map callback to the structure
(possibly unioned with an hmm field). The dev_ops providers would then
just need to do something like this (enclosed in a helper):

if (is_zone_device_page(page)) {
	pgmap = get_dev_pagemap(page_to_pfn(page));
	if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
	    !pgmap->dma_map)
		return 0;
	
	dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
	put_dev_pagemap(pgmap);
	if (!dma_addr)
		return 0;
	...
}

The pci_enable_p2p_bar function would then just need to call
devm_memremap_pages with the dma_map callback set to a function that
does the segment check and the offset calculation.

Thoughts?

@Jerome: my feedback to you would be that your patch assumes all users
of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
useful if it was generic. My suggestion would be to have the caller
allocate the dev_pagemap structure, populate it and pass it into
devm_memremap_pages. Given that pretty much everything in that structure
are already arguments to that function, I feel like this makes sense.
This should also help to unify hmm_devmem_pages_create and
devm_memremap_pages which look very similar to each other.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 16:48                   ` Logan Gunthorpe
@ 2017-04-19 17:01                     ` Dan Williams
  2017-04-19 17:32                       ` Jerome Glisse
  2017-04-19 17:14                     ` Jason Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-19 17:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Wed, Apr 19, 2017 at 9:48 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
>> I was thinking only this one would be supported with a core code
>> helper..
>
> Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
> type flag to the dev_pagemap structure which would be very useful to us.
> We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
> Then, potentially, we could add a dma_map callback to the structure
> (possibly unioned with an hmm field). The dev_ops providers would then
> just need to do something like this (enclosed in a helper):
>
> if (is_zone_device_page(page)) {
>         pgmap = get_dev_pagemap(page_to_pfn(page));
>         if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
>             !pgmap->dma_map)
>                 return 0;
>
>         dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
>         put_dev_pagemap(pgmap);
>         if (!dma_addr)
>                 return 0;
>         ...
> }
>
> The pci_enable_p2p_bar function would then just need to call
> devm_memremap_pages with the dma_map callback set to a function that
> does the segment check and the offset calculation.
>
> Thoughts?
>
> @Jerome: my feedback to you would be that your patch assumes all users
> of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
> useful if it was generic. My suggestion would be to have the caller
> allocate the dev_pagemap structure, populate it and pass it into
> devm_memremap_pages. Given that pretty much everything in that structure
> are already arguments to that function, I feel like this makes sense.
> This should also help to unify hmm_devmem_pages_create and
> devm_memremap_pages which look very similar to each other.

I like that change. Also the types should describe the memory relative
to its relationship to struct page, not whether it is persistent or
not. I would consider volatile and persistent memory that is attached
to the cpu memory controller and i/o coherent as the same type of
memory. DMA incoherent ranges like P2P and HMM should get their own
types.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 16:48                   ` Logan Gunthorpe
  2017-04-19 17:01                     ` Dan Williams
@ 2017-04-19 17:14                     ` Jason Gunthorpe
  2017-04-19 18:01                       ` Logan Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-19 17:14 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Wed, Apr 19, 2017 at 10:48:51AM -0600, Logan Gunthorpe wrote:
> The pci_enable_p2p_bar function would then just need to call
> devm_memremap_pages with the dma_map callback set to a function that
> does the segment check and the offset calculation.

I don't see a use for the dma_map function pointer at this point..

It doesn't make alot of sense for the completor of the DMA to provide
a mapping op, the mapping process is *path* specific, not specific to
a completer/initiator.

So, I would suggest more like this:

static inline struct device *get_p2p_src(struct page *page)
{
        struct device *res;
	struct dev_pagemap *pgmap;

	if (!is_zone_device_page(page))
	     return NULL;
	
        pgmap = get_dev_pagemap(page_to_pfn(page), NULL);
        if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P)
	        /* For now ZONE_DEVICE memory that is not P2P is
 		   assumed to be configured for DMA the same as CPU
		   memory. */
                return ERR_PTR(-EINVAL);
	res = pgmap->dev;
	device_get(res);
	put_dev_pagemap(pgmap);
	return res;
}

dma_addr_t pci_p2p_same_segment(struct device *initator,
                                struct device *completer,
				struct page *page)
{
   if (! PCI initiator & completer)
       return ERROR;
   if (!same segment initiator & completer)
       return ERROR;

   // Translate page directly to the value programmed into the BAR
   return (Completer's PCI BAR base address) + (offset of page within BAR);
}

// dma_sg_map

for (each sgl) {
    struct page *page = sg_page(s);
    struct device *p2p_src = get_p2p_src(page);

    if (IS_ERR(p2p_src))
        // fail dma_sg

    if (p2p_src) {
        bool needs_iommu = false;

        pa = pci_p2p_same_segment(dev, p2p_src, page);
	if (pa == ERROR)
	    pa = arch_p2p_cross_segment(dev, p2psrc, page, &needs_iommui);

        device_put(p2p_src);

        if (pa == ERROR)
	    // fail

	if (!needs_iommu) {      
	    // Insert PA directly into the result SGL
	    sg++;
	    continue;
	}
    }
    else
        // CPU memory
        pa = page_to_phys(page);


To me it looks like the code duplication across the iommu stuff comes
from just duplicating the basic iommu algorithm in every driver.

To clean that up I think someone would need to hoist the overall sgl
loop and use more ops callbacks eg allocate_iommu_range,
assign_page_to_rage, dealloc_range, etc. This is a problem p2p makes
worse, but isn't directly causing :\

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 17:01                     ` Dan Williams
@ 2017-04-19 17:32                       ` Jerome Glisse
  2017-04-19 17:41                         ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: Jerome Glisse @ 2017-04-19 17:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel

On Wed, Apr 19, 2017 at 10:01:23AM -0700, Dan Williams wrote:
> On Wed, Apr 19, 2017 at 9:48 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> >
> >
> > On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
> >> I was thinking only this one would be supported with a core code
> >> helper..
> >
> > Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
> > type flag to the dev_pagemap structure which would be very useful to us.
> > We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
> > Then, potentially, we could add a dma_map callback to the structure
> > (possibly unioned with an hmm field). The dev_ops providers would then
> > just need to do something like this (enclosed in a helper):
> >
> > if (is_zone_device_page(page)) {
> >         pgmap = get_dev_pagemap(page_to_pfn(page));
> >         if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
> >             !pgmap->dma_map)
> >                 return 0;
> >
> >         dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
> >         put_dev_pagemap(pgmap);
> >         if (!dma_addr)
> >                 return 0;
> >         ...
> > }
> >
> > The pci_enable_p2p_bar function would then just need to call
> > devm_memremap_pages with the dma_map callback set to a function that
> > does the segment check and the offset calculation.
> >
> > Thoughts?
> >
> > @Jerome: my feedback to you would be that your patch assumes all users
> > of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
> > useful if it was generic. My suggestion would be to have the caller
> > allocate the dev_pagemap structure, populate it and pass it into
> > devm_memremap_pages. Given that pretty much everything in that structure
> > are already arguments to that function, I feel like this makes sense.
> > This should also help to unify hmm_devmem_pages_create and
> > devm_memremap_pages which look very similar to each other.
> 
> I like that change. Also the types should describe the memory relative
> to its relationship to struct page, not whether it is persistent or
> not. I would consider volatile and persistent memory that is attached
> to the cpu memory controller and i/o coherent as the same type of
> memory. DMA incoherent ranges like P2P and HMM should get their own
> types.

Dan you asked me to not use devm_memremap_pages() because you didn't
want to have HMM memory in the pgmap_radix, did you change opinion
on that ? :)

Note i won't make any change now on that front but if it make sense
i am happy to do it as separate patchset on top of HMM.

Also i don't want p2pmem to be an exclusive or with HMM, we will want
GPU to do peer to peer DMA and thus HMM ZONE_DEVICE pages to support
this too.

I do believe it is easier to special case in ZONE_DEVICE into existing
dma_ops for each architecture. For x86 i think there is only 3 different
set of dma_ops to modify. For other arch i guess there is even less.

But in all the case i think p2pmem should stay out of memory management
business. If some set of device do not have memory management it is
better to propose helper to do that as part of the subsystem to which
those devices belong. Just wanted to reiterate that point.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 17:32                       ` Jerome Glisse
@ 2017-04-19 17:41                         ` Dan Williams
  2017-04-19 18:11                           ` Logan Gunthorpe
  2017-04-20 20:43                           ` Stephen  Bates
  0 siblings, 2 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-19 17:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Logan Gunthorpe, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel

On Wed, Apr 19, 2017 at 10:32 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Wed, Apr 19, 2017 at 10:01:23AM -0700, Dan Williams wrote:
>> On Wed, Apr 19, 2017 at 9:48 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>> >
>> >
>> > On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
>> >> I was thinking only this one would be supported with a core code
>> >> helper..
>> >
>> > Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
>> > type flag to the dev_pagemap structure which would be very useful to us.
>> > We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
>> > Then, potentially, we could add a dma_map callback to the structure
>> > (possibly unioned with an hmm field). The dev_ops providers would then
>> > just need to do something like this (enclosed in a helper):
>> >
>> > if (is_zone_device_page(page)) {
>> >         pgmap = get_dev_pagemap(page_to_pfn(page));
>> >         if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
>> >             !pgmap->dma_map)
>> >                 return 0;
>> >
>> >         dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
>> >         put_dev_pagemap(pgmap);
>> >         if (!dma_addr)
>> >                 return 0;
>> >         ...
>> > }
>> >
>> > The pci_enable_p2p_bar function would then just need to call
>> > devm_memremap_pages with the dma_map callback set to a function that
>> > does the segment check and the offset calculation.
>> >
>> > Thoughts?
>> >
>> > @Jerome: my feedback to you would be that your patch assumes all users
>> > of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
>> > useful if it was generic. My suggestion would be to have the caller
>> > allocate the dev_pagemap structure, populate it and pass it into
>> > devm_memremap_pages. Given that pretty much everything in that structure
>> > are already arguments to that function, I feel like this makes sense.
>> > This should also help to unify hmm_devmem_pages_create and
>> > devm_memremap_pages which look very similar to each other.
>>
>> I like that change. Also the types should describe the memory relative
>> to its relationship to struct page, not whether it is persistent or
>> not. I would consider volatile and persistent memory that is attached
>> to the cpu memory controller and i/o coherent as the same type of
>> memory. DMA incoherent ranges like P2P and HMM should get their own
>> types.
>
> Dan you asked me to not use devm_memremap_pages() because you didn't
> want to have HMM memory in the pgmap_radix, did you change opinion
> on that ? :)

No, not quite ;-). I still don't think we should require the non-HMM
to pass NULL for all the HMM arguments. What I like about Logan's
proposal is to have a separate create and register steps dev_pagemap.
That way call paths that don't care about HMM specifics can just turn
around and register the vanilla dev_pagemap.

> Note i won't make any change now on that front but if it make sense
> i am happy to do it as separate patchset on top of HMM.
>
> Also i don't want p2pmem to be an exclusive or with HMM, we will want
> GPU to do peer to peer DMA and thus HMM ZONE_DEVICE pages to support
> this too.

Yes, this makes sense I think we really just want to distinguish host
memory or not in terms of the dev_pagemap type.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 17:14                     ` Jason Gunthorpe
@ 2017-04-19 18:01                       ` Logan Gunthorpe
  2017-04-19 18:32                         ` Jason Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-19 18:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 19/04/17 11:14 AM, Jason Gunthorpe wrote:
> I don't see a use for the dma_map function pointer at this point..

Yes, it is kind of like designing for the future. I just find it a
little odd calling the pci functions in the iommu.


> It doesn't make alot of sense for the completor of the DMA to provide
> a mapping op, the mapping process is *path* specific, not specific to
> a completer/initiator.

I'm just spit balling here but if HMM wanted to use unaddressable memory
as a DMA target, it could set that function to create a window ine gpu
memory, then call the pci_p2p_same_segment and return the result as the
dma address.


> dma_addr_t pci_p2p_same_segment(struct device *initator,
>                                 struct device *completer,
> 				struct page *page)

I'm not sure I like the name pci_p2p_same_segment. It reads as though
it's only checking if the devices are not the same segment. It also may
be that, in the future, it supports devices on different segments. I'd
call it more like pci_p2p_dma_map.


> for (each sgl) {

Thanks, this code fleshes things out nicely


> To me it looks like the code duplication across the iommu stuff comes
> from just duplicating the basic iommu algorithm in every driver.

Yes, this is true.

> To clean that up I think someone would need to hoist the overall sgl
> loop and use more ops callbacks eg allocate_iommu_range,
> assign_page_to_rage, dealloc_range, etc. This is a problem p2p makes
> worse, but isn't directly causing :\

Yup.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 17:41                         ` Dan Williams
@ 2017-04-19 18:11                           ` Logan Gunthorpe
  2017-04-19 18:19                             ` Logan Gunthorpe
  2017-04-20 20:43                           ` Stephen  Bates
  1 sibling, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-19 18:11 UTC (permalink / raw)
  To: Dan Williams, Jerome Glisse
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel



On 19/04/17 11:41 AM, Dan Williams wrote:
> No, not quite ;-). I still don't think we should require the non-HMM
> to pass NULL for all the HMM arguments. What I like about Logan's
> proposal is to have a separate create and register steps dev_pagemap.
> That way call paths that don't care about HMM specifics can just turn
> around and register the vanilla dev_pagemap.

Would you necessarily even need a create step? I was thinking more along
the lines that struct dev_pagemap _could_ just be a member in another
structure. The caller would set the attributes they needed and pass it
to devm_memremap. (Similar to how we commonly do things with struct
device, et al). Potentially, that could also get rid of the need for the
*data pointer HMM is using to get back the struct hmm_devmem seeing
container_of could be used instead.

>> Note i won't make any change now on that front but if it make sense
>> i am happy to do it as separate patchset on top of HMM.
>>
>> Also i don't want p2pmem to be an exclusive or with HMM, we will want
>> GPU to do peer to peer DMA and thus HMM ZONE_DEVICE pages to support
>> this too.
> 
> Yes, this makes sense I think we really just want to distinguish host
> memory or not in terms of the dev_pagemap type.

That depends though. If unaddressable memory needs different steps to
get to the DMA address (ie if it has to setup a gpu window) then we may
still need a way to distinguish the two types of non-host memory.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 18:11                           ` Logan Gunthorpe
@ 2017-04-19 18:19                             ` Logan Gunthorpe
  2017-04-19 18:30                               ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-19 18:19 UTC (permalink / raw)
  To: Dan Williams, Jerome Glisse
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley,
	Martin K. Petersen, linux-rdma, Benjamin Herrenschmidt,
	Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Bjorn Helgaas, linux-pci, linux-nvdimm, Max Gurtovoy, linux-scsi,
	Christoph Hellwig



On 19/04/17 12:11 PM, Logan Gunthorpe wrote:
> 
> 
> On 19/04/17 11:41 AM, Dan Williams wrote:
>> No, not quite ;-). I still don't think we should require the non-HMM
>> to pass NULL for all the HMM arguments. What I like about Logan's
>> proposal is to have a separate create and register steps dev_pagemap.
>> That way call paths that don't care about HMM specifics can just turn
>> around and register the vanilla dev_pagemap.
> 
> Would you necessarily even need a create step? I was thinking more along
> the lines that struct dev_pagemap _could_ just be a member in another
> structure. The caller would set the attributes they needed and pass it
> to devm_memremap. (Similar to how we commonly do things with struct
> device, et al). Potentially, that could also get rid of the need for the
> *data pointer HMM is using to get back the struct hmm_devmem seeing
> container_of could be used instead.

Also, now that I've thought about it a little more, it _may_ be that
many or all of the hmm specific fields in dev_pagemap could move to a
containing struct too...

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 18:19                             ` Logan Gunthorpe
@ 2017-04-19 18:30                               ` Dan Williams
  2017-04-19 18:41                                 ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-19 18:30 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jerome Glisse, Jens Axboe, Keith Busch, James E.J. Bottomley,
	Martin K. Petersen, linux-rdma, Benjamin Herrenschmidt,
	Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Bjorn Helgaas, linux-pci, linux-nvdimm, Max Gurtovoy, linux-scsi,
	Christoph Hellwig

On Wed, Apr 19, 2017 at 11:19 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 19/04/17 12:11 PM, Logan Gunthorpe wrote:
>>
>>
>> On 19/04/17 11:41 AM, Dan Williams wrote:
>>> No, not quite ;-). I still don't think we should require the non-HMM
>>> to pass NULL for all the HMM arguments. What I like about Logan's
>>> proposal is to have a separate create and register steps dev_pagemap.
>>> That way call paths that don't care about HMM specifics can just turn
>>> around and register the vanilla dev_pagemap.
>>
>> Would you necessarily even need a create step? I was thinking more along
>> the lines that struct dev_pagemap _could_ just be a member in another
>> structure. The caller would set the attributes they needed and pass it
>> to devm_memremap. (Similar to how we commonly do things with struct
>> device, et al). Potentially, that could also get rid of the need for the
>> *data pointer HMM is using to get back the struct hmm_devmem seeing
>> container_of could be used instead.
>
> Also, now that I've thought about it a little more, it _may_ be that
> many or all of the hmm specific fields in dev_pagemap could move to a
> containing struct too...

Yes, that's already how we handle struct page_map, it's an internal
implementation detail of devm_memremap_pages().

Letting others users do the container_of() arrangement means that
struct page_map needs to become public and move into struct
dev_pagemap directly.

...I think that encapsulation loss is worth it for the gain of clearly
separating the HMM-case from the base case.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 18:01                       ` Logan Gunthorpe
@ 2017-04-19 18:32                         ` Jason Gunthorpe
  2017-04-19 19:02                           ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-19 18:32 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote:

> I'm just spit balling here but if HMM wanted to use unaddressable memory
> as a DMA target, it could set that function to create a window ine gpu
> memory, then call the pci_p2p_same_segment and return the result as the
> dma address.

Not entirely, it would have to call through the whole process
including the arch_p2p_cross_segment()..

Maybe we can start down the road of using ops for more iommu steps
with something like this as the helper:

dma_addr_t dma_map_pa(struct device *initiator, struct page *page,
                      void *data)
{
         struct device *completer = get_p2p_completer(page);
	 dma_addr_t pa;

         if (IS_ERR(completer))
	       return SYSTEM_MEMORY;
	       // Or maybe ?
	       return init_ops->dma_map_pa(..);

	 // Try the generic method
	 pa = pci_p2p_same_segment(dev, p2p_src, page);
	 if (pa != ERROR)
	       goto out;

         // Try the arch specific helper
	 const struct dma_map_ops *comp_ops = get_dma_ops(completer);
	 const struct dma_map_ops *init_ops = get_dma_ops(initiator);

         /* FUTURE: Let something translate a HMM page into a DMA'ble
   	    page, eg by mapping it into a GPU window. Maybe this
	    callback lives in devmap ? */
	 page = comp_ops->translate_dma_page(completer, page);

         /* New dma_map_op is the same as arch_p2p_cross_segment in
	    prior version. Any arch specific data needed to program
	    the iommu flows through data */
	 pa = init_ops->p2p_cross_segment_map(completer, inititator, page, data);

out:
	 device_put(completer);
	 return pa;
}

// map_sg op:
for (each sgl) {
    struct page *page = sg_page(s);
    struct arch_iommu_data data = {}; // pass through to ops->p2p_cross_segment
    dma_addr_t pa;

    pa = dma_map_pa(dev, page, &data)
    if (pa == ERROR)
        // fail

    if (!data.needs_iommu) {
        // Insert PA directly into the result SGL
        sg++;
        continue;
    }

    // Else pa & data describe how to setup the iommu
}

> > dma_addr_t pci_p2p_same_segment(struct device *initator,
> >                                 struct device *completer,
> > 				struct page *page)
> 
> I'm not sure I like the name pci_p2p_same_segment. It reads as though
> it's only checking if the devices are not the same segment.

Well, that is exactly what it is doing. If it succeeds then the caller
knows the DMA will not flow outside the segment and no iommu setup/etc
is required.

That function cannot be expanded to include generic cross-segment
traffic, a new function would be needed..

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 18:30                               ` Dan Williams
@ 2017-04-19 18:41                                 ` Logan Gunthorpe
  2017-04-19 18:44                                   ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-19 18:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jerome Glisse, Jens Axboe, Keith Busch, James E.J. Bottomley,
	Martin K. Petersen, linux-rdma, Benjamin Herrenschmidt,
	Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Bjorn Helgaas, linux-pci, linux-nvdimm, Max Gurtovoy, linux-scsi,
	Christoph Hellwig



On 19/04/17 12:30 PM, Dan Williams wrote:
> Letting others users do the container_of() arrangement means that
> struct page_map needs to become public and move into struct
> dev_pagemap directly.

Ah, yes, I got a bit turned around by that and failed to notice that
page_map and dev_pagemap are different. Why is it that dev_pagemap
contains pretty much the exact same information as page_map? The only
thing gained that I can see is that the struct resource gains const
protection...

> ...I think that encapsulation loss is worth it for the gain of clearly
> separating the HMM-case from the base case.

Agreed.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 18:41                                 ` Logan Gunthorpe
@ 2017-04-19 18:44                                   ` Dan Williams
  0 siblings, 0 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-19 18:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jerome Glisse, Jens Axboe, Keith Busch, James E.J. Bottomley,
	Martin K. Petersen, linux-rdma, Benjamin Herrenschmidt,
	Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Bjorn Helgaas, linux-pci, linux-nvdimm, Max Gurtovoy, linux-scsi,
	Christoph Hellwig

On Wed, Apr 19, 2017 at 11:41 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 19/04/17 12:30 PM, Dan Williams wrote:
>> Letting others users do the container_of() arrangement means that
>> struct page_map needs to become public and move into struct
>> dev_pagemap directly.
>
> Ah, yes, I got a bit turned around by that and failed to notice that
> page_map and dev_pagemap are different. Why is it that dev_pagemap
> contains pretty much the exact same information as page_map? The only
> thing gained that I can see is that the struct resource gains const
> protection...
>
>> ...I think that encapsulation loss is worth it for the gain of clearly
>> separating the HMM-case from the base case.
>
> Agreed.
>

Yeah, I forgot that dev_pagemap grew those fields, so we don't have
any real encapsulation today. So this would just be a pure cleanup to
kill struct page_map.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 18:32                         ` Jason Gunthorpe
@ 2017-04-19 19:02                           ` Logan Gunthorpe
  2017-04-19 19:31                             ` Jason Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-19 19:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 19/04/17 12:32 PM, Jason Gunthorpe wrote:
> On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote:
> Not entirely, it would have to call through the whole process
> including the arch_p2p_cross_segment()..

Hmm, yes. Though it's still not clear what, if anything,
arch_p2p_cross_segment would be doing. In my experience, if you are
going between host bridges, the CPU address (or PCI address -- I'm not
sure which seeing they are the same on my system) would still work fine
-- it just _may_ be a bad idea because of performance.

Similarly if you are crossing via a QPI bus or similar, I'd expect the
CPU address to work fine. But here the performance is even less likely
to be any good.


>          // Try the arch specific helper
> 	 const struct dma_map_ops *comp_ops = get_dma_ops(completer);
> 	 const struct dma_map_ops *init_ops = get_dma_ops(initiator);

So, in this case, what device does the completer point to? The PCI
device or a more specific GPU device? If it's the former, who's
responsible for setting the new dma_ops? Typically the dma_ops are arch
specific but now you'd be adding ones that are tied to hmm or the gpu.

>> I'm not sure I like the name pci_p2p_same_segment. It reads as though
>> it's only checking if the devices are not the same segment.
> 
> Well, that is exactly what it is doing. If it succeeds then the caller
> knows the DMA will not flow outside the segment and no iommu setup/etc
> is required.

It appears to me like it's calculating the DMA address, and the check is
just a side requirement. It reads as though it's only doing the check.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 19:02                           ` Logan Gunthorpe
@ 2017-04-19 19:31                             ` Jason Gunthorpe
  2017-04-19 19:41                               ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-19 19:31 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Wed, Apr 19, 2017 at 01:02:49PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 19/04/17 12:32 PM, Jason Gunthorpe wrote:
> > On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote:
> > Not entirely, it would have to call through the whole process
> > including the arch_p2p_cross_segment()..
> 
> Hmm, yes. Though it's still not clear what, if anything,
> arch_p2p_cross_segment would be doing.

Sets up the iommu for arches that place a iommu between the pci root
port and other pci root ports.

> In my experience, if you are going between host bridges, the CPU
> address (or PCI address -- I'm not sure which seeing they are the
> same on my system) would still work fine

Try it with VT-D turned on. It shouldn't work or there is a notable
security hole in your platform..

> > 	 const struct dma_map_ops *comp_ops = get_dma_ops(completer);
> > 	 const struct dma_map_ops *init_ops = get_dma_ops(initiator);
>
> So, in this case, what device does the completer point to? The PCI
> device or a more specific GPU device? If it's the former, who's
> responsible for setting the new dma_ops? Typically the dma_ops are arch
> specific but now you'd be adding ones that are tied to hmm or the gpu.

Donno, that is for GPU folks to figure out :)

But.. it could point to a GPU and the GPU struct device could have a
proxy dma_ops like Dan pointed out.

> >> I'm not sure I like the name pci_p2p_same_segment. It reads as though
> >> it's only checking if the devices are not the same segment.
> > 
> > Well, that is exactly what it is doing. If it succeeds then the caller
> > knows the DMA will not flow outside the segment and no iommu setup/etc
> > is required.
> 
> It appears to me like it's calculating the DMA address, and the check is
> just a side requirement. It reads as though it's only doing the check.

pci_p2p_same_segment_get_pa() then?

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 19:31                             ` Jason Gunthorpe
@ 2017-04-19 19:41                               ` Logan Gunthorpe
  2017-04-19 20:48                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-19 19:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 19/04/17 01:31 PM, Jason Gunthorpe wrote:
> Try it with VT-D turned on. It shouldn't work or there is a notable
> security hole in your platform..

Ah, ok.

>>> 	 const struct dma_map_ops *comp_ops = get_dma_ops(completer);
>>> 	 const struct dma_map_ops *init_ops = get_dma_ops(initiator);
>>
>> So, in this case, what device does the completer point to? The PCI
>> device or a more specific GPU device? If it's the former, who's
>> responsible for setting the new dma_ops? Typically the dma_ops are arch
>> specific but now you'd be adding ones that are tied to hmm or the gpu.
> 
> Donno, that is for GPU folks to figure out :)
> 
> But.. it could point to a GPU and the GPU struct device could have a
> proxy dma_ops like Dan pointed out.

Seems a bit awkward to me that in order for the intended use case, you
have to proxy the dma_ops. I'd probably still suggest throwing a couple
ops for things like this in the dev_pagemap.

>> It appears to me like it's calculating the DMA address, and the check is
>> just a side requirement. It reads as though it's only doing the check.
> 
> pci_p2p_same_segment_get_pa() then?

Ok, I think that's a bit clearer.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 19:41                               ` Logan Gunthorpe
@ 2017-04-19 20:48                                 ` Jason Gunthorpe
  2017-04-19 22:55                                   ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-19 20:48 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Wed, Apr 19, 2017 at 01:41:49PM -0600, Logan Gunthorpe wrote:

> > But.. it could point to a GPU and the GPU struct device could have a
> > proxy dma_ops like Dan pointed out.
> 
> Seems a bit awkward to me that in order for the intended use case, you
> have to proxy the dma_ops. I'd probably still suggest throwing a couple
> ops for things like this in the dev_pagemap.

Another option is adding a new 'struct completer_dma_ops *' to struct
device for this use case.

Seems like a waste to expand dev_pagemap when we only need a unique
value per struct device?

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 20:48                                 ` Jason Gunthorpe
@ 2017-04-19 22:55                                   ` Logan Gunthorpe
  2017-04-20  0:07                                     ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-19 22:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Dan Williams, Bjorn Helgaas,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse



On 19/04/17 02:48 PM, Jason Gunthorpe wrote:
> On Wed, Apr 19, 2017 at 01:41:49PM -0600, Logan Gunthorpe wrote:
> 
>>> But.. it could point to a GPU and the GPU struct device could have a
>>> proxy dma_ops like Dan pointed out.
>>
>> Seems a bit awkward to me that in order for the intended use case, you
>> have to proxy the dma_ops. I'd probably still suggest throwing a couple
>> ops for things like this in the dev_pagemap.
> 
> Another option is adding a new 'struct completer_dma_ops *' to struct
> device for this use case.
> 
> Seems like a waste to expand dev_pagemap when we only need a unique
> value per struct device?

I feel like expanding dev_pagemap has a much lower impact than expanding
struct device... dev_pagemap is only one instance per zone device region
so expanding it shouldn't be a huge issue. Expanding struct device means
every device struct in the system gets bigger.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 22:55                                   ` Logan Gunthorpe
@ 2017-04-20  0:07                                     ` Dan Williams
  0 siblings, 0 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-20  0:07 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Jens Axboe, James E.J. Bottomley,
	Martin K. Petersen, linux-rdma, Benjamin Herrenschmidt,
	Steve Wise, linux-kernel, linux-nvme, Keith Busch, Jerome Glisse,
	Bjorn Helgaas, linux-pci, linux-nvdimm, Max Gurtovoy, linux-scsi,
	Christoph Hellwig

On Wed, Apr 19, 2017 at 3:55 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 19/04/17 02:48 PM, Jason Gunthorpe wrote:
>> On Wed, Apr 19, 2017 at 01:41:49PM -0600, Logan Gunthorpe wrote:
>>
>>>> But.. it could point to a GPU and the GPU struct device could have a
>>>> proxy dma_ops like Dan pointed out.
>>>
>>> Seems a bit awkward to me that in order for the intended use case, you
>>> have to proxy the dma_ops. I'd probably still suggest throwing a couple
>>> ops for things like this in the dev_pagemap.
>>
>> Another option is adding a new 'struct completer_dma_ops *' to struct
>> device for this use case.
>>
>> Seems like a waste to expand dev_pagemap when we only need a unique
>> value per struct device?
>
> I feel like expanding dev_pagemap has a much lower impact than expanding
> struct device... dev_pagemap is only one instance per zone device region
> so expanding it shouldn't be a huge issue. Expanding struct device means
> every device struct in the system gets bigger.

Especially since we expect a very small subset of devices will ever support p2p.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-19 17:41                         ` Dan Williams
  2017-04-19 18:11                           ` Logan Gunthorpe
@ 2017-04-20 20:43                           ` Stephen  Bates
  2017-04-20 20:47                             ` Dan Williams
  1 sibling, 1 reply; 105+ messages in thread
From: Stephen  Bates @ 2017-04-20 20:43 UTC (permalink / raw)
  To: Dan Williams, Jerome Glisse
  Cc: Logan Gunthorpe, Jason Gunthorpe, Benjamin Herrenschmidt,
	Bjorn Helgaas, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel


> Yes, this makes sense I think we really just want to distinguish host
> memory or not in terms of the dev_pagemap type.

I would like to see mutually exclusive flags for host memory (or not) and persistence (or not).

Stephen

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-20 20:43                           ` Stephen  Bates
@ 2017-04-20 20:47                             ` Dan Williams
  2017-04-20 23:07                               ` Stephen  Bates
  0 siblings, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-20 20:47 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jerome Glisse, Logan Gunthorpe, Jason Gunthorpe,
	Benjamin Herrenschmidt, Bjorn Helgaas, Christoph Hellwig,
	Sagi Grimberg, James E.J. Bottomley, Martin K. Petersen,
	Jens Axboe, Steve Wise, Max Gurtovoy, Keith Busch, linux-pci,
	linux-scsi, linux-nvme, linux-rdma, linux-nvdimm, linux-kernel

On Thu, Apr 20, 2017 at 1:43 PM, Stephen  Bates <sbates@raithlin.com> wrote:
>
>> Yes, this makes sense I think we really just want to distinguish host
>> memory or not in terms of the dev_pagemap type.
>
> I would like to see mutually exclusive flags for host memory (or not) and persistence (or not).
>

Why persistence? It has zero meaning to the mm.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-20 20:47                             ` Dan Williams
@ 2017-04-20 23:07                               ` Stephen  Bates
  2017-04-21  4:59                                 ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: Stephen  Bates @ 2017-04-20 23:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jerome Glisse, Logan Gunthorpe, Jason Gunthorpe,
	Benjamin Herrenschmidt, Bjorn Helgaas, Christoph Hellwig,
	Sagi Grimberg, James E.J. Bottomley, Martin K. Petersen,
	Jens Axboe, Steve Wise, Max Gurtovoy, Keith Busch, linux-pci,
	linux-scsi, linux-nvme, linux-rdma, linux-nvdimm, linux-kernel

>> Yes, this makes sense I think we really just want to distinguish host
>> memory or not in terms of the dev_pagemap type.
>
>> I would like to see mutually exclusive flags for host memory (or not) and persistence (or not).
>>
>
> Why persistence? It has zero meaning to the mm.

I like the idea of having properties of the memory in one place. While mm might not use persistence today it may make use certain things that persistence implies (like finite endurance and/or higher write latency) in the future. Also the persistence of the memory must have issues for mm security? Again not addressed today but useful in the future.

In addition I am not sure where else would be an appropriate place to put something like a persistence property flag. I know the NVDIMM section of the kernel uses things like NFIT to describe properties of the memory but we don’t yet (to my knowledge) have something similar for IO memory.

Stephen

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-20 23:07                               ` Stephen  Bates
@ 2017-04-21  4:59                                 ` Dan Williams
  0 siblings, 0 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-21  4:59 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley,
	Martin K. Petersen, linux-rdma, Benjamin Herrenschmidt,
	Steve Wise, linux-kernel, linux-nvme, Jason Gunthorpe,
	Jerome Glisse, Bjorn Helgaas, linux-pci, linux-nvdimm,
	Max Gurtovoy, linux-scsi, Christoph Hellwig

On Thu, Apr 20, 2017 at 4:07 PM, Stephen  Bates <sbates@raithlin.com> wrote:
>>> Yes, this makes sense I think we really just want to distinguish host
>>> memory or not in terms of the dev_pagemap type.
>>
>>> I would like to see mutually exclusive flags for host memory (or not) and persistence (or not).
>>>
>>
>> Why persistence? It has zero meaning to the mm.
>
> I like the idea of having properties of the memory in one place.

We do have memory type data in the global iomem_resource tree, see
IORES_DESC_PERSISTENT_MEMORY.

> While mm might not use persistence today it may make use certain things that
> persistence implies (like finite endurance and/or higher write latency) in the future.

A persistence flag does not convey endurance or latency information.

> Also the persistence of the memory must have issues for mm security?

Not for the mm, data at rest security might be a property of the
device, but that's not the mm's concern.

>Again not addressed today but useful in the future.

Maybe, but to me "Useful for the future" == "don't add it to the
kernel until that future arrives".

> In addition I am not sure where else would be an appropriate place to put something like a persistence property flag. I know the NVDIMM section of the kernel uses things like NFIT to describe properties of the memory but we don’t yet (to my knowledge) have something similar for IO memory.

Do the IORES_DESC flags give you what you need?

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-24  7:36                                 ` Knut Omang
  2017-04-24 16:14                                   ` Logan Gunthorpe
@ 2017-04-25 21:23                                   ` Stephen  Bates
  1 sibling, 0 replies; 105+ messages in thread
From: Stephen  Bates @ 2017-04-25 21:23 UTC (permalink / raw)
  To: Knut Omang, Benjamin Herrenschmidt, Logan Gunthorpe, Dan Williams
  Cc: Jens Axboe, Keith Busch, James E.J. Bottomley, Sagi Grimberg,
	Martin K. Petersen, linux-rdma, linux-pci, Steve Wise,
	linux-kernel, linux-nvme, Jason Gunthorpe, Jerome Glisse,
	Bjorn Helgaas, linux-scsi, linux-nvdimm, Max Gurtovoy,
	Christoph Hellwig


> My first reflex when reading this thread was to think that this whole domain
> lends it self excellently to testing via Qemu. Could it be that doing this in 
> the opposite direction might be a safer approach in the long run even though 
> (significant) more work up-front?

While the idea of QEMU for this work is attractive it will be a long time before QEMU is in a position to support this development. 

Another approach is to propose a common development platform for p2pmem work using a platform we know is going to work. This an extreme version of the whitelisting approach that was discussed on this thread. We can list a very specific set of hardware (motherboard, PCIe end-points and (possibly) PCIe switch enclosure) that has been shown to work that others can copy for their development purposes.

p2pmem.io perhaps ;-)?

Stephen

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-25 17:03                                       ` Logan Gunthorpe
@ 2017-04-25 21:23                                         ` Stephen  Bates
  0 siblings, 0 replies; 105+ messages in thread
From: Stephen  Bates @ 2017-04-25 21:23 UTC (permalink / raw)
  To: Logan Gunthorpe, Knut Omang, Benjamin Herrenschmidt, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

>> Yes, that's why I used 'significant'. One good thing is that given resources 
>> it can easily be done in parallel with other development, and will give additional
>> insight of some form.
>
>Yup, well if someone wants to start working on an emulated RDMA device
>that actually simulates proper DMA transfers that would be great!

Give that each RDMA vendor’s devices expose a different MMIO I don’t expect this to happen anytime soon.

> Yes, the nvme device in qemu has a CMB buffer which is a good choice to
> test with but we don't have code to use it for p2p transfers in the
>kernel so it is a bit awkward.

Note the CMB code is not in upstream QEMU, it’s in Keith’s fork [1]. I will see if I can push this upstream.

Stephen

[1] git://git.infradead.org/users/kbusch/qemu-nvme.git

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-25  6:30                                     ` Knut Omang
@ 2017-04-25 17:03                                       ` Logan Gunthorpe
  2017-04-25 21:23                                         ` Stephen  Bates
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-25 17:03 UTC (permalink / raw)
  To: Knut Omang, Benjamin Herrenschmidt, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 25/04/17 12:30 AM, Knut Omang wrote:
> Yes, that's why I used 'significant'. One good thing is that given resources 
> it can easily be done in parallel with other development, and will give additional
> insight of some form.

Yup, well if someone wants to start working on an emulated RDMA device
that actually simulates proper DMA transfers that would be great!

>> I also imagine it would be quite difficult to develop those models
>> given the array of hardware that needs to be supported and the deep
>> functional knowledge required to figure out appropriate restrictions.
> 
> From my naive perspective it seems it need not even be a full model to get some benefits,
> just low level functionality tests with some instances of a
> device that offers some MMIO space 'playground'.

Yes, the nvme device in qemu has a CMB buffer which is a good choice to
test with but we don't have code to use it for p2p transfers in the
kernel so it is a bit awkward. We also posted [1] to qemu which was also
handy to test with using a small out of tree module. But I don't see a
lot of value in it if you completely ignore any hardware quirks.

Logan

[1] https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg04331.html

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-24 16:14                                   ` Logan Gunthorpe
@ 2017-04-25  6:30                                     ` Knut Omang
  2017-04-25 17:03                                       ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Knut Omang @ 2017-04-25  6:30 UTC (permalink / raw)
  To: Logan Gunthorpe, Benjamin Herrenschmidt, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Mon, 2017-04-24 at 10:14 -0600, Logan Gunthorpe wrote:
> 
> On 24/04/17 01:36 AM, Knut Omang wrote:
> > My first reflex when reading this thread was to think that this whole domain
> > lends it self excellently to testing via Qemu. Could it be that doing this in 
> > the opposite direction might be a safer approach in the long run even though 
> > (significant) more work up-front?
> 
> That's an interesting idea. We did do some very limited testing on qemu
> with one iteration of our work. However, it's difficult because there is
> no support for any RDMA devices which are a part of our primary use
> case. 

Yes, that's why I used 'significant'. One good thing is that given resources 
it can easily be done in parallel with other development, and will give additional
insight of some form.

> I also imagine it would be quite difficult to develop those models
> given the array of hardware that needs to be supported and the deep
> functional knowledge required to figure out appropriate restrictions.

>From my naive perspective it seems it need not even be a full model to get some benefits,
just low level functionality tests with some instances of a
device that offers some MMIO space 'playground'.

Or maybe you can leverage some of the already implemented emulated devices in Qemu?

Knut

> 
> Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-24  7:36                                 ` Knut Omang
@ 2017-04-24 16:14                                   ` Logan Gunthorpe
  2017-04-25  6:30                                     ` Knut Omang
  2017-04-25 21:23                                   ` Stephen  Bates
  1 sibling, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-24 16:14 UTC (permalink / raw)
  To: Knut Omang, Benjamin Herrenschmidt, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 24/04/17 01:36 AM, Knut Omang wrote:
> My first reflex when reading this thread was to think that this whole domain
> lends it self excellently to testing via Qemu. Could it be that doing this in 
> the opposite direction might be a safer approach in the long run even though 
> (significant) more work up-front?

That's an interesting idea. We did do some very limited testing on qemu
with one iteration of our work. However, it's difficult because there is
no support for any RDMA devices which are a part of our primary use
case. I also imagine it would be quite difficult to develop those models
given the array of hardware that needs to be supported and the deep
functional knowledge required to figure out appropriate restrictions.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16 22:31                               ` Benjamin Herrenschmidt
@ 2017-04-24  7:36                                 ` Knut Omang
  2017-04-24 16:14                                   ` Logan Gunthorpe
  2017-04-25 21:23                                   ` Stephen  Bates
  0 siblings, 2 replies; 105+ messages in thread
From: Knut Omang @ 2017-04-24  7:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Logan Gunthorpe, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Mon, 2017-04-17 at 08:31 +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2017-04-16 at 10:34 -0600, Logan Gunthorpe wrote:
> > 
> > On 16/04/17 09:53 AM, Dan Williams wrote:
> > > ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> > > context about the physical address in question. I'm thinking you can
> > > hang bus address translation data off of that structure. This seems
> > > vaguely similar to what HMM is doing.
> > 
> > Thanks! I didn't realize you had the infrastructure to look up a device
> > from a pfn/page. That would really come in handy for us.
> 
> It does indeed. I won't be able to play with that much for a few weeks
> (see my other email) so if you're going to tackle this while I'm away,
> can you work with Jerome to make sure you don't conflict with HMM ?
> 
> I really want a way for HMM to be able to layout struct pages over the
> GPU BARs rather than in "allocated free space" for the case where the
> BAR is big enough to cover all of the GPU memory.
> 
> In general, I'd like a simple & generic way for any driver to ask the
> core to layout DMA'ble struct pages over BAR space. I an not convinced
> this requires a "p2mem device" to be created on top of this though but
> that's a different discussion.
> 
> Of course the actual ability to perform the DMA mapping will be subject
> to various restrictions that will have to be implemented in the actual
> "dma_ops override" backend. We can have generic code to handle the case
> where devices reside on the same domain, which can deal with switch
> configuration etc... we will need to have iommu specific code to handle
> the case going through the fabric. 
> 
> Virtualization is a separate can of worms due to how qemu completely
> fakes the MMIO space, we can look into that later.

My first reflex when reading this thread was to think that this whole domain
lends it self excellently to testing via Qemu. Could it be that doing this in 
the opposite direction might be a safer approach in the long run even though 
(significant) more work up-front?

Eg. start by fixing/providing/documenting suitable model(s) 
for testing this in Qemu, then implement the patch set based 
on those models?

Thanks,
Knut

> 
> Cheers,
> Ben.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16 16:34                             ` Logan Gunthorpe
@ 2017-04-16 22:31                               ` Benjamin Herrenschmidt
  2017-04-24  7:36                                 ` Knut Omang
  0 siblings, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-16 22:31 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Sun, 2017-04-16 at 10:34 -0600, Logan Gunthorpe wrote:
> 
> On 16/04/17 09:53 AM, Dan Williams wrote:
> > ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> > context about the physical address in question. I'm thinking you can
> > hang bus address translation data off of that structure. This seems
> > vaguely similar to what HMM is doing.
> 
> Thanks! I didn't realize you had the infrastructure to look up a device
> from a pfn/page. That would really come in handy for us.

It does indeed. I won't be able to play with that much for a few weeks
(see my other email) so if you're going to tackle this while I'm away,
can you work with Jerome to make sure you don't conflict with HMM ?

I really want a way for HMM to be able to layout struct pages over the
GPU BARs rather than in "allocated free space" for the case where the
BAR is big enough to cover all of the GPU memory.

In general, I'd like a simple & generic way for any driver to ask the
core to layout DMA'ble struct pages over BAR space. I an not convinced
this requires a "p2mem device" to be created on top of this though but
that's a different discussion.

Of course the actual ability to perform the DMA mapping will be subject
to various restrictions that will have to be implemented in the actual
"dma_ops override" backend. We can have generic code to handle the case
where devices reside on the same domain, which can deal with switch
configuration etc... we will need to have iommu specific code to handle
the case going through the fabric. 

Virtualization is a separate can of worms due to how qemu completely
fakes the MMIO space, we can look into that later.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16 15:53                           ` Dan Williams
  2017-04-16 16:34                             ` Logan Gunthorpe
@ 2017-04-16 22:26                             ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-16 22:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Bjorn Helgaas, Jason Gunthorpe,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Sun, 2017-04-16 at 08:53 -0700, Dan Williams wrote:
> > Just thinking out loud ... I don't have a firm idea or a design. But
> > peer to peer is definitely a problem we need to tackle generically, the
> > demand for it keeps coming up.
> 
> ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> context about the physical address in question. I'm thinking you can
> hang bus address translation data off of that structure. This seems
> vaguely similar to what HMM is doing.

Ok, that's interesting. That would be a way to handle the "lookup" I
was mentioning in the email I sent a few minutes ago.

We would probably need to put some "structure" to that context.

I'm very short on time to look into the details of this for at least
a month (I'm taking about 3 weeks off for personal reasons next week),
but I'm happy to dive more into this when I'm back and sort out with
Jerome how to make it all co-habitate nicely with HMM.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16 15:53                           ` Dan Williams
@ 2017-04-16 16:34                             ` Logan Gunthorpe
  2017-04-16 22:31                               ` Benjamin Herrenschmidt
  2017-04-16 22:26                             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-16 16:34 UTC (permalink / raw)
  To: Dan Williams, Benjamin Herrenschmidt
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 16/04/17 09:53 AM, Dan Williams wrote:
> ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> context about the physical address in question. I'm thinking you can
> hang bus address translation data off of that structure. This seems
> vaguely similar to what HMM is doing.

Thanks! I didn't realize you had the infrastructure to look up a device
from a pfn/page. That would really come in handy for us.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16  3:01                         ` Benjamin Herrenschmidt
  2017-04-16  4:46                           ` Logan Gunthorpe
@ 2017-04-16 15:53                           ` Dan Williams
  2017-04-16 16:34                             ` Logan Gunthorpe
  2017-04-16 22:26                             ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 105+ messages in thread
From: Dan Williams @ 2017-04-16 15:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Logan Gunthorpe, Bjorn Helgaas, Jason Gunthorpe,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Sat, Apr 15, 2017 at 8:01 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Sat, 2017-04-15 at 15:09 -0700, Dan Williams wrote:
>> I'm wondering, since this is limited to support behind a single
>> switch, if you could have a software-iommu hanging off that switch
>> device object that knows how to catch and translate the non-zero
>> offset bus address case. We have something like this with VMD driver,
>> and I toyed with a soft pci bridge when trying to support AHCI+NVME
>> bar remapping. When the dma api looks up the iommu for its device it
>> hits this soft-iommu and that driver checks if the page is host memory
>> or device memory to do the dma translation. You wouldn't need a bit in
>> struct page, just a lookup to the hosting struct dev_pagemap in the
>> is_zone_device_page() case and that can point you to p2p details.
>
> I was thinking about a hook in the arch DMA ops but that kind of
> wrapper might work instead indeed. However I'm not sure what's the best
> way to "instantiate" it.
>
> The main issue is that the DMA ops are a function of the initiator,
> not the target (since the target is supposed to be memory) so things
> are a bit awkward.
>
> One (user ?) would have to know that a given device "intends" to DMA
> directly to another device.
>
> This is awkward because in the ideal scenario, this isn't something the
> device knows. For example, one could want to have an existing NIC DMA
> directly to/from NVME pages or GPU pages.
>
> The NIC itself doesn't know the characteristic of these pages, but
> *something* needs to insert itself in the DMA ops of that bridge to
> make it possible.
>
> That's why I wonder if it's the struct page of the target that should
> be "marked" in such a way that the arch dma'ops can immediately catch
> that they belong to a device and might require "wrapped" operations.
>
> Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a
> flag ?)

Yes, is_zone_device_page(). However I think we're getting to the point
with pmem, hmm, cdm, and now p2p where ZONE_DEVICE is losing specific
meaning and we need to have explicit type checks like is_hmm_page()
is_p2p_page() that internally check is_zone_device_page() plus some
other specific type.

> That would allow us to keep a fast path for normal memory targets, but
> also have some kind of way to handle the special cases of such peer 2
> peer (or also handle other type of peer to peer that don't necessarily
> involve PCI address wrangling but could require additional iommu bits).
>
> Just thinking out loud ... I don't have a firm idea or a design. But
> peer to peer is definitely a problem we need to tackle generically, the
> demand for it keeps coming up.

ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
context about the physical address in question. I'm thinking you can
hang bus address translation data off of that structure. This seems
vaguely similar to what HMM is doing.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-15 22:17                       ` Benjamin Herrenschmidt
@ 2017-04-16  5:36                         ` Logan Gunthorpe
  0 siblings, 0 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-16  5:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Bjorn Helgaas
  Cc: Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse



On 15/04/17 04:17 PM, Benjamin Herrenschmidt wrote:
> You can't. If the iommu is on, everything is remapped. Or do you mean
> to have dma_map_* not do a remapping ?

Well, yes, you'd have to change the code so that iomem pages do not get
remapped and the raw BAR address is passed to the DMA engine. I said
specifically we haven't done this at this time but it really doesn't
seem like an unsolvable problem. It is something we will need to address
before a proper patch set is posted though.

> That's the problem again, same as before, for that to work, the
> dma_map_* ops would have to do something special that depends on *both*
> the source and target device.

No, I don't think you have to do things different based on the source.
Have the p2pmem device layer restrict allocating p2pmem based on the
devices in use (similar to how the RFC code works now) and when the dma
mapping code sees iomem pages it just needs to leave the address alone
so it's used directly by the dma in question.

It's much better to make the decision on which memory to use when you
allocate it. If you wait until you map it, it would be a pain to fall
back to system memory if it doesn't look like it will work. So, if when
you allocate it, you know everything will work you just need the dma
mapping layer to stay out of the way.

> The dma_ops today are architecture specific and have no way to
> differenciate between normal and those special P2P DMA pages.

Correct, unless Dan's idea works (which will need some investigation),
we'd need a flag in struct page or some other similar method to
determine that these are special iomem pages.

>> Though if it does, I'd expect
>> everything would still work you just wouldn't get the performance or
>> traffic flow you are looking for. We've been testing with the software
>> iommu which doesn't have this problem.
> 
> So first, no, it's more than "you wouldn't get the performance". On
> some systems it may also just not work. Also what do you mean by "the
> SW iommu doesn't have this problem" ? It catches the fact that
> addresses don't point to RAM and maps differently ?

I haven't tested it but I can't imagine why an iommu would not correctly
map the memory in the bar. But that's _way_ beside the point. We
_really_ want to avoid that situation anyway. If the iommu maps the
memory it defeats what we are trying to accomplish.

I believe the sotfware iommu only uses bounce buffers if the DMA engine
in use cannot address the memory. So in most cases, with modern
hardware, it just passes the BAR's address to the DMA engine and
everything works. The code posted in the RFC does in fact work without
needing to do any of this fussing.

>>> The problem is that the latter while seemingly easier, is also slower
>>> and not supported by all platforms and architectures (for example,
>>> POWER currently won't allow it, or rather only allows a store-only
>>> subset of it under special circumstances).
>>
>> Yes, I think situations where we have to cross host bridges will remain
>> unsupported by this work for a long time. There are two many cases where
>> it just doesn't work or it performs too poorly to be useful.
> 
> And the situation where you don't cross bridges is the one where you
> need to also take into account the offsets.

I think for the first incarnation we will just not support systems that
have offsets. This makes things much easier and still supports all the
use cases we are interested in.

> So you are designing something that is built from scratch to only work
> on a specific limited category of systems and is also incompatible with
> virtualization.

Yes, we are starting with support for specific use cases. Almost all
technology starts that way. Dax has been in the kernel for years and
only recently has someone submitted patches for it to support pmem on
powerpc. This is not unusual. If you had forced the pmem developers to
support all architectures in existence before allowing them upstream
they couldn't possibly be as far as they are today.

Virtualization specifically would be a _lot_ more difficult than simply
supporting offsets. The actual topology of the bus will probably be lost
on the guest OS and it would therefor have a difficult time figuring out
when it's acceptable to use p2pmem. I also have a difficult time seeing
a use case for it and thus I have a hard time with the argument that we
can't support use cases that do want it because use cases that don't
want it (perhaps yet) won't work.

> This is an interesting experiement to look at I suppose, but if you
> ever want this upstream I would like at least for you to develop a
> strategy to support the wider case, if not an actual implementation.

I think there are plenty of avenues forward to support offsets, etc.
It's just work. Nothing we'd be proposing would be incompatible with it.
We just don't want to have to do it all upfront especially when no one
really knows how well various architecture's hardware supports this or
if anyone even wants to run it on systems such as those. (Keep in mind
this is a pretty specific optimization that mostly helps systems
designed in specific ways -- not a general "everybody gets faster" type
situation.) Get the cases working we know will work, can easily support
and people actually want.  Then expand it to support others as people
come around with hardware to test and use cases for it.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-16  3:01                         ` Benjamin Herrenschmidt
@ 2017-04-16  4:46                           ` Logan Gunthorpe
  2017-04-16 15:53                           ` Dan Williams
  1 sibling, 0 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-16  4:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Dan Williams
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 15/04/17 09:01 PM, Benjamin Herrenschmidt wrote:
> Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a
> flag ?)

Well you can't use ZONE_DEVICE as an indicator. They may be regular RAM,
(eg. pmem). It would need a separate flag indicating it is backed by iomem.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-15 22:09                       ` Dan Williams
@ 2017-04-16  3:01                         ` Benjamin Herrenschmidt
  2017-04-16  4:46                           ` Logan Gunthorpe
  2017-04-16 15:53                           ` Dan Williams
  0 siblings, 2 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-16  3:01 UTC (permalink / raw)
  To: Dan Williams, Logan Gunthorpe
  Cc: Bjorn Helgaas, Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Sat, 2017-04-15 at 15:09 -0700, Dan Williams wrote:
> I'm wondering, since this is limited to support behind a single
> switch, if you could have a software-iommu hanging off that switch
> device object that knows how to catch and translate the non-zero
> offset bus address case. We have something like this with VMD driver,
> and I toyed with a soft pci bridge when trying to support AHCI+NVME
> bar remapping. When the dma api looks up the iommu for its device it
> hits this soft-iommu and that driver checks if the page is host memory
> or device memory to do the dma translation. You wouldn't need a bit in
> struct page, just a lookup to the hosting struct dev_pagemap in the
> is_zone_device_page() case and that can point you to p2p details.

I was thinking about a hook in the arch DMA ops but that kind of
wrapper might work instead indeed. However I'm not sure what's the best
way to "instantiate" it.

The main issue is that the DMA ops are a function of the initiator,
not the target (since the target is supposed to be memory) so things
are a bit awkward.

One (user ?) would have to know that a given device "intends" to DMA
directly to another device.

This is awkward because in the ideal scenario, this isn't something the
device knows. For example, one could want to have an existing NIC DMA
directly to/from NVME pages or GPU pages.

The NIC itself doesn't know the characteristic of these pages, but
*something* needs to insert itself in the DMA ops of that bridge to
make it possible.

That's why I wonder if it's the struct page of the target that should
be "marked" in such a way that the arch dma'ops can immediately catch
that they belong to a device and might require "wrapped" operations.

Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a
flag ?)

That would allow us to keep a fast path for normal memory targets, but
also have some kind of way to handle the special cases of such peer 2
peer (or also handle other type of peer to peer that don't necessarily
involve PCI address wrangling but could require additional iommu bits).

Just thinking out loud ... I don't have a firm idea or a design. But
peer to peer is definitely a problem we need to tackle generically, the
demand for it keeps coming up.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-15 17:41                     ` Logan Gunthorpe
  2017-04-15 22:09                       ` Dan Williams
@ 2017-04-15 22:17                       ` Benjamin Herrenschmidt
  2017-04-16  5:36                         ` Logan Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-15 22:17 UTC (permalink / raw)
  To: Logan Gunthorpe, Bjorn Helgaas
  Cc: Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse

On Sat, 2017-04-15 at 11:41 -0600, Logan Gunthorpe wrote:
> Thanks, Benjamin, for the summary of some of the issues.
> 
> On 14/04/17 04:07 PM, Benjamin Herrenschmidt wrote
> > So I assume the p2p code provides a way to address that too via special
> > dma_ops ? Or wrappers ?
> 
> Not at this time. We will probably need a way to ensure the iommus do
> not attempt to remap these addresses.

You can't. If the iommu is on, everything is remapped. Or do you mean
to have dma_map_* not do a remapping ?

That's the problem again, same as before, for that to work, the
dma_map_* ops would have to do something special that depends on *both*
the source and target device.

The current DMA infrastructure doesn't have anything like that. It's a
rather fundamental issue to your design that you need to address.

The dma_ops today are architecture specific and have no way to
differenciate between normal and those special P2P DMA pages.

> Though if it does, I'd expect
> everything would still work you just wouldn't get the performance or
> traffic flow you are looking for. We've been testing with the software
> iommu which doesn't have this problem.

So first, no, it's more than "you wouldn't get the performance". On
some systems it may also just not work. Also what do you mean by "the
SW iommu doesn't have this problem" ? It catches the fact that
addresses don't point to RAM and maps differently ?

> > The problem is that the latter while seemingly easier, is also slower
> > and not supported by all platforms and architectures (for example,
> > POWER currently won't allow it, or rather only allows a store-only
> > subset of it under special circumstances).
> 
> Yes, I think situations where we have to cross host bridges will remain
> unsupported by this work for a long time. There are two many cases where
> it just doesn't work or it performs too poorly to be useful.

And the situation where you don't cross bridges is the one where you
need to also take into account the offsets.

*both* cases mean that you need somewhat to intervene at the dma_ops
level to handle this. Which means having a way to identify your special
struct pages or PFNs to allow the arch to add a special case to the
dma_ops.

> > I don't fully understand how p2pmem "solves" that by creating struct
> > pages. The offset problem is one issue. But there's the iommu issue as
> > well, the driver cannot just use the normal dma_map ops.
> 
> We are not using a proper iommu and we are dealing with systems that
> have zero offset. This case is also easily supported. I expect fixing
> the iommus to not map these addresses would also be reasonably achievable.

So you are designing something that is built from scratch to only work
on a specific limited category of systems and is also incompatible with
virtualization.

This is an interesting experiement to look at I suppose, but if you
ever want this upstream I would like at least for you to develop a
strategy to support the wider case, if not an actual implementation.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-15 17:41                     ` Logan Gunthorpe
@ 2017-04-15 22:09                       ` Dan Williams
  2017-04-16  3:01                         ` Benjamin Herrenschmidt
  2017-04-15 22:17                       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 105+ messages in thread
From: Dan Williams @ 2017-04-15 22:09 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Bjorn Helgaas, Jason Gunthorpe,
	Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Sat, Apr 15, 2017 at 10:41 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Thanks, Benjamin, for the summary of some of the issues.
>
> On 14/04/17 04:07 PM, Benjamin Herrenschmidt wrote
>> So I assume the p2p code provides a way to address that too via special
>> dma_ops ? Or wrappers ?
>
> Not at this time. We will probably need a way to ensure the iommus do
> not attempt to remap these addresses. Though if it does, I'd expect
> everything would still work you just wouldn't get the performance or
> traffic flow you are looking for. We've been testing with the software
> iommu which doesn't have this problem.
>
>> The problem is that the latter while seemingly easier, is also slower
>> and not supported by all platforms and architectures (for example,
>> POWER currently won't allow it, or rather only allows a store-only
>> subset of it under special circumstances).
>
> Yes, I think situations where we have to cross host bridges will remain
> unsupported by this work for a long time. There are two many cases where
> it just doesn't work or it performs too poorly to be useful.
>
>> I don't fully understand how p2pmem "solves" that by creating struct
>> pages. The offset problem is one issue. But there's the iommu issue as
>> well, the driver cannot just use the normal dma_map ops.
>
> We are not using a proper iommu and we are dealing with systems that
> have zero offset. This case is also easily supported. I expect fixing
> the iommus to not map these addresses would also be reasonably achievable.

I'm wondering, since this is limited to support behind a single
switch, if you could have a software-iommu hanging off that switch
device object that knows how to catch and translate the non-zero
offset bus address case. We have something like this with VMD driver,
and I toyed with a soft pci bridge when trying to support AHCI+NVME
bar remapping. When the dma api looks up the iommu for its device it
hits this soft-iommu and that driver checks if the page is host memory
or device memory to do the dma translation. You wouldn't need a bit in
struct page, just a lookup to the hosting struct dev_pagemap in the
is_zone_device_page() case and that can point you to p2p details.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-14 22:07                   ` Benjamin Herrenschmidt
@ 2017-04-15 17:41                     ` Logan Gunthorpe
  2017-04-15 22:09                       ` Dan Williams
  2017-04-15 22:17                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-15 17:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Bjorn Helgaas
  Cc: Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse

Thanks, Benjamin, for the summary of some of the issues.

On 14/04/17 04:07 PM, Benjamin Herrenschmidt wrote
> So I assume the p2p code provides a way to address that too via special
> dma_ops ? Or wrappers ?

Not at this time. We will probably need a way to ensure the iommus do
not attempt to remap these addresses. Though if it does, I'd expect
everything would still work you just wouldn't get the performance or
traffic flow you are looking for. We've been testing with the software
iommu which doesn't have this problem.

> The problem is that the latter while seemingly easier, is also slower
> and not supported by all platforms and architectures (for example,
> POWER currently won't allow it, or rather only allows a store-only
> subset of it under special circumstances).

Yes, I think situations where we have to cross host bridges will remain
unsupported by this work for a long time. There are two many cases where
it just doesn't work or it performs too poorly to be useful.

> I don't fully understand how p2pmem "solves" that by creating struct
> pages. The offset problem is one issue. But there's the iommu issue as
> well, the driver cannot just use the normal dma_map ops.

We are not using a proper iommu and we are dealing with systems that
have zero offset. This case is also easily supported. I expect fixing
the iommus to not map these addresses would also be reasonably achievable.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-14 19:04                 ` Bjorn Helgaas
@ 2017-04-14 22:07                   ` Benjamin Herrenschmidt
  2017-04-15 17:41                     ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-14 22:07 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe
  Cc: Jason Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse

On Fri, 2017-04-14 at 14:04 -0500, Bjorn Helgaas wrote:
> I'm a little hesitant about excluding offset support, so I'd like to
> hear more about this.
> 
> Is the issue related to PCI BARs that are not completely addressable
> by the CPU?  If so, that sounds like a first-class issue that should
> be resolved up front because I don't think the PCI core in general
> would deal well with that.
> 
> If all the PCI memory of interest is in fact addressable by the CPU,
> I would think it would be pretty straightforward to support offsets
> --
> everywhere you currently use a PCI bus address, you just use the
> corresponding CPU physical address instead.

It's not *that* easy sadly. The reason is that normal dma map APIs
assume the "target" of the DMAs are system memory, there is no way to
pass it "another device", in a way that allows it to understand the
offsets if needed.

That said, dma_map will already be problematic when doing p2p behind
the same bridge due to the fact that the iommu is not in the way so you
can't use the arch standard ops there.

So I assume the p2p code provides a way to address that too via special
dma_ops ? Or wrappers ?

Basically there are two very different ways you can do p2p, either
behind the same host bridge or accross two host bridges:

 - Behind the same host bridge, you don't go through the iommu, which
means that even if your target looks like a struct page, you can't just
use dma_map_* on it because what you'll get back from that is an iommu
token, not a sibling BAR offset. Additionally, if you use the target
struct resource address, you will need to offset it to get back to the
actual BAR value on the PCIe bus.

 - Behind different host bridges, then you go through the iommu and the
host remapping. IE. that's actually the easy case. You can probably
just use the normal iommu path and normal dma mapping ops, you just
need to have your struct page representing the target device BAR *in
CPU space* this time. So no offsetting required either.

The problem is that the latter while seemingly easier, is also slower
and not supported by all platforms and architectures (for example,
POWER currently won't allow it, or rather only allows a store-only
subset of it under special circumstances).

So what people practically want to do is to have 2 devices behind a
switch DMA'ing to/from each other.

But that brings the 2 problems above.

I don't fully understand how p2pmem "solves" that by creating struct
pages. The offset problem is one issue. But there's the iommu issue as
well, the driver cannot just use the normal dma_map ops.

I haven't had a chance to look at the details of the patches but it's
not clear from the description in patch 0 how that is solved.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-14 17:30               ` Logan Gunthorpe
@ 2017-04-14 19:04                 ` Bjorn Helgaas
  2017-04-14 22:07                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 105+ messages in thread
From: Bjorn Helgaas @ 2017-04-14 19:04 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Christoph Hellwig,
	Sagi Grimberg, James E.J. Bottomley, Martin K. Petersen,
	Jens Axboe, Steve Wise, Stephen Bates, Max Gurtovoy,
	Dan Williams, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Fri, Apr 14, 2017 at 11:30:14AM -0600, Logan Gunthorpe wrote:
> On 14/04/17 05:37 AM, Benjamin Herrenschmidt wrote:
> > I object to designing a subsystem that by design cannot work on whole
> > categories of architectures out there.
> 
> Hardly. That's extreme. We'd design a subsystem that works for the easy
> cases and needs more work to support the offset cases. It would not be
> designed in such a way that it could _never_ support those
> architectures. It would simply be such that it only permits use by the
> cases that are known to work. Then those cases could be expanded as time
> goes on and people work on adding more support.

I'm a little hesitant about excluding offset support, so I'd like to
hear more about this.

Is the issue related to PCI BARs that are not completely addressable
by the CPU?  If so, that sounds like a first-class issue that should
be resolved up front because I don't think the PCI core in general
would deal well with that.

If all the PCI memory of interest is in fact addressable by the CPU,
I would think it would be pretty straightforward to support offsets --
everywhere you currently use a PCI bus address, you just use the
corresponding CPU physical address instead.

> There's tons of stuff that needs to be done to get this upstream. I'd
> rather not require it to work for every possible architecture from the
> start. The testing alone would be impossible. Many subsystems start by
> working for x86 first and then adding support in other architectures
> later. (Often with that work done by the people who care about those
> systems and actually have the hardware to test with.)

I don't think exhaustive testing is that big a deal.  PCI offset
support is generic, so you shouldn't need any arch-specific code to
deal with it.  I'd rather have consistent support across the board,
even though some arches might not be tested.  I think that's simpler
and better than adding checks to disable functionality on some arches
merely on the grounds that it hasn't been tested there.

Bjorn

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-14 11:37             ` Benjamin Herrenschmidt
@ 2017-04-14 17:30               ` Logan Gunthorpe
  2017-04-14 19:04                 ` Bjorn Helgaas
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-14 17:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Jason Gunthorpe, Bjorn Helgaas
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse



On 14/04/17 05:37 AM, Benjamin Herrenschmidt wrote:
> I object to designing a subsystem that by design cannot work on whole
> categories of architectures out there.

Hardly. That's extreme. We'd design a subsystem that works for the easy
cases and needs more work to support the offset cases. It would not be
designed in such a way that it could _never_ support those
architectures. It would simply be such that it only permits use by the
cases that are known to work. Then those cases could be expanded as time
goes on and people work on adding more support.

There's tons of stuff that needs to be done to get this upstream. I'd
rather not require it to work for every possible architecture from the
start. The testing alone would be impossible. Many subsystems start by
working for x86 first and then adding support in other architectures
later. (Often with that work done by the people who care about those
systems and actually have the hardware to test with.)

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-14 11:37               ` Benjamin Herrenschmidt
@ 2017-04-14 11:39                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-14 11:39 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe, Bjorn Helgaas
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Fri, 2017-04-14 at 21:37 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> > 
> > On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > > I'd suggest just detecting if there is any translation in bus
> > > addresses anywhere and just hard disabling P2P on such systems.
> > 
> > That's a fantastic suggestion. It simplifies things significantly.
> > Unless there are any significant objections I think I will plan on
> > doing
> > that.
> 
> I object.

Note: It would also make your stuff fundamentally incompatible with
KVM guest pass-through since KVM plays with remapping BARs all over
the place.

Ben.

> > > On modern hardware with 64 bit BARs there is very little reason
> > > to
> > > have translation, so I think this is a legacy feature.
> > 
> > Yes, p2pmem users are likely to be designing systems around it (ie
> > JBOFs) and not trying to shoehorn it onto legacy architectures.
> > 
> > At the very least, it makes sense to leave it out and if someone
> > comes
> > along who cares they can put in the effort to support the address
> > translation.
> > 
> > Thanks,
> > 
> > Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-14  4:40             ` Logan Gunthorpe
@ 2017-04-14 11:37               ` Benjamin Herrenschmidt
  2017-04-14 11:39                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-14 11:37 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe, Bjorn Helgaas
  Cc: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, linux-pci, linux-scsi,
	linux-nvme, linux-rdma, linux-nvdimm, linux-kernel,
	Jerome Glisse

On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> 
> On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > I'd suggest just detecting if there is any translation in bus
> > addresses anywhere and just hard disabling P2P on such systems.
> 
> That's a fantastic suggestion. It simplifies things significantly.
> Unless there are any significant objections I think I will plan on
> doing
> that.

I object.

> > On modern hardware with 64 bit BARs there is very little reason to
> > have translation, so I think this is a legacy feature.
> 
> Yes, p2pmem users are likely to be designing systems around it (ie
> JBOFs) and not trying to shoehorn it onto legacy architectures.
> 
> At the very least, it makes sense to leave it out and if someone
> comes
> along who cares they can put in the effort to support the address
> translation.
> 
> Thanks,
> 
> Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-14  4:16           ` Jason Gunthorpe
  2017-04-14  4:40             ` Logan Gunthorpe
@ 2017-04-14 11:37             ` Benjamin Herrenschmidt
  2017-04-14 17:30               ` Logan Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-14 11:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas
  Cc: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse

On Thu, 2017-04-13 at 22:16 -0600, Jason Gunthorpe wrote:
> > Any caller of pci_add_resource_offset() uses CPU addresses different from
> > the PCI bus addresses (unless the offset is zero, of course).  All ACPI
> > platforms also support this translation (see "translation_offset"), though
> > in most x86 systems the offset is zero.  I'm aware of one x86 system that
> > was tested with a non-zero offset but I don't think it was shipped that
> > way.
> 
> I'd suggest just detecting if there is any translation in bus
> addresses anywhere and just hard disabling P2P on such systems.

I object to designing a subsystem that by design cannot work on whole
categories of architectures out there.

> On modern hardware with 64 bit BARs there is very little reason to
> have translation, so I think this is a legacy feature.

No it's not.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-14  4:16           ` Jason Gunthorpe
@ 2017-04-14  4:40             ` Logan Gunthorpe
  2017-04-14 11:37               ` Benjamin Herrenschmidt
  2017-04-14 11:37             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-14  4:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse



On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> I'd suggest just detecting if there is any translation in bus
> addresses anywhere and just hard disabling P2P on such systems.

That's a fantastic suggestion. It simplifies things significantly.
Unless there are any significant objections I think I will plan on doing
that.

> On modern hardware with 64 bit BARs there is very little reason to
> have translation, so I think this is a legacy feature.

Yes, p2pmem users are likely to be designing systems around it (ie
JBOFs) and not trying to shoehorn it onto legacy architectures.

At the very least, it makes sense to leave it out and if someone comes
along who cares they can put in the effort to support the address
translation.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-13 23:26         ` Bjorn Helgaas
@ 2017-04-14  4:16           ` Jason Gunthorpe
  2017-04-14  4:40             ` Logan Gunthorpe
  2017-04-14 11:37             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 105+ messages in thread
From: Jason Gunthorpe @ 2017-04-14  4:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, Christoph Hellwig,
	Sagi Grimberg, James E.J. Bottomley, Martin K. Petersen,
	Jens Axboe, Steve Wise, Stephen Bates, Max Gurtovoy,
	Dan Williams, Keith Busch, linux-pci, linux-scsi, linux-nvme,
	linux-rdma, linux-nvdimm, linux-kernel, Jerome Glisse

On Thu, Apr 13, 2017 at 06:26:31PM -0500, Bjorn Helgaas wrote:
> > Ah, thanks for the tip! On my system, this translation returns the same
> > address so it was not necessary. And, yes, that means this would have to
> > find its way into the dma mapping routine somehow. This means we'll
> > eventually need a way to look-up the p2pmem device from the struct page.
> > Which means we will likely need a new flag bit in the struct page or
> > something. The big difficulty I see is testing. Do you know what
> > architectures or in what circumstances are these translations used?
> 
> Any caller of pci_add_resource_offset() uses CPU addresses different from
> the PCI bus addresses (unless the offset is zero, of course).  All ACPI
> platforms also support this translation (see "translation_offset"), though
> in most x86 systems the offset is zero.  I'm aware of one x86 system that
> was tested with a non-zero offset but I don't think it was shipped that
> way.

I'd suggest just detecting if there is any translation in bus
addresses anywhere and just hard disabling P2P on such systems.

On modern hardware with 64 bit BARs there is very little reason to
have translation, so I think this is a legacy feature.

Jason

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-13 21:22       ` Logan Gunthorpe
  2017-04-13 22:37         ` Benjamin Herrenschmidt
@ 2017-04-13 23:26         ` Bjorn Helgaas
  2017-04-14  4:16           ` Jason Gunthorpe
  1 sibling, 1 reply; 105+ messages in thread
From: Bjorn Helgaas @ 2017-04-13 23:26 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Benjamin Herrenschmidt, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe, linux-pci, linux-scsi, linux-nvme, linux-rdma,
	linux-nvdimm, linux-kernel, Jerome Glisse

On Thu, Apr 13, 2017 at 03:22:06PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 12/04/17 03:55 PM, Benjamin Herrenschmidt wrote:
> > Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
> > will perform the conversion between the struct resource content (CPU
> > physical address) and the actual PCI bus side address.
> 
> Ah, thanks for the tip! On my system, this translation returns the same
> address so it was not necessary. And, yes, that means this would have to
> find its way into the dma mapping routine somehow. This means we'll
> eventually need a way to look-up the p2pmem device from the struct page.
> Which means we will likely need a new flag bit in the struct page or
> something. The big difficulty I see is testing. Do you know what
> architectures or in what circumstances are these translations used?

Any caller of pci_add_resource_offset() uses CPU addresses different from
the PCI bus addresses (unless the offset is zero, of course).  All ACPI
platforms also support this translation (see "translation_offset"), though
in most x86 systems the offset is zero.  I'm aware of one x86 system that
was tested with a non-zero offset but I don't think it was shipped that
way.

Bjorn

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-13 21:22       ` Logan Gunthorpe
@ 2017-04-13 22:37         ` Benjamin Herrenschmidt
  2017-04-13 23:26         ` Bjorn Helgaas
  1 sibling, 0 replies; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-13 22:37 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse

On Thu, 2017-04-13 at 15:22 -0600, Logan Gunthorpe wrote:
> 
> On 12/04/17 03:55 PM, Benjamin Herrenschmidt wrote:
> > Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
> > will perform the conversion between the struct resource content (CPU
> > physical address) and the actual PCI bus side address.
> 
> Ah, thanks for the tip! On my system, this translation returns the same
> address so it was not necessary. And, yes, that means this would have to
> find its way into the dma mapping routine somehow. This means we'll
> eventually need a way to look-up the p2pmem device from the struct page.
> Which means we will likely need a new flag bit in the struct page or
> something. The big difficulty I see is testing. Do you know what
> architectures or in what circumstances are these translations used?

I think a bunch of non-x86 architectures but I don't know which ones
outside of powerpc.

> > When behind the same switch you need to use PCI addresses. If one tries
> > later to do P2P between host bridges (via the CPU fabric) things get
> > more complex and one will have to use either CPU addresses or something
> > else alltogether (probably would have to teach the arch DMA mapping
> > routines to work with those struct pages you create and return the
> > right thing).
> 
> Probably for starters we'd want to explicitly deny cases between host
> bridges and add that later if someone wants to do the testing.

Cheers,
Ben.

> Thanks,
> 
> Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-12 21:55     ` Benjamin Herrenschmidt
@ 2017-04-13 21:22       ` Logan Gunthorpe
  2017-04-13 22:37         ` Benjamin Herrenschmidt
  2017-04-13 23:26         ` Bjorn Helgaas
  0 siblings, 2 replies; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-13 21:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse



On 12/04/17 03:55 PM, Benjamin Herrenschmidt wrote:
> Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
> will perform the conversion between the struct resource content (CPU
> physical address) and the actual PCI bus side address.

Ah, thanks for the tip! On my system, this translation returns the same
address so it was not necessary. And, yes, that means this would have to
find its way into the dma mapping routine somehow. This means we'll
eventually need a way to look-up the p2pmem device from the struct page.
Which means we will likely need a new flag bit in the struct page or
something. The big difficulty I see is testing. Do you know what
architectures or in what circumstances are these translations used?

> When behind the same switch you need to use PCI addresses. If one tries
> later to do P2P between host bridges (via the CPU fabric) things get
> more complex and one will have to use either CPU addresses or something
> else alltogether (probably would have to teach the arch DMA mapping
> routines to work with those struct pages you create and return the
> right thing).

Probably for starters we'd want to explicitly deny cases between host
bridges and add that later if someone wants to do the testing.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-12 17:09   ` Logan Gunthorpe
@ 2017-04-12 21:55     ` Benjamin Herrenschmidt
  2017-04-13 21:22       ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-12 21:55 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Jerome Glisse

On Wed, 2017-04-12 at 11:09 -0600, Logan Gunthorpe wrote:
> 
> > Do you handle funky address translation too ? IE. the fact that the PCI
> > addresses aren't the same as the CPU physical addresses for a BAR ?
> 
> No, we use the CPU physical address of the BAR. If it's not mapped that
> way we can't use it.

Ok, you need to fix that or a bunch of architectures won't work. 

Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
will perform the conversion between the struct resource content (CPU
physical address) and the actual PCI bus side address.

When behind the same switch you need to use PCI addresses. If one tries
later to do P2P between host bridges (via the CPU fabric) things get
more complex and one will have to use either CPU addresses or something
else alltogether (probably would have to teach the arch DMA mapping
routines to work with those struct pages you create and return the
right thing).

> > > This will mean many setups that could likely
> > > work well will not be supported so that we can be more confident it
> > > will work and not place any responsibility on the user to understand
> > > their topology. (We've chosen to go this route based on feedback we
> > > received at LSF).
> > > 
> > > In order to enable this functionality we introduce a new p2pmem device
> > > which can be instantiated by PCI drivers. The device will register some
> > > PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> > > users of these devices to get buffers.
> > 
> > I don't completely understand this. This is actual memory on the PCI
> > bus ? Where does it come from ? Or are you just trying to create struct
> > pages that cover your PCIe DMA target ?
> 
> Yes, the memory is on the PCI bus in a BAR. For now we have a special
> PCI card for this, but in the future it would likely be the CMB in an
> NVMe card. These patches create struct pages to map these BAR addresses
> using ZONE_DEVICE.

Ok.

So ideally we'd want things like dma_map_* to be able to be fed those
struct pages and do the right thing which is ... tricky, especially
with the address translation I mentioned since the address will be
different whether the initiator is on the same host bridge as the
target or not.

> > So correct me if I'm wrong, you are trying to create struct page's that
> > map a PCIe BAR right ? I'm trying to understand how that interacts with
> > what Jerome is doing for HMM.
> 
> Yes, well we are using ZONE_DEVICE in the exact same way as the dax code
> is. These patches use the existing API with no modifications. As I
> understand it, HMM was using ZONE_DEVICE in a way that was quite
> different to how it was originally designed.

Sort-of. I don't see why there would be a conflict with the struct
pages use though. Jerome can you chime in ? Jerome: It looks like they
are just laying out struct page over a BAR which is the same thing I
think you should do when the BAR is "large enough" for the GPU memory.

The case where HMM uses "holes" in the address space for its struct
page is somewhat orthogonal but I also see no conflict here.

> > The reason is that the HMM currently creates the struct pages with
> > "fake" PFNs pointing to a hole in the address space rather than
> > covering the actual PCIe memory of the GPU. He does that to deal with
> > the fact that some GPUs have a smaller aperture on PCIe than their
> > total memory.
> 
> I'm aware of what HMM is trying to do and although I'm not familiar with
> the intimate details, I saw it as fairly orthogonal to what we are
> attempting to do.

Right.

> > However, I have asked him to only apply that policy if the aperture is
> > indeed smaller, and if not, create struct pages that directly cover the
> > PCIe BAR of the GPU instead, which will work better on systems or
> > architecture that don't have a "pinhole" window limitation.
> > However he was under the impression that this was going to collide with
> > what you guys are doing, so I'm trying to understand how. 
> 
> I'm not sure I understand how either. However, I suspect if you collide
> with these patches then you'd also be breaking dax too.

Possibly but as I said, I don't see why so I'll let Jerome chime in
since he was under the impression that there was a conflict here :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-04-12  5:22 ` Benjamin Herrenschmidt
@ 2017-04-12 17:09   ` Logan Gunthorpe
  2017-04-12 21:55     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-04-12 17:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel



On 11/04/17 11:22 PM, Benjamin Herrenschmidt wrote:
> Another issue of course is that not all systems support P2P
> between host bridges :-) (Though almost all switches can enable it).

Yes, I'm either going to just let the user enable and test or limit it
to switches only to start. However, currently our bigger issue is
working on a way to not violate iomem safety.

> Ok. I suppose that's a reasonable starting point. Do I haven't looked
> at the patches in detail yet but it would be nice if that policy was in
> a well isolated component so it can potentially be affected by
> arch/platform code.

The policy is isolated in the new p2pmem driver. There's no reason the
policy couldn't become arbitrarily complex with specific arch
exceptions. It's just people would have to do the work to create those
exceptions.

> Do you handle funky address translation too ? IE. the fact that the PCI
> addresses aren't the same as the CPU physical addresses for a BAR ?

No, we use the CPU physical address of the BAR. If it's not mapped that
way we can't use it.

>> This will mean many setups that could likely
>> work well will not be supported so that we can be more confident it
>> will work and not place any responsibility on the user to understand
>> their topology. (We've chosen to go this route based on feedback we
>> received at LSF).
>>
>> In order to enable this functionality we introduce a new p2pmem device
>> which can be instantiated by PCI drivers. The device will register some
>> PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
>> users of these devices to get buffers.
> 
> I don't completely understand this. This is actual memory on the PCI
> bus ? Where does it come from ? Or are you just trying to create struct
> pages that cover your PCIe DMA target ?

Yes, the memory is on the PCI bus in a BAR. For now we have a special
PCI card for this, but in the future it would likely be the CMB in an
NVMe card. These patches create struct pages to map these BAR addresses
using ZONE_DEVICE.


> So correct me if I'm wrong, you are trying to create struct page's that
> map a PCIe BAR right ? I'm trying to understand how that interacts with
> what Jerome is doing for HMM.

Yes, well we are using ZONE_DEVICE in the exact same way as the dax code
is. These patches use the existing API with no modifications. As I
understand it, HMM was using ZONE_DEVICE in a way that was quite
different to how it was originally designed.

> The reason is that the HMM currently creates the struct pages with
> "fake" PFNs pointing to a hole in the address space rather than
> covering the actual PCIe memory of the GPU. He does that to deal with
> the fact that some GPUs have a smaller aperture on PCIe than their
> total memory.

I'm aware of what HMM is trying to do and although I'm not familiar with
the intimate details, I saw it as fairly orthogonal to what we are
attempting to do.

> However, I have asked him to only apply that policy if the aperture is
> indeed smaller, and if not, create struct pages that directly cover the
> PCIe BAR of the GPU instead, which will work better on systems or
> architecture that don't have a "pinhole" window limitation.
> However he was under the impression that this was going to collide with
> what you guys are doing, so I'm trying to understand how. 

I'm not sure I understand how either. However, I suspect if you collide
with these patches then you'd also be breaking dax too.

Logan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
  2017-03-30 22:12 Logan Gunthorpe
@ 2017-04-12  5:22 ` Benjamin Herrenschmidt
  2017-04-12 17:09   ` Logan Gunthorpe
  0 siblings, 1 reply; 105+ messages in thread
From: Benjamin Herrenschmidt @ 2017-04-12  5:22 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig, Sagi Grimberg,
	James E.J. Bottomley, Martin K. Petersen, Jens Axboe, Steve Wise,
	Stephen Bates, Max Gurtovoy, Dan Williams, Keith Busch,
	Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel

On Thu, 2017-03-30 at 16:12 -0600, Logan Gunthorpe wrote:
> Hello,
> 
> As discussed at LSF/MM we'd like to present our work to enable
> copy offload support in NVMe fabrics RDMA targets. We'd appreciate
> some review and feedback from the community on our direction.
> This series is not intended to go upstream at this point.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the trade-off
> is currently a reduction in overall throughput. (Largely due to hardware
> issues that would certainly improve in the future).

Another issue of course is that not all systems support P2P
between host bridges :-) (Though almost all switches can enable it).

> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch.

Ok. I suppose that's a reasonable starting point. Do I haven't looked
at the patches in detail yet but it would be nice if that policy was in
a well isolated component so it can potentially be affected by
arch/platform code.

Do you handle funky address translation too ? IE. the fact that the PCI
addresses aren't the same as the CPU physical addresses for a BAR ?

> This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We've chosen to go this route based on feedback we
> received at LSF).
> 
> In order to enable this functionality we introduce a new p2pmem device
> which can be instantiated by PCI drivers. The device will register some
> PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> users of these devices to get buffers.

I don't completely understand this. This is actual memory on the PCI
bus ? Where does it come from ? Or are you just trying to create struct
pages that cover your PCIe DMA target ?
 
> We give an example of enabling
> p2p memory with the cxgb4 driver, however currently these devices have
> some hardware issues that prevent their use so we will likely be
> dropping this patch in the future. Ideally, we'd want to enable this
> functionality with NVME CMB buffers, however we don't have any hardware
> with this feature at this time.

So correct me if I'm wrong, you are trying to create struct page's that
map a PCIe BAR right ? I'm trying to understand how that interacts with
what Jerome is doing for HMM.

The reason is that the HMM currently creates the struct pages with
"fake" PFNs pointing to a hole in the address space rather than
covering the actual PCIe memory of the GPU. He does that to deal with
the fact that some GPUs have a smaller aperture on PCIe than their
total memory.

However, I have asked him to only apply that policy if the aperture is
indeed smaller, and if not, create struct pages that directly cover the
PCIe BAR of the GPU instead, which will work better on systems or
architecture that don't have a "pinhole" window limitation.

However he was under the impression that this was going to collide with
what you guys are doing, so I'm trying to understand how. 

> In nvmet-rdma, we attempt to get an appropriate p2pmem device at
> queue creation time and if a suitable one is found we will use it for
> all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
> attribute is also created which is required to be set before any p2pmem
> is attempted.
> 
> This patchset also includes a more controversial patch which provides an
> interface for userspace to obtain p2pmem buffers through an mmap call on
> a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
> O_DIRECT interfaces. However, the user would be entirely responsible for
> knowing what their doing and inspecting sysfs to understand the pci
> topology and only using it in sane situations.
> 
> Thanks,
> 
> Logan
> 
> 
> Logan Gunthorpe (6):
>   Introduce Peer-to-Peer memory (p2pmem) device
>   nvmet: Use p2pmem in nvme target
>   scatterlist: Modify SG copy functions to support io memory.
>   nvmet: Be careful about using iomem accesses when dealing with p2pmem
>   p2pmem: Support device removal
>   p2pmem: Added char device user interface
> 
> Steve Wise (2):
>   cxgb4: setup pcie memory window 4 and create p2pmem region
>   p2pmem: Add debugfs "stats" file
> 
>  drivers/memory/Kconfig                          |   5 +
>  drivers/memory/Makefile                         |   2 +
>  drivers/memory/p2pmem.c                         | 697 ++++++++++++++++++++++++
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   3 +
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  97 +++-
>  drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   5 +
>  drivers/nvme/target/configfs.c                  |  31 ++
>  drivers/nvme/target/core.c                      |  18 +-
>  drivers/nvme/target/fabrics-cmd.c               |  28 +-
>  drivers/nvme/target/nvmet.h                     |   2 +
>  drivers/nvme/target/rdma.c                      | 183 +++++--
>  drivers/scsi/scsi_debug.c                       |   7 +-
>  include/linux/p2pmem.h                          | 120 ++++
>  include/linux/scatterlist.h                     |   7 +-
>  lib/scatterlist.c                               |  64 ++-
>  15 files changed, 1189 insertions(+), 80 deletions(-)
>  create mode 100644 drivers/memory/p2pmem.c
>  create mode 100644 include/linux/p2pmem.h
> 
> --
> 2.1.4

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
@ 2017-03-30 22:12 Logan Gunthorpe
  2017-04-12  5:22 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 105+ messages in thread
From: Logan Gunthorpe @ 2017-03-30 22:12 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, James E.J. Bottomley,
	Martin K. Petersen, Jens Axboe, Steve Wise, Stephen Bates,
	Max Gurtovoy, Dan Williams, Keith Busch, Jason Gunthorpe
  Cc: linux-pci, linux-scsi, linux-nvme, linux-rdma, linux-nvdimm,
	linux-kernel, Logan Gunthorpe

Hello,

As discussed at LSF/MM we'd like to present our work to enable
copy offload support in NVMe fabrics RDMA targets. We'd appreciate
some review and feedback from the community on our direction.
This series is not intended to go upstream at this point.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the trade-off
is currently a reduction in overall throughput. (Largely due to hardware
issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We've chosen to go this route based on feedback we
received at LSF).

In order to enable this functionality we introduce a new p2pmem device
which can be instantiated by PCI drivers. The device will register some
PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
users of these devices to get buffers. We give an example of enabling
p2p memory with the cxgb4 driver, however currently these devices have
some hardware issues that prevent their use so we will likely be
dropping this patch in the future. Ideally, we'd want to enable this
functionality with NVME CMB buffers, however we don't have any hardware
with this feature at this time.

In nvmet-rdma, we attempt to get an appropriate p2pmem device at
queue creation time and if a suitable one is found we will use it for
all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
attribute is also created which is required to be set before any p2pmem
is attempted.

This patchset also includes a more controversial patch which provides an
interface for userspace to obtain p2pmem buffers through an mmap call on
a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
O_DIRECT interfaces. However, the user would be entirely responsible for
knowing what their doing and inspecting sysfs to understand the pci
topology and only using it in sane situations.

Thanks,

Logan


Logan Gunthorpe (6):
  Introduce Peer-to-Peer memory (p2pmem) device
  nvmet: Use p2pmem in nvme target
  scatterlist: Modify SG copy functions to support io memory.
  nvmet: Be careful about using iomem accesses when dealing with p2pmem
  p2pmem: Support device removal
  p2pmem: Added char device user interface

Steve Wise (2):
  cxgb4: setup pcie memory window 4 and create p2pmem region
  p2pmem: Add debugfs "stats" file

 drivers/memory/Kconfig                          |   5 +
 drivers/memory/Makefile                         |   2 +
 drivers/memory/p2pmem.c                         | 697 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  97 +++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   5 +
 drivers/nvme/target/configfs.c                  |  31 ++
 drivers/nvme/target/core.c                      |  18 +-
 drivers/nvme/target/fabrics-cmd.c               |  28 +-
 drivers/nvme/target/nvmet.h                     |   2 +
 drivers/nvme/target/rdma.c                      | 183 +++++--
 drivers/scsi/scsi_debug.c                       |   7 +-
 include/linux/p2pmem.h                          | 120 ++++
 include/linux/scatterlist.h                     |   7 +-
 lib/scatterlist.c                               |  64 ++-
 15 files changed, 1189 insertions(+), 80 deletions(-)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

--
2.1.4

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2017-04-25 21:23 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-16 15:44 [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Dan Williams
2017-04-16 16:47 ` Logan Gunthorpe
2017-04-16 22:32   ` Benjamin Herrenschmidt
2017-04-17  5:13     ` Logan Gunthorpe
2017-04-17  7:20       ` Benjamin Herrenschmidt
2017-04-17 16:52         ` Logan Gunthorpe
2017-04-17 17:04           ` Dan Williams
2017-04-18  5:22             ` Logan Gunthorpe
2017-04-17 18:04           ` Jerome Glisse
2017-04-18  6:14             ` Logan Gunthorpe
2017-04-17 21:11           ` Benjamin Herrenschmidt
2017-04-18  5:43             ` Logan Gunthorpe
2017-04-18  6:29               ` Benjamin Herrenschmidt
2017-04-16 22:23 ` Benjamin Herrenschmidt
2017-04-18 16:45   ` Jason Gunthorpe
2017-04-18 17:27     ` Dan Williams
2017-04-18 18:00       ` Jason Gunthorpe
2017-04-18 18:34         ` Dan Williams
2017-04-19  1:13         ` Benjamin Herrenschmidt
2017-04-18 22:46       ` Benjamin Herrenschmidt
2017-04-18 22:52         ` Dan Williams
2017-04-18 18:30     ` Logan Gunthorpe
2017-04-18 19:01       ` Jason Gunthorpe
2017-04-18 19:35         ` Logan Gunthorpe
2017-04-18 19:48           ` Dan Williams
2017-04-18 20:29             ` Jerome Glisse
2017-04-18 20:31               ` Dan Williams
2017-04-18 20:48                 ` Logan Gunthorpe
2017-04-19  1:17                   ` Benjamin Herrenschmidt
2017-04-18 21:03             ` Jason Gunthorpe
2017-04-18 21:11               ` Dan Williams
2017-04-18 21:22                 ` Jason Gunthorpe
2017-04-18 21:36                   ` Dan Williams
2017-04-18 22:15                     ` Logan Gunthorpe
2017-04-18 22:28                       ` Dan Williams
2017-04-18 22:42                         ` Jason Gunthorpe
2017-04-18 22:51                           ` Dan Williams
2017-04-18 23:21                             ` Jason Gunthorpe
2017-04-19  1:25                               ` Benjamin Herrenschmidt
2017-04-18 22:48                         ` Logan Gunthorpe
2017-04-18 22:50                           ` Dan Williams
2017-04-18 22:56                             ` Logan Gunthorpe
2017-04-18 23:02                               ` Dan Williams
2017-04-19  1:21                   ` Benjamin Herrenschmidt
2017-04-18 21:31               ` Logan Gunthorpe
2017-04-18 22:24                 ` Jason Gunthorpe
2017-04-18 23:03                   ` Logan Gunthorpe
2017-04-19  1:23                   ` Benjamin Herrenschmidt
2017-04-19  1:20               ` Benjamin Herrenschmidt
2017-04-19 15:55                 ` Jason Gunthorpe
2017-04-19 16:48                   ` Logan Gunthorpe
2017-04-19 17:01                     ` Dan Williams
2017-04-19 17:32                       ` Jerome Glisse
2017-04-19 17:41                         ` Dan Williams
2017-04-19 18:11                           ` Logan Gunthorpe
2017-04-19 18:19                             ` Logan Gunthorpe
2017-04-19 18:30                               ` Dan Williams
2017-04-19 18:41                                 ` Logan Gunthorpe
2017-04-19 18:44                                   ` Dan Williams
2017-04-20 20:43                           ` Stephen  Bates
2017-04-20 20:47                             ` Dan Williams
2017-04-20 23:07                               ` Stephen  Bates
2017-04-21  4:59                                 ` Dan Williams
2017-04-19 17:14                     ` Jason Gunthorpe
2017-04-19 18:01                       ` Logan Gunthorpe
2017-04-19 18:32                         ` Jason Gunthorpe
2017-04-19 19:02                           ` Logan Gunthorpe
2017-04-19 19:31                             ` Jason Gunthorpe
2017-04-19 19:41                               ` Logan Gunthorpe
2017-04-19 20:48                                 ` Jason Gunthorpe
2017-04-19 22:55                                   ` Logan Gunthorpe
2017-04-20  0:07                                     ` Dan Williams
2017-04-18 19:48           ` Jason Gunthorpe
2017-04-18 20:06             ` Logan Gunthorpe
  -- strict thread matches above, loose matches on Subject: below --
2017-03-30 22:12 Logan Gunthorpe
2017-04-12  5:22 ` Benjamin Herrenschmidt
2017-04-12 17:09   ` Logan Gunthorpe
2017-04-12 21:55     ` Benjamin Herrenschmidt
2017-04-13 21:22       ` Logan Gunthorpe
2017-04-13 22:37         ` Benjamin Herrenschmidt
2017-04-13 23:26         ` Bjorn Helgaas
2017-04-14  4:16           ` Jason Gunthorpe
2017-04-14  4:40             ` Logan Gunthorpe
2017-04-14 11:37               ` Benjamin Herrenschmidt
2017-04-14 11:39                 ` Benjamin Herrenschmidt
2017-04-14 11:37             ` Benjamin Herrenschmidt
2017-04-14 17:30               ` Logan Gunthorpe
2017-04-14 19:04                 ` Bjorn Helgaas
2017-04-14 22:07                   ` Benjamin Herrenschmidt
2017-04-15 17:41                     ` Logan Gunthorpe
2017-04-15 22:09                       ` Dan Williams
2017-04-16  3:01                         ` Benjamin Herrenschmidt
2017-04-16  4:46                           ` Logan Gunthorpe
2017-04-16 15:53                           ` Dan Williams
2017-04-16 16:34                             ` Logan Gunthorpe
2017-04-16 22:31                               ` Benjamin Herrenschmidt
2017-04-24  7:36                                 ` Knut Omang
2017-04-24 16:14                                   ` Logan Gunthorpe
2017-04-25  6:30                                     ` Knut Omang
2017-04-25 17:03                                       ` Logan Gunthorpe
2017-04-25 21:23                                         ` Stephen  Bates
2017-04-25 21:23                                   ` Stephen  Bates
2017-04-16 22:26                             ` Benjamin Herrenschmidt
2017-04-15 22:17                       ` Benjamin Herrenschmidt
2017-04-16  5:36                         ` Logan Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).