linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Enabling peer to peer device transactions for PCIe devices
@ 2016-11-21 20:36 Deucher, Alexander
  2016-11-22 18:11 ` Dan Williams
                   ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Deucher, Alexander @ 2016-11-21 20:36 UTC (permalink / raw)
  To: 'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org'
  Cc: Koenig, Christian, Sagalovitch, Serguei, Blinzer, Paul, Kuehling,
	Felix, Sander, Ben, Suthikulpanit, Suravee, Bridgman, John

This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward.  Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory.  Also in cases where both devices are behind a switch, it avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based.  Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc.  Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
 
Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
 
Here is a relatively simple example of how this could work for testing.  This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will  return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory
 
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
 
2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
 
3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.

4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
 
5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)

6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
 
Alex

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-21 20:36 Enabling peer to peer device transactions for PCIe devices Deucher, Alexander
@ 2016-11-22 18:11 ` Dan Williams
       [not found]   ` <75a1f44f-c495-7d1e-7e1c-17e89555edba@amd.com>
  2017-01-05 18:39 ` Jerome Glisse
  2017-10-20 12:36 ` Ludwig Petrosyan
  2 siblings, 1 reply; 126+ messages in thread
From: Dan Williams @ 2016-11-22 18:11 UTC (permalink / raw)
  To: Deucher, Alexander
  Cc: linux-kernel, linux-rdma, linux-nvdimm@lists.01.org, Linux-media,
	dri-devel, linux-pci, Bridgman, John, Kuehling, Felix,
	Sagalovitch, Serguei, Blinzer, Paul, Koenig, Christian,
	Suthikulpanit, Suravee, Sander, Ben

On Mon, Nov 21, 2016 at 12:36 PM, Deucher, Alexander
<Alexander.Deucher@amd.com> wrote:
> This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward.  Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory.  Also in cases where both devices are behind a switch, it avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based.  Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc.  Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
>
> Some use cases:
> 1. Storage devices streaming directly to GPU device memory
> 2. GPU device memory to GPU device memory streaming
> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
> 4. DVB/V4L/SDI devices streaming directly to storage devices
>
> Here is a relatively simple example of how this could work for testing.  This is obviously not a complete solution.
> - Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
> - get_user_pages_fast() will  return corresponding struct pages when CPU address points to the device memory
> - put_page() will deal with struct pages for device memory
>
[..]
> 4. iopmem
> iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)

The change I suggest for this particular approach is to switch to
"device-DAX" [1]. I.e. a character device for establishing DAX
mappings rather than a block device plus a DAX filesystem. The pro of
this approach is standard user pointers and struct pages rather than a
new construct. The con is that this is done via an interface separate
from the existing gpu and storage device. For example it would require
a /dev/dax instance alongside a /dev/nvme interface, but I don't see
that as a significant blocking concern.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-October/007496.html

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
       [not found]   ` <75a1f44f-c495-7d1e-7e1c-17e89555edba@amd.com>
@ 2016-11-22 20:01     ` Dan Williams
  2016-11-22 20:10       ` Daniel Vetter
  2016-11-23 17:13     ` Logan Gunthorpe
  1 sibling, 1 reply; 126+ messages in thread
From: Dan Williams @ 2016-11-22 20:01 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: Deucher, Alexander, linux-kernel, linux-rdma,
	linux-nvdimm@lists.01.org, Linux-media, dri-devel, linux-pci,
	Bridgman, John, Kuehling, Felix, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, Dave Hansen

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
<serguei.sagalovitch@amd.com> wrote:
> Dan,
>
> I personally like "device-DAX" idea but my concerns are:
>
> -  How well it will co-exists with the  DRM infrastructure / implementations
>    in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the vma is
not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.

> -  How well we will be able to handle case when we need to "move"/"evict"
>    memory/data to the new location so CPU pointer should point to the new
> physical location/address
>     (and may be not in PCI device memory at all)?

So, device-DAX deliberately avoids support for in-kernel migration or
overcommit. Those cases are left to the core mm or drm. The device-dax
interface is for cases where all that is needed is a direct-mapping to
a statically-allocated physical-address range be it persistent memory
or some other special reserved memory range.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-22 20:01     ` Dan Williams
@ 2016-11-22 20:10       ` Daniel Vetter
  2016-11-22 20:24         ` Dan Williams
  2016-11-22 20:35         ` Serguei Sagalovitch
  0 siblings, 2 replies; 126+ messages in thread
From: Daniel Vetter @ 2016-11-22 20:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Serguei Sagalovitch, Dave Hansen, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, linux-kernel, dri-devel,
	Koenig, Christian, Sander, Ben, Suthikulpanit, Suravee, Deucher,
	Alexander, Blinzer, Paul, Linux-media

On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
> <serguei.sagalovitch@amd.com> wrote:
>> I personally like "device-DAX" idea but my concerns are:
>>
>> -  How well it will co-exists with the  DRM infrastructure / implementations
>>    in part dealing with CPU pointers?
>
> Inside the kernel a device-DAX range is "just memory" in the sense
> that you can perform pfn_to_page() on it and issue I/O, but the vma is
> not migratable. To be honest I do not know how well that co-exists
> with drm infrastructure.
>
>> -  How well we will be able to handle case when we need to "move"/"evict"
>>    memory/data to the new location so CPU pointer should point to the new
>> physical location/address
>>     (and may be not in PCI device memory at all)?
>
> So, device-DAX deliberately avoids support for in-kernel migration or
> overcommit. Those cases are left to the core mm or drm. The device-dax
> interface is for cases where all that is needed is a direct-mapping to
> a statically-allocated physical-address range be it persistent memory
> or some other special reserved memory range.

For some of the fancy use-cases (e.g. to be comparable to what HMM can
pull off) I think we want all the magic in core mm, i.e. migration and
overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where memory is
allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic, but still visible on both the cpu and gpu
side in some form. Special device to allocate memory, and not being
able to migrate stuff around sound like misfeatures from that pov.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-22 20:10       ` Daniel Vetter
@ 2016-11-22 20:24         ` Dan Williams
  2016-11-22 20:35         ` Serguei Sagalovitch
  1 sibling, 0 replies; 126+ messages in thread
From: Dan Williams @ 2016-11-22 20:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Serguei Sagalovitch, Dave Hansen, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, linux-kernel, dri-devel,
	Koenig, Christian, Sander, Ben, Suthikulpanit, Suravee, Deucher,
	Alexander, Blinzer, Paul, Linux-media

On Tue, Nov 22, 2016 at 12:10 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>> <serguei.sagalovitch@amd.com> wrote:
>>> I personally like "device-DAX" idea but my concerns are:
>>>
>>> -  How well it will co-exists with the  DRM infrastructure / implementations
>>>    in part dealing with CPU pointers?
>>
>> Inside the kernel a device-DAX range is "just memory" in the sense
>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>> not migratable. To be honest I do not know how well that co-exists
>> with drm infrastructure.
>>
>>> -  How well we will be able to handle case when we need to "move"/"evict"
>>>    memory/data to the new location so CPU pointer should point to the new
>>> physical location/address
>>>     (and may be not in PCI device memory at all)?
>>
>> So, device-DAX deliberately avoids support for in-kernel migration or
>> overcommit. Those cases are left to the core mm or drm. The device-dax
>> interface is for cases where all that is needed is a direct-mapping to
>> a statically-allocated physical-address range be it persistent memory
>> or some other special reserved memory range.
>
> For some of the fancy use-cases (e.g. to be comparable to what HMM can
> pull off) I think we want all the magic in core mm, i.e. migration and
> overcommit. At least that seems to be the very strong drive in all
> general-purpose gpu abstractions and implementations, where memory is
> allocated with malloc, and then mapped/moved into vram/gpu address
> space through some magic, but still visible on both the cpu and gpu
> side in some form. Special device to allocate memory, and not being
> able to migrate stuff around sound like misfeatures from that pov.

Agreed. For general purpose P2P use cases where all you want is
direct-I/O to a memory range that happens to be on a PCIe device then
I think a special device fits the bill. For gpu P2P use cases that
already have migration/overcommit expectations then it is not a good
fit.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-22 20:10       ` Daniel Vetter
  2016-11-22 20:24         ` Dan Williams
@ 2016-11-22 20:35         ` Serguei Sagalovitch
  2016-11-22 21:03           ` Daniel Vetter
  1 sibling, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-22 20:35 UTC (permalink / raw)
  To: Daniel Vetter, Dan Williams
  Cc: Dave Hansen, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, linux-kernel, dri-devel, Koenig, Christian,
	Sander, Ben, Suthikulpanit, Suravee, Deucher, Alexander, Blinzer,
	Paul, Linux-media



On 2016-11-22 03:10 PM, Daniel Vetter wrote:
> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>> <serguei.sagalovitch@amd.com> wrote:
>>> I personally like "device-DAX" idea but my concerns are:
>>>
>>> -  How well it will co-exists with the  DRM infrastructure / implementations
>>>     in part dealing with CPU pointers?
>> Inside the kernel a device-DAX range is "just memory" in the sense
>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>> not migratable. To be honest I do not know how well that co-exists
>> with drm infrastructure.
>>
>>> -  How well we will be able to handle case when we need to "move"/"evict"
>>>     memory/data to the new location so CPU pointer should point to the new
>>> physical location/address
>>>      (and may be not in PCI device memory at all)?
>> So, device-DAX deliberately avoids support for in-kernel migration or
>> overcommit. Those cases are left to the core mm or drm. The device-dax
>> interface is for cases where all that is needed is a direct-mapping to
>> a statically-allocated physical-address range be it persistent memory
>> or some other special reserved memory range.
> For some of the fancy use-cases (e.g. to be comparable to what HMM can
> pull off) I think we want all the magic in core mm, i.e. migration and
> overcommit. At least that seems to be the very strong drive in all
> general-purpose gpu abstractions and implementations, where memory is
> allocated with malloc, and then mapped/moved into vram/gpu address
> space through some magic,
It is possible that there is other way around: memory is requested to be
allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" such
allocation to system memory.
>   but still visible on both the cpu and gpu
> side in some form. Special device to allocate memory, and not being
> able to migrate stuff around sound like misfeatures from that pov.
> -Daniel

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-22 20:35         ` Serguei Sagalovitch
@ 2016-11-22 21:03           ` Daniel Vetter
  2016-11-22 21:21             ` Dan Williams
  0 siblings, 1 reply; 126+ messages in thread
From: Daniel Vetter @ 2016-11-22 21:03 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: Dan Williams, Dave Hansen, linux-nvdimm@lists.01.org, linux-rdma,
	linux-pci, Kuehling, Felix, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Deucher,
	Alexander, Blinzer, Paul, Linux-media

On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
<serguei.sagalovitch@amd.com> wrote:
>
> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>
>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.williams@intel.com>
>> wrote:
>>>
>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>> <serguei.sagalovitch@amd.com> wrote:
>>>>
>>>> I personally like "device-DAX" idea but my concerns are:
>>>>
>>>> -  How well it will co-exists with the  DRM infrastructure /
>>>> implementations
>>>>     in part dealing with CPU pointers?
>>>
>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>> not migratable. To be honest I do not know how well that co-exists
>>> with drm infrastructure.
>>>
>>>> -  How well we will be able to handle case when we need to
>>>> "move"/"evict"
>>>>     memory/data to the new location so CPU pointer should point to the
>>>> new
>>>> physical location/address
>>>>      (and may be not in PCI device memory at all)?
>>>
>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>> interface is for cases where all that is needed is a direct-mapping to
>>> a statically-allocated physical-address range be it persistent memory
>>> or some other special reserved memory range.
>>
>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>> pull off) I think we want all the magic in core mm, i.e. migration and
>> overcommit. At least that seems to be the very strong drive in all
>> general-purpose gpu abstractions and implementations, where memory is
>> allocated with malloc, and then mapped/moved into vram/gpu address
>> space through some magic,
>
> It is possible that there is other way around: memory is requested to be
> allocated and should be kept in vram for  performance reason but due
> to possible overcommit case we need at least temporally to "move" such
> allocation to system memory.

With migration I meant migrating both ways of course. And with stuff
like numactl we can also influence where exactly the malloc'ed memory
is allocated originally, at least if we'd expose the vram range as a
very special numa node that happens to be far away and not hold any
cpu cores.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-22 21:03           ` Daniel Vetter
@ 2016-11-22 21:21             ` Dan Williams
  2016-11-22 22:21               ` Sagalovitch, Serguei
  2016-11-23  7:49               ` Daniel Vetter
  0 siblings, 2 replies; 126+ messages in thread
From: Dan Williams @ 2016-11-22 21:21 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Serguei Sagalovitch, Dave Hansen, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, linux-kernel, dri-devel,
	Koenig, Christian, Sander, Ben, Suthikulpanit, Suravee, Deucher,
	Alexander, Blinzer, Paul, Linux-media

On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
> <serguei.sagalovitch@amd.com> wrote:
>>
>> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>>
>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.williams@intel.com>
>>> wrote:
>>>>
>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>> <serguei.sagalovitch@amd.com> wrote:
>>>>>
>>>>> I personally like "device-DAX" idea but my concerns are:
>>>>>
>>>>> -  How well it will co-exists with the  DRM infrastructure /
>>>>> implementations
>>>>>     in part dealing with CPU pointers?
>>>>
>>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>>> not migratable. To be honest I do not know how well that co-exists
>>>> with drm infrastructure.
>>>>
>>>>> -  How well we will be able to handle case when we need to
>>>>> "move"/"evict"
>>>>>     memory/data to the new location so CPU pointer should point to the
>>>>> new
>>>>> physical location/address
>>>>>      (and may be not in PCI device memory at all)?
>>>>
>>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>>> interface is for cases where all that is needed is a direct-mapping to
>>>> a statically-allocated physical-address range be it persistent memory
>>>> or some other special reserved memory range.
>>>
>>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>>> pull off) I think we want all the magic in core mm, i.e. migration and
>>> overcommit. At least that seems to be the very strong drive in all
>>> general-purpose gpu abstractions and implementations, where memory is
>>> allocated with malloc, and then mapped/moved into vram/gpu address
>>> space through some magic,
>>
>> It is possible that there is other way around: memory is requested to be
>> allocated and should be kept in vram for  performance reason but due
>> to possible overcommit case we need at least temporally to "move" such
>> allocation to system memory.
>
> With migration I meant migrating both ways of course. And with stuff
> like numactl we can also influence where exactly the malloc'ed memory
> is allocated originally, at least if we'd expose the vram range as a
> very special numa node that happens to be far away and not hold any
> cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior.  The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-22 21:21             ` Dan Williams
@ 2016-11-22 22:21               ` Sagalovitch, Serguei
  2016-11-23  7:49               ` Daniel Vetter
  1 sibling, 0 replies; 126+ messages in thread
From: Sagalovitch, Serguei @ 2016-11-22 22:21 UTC (permalink / raw)
  To: Dan Williams, Daniel Vetter
  Cc: Dave Hansen, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, linux-kernel, dri-devel, Koenig, Christian,
	Sander, Ben, Suthikulpanit, Suravee, Deucher, Alexander, Blinzer,
	Paul, Linux-media

> I don't think we should be using numa distance to reverse engineer a
> certain allocation behavior.  The latency data should be truthful, but
> you're right we'll need a mechanism to keep general purpose
> allocations out of that range by default. 

Just to clarify: Do you propose/thinking to utilize NUMA API for 
such (VRAM) allocations? 



    

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-22 21:21             ` Dan Williams
  2016-11-22 22:21               ` Sagalovitch, Serguei
@ 2016-11-23  7:49               ` Daniel Vetter
  2016-11-23  8:51                 ` Christian König
  2016-11-23 17:03                 ` Dave Hansen
  1 sibling, 2 replies; 126+ messages in thread
From: Daniel Vetter @ 2016-11-23  7:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: Daniel Vetter, Serguei Sagalovitch, Dave Hansen,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, linux-kernel, dri-devel, Koenig, Christian, Sander, Ben,
	Suthikulpanit, Suravee, Deucher, Alexander, Blinzer, Paul,
	Linux-media

On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:
> On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
> > <serguei.sagalovitch@amd.com> wrote:
> >>
> >> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
> >>>
> >>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.williams@intel.com>
> >>> wrote:
> >>>>
> >>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
> >>>> <serguei.sagalovitch@amd.com> wrote:
> >>>>>
> >>>>> I personally like "device-DAX" idea but my concerns are:
> >>>>>
> >>>>> -  How well it will co-exists with the  DRM infrastructure /
> >>>>> implementations
> >>>>>     in part dealing with CPU pointers?
> >>>>
> >>>> Inside the kernel a device-DAX range is "just memory" in the sense
> >>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
> >>>> not migratable. To be honest I do not know how well that co-exists
> >>>> with drm infrastructure.
> >>>>
> >>>>> -  How well we will be able to handle case when we need to
> >>>>> "move"/"evict"
> >>>>>     memory/data to the new location so CPU pointer should point to the
> >>>>> new
> >>>>> physical location/address
> >>>>>      (and may be not in PCI device memory at all)?
> >>>>
> >>>> So, device-DAX deliberately avoids support for in-kernel migration or
> >>>> overcommit. Those cases are left to the core mm or drm. The device-dax
> >>>> interface is for cases where all that is needed is a direct-mapping to
> >>>> a statically-allocated physical-address range be it persistent memory
> >>>> or some other special reserved memory range.
> >>>
> >>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
> >>> pull off) I think we want all the magic in core mm, i.e. migration and
> >>> overcommit. At least that seems to be the very strong drive in all
> >>> general-purpose gpu abstractions and implementations, where memory is
> >>> allocated with malloc, and then mapped/moved into vram/gpu address
> >>> space through some magic,
> >>
> >> It is possible that there is other way around: memory is requested to be
> >> allocated and should be kept in vram for  performance reason but due
> >> to possible overcommit case we need at least temporally to "move" such
> >> allocation to system memory.
> >
> > With migration I meant migrating both ways of course. And with stuff
> > like numactl we can also influence where exactly the malloc'ed memory
> > is allocated originally, at least if we'd expose the vram range as a
> > very special numa node that happens to be far away and not hold any
> > cpu cores.
> 
> I don't think we should be using numa distance to reverse engineer a
> certain allocation behavior.  The latency data should be truthful, but
> you're right we'll need a mechanism to keep general purpose
> allocations out of that range by default. Btw, strict isolation is
> another design point of device-dax, but I think in this case we're
> describing something between the two extremes of full isolation and
> full compatibility with existing numactl apis.

Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
to reuse all the existing allocation policies directly, those won't work.
So at boot-up your default numa policy would exclude any vram nodes.

But I think (as an -mm layman) that numa gives us a lot of the tools and
policy interface that we need to implement what we want for gpus.

Wrt isolation: There's a sliding scale of what different users expect,
from full auto everything, including migrating pages around if needed to
full isolation all seems to be on the table. As long as we keep vram nodes
out of any default allocation numasets, full isolation should be possible.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23  7:49               ` Daniel Vetter
@ 2016-11-23  8:51                 ` Christian König
  2016-11-23 19:27                   ` Serguei Sagalovitch
  2016-11-23 17:03                 ` Dave Hansen
  1 sibling, 1 reply; 126+ messages in thread
From: Christian König @ 2016-11-23  8:51 UTC (permalink / raw)
  To: Dan Williams, Serguei Sagalovitch, Dave Hansen,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, linux-kernel, dri-devel, Sander, Ben, Suthikulpanit,
	Suravee, Deucher, Alexander, Blinzer, Paul, Linux-media

Am 23.11.2016 um 08:49 schrieb Daniel Vetter:
> On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:
>> On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
>>> <serguei.sagalovitch@amd.com> wrote:
>>>> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.williams@intel.com>
>>>>> wrote:
>>>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>>>> <serguei.sagalovitch@amd.com> wrote:
>>>>>>> I personally like "device-DAX" idea but my concerns are:
>>>>>>>
>>>>>>> -  How well it will co-exists with the  DRM infrastructure /
>>>>>>> implementations
>>>>>>>      in part dealing with CPU pointers?
>>>>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>>>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>>>>> not migratable. To be honest I do not know how well that co-exists
>>>>>> with drm infrastructure.
>>>>>>
>>>>>>> -  How well we will be able to handle case when we need to
>>>>>>> "move"/"evict"
>>>>>>>      memory/data to the new location so CPU pointer should point to the
>>>>>>> new
>>>>>>> physical location/address
>>>>>>>       (and may be not in PCI device memory at all)?
>>>>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>>>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>>>>> interface is for cases where all that is needed is a direct-mapping to
>>>>>> a statically-allocated physical-address range be it persistent memory
>>>>>> or some other special reserved memory range.
>>>>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>>>>> pull off) I think we want all the magic in core mm, i.e. migration and
>>>>> overcommit. At least that seems to be the very strong drive in all
>>>>> general-purpose gpu abstractions and implementations, where memory is
>>>>> allocated with malloc, and then mapped/moved into vram/gpu address
>>>>> space through some magic,
>>>> It is possible that there is other way around: memory is requested to be
>>>> allocated and should be kept in vram for  performance reason but due
>>>> to possible overcommit case we need at least temporally to "move" such
>>>> allocation to system memory.
>>> With migration I meant migrating both ways of course. And with stuff
>>> like numactl we can also influence where exactly the malloc'ed memory
>>> is allocated originally, at least if we'd expose the vram range as a
>>> very special numa node that happens to be far away and not hold any
>>> cpu cores.
>> I don't think we should be using numa distance to reverse engineer a
>> certain allocation behavior.  The latency data should be truthful, but
>> you're right we'll need a mechanism to keep general purpose
>> allocations out of that range by default. Btw, strict isolation is
>> another design point of device-dax, but I think in this case we're
>> describing something between the two extremes of full isolation and
>> full compatibility with existing numactl apis.
> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
> to reuse all the existing allocation policies directly, those won't work.
> So at boot-up your default numa policy would exclude any vram nodes.
>
> But I think (as an -mm layman) that numa gives us a lot of the tools and
> policy interface that we need to implement what we want for gpus.

Agree completely. From a ten mile high view our GPUs are just command 
processors with local memory as well .

Basically this is also the whole idea of what AMD is pushing with HSA 
for a while.

It's just that a lot of problems start to pop up when you look at all 
the nasty details. For example only part of the GPU memory is usually 
accessible by the CPU.

So even when numa nodes expose a good foundation for this I think there 
is still a lot of code to write.

BTW: I should probably start to read into the numa code of the kernel. 
Any good pointers for that?

Regards,
Christian.

> Wrt isolation: There's a sliding scale of what different users expect,
> from full auto everything, including migrating pages around if needed to
> full isolation all seems to be on the table. As long as we keep vram nodes
> out of any default allocation numasets, full isolation should be possible.
> -Daniel

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23  7:49               ` Daniel Vetter
  2016-11-23  8:51                 ` Christian König
@ 2016-11-23 17:03                 ` Dave Hansen
  1 sibling, 0 replies; 126+ messages in thread
From: Dave Hansen @ 2016-11-23 17:03 UTC (permalink / raw)
  To: Dan Williams, Serguei Sagalovitch, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, linux-kernel, dri-devel,
	Koenig, Christian, Sander, Ben, Suthikulpanit, Suravee, Deucher,
	Alexander, Blinzer, Paul, Linux-media

On 11/22/2016 11:49 PM, Daniel Vetter wrote:
> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
> to reuse all the existing allocation policies directly, those won't work.
> So at boot-up your default numa policy would exclude any vram nodes.
> 
> But I think (as an -mm layman) that numa gives us a lot of the tools and
> policy interface that we need to implement what we want for gpus.

Are you suggesting creating NUMA nodes for video RAM (I assume that's
what you mean by vram) where that RAM is not at all CPU-accessible?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
       [not found]   ` <75a1f44f-c495-7d1e-7e1c-17e89555edba@amd.com>
  2016-11-22 20:01     ` Dan Williams
@ 2016-11-23 17:13     ` Logan Gunthorpe
  2016-11-23 17:27       ` Bart Van Assche
  2016-11-23 19:05       ` Jason Gunthorpe
  1 sibling, 2 replies; 126+ messages in thread
From: Logan Gunthorpe @ 2016-11-23 17:13 UTC (permalink / raw)
  To: Serguei Sagalovitch, Dan Williams, Deucher, Alexander
  Cc: linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media

Hey,

On 22/11/16 11:59 AM, Serguei Sagalovitch wrote:
> -  How well we will be able to handle case when we need to "move"/"evict"
>    memory/data to the new location so CPU pointer should point to the
> new physical location/address
>     (and may be not in PCI device memory at all)?

IMO any memory that has been registered for a P2P transaction should be
locked from being evicted. So if there's a get_user_pages call it needs
to be pinned until the put_page. The main issue being with the RDMA
case: handling an eviction when a chunk of memory has been registered as
an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application. This seems like a
lot of burden to place on applications and may be subject to timing
issues. Either that or all RDMA applications need to be written with the
assumption that their target memory could go away at any time.

More generally, if you tell one PCI device to do a DMA transfer to
another PCI device's BAR space, and the target memory gets evicted then
DMA transaction needs to be aborted which means every driver doing the
transfer would need special support for this. If the memory can be
relied on to not be evicted than existing drivers should work unmodified
(ie O_DIRECT to/from an NVMe card would just work).

I feel the better approach is to pin memory subject to P2P transactions
as is typically done with DMA transfers to main memory.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 17:13     ` Logan Gunthorpe
@ 2016-11-23 17:27       ` Bart Van Assche
  2016-11-23 18:40         ` Dan Williams
  2016-11-23 19:06         ` Serguei Sagalovitch
  2016-11-23 19:05       ` Jason Gunthorpe
  1 sibling, 2 replies; 126+ messages in thread
From: Bart Van Assche @ 2016-11-23 17:27 UTC (permalink / raw)
  To: Logan Gunthorpe, Serguei Sagalovitch, Dan Williams, Deucher, Alexander
  Cc: linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media

On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:
> IMO any memory that has been registered for a P2P transaction should be
> locked from being evicted. So if there's a get_user_pages call it needs
> to be pinned until the put_page. The main issue being with the RDMA
> case: handling an eviction when a chunk of memory has been registered as
> an MR would be very tricky. The MR may be relied upon by another host
> and the kernel would have to inform user-space the MR was invalid then
> user-space would have to tell the remote application.

Hello Logan,

Are you aware that the Linux kernel already supports ODP (On Demand 
Paging)? See also the output of git grep -nHi on.demand.paging. See also 
https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.

Bart.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 17:27       ` Bart Van Assche
@ 2016-11-23 18:40         ` Dan Williams
  2016-11-23 19:12           ` Jason Gunthorpe
  2016-11-23 19:06         ` Serguei Sagalovitch
  1 sibling, 1 reply; 126+ messages in thread
From: Dan Williams @ 2016-11-23 18:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Logan Gunthorpe, Serguei Sagalovitch, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media

On Wed, Nov 23, 2016 at 9:27 AM, Bart Van Assche
<bart.vanassche@sandisk.com> wrote:
> On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:
>>
>> IMO any memory that has been registered for a P2P transaction should be
>> locked from being evicted. So if there's a get_user_pages call it needs
>> to be pinned until the put_page. The main issue being with the RDMA
>> case: handling an eviction when a chunk of memory has been registered as
>> an MR would be very tricky. The MR may be relied upon by another host
>> and the kernel would have to inform user-space the MR was invalid then
>> user-space would have to tell the remote application.
>
>
> Hello Logan,
>
> Are you aware that the Linux kernel already supports ODP (On Demand Paging)?
> See also the output of git grep -nHi on.demand.paging. See also
> https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.
>

I don't think that was designed for the case where the backing memory
is a special/static physical address range rather than anonymous
"System RAM", right?

I think we should handle the graphics P2P concerns separately from the
general P2P-DMA case since the latter does not require the higher
order memory management facilities. Using ZONE_DEVICE/DAX mappings to
avoid changes to every driver that wants to support P2P-DMA separately
from typical DMA still seems the path of least resistance.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 17:13     ` Logan Gunthorpe
  2016-11-23 17:27       ` Bart Van Assche
@ 2016-11-23 19:05       ` Jason Gunthorpe
  2016-11-23 19:14         ` Serguei Sagalovitch
  1 sibling, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-23 19:05 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Serguei Sagalovitch, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media

On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote:

> an MR would be very tricky. The MR may be relied upon by another host
> and the kernel would have to inform user-space the MR was invalid then
> user-space would have to tell the remote application.

As Bart says, it would be best to be combined with something like
Mellanox's ODP MRs, which allows a page to be evicted and then trigger
a CPU interrupt if a DMA is attempted so it can be brought back. This
includes the usual fencing mechanism so the CPU can block, flush, and
then evict a page coherently.

This is the general direction the industry is going in: Link PCI DMA
directly to dynamic user page tabels, including support for demand
faulting and synchronicity.

Mellanox ODP is a rough implementation of mirroring a process's page
table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is
probably a good example of where this is ultimately headed.

CAPI allows a PCI DMA to directly target an ASID associated with a
user process and then use the usual CPU machinery to do the page
translation for the DMA. This includes page faults for evicted pages,
and obviously allows eviction and migration..

So, of all the solutions in the original list, I would discard
anything that isn't VMA focused. Emulating what CAPI does in hardware
with software is probably the best choice, or we have to do it all
again when CAPI style hardware broadly rolls out :(

DAX and GPU allocators should create VMAs and manipulate them in the
usual way to achieve migration, windowing, cache&mirror, movement or
swap of the potentially peer-peer memory pages. They would have to
respect the usual rules for a VMA, including pinning.

DMA drivers would use the usual approaches for dealing with DMA from
a VMA: short term pin or long term coherent translation mirror.

So, to my view (looking from RDMA), the main problem with peer-peer is
how do you DMA translate VMA's that point at non struct page memory?

Does HMM solve the peer-peer problem? Does it do it generically or
only for drivers that are mirroring translation tables?

>From a RDMA perspective we could use something other than
get_user_pages() to pin and DMA translate a VMA if the core community
could decide on an API. eg get_user_dma_sg() would probably be quite
usable.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 17:27       ` Bart Van Assche
  2016-11-23 18:40         ` Dan Williams
@ 2016-11-23 19:06         ` Serguei Sagalovitch
  1 sibling, 0 replies; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-23 19:06 UTC (permalink / raw)
  To: Bart Van Assche, Logan Gunthorpe, Dan Williams, Deucher, Alexander
  Cc: linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, haggaie

On 2016-11-23 12:27 PM, Bart Van Assche wrote:
> On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:
>> IMO any memory that has been registered for a P2P transaction should be
>> locked from being evicted. So if there's a get_user_pages call it needs
>> to be pinned until the put_page. The main issue being with the RDMA
>> case: handling an eviction when a chunk of memory has been registered as
>> an MR would be very tricky. The MR may be relied upon by another host
>> and the kernel would have to inform user-space the MR was invalid then
>> user-space would have to tell the remote application.
>
> Hello Logan,
>
> Are you aware that the Linux kernel already supports ODP (On Demand 
> Paging)? See also the output of git grep -nHi on.demand.paging. See 
> also 
> https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.
>
> Bart.
My understanding is that  the main problems are (a) h/w support (b) 
compatibility with IB Verbs semantic.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 18:40         ` Dan Williams
@ 2016-11-23 19:12           ` Jason Gunthorpe
  2016-11-23 19:24             ` Serguei Sagalovitch
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-23 19:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bart Van Assche, Logan Gunthorpe, Serguei Sagalovitch, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media

On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote:

> I don't think that was designed for the case where the backing memory
> is a special/static physical address range rather than anonymous
> "System RAM", right?

The hardware doesn't care where the memory is. ODP is just a generic
mechanism to provide demand-fault behavior for a mirrored page table.

ODP has the same issue as everything else, it needs to translate a
page table entry into a DMA address, and we have no API to do that
when the page table points to peer-peer memory.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 19:05       ` Jason Gunthorpe
@ 2016-11-23 19:14         ` Serguei Sagalovitch
  2016-11-23 19:32           ` Jason Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-23 19:14 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Dan Williams, Deucher, Alexander, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, Bridgman, John,
	linux-kernel, dri-devel, Koenig, Christian, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran


On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:
> On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote:
>
>> an MR would be very tricky. The MR may be relied upon by another host
>> and the kernel would have to inform user-space the MR was invalid then
>> user-space would have to tell the remote application.
> As Bart says, it would be best to be combined with something like
> Mellanox's ODP MRs, which allows a page to be evicted and then trigger
> a CPU interrupt if a DMA is attempted so it can be brought back.
Please note that in the general case (including  MR one) we could have
"page fault" from the different PCIe device. So all  PCIe device must
be synchronized.
> includes the usual fencing mechanism so the CPU can block, flush, and
> then evict a page coherently.
>
> This is the general direction the industry is going in: Link PCI DMA
> directly to dynamic user page tabels, including support for demand
> faulting and synchronicity.
>
> Mellanox ODP is a rough implementation of mirroring a process's page
> table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is
> probably a good example of where this is ultimately headed.
>
> CAPI allows a PCI DMA to directly target an ASID associated with a
> user process and then use the usual CPU machinery to do the page
> translation for the DMA. This includes page faults for evicted pages,
> and obviously allows eviction and migration..
>
> So, of all the solutions in the original list, I would discard
> anything that isn't VMA focused. Emulating what CAPI does in hardware
> with software is probably the best choice, or we have to do it all
> again when CAPI style hardware broadly rolls out :(
>
> DAX and GPU allocators should create VMAs and manipulate them in the
> usual way to achieve migration, windowing, cache&mirror, movement or
> swap of the potentially peer-peer memory pages. They would have to
> respect the usual rules for a VMA, including pinning.
>
> DMA drivers would use the usual approaches for dealing with DMA from
> a VMA: short term pin or long term coherent translation mirror.
>
> So, to my view (looking from RDMA), the main problem with peer-peer is
> how do you DMA translate VMA's that point at non struct page memory?
>
> Does HMM solve the peer-peer problem? Does it do it generically or
> only for drivers that are mirroring translation tables?
In current form HMM doesn't solve peer-peer problem. Currently it allow
"mirroring" of  "malloc" memory on GPU which is not always what needed.
Additionally  there is need to have opportunity to share VRAM allocations
between  different processes.
>  From a RDMA perspective we could use something other than
> get_user_pages() to pin and DMA translate a VMA if the core community
> could decide on an API. eg get_user_dma_sg() would probably be quite
> usable.
>
> Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 19:12           ` Jason Gunthorpe
@ 2016-11-23 19:24             ` Serguei Sagalovitch
  0 siblings, 0 replies; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-23 19:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Bart Van Assche, Logan Gunthorpe, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media



On 2016-11-23 02:12 PM, Jason Gunthorpe wrote:
> On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote:
>
>> I don't think that was designed for the case where the backing memory
>> is a special/static physical address range rather than anonymous
>> "System RAM", right?
> The hardware doesn't care where the memory is. ODP is just a generic
> mechanism to provide demand-fault behavior for a mirrored page table.
>
> ODP has the same issue as everything else, it needs to translate a
> page table entry into a DMA address, and we have no API to do that
> when the page table points to peer-peer memory.
>
> Jason
I would like to note that for graphics applications (especially for VR 
support) we
should  avoid ODP  case at any cost during graphics commands execution  due
to requirement to have smooth and predictable playback. We want to load 
/ "pin"
all required resources before graphics processor begin to touch them. 
This is not
so critical for compute applications. Because only graphics / compute stack
knows which resource will be in used as well as all statistics 
accordingly only graphics
stack is capable to make the correct decision when and _where_ evict as 
well
as when and _where_ to put memory back.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23  8:51                 ` Christian König
@ 2016-11-23 19:27                   ` Serguei Sagalovitch
  0 siblings, 0 replies; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-23 19:27 UTC (permalink / raw)
  To: Christian König, Dan Williams, Dave Hansen,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, linux-kernel, dri-devel, Sander, Ben, Suthikulpanit,
	Suravee, Deucher, Alexander, Blinzer, Paul, Linux-media


On 2016-11-23 03:51 AM, Christian König wrote:
> Am 23.11.2016 um 08:49 schrieb Daniel Vetter:
>> On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:
>>> On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
>>>> <serguei.sagalovitch@amd.com> wrote:
>>>>> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>>>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams 
>>>>>> <dan.j.williams@intel.com>
>>>>>> wrote:
>>>>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>>>>> <serguei.sagalovitch@amd.com> wrote:
>>>>>>>> I personally like "device-DAX" idea but my concerns are:
>>>>>>>>
>>>>>>>> -  How well it will co-exists with the  DRM infrastructure /
>>>>>>>> implementations
>>>>>>>>      in part dealing with CPU pointers?
>>>>>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>>>>>> that you can perform pfn_to_page() on it and issue I/O, but the 
>>>>>>> vma is
>>>>>>> not migratable. To be honest I do not know how well that co-exists
>>>>>>> with drm infrastructure.
>>>>>>>
>>>>>>>> -  How well we will be able to handle case when we need to
>>>>>>>> "move"/"evict"
>>>>>>>>      memory/data to the new location so CPU pointer should 
>>>>>>>> point to the
>>>>>>>> new
>>>>>>>> physical location/address
>>>>>>>>       (and may be not in PCI device memory at all)?
>>>>>>> So, device-DAX deliberately avoids support for in-kernel 
>>>>>>> migration or
>>>>>>> overcommit. Those cases are left to the core mm or drm. The 
>>>>>>> device-dax
>>>>>>> interface is for cases where all that is needed is a 
>>>>>>> direct-mapping to
>>>>>>> a statically-allocated physical-address range be it persistent 
>>>>>>> memory
>>>>>>> or some other special reserved memory range.
>>>>>> For some of the fancy use-cases (e.g. to be comparable to what 
>>>>>> HMM can
>>>>>> pull off) I think we want all the magic in core mm, i.e. 
>>>>>> migration and
>>>>>> overcommit. At least that seems to be the very strong drive in all
>>>>>> general-purpose gpu abstractions and implementations, where 
>>>>>> memory is
>>>>>> allocated with malloc, and then mapped/moved into vram/gpu address
>>>>>> space through some magic,
>>>>> It is possible that there is other way around: memory is requested 
>>>>> to be
>>>>> allocated and should be kept in vram for  performance reason but due
>>>>> to possible overcommit case we need at least temporally to "move" 
>>>>> such
>>>>> allocation to system memory.
>>>> With migration I meant migrating both ways of course. And with stuff
>>>> like numactl we can also influence where exactly the malloc'ed memory
>>>> is allocated originally, at least if we'd expose the vram range as a
>>>> very special numa node that happens to be far away and not hold any
>>>> cpu cores.
>>> I don't think we should be using numa distance to reverse engineer a
>>> certain allocation behavior.  The latency data should be truthful, but
>>> you're right we'll need a mechanism to keep general purpose
>>> allocations out of that range by default. Btw, strict isolation is
>>> another design point of device-dax, but I think in this case we're
>>> describing something between the two extremes of full isolation and
>>> full compatibility with existing numactl apis.
>> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
>> to reuse all the existing allocation policies directly, those won't 
>> work.
>> So at boot-up your default numa policy would exclude any vram nodes.
>>
>> But I think (as an -mm layman) that numa gives us a lot of the tools and
>> policy interface that we need to implement what we want for gpus.
>
> Agree completely. From a ten mile high view our GPUs are just command 
> processors with local memory as well .
>
> Basically this is also the whole idea of what AMD is pushing with HSA 
> for a while.
>
> It's just that a lot of problems start to pop up when you look at all 
> the nasty details. For example only part of the GPU memory is usually 
> accessible by the CPU.
>
> So even when numa nodes expose a good foundation for this I think 
> there is still a lot of code to write.
>
> BTW: I should probably start to read into the numa code of the kernel. 
> Any good pointers for that?
I would assume that "page" allocation logic itself should be inside of 
graphics driver due to possible different requirements especially from 
graphics: alignment, etc.

>
> Regards,
> Christian.
>
>> Wrt isolation: There's a sliding scale of what different users expect,
>> from full auto everything, including migrating pages around if needed to
>> full isolation all seems to be on the table. As long as we keep vram 
>> nodes
>> out of any default allocation numasets, full isolation should be 
>> possible.
>> -Daniel
>
>

Sincerely yours,
Serguei Sagalovitch

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 19:14         ` Serguei Sagalovitch
@ 2016-11-23 19:32           ` Jason Gunthorpe
       [not found]             ` <c2c88376-5ba7-37d1-4d3e-592383ebb00a@amd.com>
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-23 19:32 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: Logan Gunthorpe, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran

On Wed, Nov 23, 2016 at 02:14:40PM -0500, Serguei Sagalovitch wrote:
> 
> On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:

> >As Bart says, it would be best to be combined with something like
> >Mellanox's ODP MRs, which allows a page to be evicted and then trigger
> >a CPU interrupt if a DMA is attempted so it can be brought back.

> Please note that in the general case (including  MR one) we could have
> "page fault" from the different PCIe device. So all  PCIe device must
> be synchronized.

Standard RDMA MRs require pinned pages, the DMA address cannot change
while the MR exists (there is no hardware support for this at all), so
page faulting from any other device is out of the question while they
exist. This is the same requirement as typical simple driver DMA which
requires pages pinned until the simple device completes DMA.

ODP RDMA MRs do not require that, they just page fault like the CPU or
really anything and the kernel has to make sense of concurrant page
faults from multiple sources.

The upshot is that GPU scenarios that rely on highly dynamic
virtual->physical translation cannot sanely be combined with standard
long-life RDMA MRs.

Certainly, any solution for GPUs must follow the typical page pinning
semantics, changing the DMA address of a page must be blocked while
any DMA is in progress.

> >Does HMM solve the peer-peer problem? Does it do it generically or
> >only for drivers that are mirroring translation tables?

> In current form HMM doesn't solve peer-peer problem. Currently it allow
> "mirroring" of  "malloc" memory on GPU which is not always what needed.
> Additionally  there is need to have opportunity to share VRAM allocations
> between  different processes.

Humm, so it can be removed from Alexander's list then :\

As Dan suggested, maybe we need to do both. Some kind of fix for
get_user_pages() for smaller mappings (eg ZONE_DEVICE) and a mandatory
API conversion to get_user_dma_sg() for other cases?

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
       [not found]             ` <c2c88376-5ba7-37d1-4d3e-592383ebb00a@amd.com>
@ 2016-11-23 20:33               ` Jason Gunthorpe
  2016-11-23 21:11                 ` Logan Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-23 20:33 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: Logan Gunthorpe, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran

On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote:

>    We do not want to have "highly" dynamic translation due to
>    performance cost.  We need to support "overcommit" but would
>    like to minimize impact.  To support RDMA MRs for GPU/VRAM/PCIe
>    device memory (which is must) we need either globally force
>    pinning for the scope of "get_user_pages() / "put_pages" or have
>    special handling for RDMA MRs and similar cases.

As I said, there is no possible special handling. Standard IB hardware
does not support changing the DMA address once a MR is created. Forget
about doing that.

Only ODP hardware allows changing the DMA address on the fly, and it
works at the page table level. We do not need special handling for
RDMA.

>    Generally it could be difficult to correctly handle "DMA in
>    progress" due to the facts that (a) DMA could originate from
>    numerous PCIe devices simultaneously including requests to
>    receive network data.

We handle all of this today in kernel via the page pinning mechanism.
This needs to be copied into peer-peer memory and GPU memory schemes
as well. A pinned page means the DMA address channot be changed and
there is active non-CPU access to it.

Any hardware that does not support page table mirroring must go this
route.

> (b) in HSA case DMA could originated from user space without kernel
>    driver knowledge.  So without corresponding h/w support
>    everywhere I do not see how it could be solved effectively.

All true user triggered DMA must go through some kind of coherent page
table mirroring scheme (eg this is what CAPI does, presumably AMDs HSA
is similar). A page table mirroring scheme is basically the same as
what ODP does.

Like I said, this is the direction the industry seems to be moving in,
so any solution here should focus on VMAs/page tables as the way to link
the peer-peer devices.

To me this means at least items #1 and #3 should be removed from
Alexander's list.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 20:33               ` Jason Gunthorpe
@ 2016-11-23 21:11                 ` Logan Gunthorpe
  2016-11-23 21:55                   ` Jason Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-11-23 21:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Serguei Sagalovitch
  Cc: Dan Williams, Deucher, Alexander, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, Bridgman, John,
	linux-kernel, dri-devel, Koenig, Christian, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran



On 23/11/16 01:33 PM, Jason Gunthorpe wrote:
> On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote:
> 
>>    We do not want to have "highly" dynamic translation due to
>>    performance cost.  We need to support "overcommit" but would
>>    like to minimize impact.  To support RDMA MRs for GPU/VRAM/PCIe
>>    device memory (which is must) we need either globally force
>>    pinning for the scope of "get_user_pages() / "put_pages" or have
>>    special handling for RDMA MRs and similar cases.
> 
> As I said, there is no possible special handling. Standard IB hardware
> does not support changing the DMA address once a MR is created. Forget
> about doing that.

Yeah, that's essentially the point I was trying to make. Not to mention
all the other unrelated hardware that can't DMA to an address that might
disappear mid-transfer.

> Only ODP hardware allows changing the DMA address on the fly, and it
> works at the page table level. We do not need special handling for
> RDMA.

I am aware of ODP but, noted by others, it doesn't provide a general
solution to the points above.

> Like I said, this is the direction the industry seems to be moving in,
> so any solution here should focus on VMAs/page tables as the way to link
> the peer-peer devices.

Yes, this was the appeal to us of using ZONE_DEVICE.

> To me this means at least items #1 and #3 should be removed from
> Alexander's list.

It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
really the same option. iopmem is really just one way to get BAR
addresses to user-space while inside the kernel it's ZONE_DEVICE.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 21:11                 ` Logan Gunthorpe
@ 2016-11-23 21:55                   ` Jason Gunthorpe
  2016-11-23 22:42                     ` Dan Williams
                                       ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-23 21:55 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Serguei Sagalovitch, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran

On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
> > As I said, there is no possible special handling. Standard IB hardware
> > does not support changing the DMA address once a MR is created. Forget
> > about doing that.
> 
> Yeah, that's essentially the point I was trying to make. Not to mention
> all the other unrelated hardware that can't DMA to an address that might
> disappear mid-transfer.

Right, it is impossible to ask for generic page migration with ongoing
DMA. That is simply not supported by any of the hardware at all.

> > Only ODP hardware allows changing the DMA address on the fly, and it
> > works at the page table level. We do not need special handling for
> > RDMA.
> 
> I am aware of ODP but, noted by others, it doesn't provide a general
> solution to the points above.

How do you mean?

Perhaps I am not following what Serguei is asking for, but I
understood the desire was for a complex GPU allocator that could
migrate pages between GPU and CPU memory under control of the GPU
driver, among other things. The desire is for DMA to continue to work
even after these migrations happen.

Page table mirroring *is* the general solution for this problem. The
GPU driver controls the VMA and the DMA driver mirrors that VMA.

Do you know of another option that doesn't just degenerate to page
table mirroring??

Remember, there are two facets to the RDMA ODP implementation, I feel
there is some confusion here..

The crucial part for this discussion is the ability to fence and block
DMA for a specific range. This is the hardware capability that lets
page migration happen: fence&block DMA, migrate page, update page
table in HCA, unblock DMA.

Without that hardware support the DMA address must be unchanging, and
there is nothing we can do about it. This is why standard IB hardware
must have fixed MRs - it lacks the fence&block capability.

The other part is the page faulting implementation, but that is not
required, and to Serguei's point, is not desired for GPU anyhow.

> > To me this means at least items #1 and #3 should be removed from
> > Alexander's list.
> 
> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
> really the same option. iopmem is really just one way to get BAR
> addresses to user-space while inside the kernel it's ZONE_DEVICE.

Seems fine for RDMA?

Didn't we just strike off everything on the list except #2? :\

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 21:55                   ` Jason Gunthorpe
@ 2016-11-23 22:42                     ` Dan Williams
  2016-11-23 23:25                       ` Jason Gunthorpe
  2016-11-24  0:40                     ` Sagalovitch, Serguei
  2016-11-24  1:25                     ` Logan Gunthorpe
  2 siblings, 1 reply; 126+ messages in thread
From: Dan Williams @ 2016-11-23 22:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Serguei Sagalovitch, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran

On Wed, Nov 23, 2016 at 1:55 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
>> > As I said, there is no possible special handling. Standard IB hardware
>> > does not support changing the DMA address once a MR is created. Forget
>> > about doing that.
>>
>> Yeah, that's essentially the point I was trying to make. Not to mention
>> all the other unrelated hardware that can't DMA to an address that might
>> disappear mid-transfer.
>
> Right, it is impossible to ask for generic page migration with ongoing
> DMA. That is simply not supported by any of the hardware at all.
>
>> > Only ODP hardware allows changing the DMA address on the fly, and it
>> > works at the page table level. We do not need special handling for
>> > RDMA.
>>
>> I am aware of ODP but, noted by others, it doesn't provide a general
>> solution to the points above.
>
> How do you mean?
>
> Perhaps I am not following what Serguei is asking for, but I
> understood the desire was for a complex GPU allocator that could
> migrate pages between GPU and CPU memory under control of the GPU
> driver, among other things. The desire is for DMA to continue to work
> even after these migrations happen.
>
> Page table mirroring *is* the general solution for this problem. The
> GPU driver controls the VMA and the DMA driver mirrors that VMA.
>
> Do you know of another option that doesn't just degenerate to page
> table mirroring??
>
> Remember, there are two facets to the RDMA ODP implementation, I feel
> there is some confusion here..
>
> The crucial part for this discussion is the ability to fence and block
> DMA for a specific range. This is the hardware capability that lets
> page migration happen: fence&block DMA, migrate page, update page
> table in HCA, unblock DMA.

Wait, ODP requires migratable pages, ZONE_DEVICE pages are not
migratable. You can't replace a PCIe mapping with just any other
System RAM physical address, right? At least not without a filesystem
recording where things went, but at point we're no longer talking
about the base P2P-DMA mapping mechanism and are instead talking about
something like pnfs-rdma to a DAX filesystem.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 22:42                     ` Dan Williams
@ 2016-11-23 23:25                       ` Jason Gunthorpe
  2016-11-24  9:45                         ` Christian König
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-23 23:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Serguei Sagalovitch, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran

On Wed, Nov 23, 2016 at 02:42:12PM -0800, Dan Williams wrote:
> > The crucial part for this discussion is the ability to fence and block
> > DMA for a specific range. This is the hardware capability that lets
> > page migration happen: fence&block DMA, migrate page, update page
> > table in HCA, unblock DMA.
> 
> Wait, ODP requires migratable pages, ZONE_DEVICE pages are not
> migratable.

Does it? I didn't think so.. Does ZONE_DEVICE break MMU notifiers/etc
or something? There is certainly nothing about the hardware that cares
about ZONE_DEVICE vs System memory.

I used 'migration' in the broader sense of doing any transformation to
the page such that the DMA address changes - not the specific kernel
MM process...

> You can't replace a PCIe mapping with just any other System RAM
> physical address, right?

I thought that was exactly what HMM was trying to do? Migrate pages
between CPU and GPU memory as needed. As Serguei has said this process
needs to be driven by the GPU driver.

The peer-peer issue is how do you do that while RDMA is possible on
those pages, because when the page migrates to GPU memory you want the
RDMA to follow it seamlessly.

This is why page table mirroring is the best solution - use the
existing mm machinery to link the DMA driver and whatever is
controlling the VMA.

> At least not without a filesystem recording where things went, but
> at point we're no longer talking about the base P2P-DMA mapping

In the filesystem/DAX case, it would be the filesystem that initiates
any change in the page physical address.

ODP *follows* changes in the VMA it does not cause any change in
address mapping. That has to be done by whoever is in charge of the
VMA.

> something like pnfs-rdma to a DAX filesystem.

Something in the kernel (ie nfs-rdma) would be entirely different. We
generally don't do long lived mappings in the kernel for RDMA
(certainly not for NFS), so it is much more like your basic every day
DMA operation: map, execute, unmap. We probably don't need to use page
table mirroring for this.

ODP comes in when userpsace mmaps a DAX file and then tries to use it
for RDMA. Page table mirroring lets the DAX filesystem decide to move
the backing pages at any time. When it wants to do that it interacts
with the MM in the usual way which links to ODP and makes sure the
migration is seamless.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 21:55                   ` Jason Gunthorpe
  2016-11-23 22:42                     ` Dan Williams
@ 2016-11-24  0:40                     ` Sagalovitch, Serguei
  2016-11-24 16:24                       ` Jason Gunthorpe
  2016-11-24  1:25                     ` Logan Gunthorpe
  2 siblings, 1 reply; 126+ messages in thread
From: Sagalovitch, Serguei @ 2016-11-24  0:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Dan Williams, Deucher, Alexander, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, Bridgman, John,
	linux-kernel, dri-devel, Koenig, Christian, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran

On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:

> Perhaps I am not following what Serguei is asking for, but I
> understood the desire was for a complex GPU allocator that could
> migrate pages between GPU and CPU memory under control of the GPU
> driver, among other things. The desire is for DMA to continue to work
> even after these migrations happen.

The main issue is to  how to solve use cases when p2p is 
requested/initiated via CPU pointers where such pointers could 
point to non-system memory location e.g.  VRAM.  

It will allow to provide consistent working model for user to deal only
with pointers (HSA, CUDA, OpenCL 2.0 SVM) as well as provide 
performance optimization avoiding double-buffering and extra special code 
when dealing with PCIe device memory.

 Examples are:

 - RDMA Network operations.  RDMA MRs where registered memory 
could be e.g. VRAM.  Currently it is solved using so called PeerDirect  
interface which  is currently out-of-tree and  provided as part of OFED.
- File operations (fread/fwrite) when user wants to transfer file data directly 
to/from e.g. VRAM


Challenges are:
- Because graphics sub-system must support overcomit (at least each 
application/process should independently see all resources) ideally 
such memory should be movable without changing CPU pointer value
as well as "paged-out" supporting "page fault" at least on access from 
CPU.
 - We must co-exist with existing DRM infrastructure, as well as 
support sharing VRAM memory between different processes
- We should be able to deal with large allocations: tens, hundreds of 
MBs or may be GBs.
- We may have PCIe devices where p2p may not work
- Potentially any GPU memory should be supported including 
memory carved out from system RAM (e.g. allocated via
get_free_pages()).


Note:
-  In the case of RDMA MRs life-span of "pinning" 
(get_user_pages"/put_page) may be defined/controlled by 
application not kernel which  may be should 
treated differently as special case. 
  

Original proposal was to create "struct pages" for VRAM memory 
to allow "get_user_pages"  to work transparently similar 
how it is/was done for "DAX Device" case. Unfortunately 
based on my understanding "DAX Device" implementation 
deal only with permanently  "locked" memory  (fixed location) 
unrelated to "get_user_pages"/"put_page" scope  
which doesn't satisfy requirements  for "eviction" / "moving" of 
memory keeping CPU address intact.  

> The desire is for DMA to continue to work
> even after these migrations happen
At least some kind of mm notifier callback to inform about changing 
in location (pre- and post-) similar how it is done for system pages. 
My understanding is that It will not solve RDMA MR issue where "lock" 
could be during the whole  application life but  (a) it will not make 
RDMA MR case worse  (b) should be enough for all other cases for 
"get_user_pages"/"put_page" controlled by  kernel.
 
 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 21:55                   ` Jason Gunthorpe
  2016-11-23 22:42                     ` Dan Williams
  2016-11-24  0:40                     ` Sagalovitch, Serguei
@ 2016-11-24  1:25                     ` Logan Gunthorpe
  2016-11-24 16:42                       ` Jason Gunthorpe
  2 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-11-24  1:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serguei Sagalovitch, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran



On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
>>> Only ODP hardware allows changing the DMA address on the fly, and it
>>> works at the page table level. We do not need special handling for
>>> RDMA.
>>
>> I am aware of ODP but, noted by others, it doesn't provide a general
>> solution to the points above.
> 
> How do you mean?

I was only saying it wasn't general in that it wouldn't work for IB
hardware that doesn't support ODP or other hardware  that doesn't do
similar things (like an NVMe drive).

It makes sense for hardware that supports ODP to allow MRs to not pin
the underlying memory and provide for migrations that the hardware can
follow. But most DMA engines will require the memory to be pinned and
any complex allocators (GPU or otherwise) should respect that. And that
seems like it should be the default way most of this works -- and I
think it wouldn't actually take too much effort to make it all work now
as is. (Our iopmem work is actually quite small and simple.)

>> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
>> really the same option. iopmem is really just one way to get BAR
>> addresses to user-space while inside the kernel it's ZONE_DEVICE.
> 
> Seems fine for RDMA?

Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
memory working for some time. I'd say it's a good fit. The main question
we've had is how to expose PCIe bars to userspace to be used as MRs and
such.


Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-23 23:25                       ` Jason Gunthorpe
@ 2016-11-24  9:45                         ` Christian König
  2016-11-24 16:26                           ` Jason Gunthorpe
  2016-11-24 17:55                           ` Logan Gunthorpe
  0 siblings, 2 replies; 126+ messages in thread
From: Christian König @ 2016-11-24  9:45 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Logan Gunthorpe, Serguei Sagalovitch, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran

Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:
> There is certainly nothing about the hardware that cares
> about ZONE_DEVICE vs System memory.
Well that is clearly not so simple. When your ZONE_DEVICE pages describe 
a PCI BAR and another PCI device initiates a DMA to this address the DMA 
subsystem must be able to check if the interconnection really works.

E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE. 
Not PCI device B (a SATA device) can directly read/write to it because 
it is on the same bus segment, but PCI device C (a network card for 
example) can't because it is on a different bus segment and the bridge 
can't handle P2P transactions.

We need to be able to handle such cases and fall back to bouncing 
buffers, but I don't see that in the DMA subsystem right now.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24  0:40                     ` Sagalovitch, Serguei
@ 2016-11-24 16:24                       ` Jason Gunthorpe
  0 siblings, 0 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-24 16:24 UTC (permalink / raw)
  To: Sagalovitch, Serguei
  Cc: Logan Gunthorpe, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran

On Thu, Nov 24, 2016 at 12:40:37AM +0000, Sagalovitch, Serguei wrote:
> On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
> 
> > Perhaps I am not following what Serguei is asking for, but I
> > understood the desire was for a complex GPU allocator that could
> > migrate pages between GPU and CPU memory under control of the GPU
> > driver, among other things. The desire is for DMA to continue to work
> > even after these migrations happen.
> 
> The main issue is to  how to solve use cases when p2p is 
> requested/initiated via CPU pointers where such pointers could 
> point to non-system memory location e.g.  VRAM.  

Okay, but your list is conflating a whole bunch of problems..

 1) How to go from a __user pointer to a p2p DMA address
  a) How to validate, setup iommu and maybe worst case bounce buffer
     these p2p DMAs
 2) How to allow drivers (ie GPU allocator) dynamically
    remap pages in a VMA to/from p2p DMA addresses
 3) How to expose uncachable p2p DMA address to user space via mmap

> to allow "get_user_pages"  to work transparently similar 
> how it is/was done for "DAX Device" case. Unfortunately 
> based on my understanding "DAX Device" implementation 
> deal only with permanently  "locked" memory  (fixed location) 
> unrelated to "get_user_pages"/"put_page" scope  
> which doesn't satisfy requirements  for "eviction" / "moving" of 
> memory keeping CPU address intact.  

Hurm, isn't that issue with DAX only to do with being coherent with
the page cache?

A GPU allocator would not use the page cache, it would have to
construct VMAs some other way.

> My understanding is that It will not solve RDMA MR issue where "lock" 
> could be during the whole  application life but  (a) it will not make 
> RDMA MR case worse  (b) should be enough for all other cases for 
> "get_user_pages"/"put_page" controlled by  kernel.

Right. There is no solution to the RDMA MR issue on old hardware. Apps
that are using GPU+RDMA+Old hardware will have to use short lived MRs
and pay that performance cost, or give up on migration.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24  9:45                         ` Christian König
@ 2016-11-24 16:26                           ` Jason Gunthorpe
  2016-11-24 17:00                             ` Serguei Sagalovitch
  2016-11-24 17:55                           ` Logan Gunthorpe
  1 sibling, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-24 16:26 UTC (permalink / raw)
  To: Christian König
  Cc: Dan Williams, Logan Gunthorpe, Serguei Sagalovitch, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Sander,
	Ben, Suthikulpanit, Suravee, Blinzer, Paul, Linux-media,
	Haggai Eran

On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote:
> Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:
> >There is certainly nothing about the hardware that cares
> >about ZONE_DEVICE vs System memory.
> Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
> PCI BAR and another PCI device initiates a DMA to this address the DMA
> subsystem must be able to check if the interconnection really works.

I said the hardware doesn't care.. You are right, we still have an
outstanding problem in Linux of how to generically DMA map a P2P
address - which is a different issue from getting the P2P address from
a __user pointer...

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24  1:25                     ` Logan Gunthorpe
@ 2016-11-24 16:42                       ` Jason Gunthorpe
  2016-11-24 18:11                         ` Logan Gunthorpe
  2016-11-25 13:22                         ` Christian König
  0 siblings, 2 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-24 16:42 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Serguei Sagalovitch, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran

On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
> >>> Only ODP hardware allows changing the DMA address on the fly, and it
> >>> works at the page table level. We do not need special handling for
> >>> RDMA.
> >>
> >> I am aware of ODP but, noted by others, it doesn't provide a general
> >> solution to the points above.
> > 
> > How do you mean?
> 
> I was only saying it wasn't general in that it wouldn't work for IB
> hardware that doesn't support ODP or other hardware  that doesn't do
> similar things (like an NVMe drive).

There are three cases to worry about:
 - Coherent long lived page table mirroring (RDMA ODP MR)
 - Non-coherent long lived page table mirroring (RDMA MR)
 - Short lived DMA mapping (everything else)

Like you say below we have to handle short lived in the usual way, and
that covers basically every device except IB MRs, including the
command queue on a NVMe drive.

> any complex allocators (GPU or otherwise) should respect that. And that
> seems like it should be the default way most of this works -- and I
> think it wouldn't actually take too much effort to make it all work now
> as is. (Our iopmem work is actually quite small and simple.)

Yes, absolutely, some kind of page pinning like locking is a hard
requirement.

> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
> memory working for some time. I'd say it's a good fit. The main question
> we've had is how to expose PCIe bars to userspace to be used as MRs and
> such.

Is there any progress on that?

I still don't quite get what iopmem was about.. I thought the
objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
over iopmem and still ending up with uncacheable mmaps still seems
like a non-starter to me...

Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?

One approach might be to mmap the uncachable ZONE_DEVICE memory and
mark it inaccessible to the CPU - DMA could still translate. If the
CPU needs it then the kernel migrates it to system memory so it
becomes cachable. ??

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24 16:26                           ` Jason Gunthorpe
@ 2016-11-24 17:00                             ` Serguei Sagalovitch
  0 siblings, 0 replies; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-24 17:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König
  Cc: Dan Williams, Logan Gunthorpe, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran


On 2016-11-24 11:26 AM, Jason Gunthorpe wrote:
> On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote:
>> Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:
>>> There is certainly nothing about the hardware that cares
>>> about ZONE_DEVICE vs System memory.
>> Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
>> PCI BAR and another PCI device initiates a DMA to this address the DMA
>> subsystem must be able to check if the interconnection really works.
> I said the hardware doesn't care.. You are right, we still have an
> outstanding problem in Linux of how to generically DMA map a P2P
> address - which is a different issue from getting the P2P address from
> a __user pointer...
>
> Jason
I agreed but the problem is that one issue immediately introduce another 
one
to solve and so on (if we do not want to cut corners). I would think  that
a lot of them interconnected because the way how one problem could be
solved may impact solution for another.

btw: about "DMA map a p2p address": Right now to enable  p2p between 
devices
it is required/recommended to disable iommu support  (e.g. intel iommu 
driver
has special logic for graphics and  comment "Reserve all PCI MMIO to avoid
peer-to-peer access").

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24  9:45                         ` Christian König
  2016-11-24 16:26                           ` Jason Gunthorpe
@ 2016-11-24 17:55                           ` Logan Gunthorpe
  2016-11-25 13:06                             ` Christian König
  1 sibling, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-11-24 17:55 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe, Dan Williams
  Cc: Serguei Sagalovitch, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran

Hey,

On 24/11/16 02:45 AM, Christian König wrote:
> E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
> Not PCI device B (a SATA device) can directly read/write to it because
> it is on the same bus segment, but PCI device C (a network card for
> example) can't because it is on a different bus segment and the bridge
> can't handle P2P transactions.

Yeah, that could be an issue but in our experience we have yet to see
it. We've tested with two separate PCI buses on different CPUs connected
through QPI links and it works fine. (It is rather slow but I understand
Intel has improved the bottleneck in newer CPUs than the ones we tested.)

It may just be older hardware that has this issue. I expect that as long
as a failed transfer can be handled gracefully by the initiator I don't
see a need to predetermine whether a device can see another devices memory.


Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24 16:42                       ` Jason Gunthorpe
@ 2016-11-24 18:11                         ` Logan Gunthorpe
  2016-11-25  7:58                           ` Christoph Hellwig
  2016-11-25 17:59                           ` Serguei Sagalovitch
  2016-11-25 13:22                         ` Christian König
  1 sibling, 2 replies; 126+ messages in thread
From: Logan Gunthorpe @ 2016-11-24 18:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serguei Sagalovitch, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran



On 24/11/16 09:42 AM, Jason Gunthorpe wrote:
> There are three cases to worry about:
>  - Coherent long lived page table mirroring (RDMA ODP MR)
>  - Non-coherent long lived page table mirroring (RDMA MR)
>  - Short lived DMA mapping (everything else)
> 
> Like you say below we have to handle short lived in the usual way, and
> that covers basically every device except IB MRs, including the
> command queue on a NVMe drive.

Yes, this makes sense to me. Though I thought regular IB MRs with
regular memory currently pinned the pages (despite being long lived)
that's why we can run up against the "max locked memory" limit. It
doesn't seem so terrible if GPU memory had a similar restriction until
ODP like solutions get implemented.

>> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
>> memory working for some time. I'd say it's a good fit. The main question
>> we've had is how to expose PCIe bars to userspace to be used as MRs and
>> such.

> Is there any progress on that?

Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what
memeroy.

* Regular DAX in the FS doesn't work at this time because the FS can
move the file you think your transfer to out from under you. Though I
understand there's been some work with XFS to solve that issue.

Though, we've been considering that the backed memory would be
non-volatile which adds some of this complexity. If the memory were
volatile the kernel would just need to do some relatively straight
forward allocation to user-space when asked. For example, with NVMe, the
kernel could give chunks of the CMB buffer to userspace via an mmap call
to /dev/nvmeX. Though I think there's been some push back against things
like that as well.

> I still don't quite get what iopmem was about.. I thought the
> objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
> over iopmem and still ending up with uncacheable mmaps still seems
> like a non-starter to me...

The latest incarnation of iopmem simply created a block device backed by
ZONE_DEVICE memory on a PCIe BAR. We then put a DAX FS on it and
user-space could mmap the files and send them to other devices to do P2P
transfers.

I don't think there was a hard objection to uncachable ZONE_DEVICE and
DAX. We did try our experimental hardware with cached ZONE_DEVICE and it
did work but the performance was beyond unusable (which may be a
hardware issue). In the end I feel the driver would have to decide the
most appropriate caching for the hardware and I don't understand why WC
or UC wouldn't work with ZONE_DEVICE.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24 18:11                         ` Logan Gunthorpe
@ 2016-11-25  7:58                           ` Christoph Hellwig
  2016-11-25 19:41                             ` Jason Gunthorpe
  2016-11-25 17:59                           ` Serguei Sagalovitch
  1 sibling, 1 reply; 126+ messages in thread
From: Christoph Hellwig @ 2016-11-25  7:58 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Serguei Sagalovitch, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran

On Thu, Nov 24, 2016 at 11:11:34AM -0700, Logan Gunthorpe wrote:
> * Regular DAX in the FS doesn't work at this time because the FS can
> move the file you think your transfer to out from under you. Though I
> understand there's been some work with XFS to solve that issue.

The file system will never move anything under locked down pages,
locking down pages is used exactly to protect against that.  So as long
as we page structures available RDMA to/from device memory _from kernel
space_ is trivial, although for file systems to work properly you
really want a notification to the consumer if the file systems wants
to remove the mapping.  We have implemented that using FL_LAYOUTS locks
for NFSD, but only XFS supports it so far.  Without that a long term
locked down region of memory (e.g. a kernel MR) would prevent various
file operations that would simply hang.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24 17:55                           ` Logan Gunthorpe
@ 2016-11-25 13:06                             ` Christian König
  2016-11-25 16:45                               ` Logan Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Christian König @ 2016-11-25 13:06 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe, Dan Williams
  Cc: Serguei Sagalovitch, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran

Am 24.11.2016 um 18:55 schrieb Logan Gunthorpe:
> Hey,
>
> On 24/11/16 02:45 AM, Christian König wrote:
>> E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
>> Not PCI device B (a SATA device) can directly read/write to it because
>> it is on the same bus segment, but PCI device C (a network card for
>> example) can't because it is on a different bus segment and the bridge
>> can't handle P2P transactions.
> Yeah, that could be an issue but in our experience we have yet to see
> it. We've tested with two separate PCI buses on different CPUs connected
> through QPI links and it works fine. (It is rather slow but I understand
> Intel has improved the bottleneck in newer CPUs than the ones we tested.)

Well Serguei send me a couple of documents about QPI when we started to 
discuss this internally as well and that's exactly one of the cases I 
had in mind when writing this.

If I understood it correctly for such systems P2P is technical possible, 
but not necessary a good idea. Usually it is faster to just use a 
bouncing buffer when the peers are a bit "father" apart.

That this problem is solved on newer hardware is good, but doesn't helps 
us at all if we at want to support at least systems from the last five 
years or so.

> It may just be older hardware that has this issue. I expect that as long
> as a failed transfer can be handled gracefully by the initiator I don't
> see a need to predetermine whether a device can see another devices memory.

I don't want to predetermine whether a device can see another devices 
memory at get_user_pages() time.

My thinking was more going into the direction of a whitelist to figure 
out during dma_map_single()/dma_map_sg() time if we should use a 
bouncing buffer or not.

Christian.

>
>
> Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24 16:42                       ` Jason Gunthorpe
  2016-11-24 18:11                         ` Logan Gunthorpe
@ 2016-11-25 13:22                         ` Christian König
  2016-11-25 17:16                           ` Serguei Sagalovitch
  2016-11-25 19:32                           ` Jason Gunthorpe
  1 sibling, 2 replies; 126+ messages in thread
From: Christian König @ 2016-11-25 13:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Serguei Sagalovitch, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran

Am 24.11.2016 um 17:42 schrieb Jason Gunthorpe:
> On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote:
>>
>> On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
>>>>> Only ODP hardware allows changing the DMA address on the fly, and it
>>>>> works at the page table level. We do not need special handling for
>>>>> RDMA.
>>>> I am aware of ODP but, noted by others, it doesn't provide a general
>>>> solution to the points above.
>>> How do you mean?
>> I was only saying it wasn't general in that it wouldn't work for IB
>> hardware that doesn't support ODP or other hardware  that doesn't do
>> similar things (like an NVMe drive).
> There are three cases to worry about:
>   - Coherent long lived page table mirroring (RDMA ODP MR)
>   - Non-coherent long lived page table mirroring (RDMA MR)
>   - Short lived DMA mapping (everything else)
>
> Like you say below we have to handle short lived in the usual way, and
> that covers basically every device except IB MRs, including the
> command queue on a NVMe drive.

Well a problem which wasn't mentioned so far is that while GPUs do have 
a page table to mirror the CPU page table, they usually can't recover 
from page faults.

So what we do is making sure that all memory accessed by the GPU Jobs 
stays in place while those jobs run (pretty much the same pinning you do 
for the DMA).

But since this can lock down huge amounts of memory the whole command 
submission to GPUs is bound to the memory management. So when to much 
memory would get blocked by the GPU we block further command submissions 
until the situation resolves.

>> any complex allocators (GPU or otherwise) should respect that. And that
>> seems like it should be the default way most of this works -- and I
>> think it wouldn't actually take too much effort to make it all work now
>> as is. (Our iopmem work is actually quite small and simple.)
> Yes, absolutely, some kind of page pinning like locking is a hard
> requirement.
>
>> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
>> memory working for some time. I'd say it's a good fit. The main question
>> we've had is how to expose PCIe bars to userspace to be used as MRs and
>> such.
> Is there any progress on that?
>
> I still don't quite get what iopmem was about.. I thought the
> objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
> over iopmem and still ending up with uncacheable mmaps still seems
> like a non-starter to me...
>
> Serguei, what is your plan in GPU land for migration? Ie if I have a
> CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
> - do you still allow the CPU to access it? Or do you swap it back to
> cachable memory if the CPU touches it?

Depends on the policy in command, but currently it's the other way 
around most of the time.

E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids 
reading because that is slow, the GPU in turn can access it with full speed.

When we run out of VRAM we move those allocations to system memory and 
update both the CPU as well as the GPU page tables.

So that move is transparent for both userspace as well as shaders 
running on the GPU.

> One approach might be to mmap the uncachable ZONE_DEVICE memory and
> mark it inaccessible to the CPU - DMA could still translate. If the
> CPU needs it then the kernel migrates it to system memory so it
> becomes cachable. ??

The whole purpose of this effort is that we can do I/O on VRAM directly 
without migrating everything back to system memory.

Allowing this, but then doing the migration by the first touch of the 
CPU is clearly not a good idea.

Regards,
Christian.

>
> Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 13:06                             ` Christian König
@ 2016-11-25 16:45                               ` Logan Gunthorpe
  2016-11-25 17:20                                 ` Serguei Sagalovitch
  0 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-11-25 16:45 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe, Dan Williams
  Cc: Serguei Sagalovitch, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran



On 25/11/16 06:06 AM, Christian König wrote:
> Well Serguei send me a couple of documents about QPI when we started to
> discuss this internally as well and that's exactly one of the cases I
> had in mind when writing this.
> 
> If I understood it correctly for such systems P2P is technical possible,
> but not necessary a good idea. Usually it is faster to just use a
> bouncing buffer when the peers are a bit "father" apart.
> 
> That this problem is solved on newer hardware is good, but doesn't helps
> us at all if we at want to support at least systems from the last five
> years or so.

Well we have been testing with Sandy Bridge, I think the problem was
supposed to be fixed in Ivy but we never tested it so I can't say what
the performance turned out to be. Ivy is nearly 5 years old. I expect
this is something that will be improved more and more with subsequent
generations.

A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 13:22                         ` Christian König
@ 2016-11-25 17:16                           ` Serguei Sagalovitch
  2016-11-25 19:34                             ` Jason Gunthorpe
  2016-11-25 19:32                           ` Jason Gunthorpe
  1 sibling, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-25 17:16 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe, Logan Gunthorpe
  Cc: Dan Williams, Deucher, Alexander, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, Bridgman, John,
	linux-kernel, dri-devel, Sander, Ben, Suthikulpanit, Suravee,
	Blinzer, Paul, Linux-media, Haggai Eran

On 2016-11-25 08:22 AM, Christian König wrote:
>
>> Serguei, what is your plan in GPU land for migration? Ie if I have a
>> CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
>> - do you still allow the CPU to access it? Or do you swap it back to
>> cachable memory if the CPU touches it?
>
> Depends on the policy in command, but currently it's the other way 
> around most of the time.
>
> E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids 
> reading because that is slow, the GPU in turn can access it with full 
> speed.
>
> When we run out of VRAM we move those allocations to system memory and 
> update both the CPU as well as the GPU page tables.
>
> So that move is transparent for both userspace as well as shaders 
> running on the GPU.
I would like to add more in relation to  CPU access :

a) we could have CPU-accessible part of VRAM ("inside" of PCIe BAR register)
and non-CPU  accessible part.  As the result if user needs to have
CPU access than memory should be located in CPU-accessible part
of VRAM or in system memory.

Application/user mode driver could specify preference/hints of
locations based on their assumption / knowledge about access
patterns requirements, game resolution,  knowledge
about size of VRAM memory, etc.  So if CPU access performance
is critical then such memory should be allocated in system memory
as  the first (and may be only) choice.

b) Allocation may not  have CPU address  at all - only GPU one.
Also we may not be able to have CPU address/accesses for all VRAM
memory but memory may still  be migrated in any case unrelated
if we have CPU address or not.

c) " VRAM, it becomes non-cachable "
Strictly speaking VRAM is configured as WC (write-combined memory) to
provide fast CPU write access. Also it was found that sometimes if CPU
access is not critical from performance perspective it may be useful
to allocate/program system memory also as WC to  avoid needs for
extra "snooping" to synchronize with CPU caches during GPU access.
So potentially system memory could be WC too.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 16:45                               ` Logan Gunthorpe
@ 2016-11-25 17:20                                 ` Serguei Sagalovitch
  2016-11-25 20:26                                   ` Felix Kuehling
  0 siblings, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-25 17:20 UTC (permalink / raw)
  To: Logan Gunthorpe, Christian König, Jason Gunthorpe, Dan Williams
  Cc: Deucher, Alexander, linux-nvdimm@lists.01.org, linux-rdma,
	linux-pci, Kuehling, Felix, Bridgman, John, linux-kernel,
	dri-devel, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran


> A white list may end up being rather complicated if it has to cover
> different CPU generations and system architectures. I feel this is a
> decision user space could easily make.
>
> Logan
I agreed that it is better to leave up to user space to check what is 
working
and what is not. I found that write is practically always working but 
read very
often not. Also sometimes system BIOS update could fix the issue.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-24 18:11                         ` Logan Gunthorpe
  2016-11-25  7:58                           ` Christoph Hellwig
@ 2016-11-25 17:59                           ` Serguei Sagalovitch
  1 sibling, 0 replies; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-25 17:59 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe
  Cc: Dan Williams, Deucher, Alexander, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, Bridgman, John,
	linux-kernel, dri-devel, Koenig, Christian, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran


> Well, I guess there's some consensus building to do. The existing
> options are:
>
> * Device DAX: which could work but the problem I see with it is that it
> only allows one application to do these transfers. Or there would have
> to be some user-space coordination to figure which application gets what
> memeroy.
About one application restriction: so it is per memory mapping? I assume 
that
it should not be problem for one application to do transfer to the 
several devices
simultaneously? Am I right?

May be we should follow RDMA MR design and register memory for p2p 
transfer from user
space?

What about the following:

a)  Device DAX is created
b) "Normal" (movable, etc.) allocation will be done for PCIe memory and 
CPU pointer/access will
be requested.
c)  p2p_mr_register() will be called and CPU pointer (mmap( on DAX 
Device)) will be returned.
Accordingly such memory will be marked as "unmovable" by e.g. graphics 
driver.
d) When p2p is not needed then p2p_mr_unregister() will be called.

What do you think? Will it work?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 13:22                         ` Christian König
  2016-11-25 17:16                           ` Serguei Sagalovitch
@ 2016-11-25 19:32                           ` Jason Gunthorpe
  2016-11-25 20:40                             ` Christian König
                                               ` (2 more replies)
  1 sibling, 3 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-25 19:32 UTC (permalink / raw)
  To: Christian König
  Cc: Logan Gunthorpe, Serguei Sagalovitch, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Sander,
	Ben, Suthikulpanit, Suravee, Blinzer, Paul, Linux-media,
	Haggai Eran

On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:

> >Like you say below we have to handle short lived in the usual way, and
> >that covers basically every device except IB MRs, including the
> >command queue on a NVMe drive.
> 
> Well a problem which wasn't mentioned so far is that while GPUs do have a
> page table to mirror the CPU page table, they usually can't recover from
> page faults.

> So what we do is making sure that all memory accessed by the GPU Jobs stays
> in place while those jobs run (pretty much the same pinning you do for the
> DMA).

Yes, it is DMA, so this is a valid approach.

But, you don't need page faults from the GPU to do proper coherent
page table mirroring. Basically when the driver submits the work to
the GPU it 'faults' the pages into the CPU and mirror translation
table (instead of pinning).

Like in ODP, MMU notifiers/HMM are used to monitor for translation
changes. If a change comes in the GPU driver checks if an executing
command is touching those pages and blocks the MMU notifier until the
command flushes, then unfaults the page (blocking future commands) and
unblocks the mmu notifier.

The code moving the page will move it and the next GPU command that
needs it will refault it in the usual way, just like the CPU would.

This might be much more efficient since it optimizes for the common
case of unchanging translation tables.

This assumes the commands are fairly short lived of course, the
expectation of the mmu notifiers is that a flush is reasonably prompt
..

> >Serguei, what is your plan in GPU land for migration? Ie if I have a
> >CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
> >- do you still allow the CPU to access it? Or do you swap it back to
> >cachable memory if the CPU touches it?
> 
> Depends on the policy in command, but currently it's the other way around
> most of the time.
> 
> E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids reading
> because that is slow, the GPU in turn can access it with full speed.
> 
> When we run out of VRAM we move those allocations to system memory and
> update both the CPU as well as the GPU page tables.
> 
> So that move is transparent for both userspace as well as shaders running on
> the GPU.

That makes sense to me, but the objection that came back for
non-cachable CPU mappings is that it basically breaks too much stuff
subtly, eg atomics, unaligned accesses, the CPU threading memory
model, all change on various architectures and break when caching is
disabled.

IMHO that is OK for specialty things like the GPU where the mmap comes
in via drm or something and apps know to handle that buffer specially.

But it is certainly not OK for DAX where the application is coded for
normal file open()/mmap() is not prepared for a mmap where (eg)
unaligned read accesses or atomics don't work depending on how the
filesystem is setup.

Which is why I think iopmem is still problematic..

At the very least I think a mmap flag or open flag should be needed to
opt into this behavior and by default non-cachebale DAX mmaps should
be paged into system ram when the CPU accesses them.

I'm hearing most people say ZONE_DEVICE is the way to handle this,
which means the missing remaing piece for RDMA is some kind of DMA
core support for p2p address translation..

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 17:16                           ` Serguei Sagalovitch
@ 2016-11-25 19:34                             ` Jason Gunthorpe
  2016-11-25 19:49                               ` Serguei Sagalovitch
  2016-11-25 23:41                               ` Alex Deucher
  0 siblings, 2 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-25 19:34 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: Christian König, Logan Gunthorpe, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Sander,
	Ben, Suthikulpanit, Suravee, Blinzer, Paul, Linux-media,
	Haggai Eran

On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:

> b) Allocation may not  have CPU address  at all - only GPU one.

But you don't expect RDMA to work in the case, right?

GPU people need to stop doing this windowed memory stuff :)

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25  7:58                           ` Christoph Hellwig
@ 2016-11-25 19:41                             ` Jason Gunthorpe
  0 siblings, 0 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-25 19:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Logan Gunthorpe, Serguei Sagalovitch, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Koenig,
	Christian, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Haggai Eran

On Thu, Nov 24, 2016 at 11:58:17PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 24, 2016 at 11:11:34AM -0700, Logan Gunthorpe wrote:
> > * Regular DAX in the FS doesn't work at this time because the FS can
> > move the file you think your transfer to out from under you. Though I
> > understand there's been some work with XFS to solve that issue.
> 
> The file system will never move anything under locked down pages,
> locking down pages is used exactly to protect against that.

.. And ODP style mmu notifiers work correctly as well, I'd assume.

So this should work with ZONE_DEVICE, if it doesn't it is a filesystem
bug?

> really want a notification to the consumer if the file systems wants
> to remove the mapping.  We have implemented that using FL_LAYOUTS locks
> for NFSD, but only XFS supports it so far.  Without that a long term
> locked down region of memory (e.g. a kernel MR) would prevent various
> file operations that would simply hang.

So you imagine a signal back to user space asking user space to drop
any RDMA MRS so the FS can relocate things?

Do we need that, or should we encourage people to use either short
lived MRs or ODP MRs when working with scenarios that need FS
relocation?

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 19:34                             ` Jason Gunthorpe
@ 2016-11-25 19:49                               ` Serguei Sagalovitch
  2016-11-25 20:19                                 ` Jason Gunthorpe
  2016-11-25 23:41                               ` Alex Deucher
  1 sibling, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-25 19:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian König, Logan Gunthorpe, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Sander,
	Ben, Suthikulpanit, Suravee, Blinzer, Paul, Linux-media,
	Haggai Eran

On 2016-11-25 02:34 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:
>
>> b) Allocation may not  have CPU address  at all - only GPU one.
> But you don't expect RDMA to work in the case, right?
>
> GPU people need to stop doing this windowed memory stuff :)
GPU could perfectly access all VRAM.  It is only issue for p2p without
special interconnect and CPU access. Strictly speaking as long as we
have "bus address"  we could have RDMA but  I agreed that for
RDMA we could/should(?) always "request"  CPU address (I hope that we
could forget about 32-bit application :-)).

BTW/FYI: About CPU access: Some user-level API is mainly handle based
so there is no need for CPU access by default.

About "visible" / non-visible VRAM parts: I assume  that going
forward we will be able to get rid from it completely as soon as support
for resizable PCI BAR will be implemented and/or old/current h/w
will become obsolete.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 19:49                               ` Serguei Sagalovitch
@ 2016-11-25 20:19                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-25 20:19 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: Christian König, Logan Gunthorpe, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Sander,
	Ben, Suthikulpanit, Suravee, Blinzer, Paul, Linux-media,
	Haggai Eran

On Fri, Nov 25, 2016 at 02:49:50PM -0500, Serguei Sagalovitch wrote:

> GPU could perfectly access all VRAM.  It is only issue for p2p without
> special interconnect and CPU access. Strictly speaking as long as we
> have "bus address"  we could have RDMA but  I agreed that for
> RDMA we could/should(?) always "request"  CPU address (I hope that we
> could forget about 32-bit application :-)).

At least on x86 if you have a bus address you have a CPU address. All
RDMAable VRAM has to be visible in the BAR.

> BTW/FYI: About CPU access: Some user-level API is mainly handle based
> so there is no need for CPU access by default.

You mean no need for the memory to be virtually mapped into the
process?

Do you expect to RDMA from this kind of API? How will that work?

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 17:20                                 ` Serguei Sagalovitch
@ 2016-11-25 20:26                                   ` Felix Kuehling
  2016-11-25 20:48                                     ` Serguei Sagalovitch
  0 siblings, 1 reply; 126+ messages in thread
From: Felix Kuehling @ 2016-11-25 20:26 UTC (permalink / raw)
  To: Serguei Sagalovitch, Logan Gunthorpe, Christian König,
	Jason Gunthorpe, Dan Williams
  Cc: Deucher, Alexander, linux-nvdimm@lists.01.org, linux-rdma,
	linux-pci, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran

On 16-11-25 12:20 PM, Serguei Sagalovitch wrote:
>
>> A white list may end up being rather complicated if it has to cover
>> different CPU generations and system architectures. I feel this is a
>> decision user space could easily make.
>>
>> Logan
> I agreed that it is better to leave up to user space to check what is
> working
> and what is not. I found that write is practically always working but
> read very
> often not. Also sometimes system BIOS update could fix the issue.
>
But is user mode always aware that P2P is going on or even possible? For
example you may have a library reading a buffer from a file, but it
doesn't necessarily know where that buffer is located (system memory,
VRAM, ...) and it may not know what kind of the device the file is on
(SATA drive, NVMe SSD, ...). The library will never know if all it gets
is a pointer and a file descriptor.

The library ends up calling a read system call. Then it would be up to
the kernel to figure out the most efficient way to read the buffer from
the file. If supported, it could use P2P between a GPU and NVMe where
the NVMe device performs a DMA write to VRAM.

If you put the burden of figuring out the P2P details on user mode code,
I think it will severely limit the use cases that actually take
advantage of it. You also risk a bunch of different implementations that
get it wrong half the time on half the systems out there.

Regards,
  Felix

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 19:32                           ` Jason Gunthorpe
@ 2016-11-25 20:40                             ` Christian König
  2016-11-25 20:51                               ` Felix Kuehling
  2016-11-25 21:18                               ` Jason Gunthorpe
  2016-11-27  8:16                             ` Haggai Eran
  2016-11-27 14:02                             ` Haggai Eran
  2 siblings, 2 replies; 126+ messages in thread
From: Christian König @ 2016-11-25 20:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König
  Cc: Haggai Eran, linux-rdma, linux-nvdimm@lists.01.org, Kuehling,
	Felix, Serguei Sagalovitch, linux-kernel, dri-devel, Blinzer,
	Paul, Suthikulpanit, Suravee, linux-pci, Deucher, Alexander,
	Dan Williams, Logan Gunthorpe, Sander, Ben, Linux-media

Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
> Yes, it is DMA, so this is a valid approach.
>
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
>
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.

Yeah, we have a function to "import" anonymous pages from a CPU pointer 
which works exactly that way as well.

We call this "userptr" and it's just a combination of get_user_pages() 
on command submission and making sure the returned list of pages stays 
valid using a MMU notifier.

The "big" problem with this approach is that it is horrible slow. I mean 
seriously horrible slow so that we actually can't use it for some of the 
purposes we wanted to use it.

> The code moving the page will move it and the next GPU command that
> needs it will refault it in the usual way, just like the CPU would.

And here comes the problem. CPU do this on a page by page basis, so they 
fault only what needed and everything else gets filled in on demand. 
This results that faulting a page is relatively light weight operation.

But for GPU command submission we don't know which pages might be 
accessed beforehand, so what we do is walking all possible pages and 
make sure all of them are present.

Now as far as I understand it the I/O subsystem for example assumes that 
it can easily change the CPU page tables without much overhead. So for 
example when a page can't modified it is temporary marked as readonly 
AFAIK (you are probably way deeper into this than me, so please confirm).

That absolutely kills any performance for GPU command submissions. We 
have use cases where we practically ended up playing ping/pong between 
the GPU driver trying to grab the page with get_user_pages() and sombody 
else in the kernel marking it readonly.

> This might be much more efficient since it optimizes for the common
> case of unchanging translation tables.

Yeah, completely agree. It works perfectly fine as long as you don't 
have two drivers trying to mess with the same page.

> This assumes the commands are fairly short lived of course, the
> expectation of the mmu notifiers is that a flush is reasonably prompt

Correct, this is another problem. GFX command submissions usually don't 
take longer than a few milliseconds, but compute command submission can 
easily take multiple hours.

I can easily imagine what would happen when kswapd is blocked by a GPU 
command submission for an hour or so while the system is under memory 
pressure :)

I'm thinking on this problem for about a year now and going in circles 
for quite a while. So if you have ideas on this even if they sound 
totally crazy, feel free to come up.

Cheers,
Christian.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 20:26                                   ` Felix Kuehling
@ 2016-11-25 20:48                                     ` Serguei Sagalovitch
  0 siblings, 0 replies; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-25 20:48 UTC (permalink / raw)
  To: Felix Kuehling, Logan Gunthorpe, Christian König,
	Jason Gunthorpe, Dan Williams
  Cc: Deucher, Alexander, linux-nvdimm@lists.01.org, linux-rdma,
	linux-pci, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Haggai Eran


On 2016-11-25 03:26 PM, Felix Kuehling wrote:
> On 16-11-25 12:20 PM, Serguei Sagalovitch wrote:
>>> A white list may end up being rather complicated if it has to cover
>>> different CPU generations and system architectures. I feel this is a
>>> decision user space could easily make.
>>>
>>> Logan
>> I agreed that it is better to leave up to user space to check what is
>> working
>> and what is not. I found that write is practically always working but
>> read very
>> often not. Also sometimes system BIOS update could fix the issue.
>>
> But is user mode always aware that P2P is going on or even possible? For
> example you may have a library reading a buffer from a file, but it
> doesn't necessarily know where that buffer is located (system memory,
> VRAM, ...) and it may not know what kind of the device the file is on
> (SATA drive, NVMe SSD, ...). The library will never know if all it gets
> is a pointer and a file descriptor.
>
> The library ends up calling a read system call. Then it would be up to
> the kernel to figure out the most efficient way to read the buffer from
> the file. If supported, it could use P2P between a GPU and NVMe where
> the NVMe device performs a DMA write to VRAM.
>
> If you put the burden of figuring out the P2P details on user mode code,
> I think it will severely limit the use cases that actually take
> advantage of it. You also risk a bunch of different implementations that
> get it wrong half the time on half the systems out there.
>
> Regards,
>    Felix
>
>
I agreed in theory with you but  I must admit that I do not know how
kernel could effectively collect all informations without running
pretty complicated tests each time on boot-up (if any configuration
changed including BIOS settings)  and on pnp events. Also for efficient
way kernel needs to know performance results (and it could also
depends on clock / power mode) for read/write of each pair devices, for
double-buffering it needs to know / detect on which NUMA node
to allocate, etc. etc.  Also  device could be fully configured only
on the first request for access so it may be needed to change initialization
sequences.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 20:40                             ` Christian König
@ 2016-11-25 20:51                               ` Felix Kuehling
  2016-11-25 21:18                               ` Jason Gunthorpe
  1 sibling, 0 replies; 126+ messages in thread
From: Felix Kuehling @ 2016-11-25 20:51 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe, Christian König
  Cc: Haggai Eran, linux-rdma, linux-nvdimm@lists.01.org,
	Serguei Sagalovitch, linux-kernel, dri-devel, Blinzer, Paul,
	Suthikulpanit, Suravee, linux-pci, Deucher, Alexander,
	Dan Williams, Logan Gunthorpe, Sander, Ben, Linux-media


On 16-11-25 03:40 PM, Christian König wrote:
> Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe:
>> This assumes the commands are fairly short lived of course, the
>> expectation of the mmu notifiers is that a flush is reasonably prompt
>
> Correct, this is another problem. GFX command submissions usually
> don't take longer than a few milliseconds, but compute command
> submission can easily take multiple hours.
>
> I can easily imagine what would happen when kswapd is blocked by a GPU
> command submission for an hour or so while the system is under memory
> pressure :)
>
> I'm thinking on this problem for about a year now and going in circles
> for quite a while. So if you have ideas on this even if they sound
> totally crazy, feel free to come up.

Our GPUs (at least starting with VI) support compute-wave-save-restore
and can swap out compute queues with fairly low latency. Yes, there is
some overhead (both memory usage and time), but it's a fairly regular
thing with our hardware scheduler (firmware, actually) when we need to
preempt running compute queues to update runlists or we overcommit the
hardware queue resources.

Regards,
  Felix

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 20:40                             ` Christian König
  2016-11-25 20:51                               ` Felix Kuehling
@ 2016-11-25 21:18                               ` Jason Gunthorpe
  1 sibling, 0 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-25 21:18 UTC (permalink / raw)
  To: Christian König
  Cc: Christian König, Haggai Eran, linux-rdma,
	linux-nvdimm@lists.01.org, Kuehling, Felix, Serguei Sagalovitch,
	linux-kernel, dri-devel, Blinzer, Paul, Suthikulpanit, Suravee,
	linux-pci, Deucher, Alexander, Dan Williams, Logan Gunthorpe,
	Sander, Ben, Linux-media

On Fri, Nov 25, 2016 at 09:40:10PM +0100, Christian König wrote:

> We call this "userptr" and it's just a combination of get_user_pages() on
> command submission and making sure the returned list of pages stays valid
> using a MMU notifier.

Doesn't that still pin the page?

> The "big" problem with this approach is that it is horrible slow. I mean
> seriously horrible slow so that we actually can't use it for some of the
> purposes we wanted to use it.
> 
> >The code moving the page will move it and the next GPU command that
> >needs it will refault it in the usual way, just like the CPU would.
> 
> And here comes the problem. CPU do this on a page by page basis, so they
> fault only what needed and everything else gets filled in on demand. This
> results that faulting a page is relatively light weight operation.
>
> But for GPU command submission we don't know which pages might be accessed
> beforehand, so what we do is walking all possible pages and make sure all of
> them are present.

Little confused why this is slow? So you fault the entire user MM into
your page tables at start of day and keep track of it with mmu
notifiers?

> >This might be much more efficient since it optimizes for the common
> >case of unchanging translation tables.
> 
> Yeah, completely agree. It works perfectly fine as long as you don't have
> two drivers trying to mess with the same page.

Well, the idea would be to not have the GPU block the other driver
beyond hinting that the page shouldn't be swapped out.

> >This assumes the commands are fairly short lived of course, the
> >expectation of the mmu notifiers is that a flush is reasonably prompt
> 
> Correct, this is another problem. GFX command submissions usually don't take
> longer than a few milliseconds, but compute command submission can easily
> take multiple hours.

So, that won't work - you have the same issue as RDMA with work loads
like that.

If you can't somehow fence the hardware then pinning is the only
solution. Felix has the right kind of suggestion for what is needed -
globally stop the GPU, fence the DMA, fix the page tables, and start
it up again. :\

> I can easily imagine what would happen when kswapd is blocked by a GPU
> command submission for an hour or so while the system is under memory
> pressure :)

Right. The advantage of pinning is it tells the other stuff not to
touch the page and doesn't block it, MMU notifiers have to be able to
block&fence quickly.

> I'm thinking on this problem for about a year now and going in circles for
> quite a while. So if you have ideas on this even if they sound totally
> crazy, feel free to come up.

Well, it isn't a software problem. From what I've seen in this thread
the GPU application requires coherent page table mirroring, so the
only full & complete solution is going to be to actually implement
that somehow in GPU hardware.

Everything else is going to be deeply flawed somehow. Linux just
doesn't have the support for this kind of stuff - and I'm honestly not
sure something better is even possible considering the hardware
constraints....

This doesn't have to be faulting, but really anything that lets you
pause the GPU DMA and reload the page tables.

You might look at trying to use the IOMMU and/or PCI ATS in very new
hardware. IIRC the physical IOMMU hardware can do the fault and fence
and block stuff, but I'm not sure about software support for using the
IOMMU to create coherent user page table mirrors - that is something
Linux doesn't do today. But there is demand for this kind of capability..

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 19:34                             ` Jason Gunthorpe
  2016-11-25 19:49                               ` Serguei Sagalovitch
@ 2016-11-25 23:41                               ` Alex Deucher
  1 sibling, 0 replies; 126+ messages in thread
From: Alex Deucher @ 2016-11-25 23:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Serguei Sagalovitch, Christian König, Logan Gunthorpe,
	Dan Williams, Deucher, Alexander, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, Bridgman, John,
	linux-kernel, dri-devel, Sander, Ben, Suthikulpanit, Suravee,
	Blinzer, Paul, Linux-media, Haggai Eran

On Fri, Nov 25, 2016 at 2:34 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:
>
>> b) Allocation may not  have CPU address  at all - only GPU one.
>
> But you don't expect RDMA to work in the case, right?
>
> GPU people need to stop doing this windowed memory stuff :)
>

Blame 32 bit systems and GPUs with tons of vram :)

I think resizable bars are finally coming in a useful way so this
should go away soon.

Alex

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 19:32                           ` Jason Gunthorpe
  2016-11-25 20:40                             ` Christian König
@ 2016-11-27  8:16                             ` Haggai Eran
  2016-11-27 14:02                             ` Haggai Eran
  2 siblings, 0 replies; 126+ messages in thread
From: Haggai Eran @ 2016-11-27  8:16 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König
  Cc: Logan Gunthorpe, Serguei Sagalovitch, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Sander,
	Ben, Suthikulpanit, Suravee, Blinzer, Paul, Linux-media

On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
> 
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>>
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
> 
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
> 
> Yes, it is DMA, so this is a valid approach.
> 
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
> 
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

> I'm hearing most people say ZONE_DEVICE is the way to handle this,
> which means the missing remaing piece for RDMA is some kind of DMA
> core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that. Using
ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Haggai

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-25 19:32                           ` Jason Gunthorpe
  2016-11-25 20:40                             ` Christian König
  2016-11-27  8:16                             ` Haggai Eran
@ 2016-11-27 14:02                             ` Haggai Eran
  2016-11-27 14:07                               ` Christian König
                                                 ` (2 more replies)
  2 siblings, 3 replies; 126+ messages in thread
From: Haggai Eran @ 2016-11-27 14:02 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König
  Cc: Logan Gunthorpe, Serguei Sagalovitch, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Sander,
	Ben, Suthikulpanit, Suravee, Blinzer, Paul, Linux-media,
	Max Gurtovoy

On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>>
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
>
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
>
> Yes, it is DMA, so this is a valid approach.
>
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
>
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

> I'm hearing most people say ZONE_DEVICE is the way to handle this,
> which means the missing remaing piece for RDMA is some kind of DMA
> core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that. Using
ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Haggai

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-27 14:02                             ` Haggai Eran
@ 2016-11-27 14:07                               ` Christian König
  2016-11-28  5:31                                 ` zhoucm1
  2016-11-28 14:48                               ` Serguei Sagalovitch
  2016-11-28 16:57                               ` Jason Gunthorpe
  2 siblings, 1 reply; 126+ messages in thread
From: Christian König @ 2016-11-27 14:07 UTC (permalink / raw)
  To: Haggai Eran, Jason Gunthorpe
  Cc: Logan Gunthorpe, Serguei Sagalovitch, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Sander,
	Ben, Suthikulpanit, Suravee, Blinzer, Paul, Linux-media,
	Max Gurtovoy

Am 27.11.2016 um 15:02 schrieb Haggai Eran:
> On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
>> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>>
>>>> Like you say below we have to handle short lived in the usual way, and
>>>> that covers basically every device except IB MRs, including the
>>>> command queue on a NVMe drive.
>>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>>> page table to mirror the CPU page table, they usually can't recover from
>>> page faults.
>>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>>> in place while those jobs run (pretty much the same pinning you do for the
>>> DMA).
>> Yes, it is DMA, so this is a valid approach.
>>
>> But, you don't need page faults from the GPU to do proper coherent
>> page table mirroring. Basically when the driver submits the work to
>> the GPU it 'faults' the pages into the CPU and mirror translation
>> table (instead of pinning).
>>
>> Like in ODP, MMU notifiers/HMM are used to monitor for translation
>> changes. If a change comes in the GPU driver checks if an executing
>> command is touching those pages and blocks the MMU notifier until the
>> command flushes, then unfaults the page (blocking future commands) and
>> unblocks the mmu notifier.
> I think blocking mmu notifiers against something that is basically
> controlled by user-space can be problematic. This can block things like
> memory reclaim. If you have user-space access to the device's queues,
> user-space can block the mmu notifier forever.
Really good point.

I think this means the bare minimum if we don't have recoverable page 
faults is to have preemption support like Felix described in his answer 
as well.

Going to keep that in mind,
Christian.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-27 14:07                               ` Christian König
@ 2016-11-28  5:31                                 ` zhoucm1
  0 siblings, 0 replies; 126+ messages in thread
From: zhoucm1 @ 2016-11-28  5:31 UTC (permalink / raw)
  To: Christian König, Haggai Eran, Jason Gunthorpe, Yu, Qiang
  Cc: linux-rdma, linux-nvdimm@lists.01.org, Kuehling, Felix,
	Serguei Sagalovitch, linux-kernel, dri-devel, Blinzer, Paul,
	Suthikulpanit, Suravee, linux-pci, Deucher, Alexander,
	Max Gurtovoy, Dan Williams, Logan Gunthorpe, Sander, Ben,
	Linux-media

+Qiang, who is working on it.

On 2016年11月27日 22:07, Christian König wrote:
> Am 27.11.2016 um 15:02 schrieb Haggai Eran:
>> On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
>>> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>>>
>>>>> Like you say below we have to handle short lived in the usual way, 
>>>>> and
>>>>> that covers basically every device except IB MRs, including the
>>>>> command queue on a NVMe drive.
>>>> Well a problem which wasn't mentioned so far is that while GPUs do 
>>>> have a
>>>> page table to mirror the CPU page table, they usually can't recover 
>>>> from
>>>> page faults.
>>>> So what we do is making sure that all memory accessed by the GPU 
>>>> Jobs stays
>>>> in place while those jobs run (pretty much the same pinning you do 
>>>> for the
>>>> DMA).
>>> Yes, it is DMA, so this is a valid approach.
>>>
>>> But, you don't need page faults from the GPU to do proper coherent
>>> page table mirroring. Basically when the driver submits the work to
>>> the GPU it 'faults' the pages into the CPU and mirror translation
>>> table (instead of pinning).
>>>
>>> Like in ODP, MMU notifiers/HMM are used to monitor for translation
>>> changes. If a change comes in the GPU driver checks if an executing
>>> command is touching those pages and blocks the MMU notifier until the
>>> command flushes, then unfaults the page (blocking future commands) and
>>> unblocks the mmu notifier.
>> I think blocking mmu notifiers against something that is basically
>> controlled by user-space can be problematic. This can block things like
>> memory reclaim. If you have user-space access to the device's queues,
>> user-space can block the mmu notifier forever.
> Really good point.
>
> I think this means the bare minimum if we don't have recoverable page 
> faults is to have preemption support like Felix described in his 
> answer as well.
>
> Going to keep that in mind,
> Christian.
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-27 14:02                             ` Haggai Eran
  2016-11-27 14:07                               ` Christian König
@ 2016-11-28 14:48                               ` Serguei Sagalovitch
  2016-11-28 18:36                                 ` Haggai Eran
  2016-11-28 16:57                               ` Jason Gunthorpe
  2 siblings, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-28 14:48 UTC (permalink / raw)
  To: Haggai Eran, Jason Gunthorpe, Christian König
  Cc: Logan Gunthorpe, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Max Gurtovoy

On 2016-11-27 09:02 AM, Haggai Eran wrote
> On PeerDirect, we have some kind of a middle-ground solution for pinning
> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> user-space and the GPU not to migrate it. If they do, the MR gets
> destroyed immediately. This should work on legacy devices without ODP
> support, and allows the system to safely terminate a process that
> misbehaves. The downside of course is that it cannot transparently
> migrate memory but I think for user-space RDMA doing that transparently
> requires hardware support for paging, via something like HMM.
>
> ...
May be I am wrong but my understanding is that PeerDirect logic basically
follow  "RDMA register MR" logic so basically nothing prevent to "terminate"
process for "MMU notifier" case when we are very low on memory
not making it similar (not worse) then PeerDirect case.
>> I'm hearing most people say ZONE_DEVICE is the way to handle this,
>> which means the missing remaing piece for RDMA is some kind of DMA
>> core support for p2p address translation..
> Yes, this is definitely something we need. I think Will Davis's patches
> are a good start.
>
> Another thing I think is that while HMM is good for user-space
> applications, for kernel p2p use there is no need for that.
About HMM: I do not think that in the current form HMM would  fit in
requirement for generic P2P transfer case. My understanding is that at
the current stage HMM is good for "caching" system memory
in device memory for fast GPU access but in RDMA MR non-ODP case
it will not work because  the location of memory should not be
changed so memory should be allocated directly in PCIe memory.
> Using ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
> pages for the short duration as you wrote above could work fine for
> kernel uses in which we can guarantee they are short.
Potentially there is another issue related to pin/unpin. If memory could
be used a lot of time then there is no sense to rebuild and program
s/g tables each time if location of memory was not changed.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-27 14:02                             ` Haggai Eran
  2016-11-27 14:07                               ` Christian König
  2016-11-28 14:48                               ` Serguei Sagalovitch
@ 2016-11-28 16:57                               ` Jason Gunthorpe
  2016-11-28 18:19                                 ` Haggai Eran
  2016-11-28 18:20                                 ` Logan Gunthorpe
  2 siblings, 2 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-28 16:57 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Christian König, Logan Gunthorpe, Serguei Sagalovitch,
	Dan Williams, Deucher, Alexander, linux-nvdimm@lists.01.org,
	linux-rdma, linux-pci, Kuehling, Felix, Bridgman, John,
	linux-kernel, dri-devel, Sander, Ben, Suthikulpanit, Suravee,
	Blinzer, Paul, Linux-media, Max Gurtovoy

On Sun, Nov 27, 2016 at 04:02:16PM +0200, Haggai Eran wrote:

> > Like in ODP, MMU notifiers/HMM are used to monitor for translation
> > changes. If a change comes in the GPU driver checks if an executing
> > command is touching those pages and blocks the MMU notifier until the
> > command flushes, then unfaults the page (blocking future commands) and
> > unblocks the mmu notifier.

> I think blocking mmu notifiers against something that is basically
> controlled by user-space can be problematic. This can block things like
> memory reclaim. If you have user-space access to the device's queues,
> user-space can block the mmu notifier forever.

Right, I mentioned that..

> On PeerDirect, we have some kind of a middle-ground solution for pinning
> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> user-space and the GPU not to migrate it. If they do, the MR gets
> destroyed immediately.

That sounds horrible. How can that possibly work? What if the MR is
being used when the GPU decides to migrate? I would not support that
upstream without a lot more explanation..

I know people don't like requiring new hardware, but in this case we
really do need ODP hardware to get all the semantics people want..

> Another thing I think is that while HMM is good for user-space
> applications, for kernel p2p use there is no need for that. Using

>From what I understand we are not really talking about kernel p2p,
everything proposed so far is being mediated by a userspace VMA, so
I'd focus on making that work.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-28 16:57                               ` Jason Gunthorpe
@ 2016-11-28 18:19                                 ` Haggai Eran
  2016-11-28 19:02                                   ` Jason Gunthorpe
  2016-11-28 18:20                                 ` Logan Gunthorpe
  1 sibling, 1 reply; 126+ messages in thread
From: Haggai Eran @ 2016-11-28 18:19 UTC (permalink / raw)
  To: jgunthorpe
  Cc: linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit, John.Bridgman, Alexander.Deucher,
	Linux-media, dan.j.williams, logang, dri-devel, Max Gurtovoy,
	linux-pci, serguei.sagalovitch, Paul.Blinzer, Felix.Kuehling,
	ben.sander

On Mon, 2016-11-28 at 09:57 -0700, Jason Gunthorpe wrote:
> On Sun, Nov 27, 2016 at 04:02:16PM +0200, Haggai Eran wrote:
> > I think blocking mmu notifiers against something that is basically
> > controlled by user-space can be problematic. This can block things
> > like
> > memory reclaim. If you have user-space access to the device's
> > queues,
> > user-space can block the mmu notifier forever.
> Right, I mentioned that..
Sorry, I must have missed it.

> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately.
> That sounds horrible. How can that possibly work? What if the MR is
> being used when the GPU decides to migrate? 
Naturally this doesn't support migration. The GPU is expected to pin
these pages as long as the MR lives. The MR invalidation is done only as
a last resort to keep system correctness.

I think it is similar to how non-ODP MRs rely on user-space today to
keep them correct. If you do something like madvise(MADV_DONTNEED) on a
non-ODP MR's pages, you can still get yourself into a data corruption
situation (HCA sees one page and the process sees another for the same
virtual address). The pinning that we use only guarentees the HCA's page
won't be reused.

> I would not support that
> upstream without a lot more explanation..
> 
> I know people don't like requiring new hardware, but in this case we
> really do need ODP hardware to get all the semantics people want..
> 
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that. Using
> From what I understand we are not really talking about kernel p2p,
> everything proposed so far is being mediated by a userspace VMA, so
> I'd focus on making that work.
Fair enough, although we will need both eventually, and I hope the
infrastructure can be shared to some degree.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-28 16:57                               ` Jason Gunthorpe
  2016-11-28 18:19                                 ` Haggai Eran
@ 2016-11-28 18:20                                 ` Logan Gunthorpe
  2016-11-28 19:35                                   ` Serguei Sagalovitch
  1 sibling, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-11-28 18:20 UTC (permalink / raw)
  To: Jason Gunthorpe, Haggai Eran
  Cc: Christian König, Serguei Sagalovitch, Dan Williams, Deucher,
	Alexander, linux-nvdimm@lists.01.org, linux-rdma, linux-pci,
	Kuehling, Felix, Bridgman, John, linux-kernel, dri-devel, Sander,
	Ben, Suthikulpanit, Suravee, Blinzer, Paul, Linux-media,
	Max Gurtovoy



On 28/11/16 09:57 AM, Jason Gunthorpe wrote:
>> On PeerDirect, we have some kind of a middle-ground solution for pinning
>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>> user-space and the GPU not to migrate it. If they do, the MR gets
>> destroyed immediately.
> 
> That sounds horrible. How can that possibly work? What if the MR is
> being used when the GPU decides to migrate? I would not support that
> upstream without a lot more explanation..

Yup, this was our experience when playing around with PeerDirect. There
was nothing we could do if the GPU decided to invalidate the P2P
mapping. It just meant the application would fail or need complicated
logic to detect this and redo just about everything. And given that it
was a reasonably rare occurrence during development it probably means
not a lot of applications will be developed to handle it and most would
end up being randomly broken in environments with memory pressure.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-28 14:48                               ` Serguei Sagalovitch
@ 2016-11-28 18:36                                 ` Haggai Eran
  0 siblings, 0 replies; 126+ messages in thread
From: Haggai Eran @ 2016-11-28 18:36 UTC (permalink / raw)
  To: jgunthorpe, christian.koenig, serguei.sagalovitch
  Cc: linux-kernel, linux-rdma, linux-nvdimm, Suravee.Suthikulpanit,
	Linux-media, John.Bridgman, Alexander.Deucher, dan.j.williams,
	logang, dri-devel, Max Gurtovoy, linux-pci, Paul.Blinzer,
	Felix.Kuehling, ben.sander

On Mon, 2016-11-28 at 09:48 -0500, Serguei Sagalovitch wrote:
> On 2016-11-27 09:02 AM, Haggai Eran wrote
> > 
> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately. This should work on legacy devices without
> > ODP
> > support, and allows the system to safely terminate a process that
> > misbehaves. The downside of course is that it cannot transparently
> > migrate memory but I think for user-space RDMA doing that
> > transparently
> > requires hardware support for paging, via something like HMM.
> > 
> > ...
> May be I am wrong but my understanding is that PeerDirect logic
> basically
> follow  "RDMA register MR" logic 
Yes. The only difference from regular MRs is the invalidation process I
mentioned, and the fact that we get the addresses not from
get_user_pages but from a peer driver.

> so basically nothing prevent to "terminate"
> process for "MMU notifier" case when we are very low on memory
> not making it similar (not worse) then PeerDirect case.
I'm not sure I understand. I don't think any solution prevents
terminating an application. The paragraph above is just trying to
explain how a non-ODP device/MR can handle an invalidation.

> > > I'm hearing most people say ZONE_DEVICE is the way to handle this,
> > > which means the missing remaing piece for RDMA is some kind of DMA
> > > core support for p2p address translation..
> > Yes, this is definitely something we need. I think Will Davis's
> > patches
> > are a good start.
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that.
> About HMM: I do not think that in the current form HMM would  fit in
> requirement for generic P2P transfer case. My understanding is that at
> the current stage HMM is good for "caching" system memory
> in device memory for fast GPU access but in RDMA MR non-ODP case
> it will not work because  the location of memory should not be
> changed so memory should be allocated directly in PCIe memory.
The way I see it there are two ways to handle non-ODP MRs. Either you
prevent the GPU from migrating / reusing the MR's VRAM pages for as long
as the MR is alive (if I understand correctly you didn't like this
solution), or you allow the GPU to somehow notify the HCA to invalidate
the MR. If you do that, you can use mmu notifiers or HMM or something
else, but HMM provides a nice framework to facilitate that notification.

> > 
> > Using ZONE_DEVICE with or without something like DMA-BUF to pin and
> > unpin
> > pages for the short duration as you wrote above could work fine for
> > kernel uses in which we can guarantee they are short.
> Potentially there is another issue related to pin/unpin. If memory
> could
> be used a lot of time then there is no sense to rebuild and program
> s/g tables each time if location of memory was not changed.
Is this about the kernel use or user-space? In user-space I think the MR
concept captures a long-lived s/g table so you don't need to rebuild it
(unless the mapping changes).

Haggai

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-28 18:19                                 ` Haggai Eran
@ 2016-11-28 19:02                                   ` Jason Gunthorpe
  2016-11-30 10:45                                     ` Haggai Eran
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-28 19:02 UTC (permalink / raw)
  To: Haggai Eran
  Cc: linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit, John.Bridgman, Alexander.Deucher,
	Linux-media, dan.j.williams, logang, dri-devel, Max Gurtovoy,
	linux-pci, serguei.sagalovitch, Paul.Blinzer, Felix.Kuehling,
	ben.sander

On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
> > > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > > user-space and the GPU not to migrate it. If they do, the MR gets
> > > destroyed immediately.
> > That sounds horrible. How can that possibly work? What if the MR is
> > being used when the GPU decides to migrate? 
> Naturally this doesn't support migration. The GPU is expected to pin
> these pages as long as the MR lives. The MR invalidation is done only as
> a last resort to keep system correctness.

That just forces applications to handle horrible unexpected
failures. If this sort of thing is needed for correctness then OOM
kill the offending process, don't corrupt its operation.

> I think it is similar to how non-ODP MRs rely on user-space today to
> keep them correct. If you do something like madvise(MADV_DONTNEED) on a
> non-ODP MR's pages, you can still get yourself into a data corruption
> situation (HCA sees one page and the process sees another for the same
> virtual address). The pinning that we use only guarentees the HCA's page
> won't be reused.

That is not really data corruption - the data still goes where it was
originally destined. That is an application violating the
requirements of a MR. An application cannot munmap/mremap a VMA
while a non ODP MR points to it and then keep using the MR.

That is totally different from a GPU driver wanthing to mess with
translation to physical pages.

> > From what I understand we are not really talking about kernel p2p,
> > everything proposed so far is being mediated by a userspace VMA, so
> > I'd focus on making that work.

> Fair enough, although we will need both eventually, and I hope the
> infrastructure can be shared to some degree.

What use case do you see for in kernel?

Presumably in-kernel could use a vmap or something and the same basic
flow?

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-28 18:20                                 ` Logan Gunthorpe
@ 2016-11-28 19:35                                   ` Serguei Sagalovitch
  2016-11-28 21:36                                     ` Logan Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-28 19:35 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe, Haggai Eran
  Cc: Christian König, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Max Gurtovoy

On 2016-11-28 01:20 PM, Logan Gunthorpe wrote:
>
> On 28/11/16 09:57 AM, Jason Gunthorpe wrote:
>>> On PeerDirect, we have some kind of a middle-ground solution for pinning
>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>>> user-space and the GPU not to migrate it. If they do, the MR gets
>>> destroyed immediately.
>> That sounds horrible. How can that possibly work? What if the MR is
>> being used when the GPU decides to migrate? I would not support that
>> upstream without a lot more explanation..
> Yup, this was our experience when playing around with PeerDirect. There
> was nothing we could do if the GPU decided to invalidate the P2P
> mapping.
As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-28 19:35                                   ` Serguei Sagalovitch
@ 2016-11-28 21:36                                     ` Logan Gunthorpe
  2016-11-28 21:55                                       ` Serguei Sagalovitch
  0 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-11-28 21:36 UTC (permalink / raw)
  To: Serguei Sagalovitch, Jason Gunthorpe, Haggai Eran
  Cc: Christian König, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Max Gurtovoy


On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:
> As soon as PeerDirect mapping is called then GPU must not "move" the
> such memory.  It is by PeerDirect design. It is similar how it is works
> with system memory and RDMA MR: when "get_user_pages" is called then the
> memory is pinned.

We haven't touch this in a long time and perhaps it changed, but there
definitely was a call back in the PeerDirect API to allow the GPU to
invalidate the mapping. That's what we don't want.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-28 21:36                                     ` Logan Gunthorpe
@ 2016-11-28 21:55                                       ` Serguei Sagalovitch
  2016-11-28 22:24                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-28 21:55 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe, Haggai Eran
  Cc: Christian König, Dan Williams, Deucher, Alexander,
	linux-nvdimm@lists.01.org, linux-rdma, linux-pci, Kuehling,
	Felix, Bridgman, John, linux-kernel, dri-devel, Sander, Ben,
	Suthikulpanit, Suravee, Blinzer, Paul, Linux-media, Max Gurtovoy


On 2016-11-28 04:36 PM, Logan Gunthorpe wrote:
> On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:
>> As soon as PeerDirect mapping is called then GPU must not "move" the
>> such memory.  It is by PeerDirect design. It is similar how it is works
>> with system memory and RDMA MR: when "get_user_pages" is called then the
>> memory is pinned.
> We haven't touch this in a long time and perhaps it changed, but there
> definitely was a call back in the PeerDirect API to allow the GPU to
> invalidate the mapping. That's what we don't want.
I assume that you are talking about "invalidate_peer_memory()' callback?
I was told that it is the "last resort" because HCA (and driver) is not
able to handle  it in the safe manner so it is basically "abort" 
everything.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-28 21:55                                       ` Serguei Sagalovitch
@ 2016-11-28 22:24                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-28 22:24 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: Logan Gunthorpe, Haggai Eran, Christian K??nig, Dan Williams,
	Deucher, Alexander, linux-nvdimm@lists.01.org, linux-rdma,
	linux-pci, Kuehling, Felix, Bridgman, John, linux-kernel,
	dri-devel, Sander, Ben, Suthikulpanit, Suravee, Blinzer, Paul,
	Linux-media, Max Gurtovoy

On Mon, Nov 28, 2016 at 04:55:23PM -0500, Serguei Sagalovitch wrote:

> >We haven't touch this in a long time and perhaps it changed, but there
> >definitely was a call back in the PeerDirect API to allow the GPU to
> >invalidate the mapping. That's what we don't want.

> I assume that you are talking about "invalidate_peer_memory()' callback?
> I was told that it is the "last resort" because HCA (and driver) is not
> able to handle  it in the safe manner so it is basically "abort" everything.

If it is a last resort to save system stability then kill the impacted
process, that will release the MRs.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-28 19:02                                   ` Jason Gunthorpe
@ 2016-11-30 10:45                                     ` Haggai Eran
  2016-11-30 16:23                                       ` Jason Gunthorpe
  2016-11-30 17:10                                       ` Deucher, Alexander
  0 siblings, 2 replies; 126+ messages in thread
From: Haggai Eran @ 2016-11-30 10:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit, John.Bridgman, Alexander.Deucher,
	Linux-media, dan.j.williams, logang, dri-devel, Max Gurtovoy,
	linux-pci, serguei.sagalovitch, Paul.Blinzer, Felix.Kuehling,
	ben.sander

On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
>>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>>>> user-space and the GPU not to migrate it. If they do, the MR gets
>>>> destroyed immediately.
>>> That sounds horrible. How can that possibly work? What if the MR is
>>> being used when the GPU decides to migrate? 
>> Naturally this doesn't support migration. The GPU is expected to pin
>> these pages as long as the MR lives. The MR invalidation is done only as
>> a last resort to keep system correctness.
> 
> That just forces applications to handle horrible unexpected
> failures. If this sort of thing is needed for correctness then OOM
> kill the offending process, don't corrupt its operation.
Yes, that sounds fine. Can we simply kill the process from the GPU driver?
Or do we need to extend the OOM killer to manage GPU pages?

> 
>> I think it is similar to how non-ODP MRs rely on user-space today to
>> keep them correct. If you do something like madvise(MADV_DONTNEED) on a
>> non-ODP MR's pages, you can still get yourself into a data corruption
>> situation (HCA sees one page and the process sees another for the same
>> virtual address). The pinning that we use only guarentees the HCA's page
>> won't be reused.
> 
> That is not really data corruption - the data still goes where it was
> originally destined. That is an application violating the
> requirements of a MR. 
I guess it is a matter of terminology. If you compare it to the ODP case 
or the CPU case then you usually expect a single virtual address to map to
a single physical page. Violating this cause some of your writes to be dropped
which is a data corruption in my book, even if the application caused it.

> An application cannot munmap/mremap a VMA
> while a non ODP MR points to it and then keep using the MR.
Right. And it is perfectly fine to have some similar requirements from the application
when doing peer to peer with a non-ODP MR. 

> That is totally different from a GPU driver wanthing to mess with
> translation to physical pages.
> 
>>> From what I understand we are not really talking about kernel p2p,
>>> everything proposed so far is being mediated by a userspace VMA, so
>>> I'd focus on making that work.
> 
>> Fair enough, although we will need both eventually, and I hope the
>> infrastructure can be shared to some degree.
> 
> What use case do you see for in kernel?
Two cases I can think of are RDMA access to an NVMe device's controller 
memory buffer, and O_DIRECT operations that access GPU memory.
Also, HMM's migration between two GPUs could use peer to peer in the kernel,
although that is intended to be handled by the GPU driver if I understand
correctly.

> Presumably in-kernel could use a vmap or something and the same basic
> flow?
I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
of MMIO pfns, and ZONE_DEVICE allows that.

Haggai

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-30 10:45                                     ` Haggai Eran
@ 2016-11-30 16:23                                       ` Jason Gunthorpe
  2016-11-30 17:28                                         ` Serguei Sagalovitch
                                                           ` (2 more replies)
  2016-11-30 17:10                                       ` Deucher, Alexander
  1 sibling, 3 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-11-30 16:23 UTC (permalink / raw)
  To: Haggai Eran
  Cc: linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit, John.Bridgman, Alexander.Deucher,
	Linux-media, dan.j.williams, logang, dri-devel, Max Gurtovoy,
	linux-pci, serguei.sagalovitch, Paul.Blinzer, Felix.Kuehling,
	ben.sander

On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:

> > That just forces applications to handle horrible unexpected
> > failures. If this sort of thing is needed for correctness then OOM
> > kill the offending process, don't corrupt its operation.

> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
> Or do we need to extend the OOM killer to manage GPU pages?

I don't know..

> >>> From what I understand we are not really talking about kernel p2p,
> >>> everything proposed so far is being mediated by a userspace VMA, so
> >>> I'd focus on making that work.
> > 
> >> Fair enough, although we will need both eventually, and I hope the
> >> infrastructure can be shared to some degree.
> > 
> > What use case do you see for in kernel?

> Two cases I can think of are RDMA access to an NVMe device's controller
> memory buffer,

I'm not sure on the use model there..

> and O_DIRECT operations that access GPU memory.

This goes through user space so there is still a VMA..

> Also, HMM's migration between two GPUs could use peer to peer in the
> kernel, although that is intended to be handled by the GPU driver if
> I understand correctly.

Hum, presumably these migrations are VMA backed as well...

> > Presumably in-kernel could use a vmap or something and the same basic
> > flow?
> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
> of MMIO pfns, and ZONE_DEVICE allows that.

Well, if there is no virtual map then we are back to how do you do
migrations and other things people seem to want to do on these
pages. Maybe the loose 'struct page' flow is not for those users.

But I think if you want kGPU or similar then you probably need vmaps
or something similar to represent the GPU pages in kernel memory.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* RE: Enabling peer to peer device transactions for PCIe devices
  2016-11-30 10:45                                     ` Haggai Eran
  2016-11-30 16:23                                       ` Jason Gunthorpe
@ 2016-11-30 17:10                                       ` Deucher, Alexander
  1 sibling, 0 replies; 126+ messages in thread
From: Deucher, Alexander @ 2016-11-30 17:10 UTC (permalink / raw)
  To: 'Haggai Eran', Jason Gunthorpe
  Cc: linux-kernel, linux-rdma, linux-nvdimm, Koenig, Christian,
	Suthikulpanit, Suravee, Bridgman, John, Linux-media,
	dan.j.williams, logang, dri-devel, Max Gurtovoy, linux-pci,
	Sagalovitch, Serguei, Blinzer, Paul, Kuehling, Felix, Sander,
	Ben

> -----Original Message-----
> From: Haggai Eran [mailto:haggaie@mellanox.com]
> Sent: Wednesday, November 30, 2016 5:46 AM
> To: Jason Gunthorpe
> Cc: linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; linux-
> nvdimm@ml01.01.org; Koenig, Christian; Suthikulpanit, Suravee; Bridgman,
> John; Deucher, Alexander; Linux-media@vger.kernel.org;
> dan.j.williams@intel.com; logang@deltatee.com; dri-
> devel@lists.freedesktop.org; Max Gurtovoy; linux-pci@vger.kernel.org;
> Sagalovitch, Serguei; Blinzer, Paul; Kuehling, Felix; Sander, Ben
> Subject: Re: Enabling peer to peer device transactions for PCIe devices
> 
> On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
> >>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> >>>> user-space and the GPU not to migrate it. If they do, the MR gets
> >>>> destroyed immediately.
> >>> That sounds horrible. How can that possibly work? What if the MR is
> >>> being used when the GPU decides to migrate?
> >> Naturally this doesn't support migration. The GPU is expected to pin
> >> these pages as long as the MR lives. The MR invalidation is done only as
> >> a last resort to keep system correctness.
> >
> > That just forces applications to handle horrible unexpected
> > failures. If this sort of thing is needed for correctness then OOM
> > kill the offending process, don't corrupt its operation.
> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
> Or do we need to extend the OOM killer to manage GPU pages?

Christian sent out an RFC patch a while back that extended the OOM to cover memory allocated for the GPU:
https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html

Alex

> 
> >
> >> I think it is similar to how non-ODP MRs rely on user-space today to
> >> keep them correct. If you do something like madvise(MADV_DONTNEED)
> on a
> >> non-ODP MR's pages, you can still get yourself into a data corruption
> >> situation (HCA sees one page and the process sees another for the same
> >> virtual address). The pinning that we use only guarentees the HCA's page
> >> won't be reused.
> >
> > That is not really data corruption - the data still goes where it was
> > originally destined. That is an application violating the
> > requirements of a MR.
> I guess it is a matter of terminology. If you compare it to the ODP case
> or the CPU case then you usually expect a single virtual address to map to
> a single physical page. Violating this cause some of your writes to be dropped
> which is a data corruption in my book, even if the application caused it.
> 
> > An application cannot munmap/mremap a VMA
> > while a non ODP MR points to it and then keep using the MR.
> Right. And it is perfectly fine to have some similar requirements from the
> application
> when doing peer to peer with a non-ODP MR.
> 
> > That is totally different from a GPU driver wanthing to mess with
> > translation to physical pages.
> >
> >>> From what I understand we are not really talking about kernel p2p,
> >>> everything proposed so far is being mediated by a userspace VMA, so
> >>> I'd focus on making that work.
> >
> >> Fair enough, although we will need both eventually, and I hope the
> >> infrastructure can be shared to some degree.
> >
> > What use case do you see for in kernel?
> Two cases I can think of are RDMA access to an NVMe device's controller
> memory buffer, and O_DIRECT operations that access GPU memory.
> Also, HMM's migration between two GPUs could use peer to peer in the
> kernel,
> although that is intended to be handled by the GPU driver if I understand
> correctly.
> 
> > Presumably in-kernel could use a vmap or something and the same basic
> > flow?
> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API
> support
> for peer to peer. I'm not sure we need vmap. We need a way to have a
> scatterlist
> of MMIO pfns, and ZONE_DEVICE allows that.
> 
> Haggai

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-30 16:23                                       ` Jason Gunthorpe
@ 2016-11-30 17:28                                         ` Serguei Sagalovitch
  2016-12-04  7:33                                           ` Haggai Eran
  2016-11-30 18:01                                         ` Logan Gunthorpe
  2016-12-04  7:53                                         ` Haggai Eran
  2 siblings, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2016-11-30 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe, Haggai Eran
  Cc: linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit, John.Bridgman, Alexander.Deucher,
	Linux-media, dan.j.williams, logang, dri-devel, Max Gurtovoy,
	linux-pci, Paul.Blinzer, Felix.Kuehling, ben.sander

On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:
>> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
>> Or do we need to extend the OOM killer to manage GPU pages?
> I don't know..
We could use send_sig_info to send signal from  kernel  to user space. 
So theoretically GPU driver
could issue KILL signal to some process.

> On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:
>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
>> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
>> of MMIO pfns, and ZONE_DEVICE allows that.
I do not think that using DMA-API as it is is the best solution (at 
least in the current form):

-  It deals with handles/fd for the whole allocation but client 
could/will use sub-allocation as
well as theoretically possible to "merge" several allocations in one 
from GPU perspective.
-  It require knowledge to export but because "sharing" is controlled 
from user space it
means that we must "export" all allocation by default
- It deals with 'fd'/handles but user application may work with 
addresses/pointers.

Also current  DMA-API force each time to do all DMA table programming 
unrelated if
location was changed or not. With  vma / mmu  we are  able to install 
notifier to intercept
changes in location and update  translation tables only as needed (we do 
not need to keep
get_user_pages()  lock).

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-30 16:23                                       ` Jason Gunthorpe
  2016-11-30 17:28                                         ` Serguei Sagalovitch
@ 2016-11-30 18:01                                         ` Logan Gunthorpe
  2016-12-04  7:42                                           ` Haggai Eran
  2016-12-04  7:53                                         ` Haggai Eran
  2 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-11-30 18:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Haggai Eran
  Cc: linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit, John.Bridgman, Alexander.Deucher,
	Linux-media, dan.j.williams, dri-devel, Max Gurtovoy, linux-pci,
	serguei.sagalovitch, Paul.Blinzer, Felix.Kuehling, ben.sander



On 30/11/16 09:23 AM, Jason Gunthorpe wrote:
>> Two cases I can think of are RDMA access to an NVMe device's controller
>> memory buffer,
> 
> I'm not sure on the use model there..

The NVMe fabrics stuff could probably make use of this. It's an
in-kernel system to allow remote access to an NVMe device over RDMA. So
they ought to be able to optimize their transfers by DMAing directly to
the NVMe's CMB -- no userspace interface would be required but there
would need some kernel infrastructure.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-30 17:28                                         ` Serguei Sagalovitch
@ 2016-12-04  7:33                                           ` Haggai Eran
  0 siblings, 0 replies; 126+ messages in thread
From: Haggai Eran @ 2016-12-04  7:33 UTC (permalink / raw)
  To: Serguei Sagalovitch, Jason Gunthorpe
  Cc: linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit, John.Bridgman, Alexander.Deucher,
	Linux-media, dan.j.williams, logang, dri-devel, Max Gurtovoy,
	linux-pci, Paul.Blinzer, Felix.Kuehling, ben.sander

On 11/30/2016 7:28 PM, Serguei Sagalovitch wrote:
> On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:
>>> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
>>> Or do we need to extend the OOM killer to manage GPU pages?
>> I don't know..
> We could use send_sig_info to send signal from  kernel  to user space. So theoretically GPU driver
> could issue KILL signal to some process.
> 
>> On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:
>>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
>>> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
>>> of MMIO pfns, and ZONE_DEVICE allows that.
> I do not think that using DMA-API as it is is the best solution (at least in the current form):
> 
> -  It deals with handles/fd for the whole allocation but client could/will use sub-allocation as
> well as theoretically possible to "merge" several allocations in one from GPU perspective.
> -  It require knowledge to export but because "sharing" is controlled from user space it
> means that we must "export" all allocation by default
> - It deals with 'fd'/handles but user application may work with addresses/pointers.

Aren't you confusing DMABUF and DMA-API? DMA-API is how you program the IOMMU (dma_map_page/dma_map_sg/etc.).
The comment above is just about the need to extend this API to allow mapping peer device pages to bus addresses.

In the past I sent an RFC for using DMABUF for peer to peer. I think it had some
advantages for legacy devices. I agree that working with addresses and pointers through
something like HMM/ODP is much more flexible and easier to program from user-space.
For legacy, DMABUF would have allowed you a way to pin the pages so the GPU knows not to
move them. However, that can probably also be achieved simply via the reference count
on ZONE_DEVICE pages. The other nice thing about DMABUF is that it migrate the buffer
itself during attachment according to the requirements of the device that is attaching,
so you can automatically decide in the exporter whether to use p2p or a staging buffer.

> 
> Also current  DMA-API force each time to do all DMA table programming unrelated if
> location was changed or not. With  vma / mmu  we are  able to install notifier to intercept
> changes in location and update  translation tables only as needed (we do not need to keep
> get_user_pages()  lock).
I agree.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-30 18:01                                         ` Logan Gunthorpe
@ 2016-12-04  7:42                                           ` Haggai Eran
  2016-12-04 13:06                                             ` Stephen Bates
  2016-12-04 13:23                                             ` Stephen Bates
  0 siblings, 2 replies; 126+ messages in thread
From: Haggai Eran @ 2016-12-04  7:42 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe
  Cc: linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit, John.Bridgman, Alexander.Deucher,
	Linux-media, dan.j.williams, dri-devel, Max Gurtovoy, linux-pci,
	serguei.sagalovitch, Paul.Blinzer, Felix.Kuehling, ben.sander

On 11/30/2016 8:01 PM, Logan Gunthorpe wrote:
> 
> 
> On 30/11/16 09:23 AM, Jason Gunthorpe wrote:
>>> Two cases I can think of are RDMA access to an NVMe device's controller
>>> memory buffer,
>>
>> I'm not sure on the use model there..
> 
> The NVMe fabrics stuff could probably make use of this. It's an
> in-kernel system to allow remote access to an NVMe device over RDMA. So
> they ought to be able to optimize their transfers by DMAing directly to
> the NVMe's CMB -- no userspace interface would be required but there
> would need some kernel infrastructure.

Yes, that's what I was thinking. The NVMe/f driver needs to map the CMB for
RDMA. I guess if it used ZONE_DEVICE like in the iopmem patches it would be
relatively easy to do.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-30 16:23                                       ` Jason Gunthorpe
  2016-11-30 17:28                                         ` Serguei Sagalovitch
  2016-11-30 18:01                                         ` Logan Gunthorpe
@ 2016-12-04  7:53                                         ` Haggai Eran
  2 siblings, 0 replies; 126+ messages in thread
From: Haggai Eran @ 2016-12-04  7:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit, John.Bridgman, Alexander.Deucher,
	Linux-media, dan.j.williams, logang, dri-devel, Max Gurtovoy,
	linux-pci, serguei.sagalovitch, Paul.Blinzer, Felix.Kuehling,
	ben.sander

On 11/30/2016 6:23 PM, Jason Gunthorpe wrote:
>> and O_DIRECT operations that access GPU memory.
> This goes through user space so there is still a VMA..
> 
>> Also, HMM's migration between two GPUs could use peer to peer in the
>> kernel, although that is intended to be handled by the GPU driver if
>> I understand correctly.
> Hum, presumably these migrations are VMA backed as well...
I guess so.

>>> Presumably in-kernel could use a vmap or something and the same basic
>>> flow?
>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
>> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
>> of MMIO pfns, and ZONE_DEVICE allows that.
> Well, if there is no virtual map then we are back to how do you do
> migrations and other things people seem to want to do on these
> pages. Maybe the loose 'struct page' flow is not for those users.
I was thinking that kernel use cases would disallow migration, similar to how 
non-ODP MRs would work. Either they are short-lived (like an O_DIRECT transfer)
or they can be longed lived but non-migratable (like perhaps a CMB staging buffer).

> But I think if you want kGPU or similar then you probably need vmaps
> or something similar to represent the GPU pages in kernel memory.
Right, although sometimes the GPU pages are simply inaccessible to the CPU.
In any case, I haven't thought about kGPU as a use-case.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-04  7:42                                           ` Haggai Eran
@ 2016-12-04 13:06                                             ` Stephen Bates
  2016-12-04 13:23                                             ` Stephen Bates
  1 sibling, 0 replies; 126+ messages in thread
From: Stephen Bates @ 2016-12-04 13:06 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Logan Gunthorpe, Jason Gunthorpe, linux-kernel, linux-rdma,
	linux-nvdimm, christian.koenig, Suravee.Suthikulpanit@amd.com,
	John.Bridgman@amd.com, Alexander.Deucher@amd.com,
	Linux-media@vger.kernel.org, dan.j.williams, dri-devel,
	Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

>>
>> The NVMe fabrics stuff could probably make use of this. It's an
>> in-kernel system to allow remote access to an NVMe device over RDMA. So
>> they ought to be able to optimize their transfers by DMAing directly to
>>  the NVMe's CMB -- no userspace interface would be required but there
>> would need some kernel infrastructure.
>
> Yes, that's what I was thinking. The NVMe/f driver needs to map the CMB
> for RDMA. I guess if it used ZONE_DEVICE like in the iopmem patches it
> would be relatively easy to do.
>

Haggai, yes that was one of the use cases we considered when we put
together the patchset.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-04  7:42                                           ` Haggai Eran
  2016-12-04 13:06                                             ` Stephen Bates
@ 2016-12-04 13:23                                             ` Stephen Bates
  2016-12-05 17:18                                               ` Jason Gunthorpe
  1 sibling, 1 reply; 126+ messages in thread
From: Stephen Bates @ 2016-12-04 13:23 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Logan Gunthorpe, Jason Gunthorpe, linux-kernel, linux-rdma,
	linux-nvdimm, christian.koenig, Suravee.Suthikulpanit@amd.com,
	John.Bridgman@amd.com, Alexander.Deucher@amd.com,
	Linux-media@vger.kernel.org, dan.j.williams, dri-devel,
	Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

Hi All

This has been a great thread (thanks to Alex for kicking it off) and I
wanted to jump in and maybe try and put some summary around the
discussion. I also wanted to propose we include this as a topic for LFS/MM
because I think we need more discussion on the best way to add this
functionality to the kernel.

As far as I can tell the people looking for P2P support in the kernel fall
into two main camps:

1. Those who simply want to expose static BARs on PCIe devices that can be
used as the source/destination for DMAs from another PCIe device. This
group has no need for memory invalidation and are happy to use
physical/bus addresses and not virtual addresses.

2. Those who want to support devices that suffer from occasional memory
pressure and need to invalidate memory regions from time to time. This
camp also would like to use virtual addresses rather than physical ones to
allow for things like migration.

I am wondering if people agree with this assessment?

I think something like the iopmem patches Logan and I submitted recently
come close to addressing use case 1. There are some issues around
routability but based on feedback to date that does not seem to be a
show-stopper for an initial inclusion.

For use-case 2 it looks like there are several options and some of them
(like HMM) have been around for quite some time without gaining
acceptance. I think there needs to be more discussion on this usecase and
it could be some time before we get something upstreamable.

I for one, would really like to see use case 1 get addressed soon because
we have consumers for it coming soon in the form of CMBs for NVMe devices.

Long term I think Jason summed it up really well. CPU vendors will put
high-speed, open, switchable, coherent buses on their processors and all
these problems will vanish. But I ain't holding my breathe for that to
happen ;-).

Cheers

Stephen

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-04 13:23                                             ` Stephen Bates
@ 2016-12-05 17:18                                               ` Jason Gunthorpe
  2016-12-05 17:40                                                 ` Dan Williams
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-12-05 17:18 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Haggai Eran, Logan Gunthorpe, linux-kernel, linux-rdma,
	linux-nvdimm, christian.koenig, Suravee.Suthikulpanit@amd.com,
	John.Bridgman@amd.com, Alexander.Deucher@amd.com,
	Linux-media@vger.kernel.org, dan.j.williams, dri-devel,
	Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Sun, Dec 04, 2016 at 07:23:00AM -0600, Stephen Bates wrote:
> Hi All
> 
> This has been a great thread (thanks to Alex for kicking it off) and I
> wanted to jump in and maybe try and put some summary around the
> discussion. I also wanted to propose we include this as a topic for LFS/MM
> because I think we need more discussion on the best way to add this
> functionality to the kernel.
> 
> As far as I can tell the people looking for P2P support in the kernel fall
> into two main camps:
> 
> 1. Those who simply want to expose static BARs on PCIe devices that can be
> used as the source/destination for DMAs from another PCIe device. This
> group has no need for memory invalidation and are happy to use
> physical/bus addresses and not virtual addresses.

I didn't think there was much on this topic except for the CMB
thing.. Even that is really a mapped kernel address..

> I think something like the iopmem patches Logan and I submitted recently
> come close to addressing use case 1. There are some issues around
> routability but based on feedback to date that does not seem to be a
> show-stopper for an initial inclusion.

If it is kernel only with physical addresess we don't need a uAPI for
it, so I'm not sure #1 is at all related to iopmem.

Most people who want #1 probably can just mmap
/sys/../pci/../resourceX to get a user handle to it, or pass around
__iomem pointers in the kernel. This has been asked for before with
RDMA.

I'm still not really clear what iopmem is for, or why DAX should ever
be involved in this..

> For use-case 2 it looks like there are several options and some of them
> (like HMM) have been around for quite some time without gaining
> acceptance. I think there needs to be more discussion on this usecase and
> it could be some time before we get something upstreamable.

AFAIK, hmm makes parts easier, but isn't directly addressing this
need..

I think you need to get ZONE_DEVICE accepted for non-cachable PCI BARs
as the first step.

>From there is pretty clear we the DMA API needs to be updated to
support that use and work can be done to solve the various problems
there on the basis of using ZONE_DEVICE pages to figure out to the
PCI-E end points

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 17:18                                               ` Jason Gunthorpe
@ 2016-12-05 17:40                                                 ` Dan Williams
  2016-12-05 18:02                                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Dan Williams @ 2016-12-05 17:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Stephen Bates, Haggai Eran, Logan Gunthorpe, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Mon, Dec 5, 2016 at 9:18 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Sun, Dec 04, 2016 at 07:23:00AM -0600, Stephen Bates wrote:
>> Hi All
>>
>> This has been a great thread (thanks to Alex for kicking it off) and I
>> wanted to jump in and maybe try and put some summary around the
>> discussion. I also wanted to propose we include this as a topic for LFS/MM
>> because I think we need more discussion on the best way to add this
>> functionality to the kernel.
>>
>> As far as I can tell the people looking for P2P support in the kernel fall
>> into two main camps:
>>
>> 1. Those who simply want to expose static BARs on PCIe devices that can be
>> used as the source/destination for DMAs from another PCIe device. This
>> group has no need for memory invalidation and are happy to use
>> physical/bus addresses and not virtual addresses.
>
> I didn't think there was much on this topic except for the CMB
> thing.. Even that is really a mapped kernel address..
>
>> I think something like the iopmem patches Logan and I submitted recently
>> come close to addressing use case 1. There are some issues around
>> routability but based on feedback to date that does not seem to be a
>> show-stopper for an initial inclusion.
>
> If it is kernel only with physical addresess we don't need a uAPI for
> it, so I'm not sure #1 is at all related to iopmem.
>
> Most people who want #1 probably can just mmap
> /sys/../pci/../resourceX to get a user handle to it, or pass around
> __iomem pointers in the kernel. This has been asked for before with
> RDMA.
>
> I'm still not really clear what iopmem is for, or why DAX should ever
> be involved in this..

Right, by default remap_pfn_range() does not establish DMA capable
mappings. You can think of iopmem as remap_pfn_range() converted to
use devm_memremap_pages(). Given the extra constraints of
devm_memremap_pages() it seems reasonable to have those DMA capable
mappings be optionally established via a separate driver.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 17:40                                                 ` Dan Williams
@ 2016-12-05 18:02                                                   ` Jason Gunthorpe
  2016-12-05 18:08                                                     ` Dan Williams
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-12-05 18:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Stephen Bates, Haggai Eran, Logan Gunthorpe, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:

> > If it is kernel only with physical addresess we don't need a uAPI for
> > it, so I'm not sure #1 is at all related to iopmem.
> >
> > Most people who want #1 probably can just mmap
> > /sys/../pci/../resourceX to get a user handle to it, or pass around
> > __iomem pointers in the kernel. This has been asked for before with
> > RDMA.
> >
> > I'm still not really clear what iopmem is for, or why DAX should ever
> > be involved in this..
> 
> Right, by default remap_pfn_range() does not establish DMA capable
> mappings. You can think of iopmem as remap_pfn_range() converted to
> use devm_memremap_pages(). Given the extra constraints of
> devm_memremap_pages() it seems reasonable to have those DMA capable
> mappings be optionally established via a separate driver.

Except the iopmem driver claims the PCI ID, and presents a block
interface which is really *NOT* what people who have asked for this in
the past have wanted. IIRC it was embedded stuff eg RDMA video
directly out of a capture card or a similar kind of thinking.

It is a good point about devm_memremap_pages limitations, but maybe
that just says to create a /sys/.../resource_dmableX ?

Or is there some reason why people want a filesystem on top of BAR
memory? That does not seem to have been covered yet..

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 18:02                                                   ` Jason Gunthorpe
@ 2016-12-05 18:08                                                     ` Dan Williams
  2016-12-05 18:39                                                       ` Logan Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Dan Williams @ 2016-12-05 18:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Stephen Bates, Haggai Eran, Logan Gunthorpe, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Mon, Dec 5, 2016 at 10:02 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:
>
>> > If it is kernel only with physical addresess we don't need a uAPI for
>> > it, so I'm not sure #1 is at all related to iopmem.
>> >
>> > Most people who want #1 probably can just mmap
>> > /sys/../pci/../resourceX to get a user handle to it, or pass around
>> > __iomem pointers in the kernel. This has been asked for before with
>> > RDMA.
>> >
>> > I'm still not really clear what iopmem is for, or why DAX should ever
>> > be involved in this..
>>
>> Right, by default remap_pfn_range() does not establish DMA capable
>> mappings. You can think of iopmem as remap_pfn_range() converted to
>> use devm_memremap_pages(). Given the extra constraints of
>> devm_memremap_pages() it seems reasonable to have those DMA capable
>> mappings be optionally established via a separate driver.
>
> Except the iopmem driver claims the PCI ID, and presents a block
> interface which is really *NOT* what people who have asked for this in
> the past have wanted. IIRC it was embedded stuff eg RDMA video
> directly out of a capture card or a similar kind of thinking.
>
> It is a good point about devm_memremap_pages limitations, but maybe
> that just says to create a /sys/.../resource_dmableX ?
>
> Or is there some reason why people want a filesystem on top of BAR
> memory? That does not seem to have been covered yet..
>

I've already recommended that iopmem not be a block device and instead
be a device-dax instance. I also don't think it should claim the PCI
ID, rather the driver that wants to map one of its bars this way can
register the memory region with the device-dax core.

I'm not sure there are enough device drivers that want to do this to
have it be a generic /sys/.../resource_dmableX capability. It still
seems to be an exotic one-off type of configuration.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 18:08                                                     ` Dan Williams
@ 2016-12-05 18:39                                                       ` Logan Gunthorpe
  2016-12-05 18:48                                                         ` Dan Williams
  0 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-12-05 18:39 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe
  Cc: Stephen Bates, Haggai Eran, linux-kernel, linux-rdma,
	linux-nvdimm, christian.koenig, Suravee.Suthikulpanit@amd.com,
	John.Bridgman@amd.com, Alexander.Deucher@amd.com,
	Linux-media@vger.kernel.org, dri-devel, Max Gurtovoy, linux-pci,
	serguei.sagalovitch, Paul.Blinzer@amd.com,
	Felix.Kuehling@amd.com, ben.sander

On 05/12/16 11:08 AM, Dan Williams wrote:
> I've already recommended that iopmem not be a block device and instead
> be a device-dax instance. I also don't think it should claim the PCI
> ID, rather the driver that wants to map one of its bars this way can
> register the memory region with the device-dax core.
>
> I'm not sure there are enough device drivers that want to do this to
> have it be a generic /sys/.../resource_dmableX capability. It still
> seems to be an exotic one-off type of configuration.

Yes, this is essentially my thinking. Except I think the userspace 
interface should really depend on the device itself. Device dax is a 
good  choice for many and I agree the block device approach wouldn't be 
ideal.

Specifically for NVME CMB: I think it would make a lot of sense to just 
hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB 
buffers would be volatile and thus you wouldn't need to keep track of 
where in the BAR the region came from. Thus, the mmap call would just be 
an allocator from BAR memory. If device-dax were used, userspace would 
need to lookup which device-dax instance corresponds to which nvme drive.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 18:39                                                       ` Logan Gunthorpe
@ 2016-12-05 18:48                                                         ` Dan Williams
  2016-12-05 19:14                                                           ` Jason Gunthorpe
  2016-12-06  8:06                                                           ` Stephen Bates
  0 siblings, 2 replies; 126+ messages in thread
From: Dan Williams @ 2016-12-05 18:48 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Stephen Bates, Haggai Eran, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> On 05/12/16 11:08 AM, Dan Williams wrote:
>>
>> I've already recommended that iopmem not be a block device and instead
>> be a device-dax instance. I also don't think it should claim the PCI
>> ID, rather the driver that wants to map one of its bars this way can
>> register the memory region with the device-dax core.
>>
>> I'm not sure there are enough device drivers that want to do this to
>> have it be a generic /sys/.../resource_dmableX capability. It still
>> seems to be an exotic one-off type of configuration.
>
>
> Yes, this is essentially my thinking. Except I think the userspace interface
> should really depend on the device itself. Device dax is a good  choice for
> many and I agree the block device approach wouldn't be ideal.
>
> Specifically for NVME CMB: I think it would make a lot of sense to just hand
> out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> would be volatile and thus you wouldn't need to keep track of where in the
> BAR the region came from. Thus, the mmap call would just be an allocator
> from BAR memory. If device-dax were used, userspace would need to lookup
> which device-dax instance corresponds to which nvme drive.
>

I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
to accomplish in sysfs through /sys/dev/char to find the sysfs path of
the device-dax instance under the nvme device, or if you already have
the nvme sysfs path the dax instance(s) will appear under the "dax"
sub-directory.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 18:48                                                         ` Dan Williams
@ 2016-12-05 19:14                                                           ` Jason Gunthorpe
  2016-12-05 19:27                                                             ` Logan Gunthorpe
  2016-12-06  8:06                                                           ` Stephen Bates
  1 sibling, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-12-05 19:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Stephen Bates, Haggai Eran, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Mon, Dec 05, 2016 at 10:48:58AM -0800, Dan Williams wrote:
> On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> > On 05/12/16 11:08 AM, Dan Williams wrote:
> >>
> >> I've already recommended that iopmem not be a block device and instead
> >> be a device-dax instance. I also don't think it should claim the PCI
> >> ID, rather the driver that wants to map one of its bars this way can
> >> register the memory region with the device-dax core.
> >>
> >> I'm not sure there are enough device drivers that want to do this to
> >> have it be a generic /sys/.../resource_dmableX capability. It still
> >> seems to be an exotic one-off type of configuration.
> >
> >
> > Yes, this is essentially my thinking. Except I think the userspace interface
> > should really depend on the device itself. Device dax is a good  choice for
> > many and I agree the block device approach wouldn't be ideal.
> >
> > Specifically for NVME CMB: I think it would make a lot of sense to just hand
> > out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> > would be volatile and thus you wouldn't need to keep track of where in the
> > BAR the region came from. Thus, the mmap call would just be an allocator
> > from BAR memory. If device-dax were used, userspace would need to lookup
> > which device-dax instance corresponds to which nvme drive.
> 
> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path
> of

But CMB sounds much more like the GPU case where there is a
specialized allocator handing out the BAR to consumers, so I'm not
sure a general purpose chardev makes a lot of sense?

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 19:14                                                           ` Jason Gunthorpe
@ 2016-12-05 19:27                                                             ` Logan Gunthorpe
  2016-12-05 19:46                                                               ` Jason Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-12-05 19:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Stephen Bates, Haggai Eran, linux-kernel, linux-rdma,
	linux-nvdimm, christian.koenig, Suravee.Suthikulpanit@amd.com,
	John.Bridgman@amd.com, Alexander.Deucher@amd.com,
	Linux-media@vger.kernel.org, dri-devel, Max Gurtovoy, linux-pci,
	serguei.sagalovitch, Paul.Blinzer@amd.com,
	Felix.Kuehling@amd.com, ben.sander



On 05/12/16 12:14 PM, Jason Gunthorpe wrote:
> But CMB sounds much more like the GPU case where there is a
> specialized allocator handing out the BAR to consumers, so I'm not
> sure a general purpose chardev makes a lot of sense?

I don't think it will ever need to be as complicated as the GPU case. 
There will probably only ever be a relatively small amount of memory 
behind the CMB and really the only users are those doing P2P work. Thus 
the specialized allocator could be pretty simple and I expect it would 
be fine to just return -ENOMEM if there is not enough memory.

Also, if it was implemented this way, if there was a need to make the 
allocator more complicated it could easily be added later as the 
userspace interface is just mmap to obtain a buffer.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 19:27                                                             ` Logan Gunthorpe
@ 2016-12-05 19:46                                                               ` Jason Gunthorpe
  2016-12-05 19:59                                                                 ` Logan Gunthorpe
  2016-12-05 20:06                                                                 ` Christoph Hellwig
  0 siblings, 2 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-12-05 19:46 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Stephen Bates, Haggai Eran, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Mon, Dec 05, 2016 at 12:27:20PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 05/12/16 12:14 PM, Jason Gunthorpe wrote:
> >But CMB sounds much more like the GPU case where there is a
> >specialized allocator handing out the BAR to consumers, so I'm not
> >sure a general purpose chardev makes a lot of sense?
> 
> I don't think it will ever need to be as complicated as the GPU case. There
> will probably only ever be a relatively small amount of memory behind the
> CMB and really the only users are those doing P2P work. Thus the specialized
> allocator could be pretty simple and I expect it would be fine to just
> return -ENOMEM if there is not enough memory.

NVMe might have to deal with pci-e hot-unplug, which is a similar
problem-class to the GPU case..

In any event the allocator still needs to track which regions are in
use and be able to hook 'free' from userspace. That does suggest it
should be integrated into the nvme driver and not a bolt on driver..

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 19:46                                                               ` Jason Gunthorpe
@ 2016-12-05 19:59                                                                 ` Logan Gunthorpe
  2016-12-05 20:06                                                                 ` Christoph Hellwig
  1 sibling, 0 replies; 126+ messages in thread
From: Logan Gunthorpe @ 2016-12-05 19:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Stephen Bates, Haggai Eran, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander



On 05/12/16 12:46 PM, Jason Gunthorpe wrote:
> NVMe might have to deal with pci-e hot-unplug, which is a similar
> problem-class to the GPU case..

Sure, but if the NVMe device gets hot-unplugged it means that all the 
CMB mappings are useless and need to be torn down. This probably means 
killing any process that has mappings open.

> In any event the allocator still needs to track which regions are in
> use and be able to hook 'free' from userspace. That does suggest it
> should be integrated into the nvme driver and not a bolt on driver..

Yup, that's correct. And yes, I've never suggested this to be a bolt on 
driver -- I always expected for it to get integrated into the nvme 
driver. (iopmem was not meant for this.)

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 19:46                                                               ` Jason Gunthorpe
  2016-12-05 19:59                                                                 ` Logan Gunthorpe
@ 2016-12-05 20:06                                                                 ` Christoph Hellwig
  1 sibling, 0 replies; 126+ messages in thread
From: Christoph Hellwig @ 2016-12-05 20:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Dan Williams, Stephen Bates, Haggai Eran,
	linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Mon, Dec 05, 2016 at 12:46:14PM -0700, Jason Gunthorpe wrote:
> In any event the allocator still needs to track which regions are in
> use and be able to hook 'free' from userspace. That does suggest it
> should be integrated into the nvme driver and not a bolt on driver..

Two totally different use cases:

 - a card that exposes directly byte addressable storage as a PCI-e
   bar.  Thin of it as a nvdimm on a PCI-e card.  That's the iopmem
   case.
 - the NVMe CMB which exposes a byte addressable indirection buffer for
   I/O, but does not actually provide byte addressable persistent
   storage.  This is something that needs to be added to the NVMe driver
   (and the block layer for the abstraction probably).

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-05 18:48                                                         ` Dan Williams
  2016-12-05 19:14                                                           ` Jason Gunthorpe
@ 2016-12-06  8:06                                                           ` Stephen Bates
  2016-12-06 16:38                                                             ` Jason Gunthorpe
  1 sibling, 1 reply; 126+ messages in thread
From: Stephen Bates @ 2016-12-06  8:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Jason Gunthorpe, Haggai Eran, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

>>> I've already recommended that iopmem not be a block device and
>>> instead be a device-dax instance. I also don't think it should claim
>>> the PCI ID, rather the driver that wants to map one of its bars this
>>> way can register the memory region with the device-dax core.
>>>
>>> I'm not sure there are enough device drivers that want to do this to
>>> have it be a generic /sys/.../resource_dmableX capability. It still
>>> seems to be an exotic one-off type of configuration.
>>
>>
>> Yes, this is essentially my thinking. Except I think the userspace
>> interface should really depend on the device itself. Device dax is a
>> good  choice for many and I agree the block device approach wouldn't be
>> ideal.

I tend to agree here. The block device interface has seen quite a bit of
resistance and /dev/dax looks like a better approach for most. We can look
at doing it that way in v2.

>>
>> Specifically for NVME CMB: I think it would make a lot of sense to just
>> hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB
>> buffers would be volatile and thus you wouldn't need to keep track of
>> where in the BAR the region came from. Thus, the mmap call would just be
>> an allocator from BAR memory. If device-dax were used, userspace would
>> need to lookup which device-dax instance corresponds to which nvme
>> drive.
>>
>
> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> device-dax instance under the nvme device, or if you already have the nvme
> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>

Personally I think mapping the dax resource in the sysfs tree is a nice
way to do this and a bit more intuitive than mapping a /dev/nvmeX.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-06  8:06                                                           ` Stephen Bates
@ 2016-12-06 16:38                                                             ` Jason Gunthorpe
  2016-12-06 16:51                                                               ` Logan Gunthorpe
  2016-12-06 17:12                                                               ` Christoph Hellwig
  0 siblings, 2 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2016-12-06 16:38 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Dan Williams, Logan Gunthorpe, Haggai Eran, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

> > I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> > device-dax instance under the nvme device, or if you already have the nvme
> > sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> 
> Personally I think mapping the dax resource in the sysfs tree is a nice
> way to do this and a bit more intuitive than mapping a /dev/nvmeX.

It is still not at all clear to me what userpsace is supposed to do
with this on nvme.. How is the CMB usable from userspace?

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-06 16:38                                                             ` Jason Gunthorpe
@ 2016-12-06 16:51                                                               ` Logan Gunthorpe
  2016-12-06 17:28                                                                 ` Jason Gunthorpe
  2016-12-06 17:12                                                               ` Christoph Hellwig
  1 sibling, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-12-06 16:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Stephen Bates
  Cc: Dan Williams, Haggai Eran, linux-kernel, linux-rdma,
	linux-nvdimm, christian.koenig, Suravee.Suthikulpanit@amd.com,
	John.Bridgman@amd.com, Alexander.Deucher@amd.com,
	Linux-media@vger.kernel.org, dri-devel, Max Gurtovoy, linux-pci,
	serguei.sagalovitch, Paul.Blinzer@amd.com,
	Felix.Kuehling@amd.com, ben.sander

Hey,

On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
>>> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
>>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
>>> device-dax instance under the nvme device, or if you already have the nvme
>>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>>
>> Personally I think mapping the dax resource in the sysfs tree is a nice
>> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> 
> It is still not at all clear to me what userpsace is supposed to do
> with this on nvme.. How is the CMB usable from userspace?

The flow is pretty simple. For example to write to NVMe from an RDMA device:

1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
/dev/nvmx, the device dax char device or through a block layer interface
(which sounds like a good suggestion from Christoph, but I'm not really
sure how it would look).

2) Create an MR with the buffer and use an RDMA function to fill it with
data from a remote host. This will cause the RDMA hardware to write
directly to the memory in the NVMe card.

3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
When the address reaches hardware the NVMe will recognize it as local
memory and copy it directly there.

Thus we are able to transfer data to any file on an NVMe device without
going through system memory. This has benefits on systems with lots of
activity in system memory but step 3 is likely to be slowish due to the
need to pin/unpin the memory for every transaction.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-06 16:38                                                             ` Jason Gunthorpe
  2016-12-06 16:51                                                               ` Logan Gunthorpe
@ 2016-12-06 17:12                                                               ` Christoph Hellwig
  1 sibling, 0 replies; 126+ messages in thread
From: Christoph Hellwig @ 2016-12-06 17:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Stephen Bates, Dan Williams, Logan Gunthorpe, Haggai Eran,
	linux-kernel, linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Tue, Dec 06, 2016 at 09:38:50AM -0700, Jason Gunthorpe wrote:
> > > I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> > > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> > > device-dax instance under the nvme device, or if you already have the nvme
> > > sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> > 
> > Personally I think mapping the dax resource in the sysfs tree is a nice
> > way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> 
> It is still not at all clear to me what userpsace is supposed to do
> with this on nvme.. How is the CMB usable from userspace?

I don't think trying to expose it to userspace makes any sense.
Exposing it to in-kernel storage targets on the other hand makes a lot
of sense.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-06 16:51                                                               ` Logan Gunthorpe
@ 2016-12-06 17:28                                                                 ` Jason Gunthorpe
  2016-12-06 21:47                                                                   ` Logan Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2016-12-06 17:28 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Stephen Bates, Dan Williams, Haggai Eran, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Tue, Dec 06, 2016 at 09:51:15AM -0700, Logan Gunthorpe wrote:
> Hey,
> 
> On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
> >>> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> >>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> >>> device-dax instance under the nvme device, or if you already have the nvme
> >>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> >>
> >> Personally I think mapping the dax resource in the sysfs tree is a nice
> >> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> > 
> > It is still not at all clear to me what userpsace is supposed to do
> > with this on nvme.. How is the CMB usable from userspace?
> 
> The flow is pretty simple. For example to write to NVMe from an RDMA device:
> 
> 1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
> /dev/nvmx, the device dax char device or through a block layer interface
> (which sounds like a good suggestion from Christoph, but I'm not really
> sure how it would look).

Okay, so clearly this needs a kernel side NVMe specific allocator
and locking so users don't step on each other..

Or as Christoph says some kind of general mechanism to get these
bounce buffers..

> 2) Create an MR with the buffer and use an RDMA function to fill it with
> data from a remote host. This will cause the RDMA hardware to write
> directly to the memory in the NVMe card.
> 
> 3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
> When the address reaches hardware the NVMe will recognize it as local
> memory and copy it directly there.

Ah, I see.

As a first draft I'd stick with some kind of API built into the
/dev/nvmeX that backs the filesystem. The user app would fstat the
target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some
ioctl to get a CMB mmap, and then proceed from there..

When that is all working kernel-side, it would make sense to look at a
more general mechanism that could be used unprivileged??

> Thus we are able to transfer data to any file on an NVMe device without
> going through system memory. This has benefits on systems with lots of
> activity in system memory but step 3 is likely to be slowish due to the
> need to pin/unpin the memory for every transaction.

This is similar to the GPU issues too.. On NVMe you don't need to pin
the pages, you just need to lock that VMA so it doesn't get freed from
the NVMe CMB allocator while the IO is running...

Probably in the long run the get_user_pages is going to have to be
pushed down into drivers.. Future MMU coherent IO hardware also does
not need the pinning or other overheads.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-06 17:28                                                                 ` Jason Gunthorpe
@ 2016-12-06 21:47                                                                   ` Logan Gunthorpe
  2016-12-06 22:02                                                                     ` Dan Williams
  0 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2016-12-06 21:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Stephen Bates, Dan Williams, Haggai Eran, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

Hey,

> Okay, so clearly this needs a kernel side NVMe specific allocator
> and locking so users don't step on each other..

Yup, ideally. That's why device dax isn't ideal for this application: it
doesn't provide any way to prevent users from stepping on each other.

> Or as Christoph says some kind of general mechanism to get these
> bounce buffers..

Yeah, I imagine a general allocate from BAR/region system would be very
useful.

> Ah, I see.
> 
> As a first draft I'd stick with some kind of API built into the
> /dev/nvmeX that backs the filesystem. The user app would fstat the
> target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some
> ioctl to get a CMB mmap, and then proceed from there..
> 
> When that is all working kernel-side, it would make sense to look at a
> more general mechanism that could be used unprivileged??

That makes a lot of sense to me. I suggested mmapping the char device
because it's really easy, but I can see that an ioctl on the block
device does seem more general and device agnostic.

> This is similar to the GPU issues too.. On NVMe you don't need to pin
> the pages, you just need to lock that VMA so it doesn't get freed from
> the NVMe CMB allocator while the IO is running...
> Probably in the long run the get_user_pages is going to have to be
> pushed down into drivers.. Future MMU coherent IO hardware also does
> not need the pinning or other overheads.

Yup. Yup.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-12-06 21:47                                                                   ` Logan Gunthorpe
@ 2016-12-06 22:02                                                                     ` Dan Williams
  0 siblings, 0 replies; 126+ messages in thread
From: Dan Williams @ 2016-12-06 22:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Stephen Bates, Haggai Eran, linux-kernel,
	linux-rdma, linux-nvdimm, christian.koenig,
	Suravee.Suthikulpanit@amd.com, John.Bridgman@amd.com,
	Alexander.Deucher@amd.com, Linux-media@vger.kernel.org,
	dri-devel, Max Gurtovoy, linux-pci, serguei.sagalovitch,
	Paul.Blinzer@amd.com, Felix.Kuehling@amd.com, ben.sander

On Tue, Dec 6, 2016 at 1:47 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hey,
>
>> Okay, so clearly this needs a kernel side NVMe specific allocator
>> and locking so users don't step on each other..
>
> Yup, ideally. That's why device dax isn't ideal for this application: it
> doesn't provide any way to prevent users from stepping on each other.

On this particular point I'm in the process of posting patches that
allow device-dax sub-division, so you could carve up a bar into
multiple devices of various sizes.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-21 20:36 Enabling peer to peer device transactions for PCIe devices Deucher, Alexander
  2016-11-22 18:11 ` Dan Williams
@ 2017-01-05 18:39 ` Jerome Glisse
  2017-01-05 19:01   ` Jason Gunthorpe
  2017-10-20 12:36 ` Ludwig Petrosyan
  2 siblings, 1 reply; 126+ messages in thread
From: Jerome Glisse @ 2017-01-05 18:39 UTC (permalink / raw)
  To: Deucher, Alexander
  Cc: 'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, jgunthorpe,
	david1.zhou, qiang.yu

Sorry to revive this thread but it fells through my filters and i
miss it. I have been going through it and i think the discussion
has been hinder by the fact that distinct problems were merge while
they should be address separately.

First for peer-to-peer we need to be clear on how this happens. Two
cases here :
  1) peer-to-peer because of userspace specific API like NVidia GPU
    direct (AMD is pushing its own similar API i just can't remember
    marketing name). This does not happen through a vma, this happens
    through specific device driver call going through device specific
    ioctl on both side (GPU and RDMA). So both kernel driver are aware
    of each others.
  2) peer-to-peer because RDMA/device is trying to access a regular
    vma (ie non special either private anonymous or share memory or
    mmap of a regular file not a device file).

For 1) there is no need to over complicate thing. Device driver must
have a back-channel between them and must be able to invalidate their
respective mapping (ie GPU must be able to ask RDMA device to kill/
stop its MR).

So remaining issue for 1) is how to enable effective peer-to-peer
mapping given that it might not work reliably on all platform. Here
Alex was listing existing proposal:
  A P2P DMA DMA-API/PCI map_peer_resource support for peer-to-peer
    http://www.spinics.net/lists/linux-pci/msg44560.html
  B ZONE_DEVICE IO irect I/O and DMA for persistent memory
    https://lwn.net/Articles/672457/
  C DMA-BUF RDMA subsystem DMA-BUF support
    http://www.spinics.net/lists/linux-rdma/msg38748.html
  D iopmem iopmem : A block device for PCIe memory
    https://lwn.net/Articles/703895/
  E HMM (not interesting for case 1)
  F Something new

Of the above D is ill suited for for GPU as we do not want to pin
GPU memory and D is design with long live object that do not move.
Also i do not think that exposing device PCIe bar through a new
/dev/somefilename is a good idea for GPU. So i think this should
be discarded.

HMM should be discard in respect of case 1 too. It is useful for
case 2. I don't think dma-buf is the right path either.

So we i think there is only A and B that make sense. Now for use
case 1 i think A is the best solution. No need to have struct page
and it require explicit knowlegde for device driver that it is
mapping another device memory which is a given in usecase 1.


If we look at case 2 the situation is bit more complex. Here RDMA
is just trying to access a regular VMA but it might happens that
some memory inside that VMA reside inside a device memory. When
that happens we would like to avoid to move that memory back to
system memory assuming that a peer mapping is doable.

Usecase 2 assume that the GPU is either on platform with CAPI or
CCTX (or something similar) in which case it is easy as device
memory will have struct page and is always accessible by CPU and
transparent from device to device access (AFAICT).

So we left with platform that do not have proper support for
device memory (ie CPU can not access it the same as DDR or as
limited access). Which apply to x86 for the foreseeable future.

This is the problem HMM address, allowing to transparently use
device memory inside a process even if direct CPU access are not
permited. I have plan to support peer-to-peer with HMM because
it is an important usecase. The idea is to have the device driver
fault against ZONE_DEVICE page and communicate through common API
to establish mapping. HMM will only handle keeping track of device
to device mapping and allowing to invalidate such mapping at any
time to allow memory to be migrated.

I do not intend to solve the IOMMU side of the problem or even
the PCI hierarchy issue where you can't peer-to-peer between device
accross some PCI bridge. I believe this is an orthogonal problem
and that it is best solve inside the DMA API ie with solution A.


I do not think we should try to solve all the problems with a
common solutions. They are too disparate from capabilities (what
the hardware can and can't do).

>From my point of view there is few take aways:
  - device should only access regular vma
  - device should never try to access vma that point to another
    device (mmap of any file in /dev)
  - peer to peer access through dedicated userspace API must
    involve dedicated API between kernel driver taking part into
    the peer to peer access
  - peer to peer of regular vma must involve a common API for
    drivers to interact so no driver can block the other


So i think the DMA-API proposal is the one to pursue and others
problem relating to handling GPU memory and how to use it is a
different kind of problem. One with either an hardware solution
(CAPI, CCTX, ...) or a software solution (HMM so far).

I don't think we should conflict the 2 problems into one. Anyway
i think this should be something worth discussing face to face
with interested party to flesh out a solution (can be at LSF/MM
or in another forum).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-05 18:39 ` Jerome Glisse
@ 2017-01-05 19:01   ` Jason Gunthorpe
  2017-01-05 19:54     ` Jerome Glisse
  2017-01-06 15:08     ` Henrique Almeida
  0 siblings, 2 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2017-01-05 19:01 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Deucher, Alexander, 'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, david1.zhou,
	qiang.yu

On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:

>   1) peer-to-peer because of userspace specific API like NVidia GPU
>     direct (AMD is pushing its own similar API i just can't remember
>     marketing name). This does not happen through a vma, this happens
>     through specific device driver call going through device specific
>     ioctl on both side (GPU and RDMA). So both kernel driver are aware
>     of each others.

Today you can only do user-initiated RDMA operations in conjection
with a VMA.

We'd need a really big and strong reason to create an entirely new
non-VMA based memory handle scheme for RDMA.

So my inclination is to just completely push back on this idea. You
need a VMA to do RMA.

GPUs need to create VMAs for the memory they want to RDMA from, even
if the VMA handle just causes SIGBUS for any CPU access.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-05 19:01   ` Jason Gunthorpe
@ 2017-01-05 19:54     ` Jerome Glisse
  2017-01-05 20:07       ` Jason Gunthorpe
  2017-01-06 15:08     ` Henrique Almeida
  1 sibling, 1 reply; 126+ messages in thread
From: Jerome Glisse @ 2017-01-05 19:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, david1.zhou,
	qiang.yu

On Thu, Jan 05, 2017 at 12:01:13PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:
> 
> >   1) peer-to-peer because of userspace specific API like NVidia GPU
> >     direct (AMD is pushing its own similar API i just can't remember
> >     marketing name). This does not happen through a vma, this happens
> >     through specific device driver call going through device specific
> >     ioctl on both side (GPU and RDMA). So both kernel driver are aware
> >     of each others.
> 
> Today you can only do user-initiated RDMA operations in conjection
> with a VMA.
> 
> We'd need a really big and strong reason to create an entirely new
> non-VMA based memory handle scheme for RDMA.
> 
> So my inclination is to just completely push back on this idea. You
> need a VMA to do RMA.
> 
> GPUs need to create VMAs for the memory they want to RDMA from, even
> if the VMA handle just causes SIGBUS for any CPU access.

Mellanox and NVidia support peer to peer with what they market a
GPUDirect. It only works without IOMMU. It is probably not upstream :

https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html

I thought it was but it seems it require an out of tree driver to work.

Wether there is a vma or not isn't important to the issue anyway. If
you want to enforce VMA rule for RDMA it is an RDMA specific discussion
in which i don't want to be involve, it is not my turf :)

What matter is the back channel API between peer-to-peer device. Like
the above patchset points out for GPU we need to be able to invalidate
a mapping at any point in time. Pining is not something we want to
live with.

So the VMA consideration does not change what i was saying there is
2 cases:
  1) device vma (might be restricted to specific userspace API)
  2) regular vma (!VM_MIXED and no special pte entry)

For 1) you need back channel it can be per device driver or we can agree
to some common API that can add to vm_operations_struct.

For 2) expectation is that you will have valid struct page but you still
need special handling at the dma API level.

In 1) the peer-to-peer mapping is track at vma level and mediated there.
For 2) it is per page and it is mediated at that level.

In both case on you have setup mapping you need to handle the IOMMU and
the PCI bridge restriction that might apply and i believe that the DMA
API is the place where we want to solve that second side of the problem.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-05 19:54     ` Jerome Glisse
@ 2017-01-05 20:07       ` Jason Gunthorpe
  2017-01-05 20:19         ` Jerome Glisse
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2017-01-05 20:07 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, david1.zhou,
	qiang.yu

On Thu, Jan 05, 2017 at 02:54:24PM -0500, Jerome Glisse wrote:

> Mellanox and NVidia support peer to peer with what they market a
> GPUDirect. It only works without IOMMU. It is probably not upstream :
> 
> https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html
> 
> I thought it was but it seems it require an out of tree driver to work.

Right, it is out of tree and not under consideration for mainline.

> Wether there is a vma or not isn't important to the issue anyway. If
> you want to enforce VMA rule for RDMA it is an RDMA specific discussion
> in which i don't want to be involve, it is not my turf :)

Always having a VMA changes the discussion - the question is how to
create a VMA that reprensents IO device memory, and how do DMA
consumers extract the correct information from that VMA to pass to the
kernel DMA API so it can setup peer-peer DMA.

> What matter is the back channel API between peer-to-peer device. Like
> the above patchset points out for GPU we need to be able to invalidate
> a mapping at any point in time. Pining is not something we want to
> live with.

We have MMU notifiers to handle this today in RDMA. Async RDMA MR
Invalidate like you see in the above out of tree patches is totally
crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-05 20:07       ` Jason Gunthorpe
@ 2017-01-05 20:19         ` Jerome Glisse
  2017-01-05 22:42           ` Jason Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Jerome Glisse @ 2017-01-05 20:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, david1.zhou,
	qiang.yu

On Thu, Jan 05, 2017 at 01:07:19PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 02:54:24PM -0500, Jerome Glisse wrote:
> 
> > Mellanox and NVidia support peer to peer with what they market a
> > GPUDirect. It only works without IOMMU. It is probably not upstream :
> > 
> > https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html
> > 
> > I thought it was but it seems it require an out of tree driver to work.
> 
> Right, it is out of tree and not under consideration for mainline.
> 
> > Wether there is a vma or not isn't important to the issue anyway. If
> > you want to enforce VMA rule for RDMA it is an RDMA specific discussion
> > in which i don't want to be involve, it is not my turf :)
> 
> Always having a VMA changes the discussion - the question is how to
> create a VMA that reprensents IO device memory, and how do DMA
> consumers extract the correct information from that VMA to pass to the
> kernel DMA API so it can setup peer-peer DMA.

Well my point is that it can't be. In HMM case inside a single VMA you
can have one page inside GPU memory at address A but next page inside
regular memory at A+4k. So handling this at the VMA level does not make
sense. So in this case you would get the device from the struct page
and you would query through common API to determine if you can do peer
to peer. If not it would trigger migration back to regular memory.
If yes then you still have to solve the IOMMU issue and hence the DMA
API changes that were propose.

In the GPUDirect case the idea is that you have a specific device vma
that you map for peer to peer. Here thing can be at vma level and not at
a page level. Expectation here is that the GPU userspace expose a special
API to allow RDMA to directly happen on GPU object allocated through
GPU specific API (ie it is not regular memory and it is not accessible
by CPU).


Both case are disjoint. Both case need to solve the IOMMU issue which
seems to be best solve at the DMA API level.


> > What matter is the back channel API between peer-to-peer device. Like
> > the above patchset points out for GPU we need to be able to invalidate
> > a mapping at any point in time. Pining is not something we want to
> > live with.
> 
> We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> Invalidate like you see in the above out of tree patches is totally
> crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.

Well there is still a large base of hardware that do not have such
feature and some people would like to be able to keep using those.
I believe allowing direct access to GPU object that are otherwise
hidden from regular kernel memory management is still meaningfull.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-05 20:19         ` Jerome Glisse
@ 2017-01-05 22:42           ` Jason Gunthorpe
  2017-01-05 23:23             ` Jerome Glisse
  0 siblings, 1 reply; 126+ messages in thread
From: Jason Gunthorpe @ 2017-01-05 22:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, david1.zhou,
	qiang.yu

On Thu, Jan 05, 2017 at 03:19:36PM -0500, Jerome Glisse wrote:

> > Always having a VMA changes the discussion - the question is how to
> > create a VMA that reprensents IO device memory, and how do DMA
> > consumers extract the correct information from that VMA to pass to the
> > kernel DMA API so it can setup peer-peer DMA.
> 
> Well my point is that it can't be. In HMM case inside a single VMA
> you
[..]

> In the GPUDirect case the idea is that you have a specific device vma
> that you map for peer to peer.

[..]

I still don't understand what you driving at - you've said in both
cases a user VMA exists.

>From my perspective in RDMA, all I want is a core kernel flow to
convert a '__user *' into a scatter list of DMA addresses, that works no
matter what is backing that VMA, be it HMM, a 'hidden' GPU object, or
struct page memory.

A '__user *' pointer is the only way to setup a RDMA MR, and I see no
reason to have another API at this time.

The details of how to translate to a scatter list are a MM subject,
and the MM folks need to get 

I just don't care if that routine works at a page level, or a whole
VMA level, or some combination of both, that is up to the MM team to
figure out :)

> a page level. Expectation here is that the GPU userspace expose a special
> API to allow RDMA to directly happen on GPU object allocated through
> GPU specific API (ie it is not regular memory and it is not accessible
> by CPU).

So, how do you identify these GPU objects? How do you expect RDMA
convert them to scatter lists? How will ODP work?

> > We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> > Invalidate like you see in the above out of tree patches is totally
> > crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.
> 
> Well there is still a large base of hardware that do not have such
> feature and some people would like to be able to keep using those.

Hopefully someone will figure out how to do that without the crazy
async MR invalidation.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-05 22:42           ` Jason Gunthorpe
@ 2017-01-05 23:23             ` Jerome Glisse
  2017-01-06  0:30               ` Jason Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Jerome Glisse @ 2017-01-05 23:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, david1.zhou,
	qiang.yu

On Thu, Jan 05, 2017 at 03:42:15PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 03:19:36PM -0500, Jerome Glisse wrote:
> 
> > > Always having a VMA changes the discussion - the question is how to
> > > create a VMA that reprensents IO device memory, and how do DMA
> > > consumers extract the correct information from that VMA to pass to the
> > > kernel DMA API so it can setup peer-peer DMA.
> > 
> > Well my point is that it can't be. In HMM case inside a single VMA
> > you
> [..]
> 
> > In the GPUDirect case the idea is that you have a specific device vma
> > that you map for peer to peer.
> 
> [..]
> 
> I still don't understand what you driving at - you've said in both
> cases a user VMA exists.

In the former case no, there is no VMA directly but if you want one than
a device can provide one. But such VMA is useless as CPU access is not
expected.

> 
> From my perspective in RDMA, all I want is a core kernel flow to
> convert a '__user *' into a scatter list of DMA addresses, that works no
> matter what is backing that VMA, be it HMM, a 'hidden' GPU object, or
> struct page memory.
> 
> A '__user *' pointer is the only way to setup a RDMA MR, and I see no
> reason to have another API at this time.
> 
> The details of how to translate to a scatter list are a MM subject,
> and the MM folks need to get 
> 
> I just don't care if that routine works at a page level, or a whole
> VMA level, or some combination of both, that is up to the MM team to
> figure out :)

And that's what i am trying to get accross. There is 2 cases here.
What exist on today hardware. Thing like GPU direct, that works on
VMA level. Versus where some new hardware is going were want to do
thing on page level. Both require different API at different level.

What i was trying to get accross is that no matter what level you
consider in the end you still need something at the DMA API level.
And that the 2 different use case (device vma or regular vma) means
2 differents API for the device driver.

> 
> > a page level. Expectation here is that the GPU userspace expose a special
> > API to allow RDMA to directly happen on GPU object allocated through
> > GPU specific API (ie it is not regular memory and it is not accessible
> > by CPU).
> 
> So, how do you identify these GPU objects? How do you expect RDMA
> convert them to scatter lists? How will ODP work?

No ODP on those. If you want vma, the GPU device driver can provide
one. GPU object are disjoint from regular memory (coming from some
form of mmap). They are created through ioctl and in many case are
never expose to the CPU. They only exist inside the GPU driver realm.

None the less there is usecase where exchanging those object accross
computer over a network make sense. I am not an end user here :)


> > > We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> > > Invalidate like you see in the above out of tree patches is totally
> > > crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.
> > 
> > Well there is still a large base of hardware that do not have such
> > feature and some people would like to be able to keep using those.
> 
> Hopefully someone will figure out how to do that without the crazy
> async MR invalidation.

Personnaly i don't care too much about this old hardware and thus i am
fine without supporting them. The open source userspace is playing
catchup and doing feature for old hardware probably does not make sense.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-05 23:23             ` Jerome Glisse
@ 2017-01-06  0:30               ` Jason Gunthorpe
  2017-01-06  0:41                 ` Serguei Sagalovitch
  2017-01-06  1:58                 ` Jerome Glisse
  0 siblings, 2 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2017-01-06  0:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, david1.zhou,
	qiang.yu

On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:

> > I still don't understand what you driving at - you've said in both
> > cases a user VMA exists.
> 
> In the former case no, there is no VMA directly but if you want one than
> a device can provide one. But such VMA is useless as CPU access is not
> expected.

I disagree it is useless, the VMA is going to be necessary to support
upcoming things like CAPI, you need it to support O_DIRECT from the
filesystem, DPDK, etc. This is why I am opposed to any model that is
not VMA based for setting up RDMA - that is shorted sighted and does
not seem to reflect where the industry is going.

So focus on having VMA backed by actual physical memory that covers
your GPU objects and ask how do we wire up the '__user *' to the DMA
API in the best way so the DMA API still has enough information to
setup IOMMUs and whatnot.

> What i was trying to get accross is that no matter what level you
> consider in the end you still need something at the DMA API level.
> And that the 2 different use case (device vma or regular vma) means
> 2 differents API for the device driver.

I agree we need new stuff at the DMA API level, but I am opposed to
the idea we need two API paths that the *driver* has to figure out.
That is fundamentally not what I want as a driver developer.

Give me a common API to convert '__user *' to a scatter list and pin
the pages. This needs to figure out your two cases. And Huge
Pages. And ZONE_DIRECT.. (a better get_user_pages)

Give me an API to take the scatter list and DMA map it, handling all
the stuff associated with peer-peer. (a better dma_map_sg)

Give me a notifier scheme to rework my scatter list when physical
pages need to change (mmu notifiers)

Use the scatter list memory to convey needed information from the
first step to the second.

Do not bother the driver with distinctions on what kind of memory is
behind that VMA. Don't ask me to use get_user_pages or
gpu_get_user_pages, do not ask me to use dma_map_sg or
dma_map_sg_peer_direct. The Driver Doesn't Need To Know.

IMHO this is why GPU direct is not mergable - it creates a crazy
parallel mini-mm subsystem inside RDMA and uses that to connect to a
GPU driver, everything is expected to have parallel paths for GPU
direct and normal MM. No good at all.

> > So, how do you identify these GPU objects? How do you expect RDMA
> > convert them to scatter lists? How will ODP work?
> 
> No ODP on those. If you want vma, the GPU device driver can provide

You said you needed invalidate, that has to be done via ODP.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-06  0:30               ` Jason Gunthorpe
@ 2017-01-06  0:41                 ` Serguei Sagalovitch
  2017-01-06  1:58                 ` Jerome Glisse
  1 sibling, 0 replies; 126+ messages in thread
From: Serguei Sagalovitch @ 2017-01-06  0:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse
  Cc: Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Blinzer, Paul, Koenig, Christian, Suthikulpanit,
	Suravee, Sander, Ben, hch, david1.zhou, qiang.yu

On 2017-01-05 07:30 PM, Jason Gunthorpe wrote:
> ........ but I am opposed to
> the idea we need two API paths that the *driver* has to figure out.
> That is fundamentally not what I want as a driver developer.
>
> Give me a common API to convert '__user *' to a scatter list and pin
> the pages.
Completely agreed. IMHO there is no sense to duplicate the same logic
everywhere as well as  trying to find places where it is missing.

Sincerely yours,
Serguei Sagalovitch

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-06  0:30               ` Jason Gunthorpe
  2017-01-06  0:41                 ` Serguei Sagalovitch
@ 2017-01-06  1:58                 ` Jerome Glisse
  2017-01-06 16:56                   ` Serguei Sagalovitch
  1 sibling, 1 reply; 126+ messages in thread
From: Jerome Glisse @ 2017-01-06  1:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, david1.zhou,
	qiang.yu

On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> 
> > > I still don't understand what you driving at - you've said in both
> > > cases a user VMA exists.
> > 
> > In the former case no, there is no VMA directly but if you want one than
> > a device can provide one. But such VMA is useless as CPU access is not
> > expected.
> 
> I disagree it is useless, the VMA is going to be necessary to support
> upcoming things like CAPI, you need it to support O_DIRECT from the
> filesystem, DPDK, etc. This is why I am opposed to any model that is
> not VMA based for setting up RDMA - that is shorted sighted and does
> not seem to reflect where the industry is going.
> 
> So focus on having VMA backed by actual physical memory that covers
> your GPU objects and ask how do we wire up the '__user *' to the DMA
> API in the best way so the DMA API still has enough information to
> setup IOMMUs and whatnot.

I am talking about 2 different thing. Existing hardware and API where you
_do not_ have a vma and you do not need one. This is just existing stuff.
Some close driver provide a functionality on top of this design. Question
is do we want to do the same ? If yes and you insist on having a vma we
could provide one but this is does not apply and is useless for where we
are going with new hardware.

With new hardware you just use malloc or mmap to allocate memory and then
you use it directly with the device. Device driver can migrate any part of
the process address space to device memory. In this scheme you have your
usual VMAs but there is nothing special about them.

Now when you try to do get_user_page() on any page that is inside the
device it will fails because we do not allow any device memory to be pin.
There is various reasons for that and they are not going away in any hw
in the planing (so for next few years).

Still we do want to support peer to peer mapping. Plan is to only do so
with ODP capable hardware. Still we need to solve the IOMMU issue and
it needs special handling inside the RDMA device. The way it works is
that RDMA ask for a GPU page, GPU check if it has place inside its PCI
bar to map this page for the device, this can fail. If it succeed then
you need the IOMMU to let the RDMA device access the GPU PCI bar.

So here we have 2 orthogonal problem. First one is how to make 2 drivers
talks to each other to setup mapping to allow peer to peer and second is
about IOMMU.


> > What i was trying to get accross is that no matter what level you
> > consider in the end you still need something at the DMA API level.
> > And that the 2 different use case (device vma or regular vma) means
> > 2 differents API for the device driver.
> 
> I agree we need new stuff at the DMA API level, but I am opposed to
> the idea we need two API paths that the *driver* has to figure out.
> That is fundamentally not what I want as a driver developer.
> 
> Give me a common API to convert '__user *' to a scatter list and pin
> the pages. This needs to figure out your two cases. And Huge
> Pages. And ZONE_DIRECT.. (a better get_user_pages)

Pining is not gonna happen like i said it would hinder the GPU to the
point it would become useless.


> Give me an API to take the scatter list and DMA map it, handling all
> the stuff associated with peer-peer. (a better dma_map_sg)
> 
> Give me a notifier scheme to rework my scatter list when physical
> pages need to change (mmu notifiers)
> 
> Use the scatter list memory to convey needed information from the
> first step to the second.
> 
> Do not bother the driver with distinctions on what kind of memory is
> behind that VMA. Don't ask me to use get_user_pages or
> gpu_get_user_pages, do not ask me to use dma_map_sg or
> dma_map_sg_peer_direct. The Driver Doesn't Need To Know.

I understand you want it easy but there must be part that must be aware,
at very least the ODP logic. Creating a peer to peer mapping is a multi
step process and some of those step can fails. Fallback is always to
migrate back to system memory as a default path that can not fail, except
if we are out of memory.


> IMHO this is why GPU direct is not mergable - it creates a crazy
> parallel mini-mm subsystem inside RDMA and uses that to connect to a
> GPU driver, everything is expected to have parallel paths for GPU
> direct and normal MM. No good at all.

Existing hardware and new hardware works differently. I am trying to
explain the two different design needed for each one. You understandtably
dislike the existing hardware that has more stringent requirement and
can not be supported transparently and need dedicated communication with
the two driver.

New hardware that have a completely different API in userspace. We can
decide to only support the latter and forget about the former.


> > > So, how do you identify these GPU objects? How do you expect RDMA
> > > convert them to scatter lists? How will ODP work?
> > 
> > No ODP on those. If you want vma, the GPU device driver can provide
> 
> You said you needed invalidate, that has to be done via ODP.

Invalidate is needed for both old and new hardware. With new hardware the
mmu_notifier is good enough. But you still need special handling when trying
to establish a mapping in HMM case where not all of the GPU memory can be
accessed through the bar. So no matter what it will need special handling
but this can happen in the common infrastructure code (in ODP fault path).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-05 19:01   ` Jason Gunthorpe
  2017-01-05 19:54     ` Jerome Glisse
@ 2017-01-06 15:08     ` Henrique Almeida
  1 sibling, 0 replies; 126+ messages in thread
From: Henrique Almeida @ 2017-01-06 15:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Deucher, Alexander, linux-kernel, linux-rdma,
	linux-nvdimm@lists.01.org, Linux-media, dri-devel, linux-pci,
	Kuehling, Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig,
	Christian, Suthikulpanit, Suravee, Sander, Ben, hch, david1.zhou,
	qiang.yu

 Hello, I've been watching this thread not as a kernel developer, but
as an user interested in doing peer-to-peer access between network
card and GPU. I believe that merging raw direct access with vma
overcomplicates things for our use case. We'll have a very large
camera streaming data at high throughput (up to 100 Gbps) to the GPU,
which will operate in soft real time mode and write back the results
to a RDMA enabled network storage. The CPU will only arrange the
connection between GPU and network card. Having things like paging or
memory overcommit is possible, but they are not required and they
might consistently decrease the quality of the data acquisition.

 I see my use case something likely to exist for others and a strong
reason to split the implementation in two.


2017-01-05 16:01 GMT-03:00 Jason Gunthorpe <jgunthorpe@obsidianresearch.com>:
> On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:
>
>>   1) peer-to-peer because of userspace specific API like NVidia GPU
>>     direct (AMD is pushing its own similar API i just can't remember
>>     marketing name). This does not happen through a vma, this happens
>>     through specific device driver call going through device specific
>>     ioctl on both side (GPU and RDMA). So both kernel driver are aware
>>     of each others.
>
> Today you can only do user-initiated RDMA operations in conjection
> with a VMA.
>
> We'd need a really big and strong reason to create an entirely new
> non-VMA based memory handle scheme for RDMA.
>
> So my inclination is to just completely push back on this idea. You
> need a VMA to do RMA.
>
> GPUs need to create VMAs for the memory they want to RDMA from, even
> if the VMA handle just causes SIGBUS for any CPU access.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-06  1:58                 ` Jerome Glisse
@ 2017-01-06 16:56                   ` Serguei Sagalovitch
  2017-01-06 17:37                     ` Jerome Glisse
  0 siblings, 1 reply; 126+ messages in thread
From: Serguei Sagalovitch @ 2017-01-06 16:56 UTC (permalink / raw)
  To: Jerome Glisse, Jason Gunthorpe
  Cc: Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Blinzer, Paul, Koenig, Christian, Suthikulpanit,
	Suravee, Sander, Ben, hch, david1.zhou, qiang.yu

On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
>> On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
>>
>>>> I still don't understand what you driving at - you've said in both
>>>> cases a user VMA exists.
>>> In the former case no, there is no VMA directly but if you want one than
>>> a device can provide one. But such VMA is useless as CPU access is not
>>> expected.
>> I disagree it is useless, the VMA is going to be necessary to support
>> upcoming things like CAPI, you need it to support O_DIRECT from the
>> filesystem, DPDK, etc. This is why I am opposed to any model that is
>> not VMA based for setting up RDMA - that is shorted sighted and does
>> not seem to reflect where the industry is going.
>>
>> So focus on having VMA backed by actual physical memory that covers
>> your GPU objects and ask how do we wire up the '__user *' to the DMA
>> API in the best way so the DMA API still has enough information to
>> setup IOMMUs and whatnot.
> I am talking about 2 different thing. Existing hardware and API where you
> _do not_ have a vma and you do not need one. This is just existing stuff.
I do not understand why you assume that existing API doesn't  need one.
I would say that a lot of __existing__ user level API and their support 
in kernel
(especially outside of graphics domain) assumes that we have vma and
deal with __user * pointers.
> Some close driver provide a functionality on top of this design. Question
> is do we want to do the same ? If yes and you insist on having a vma we
> could provide one but this is does not apply and is useless for where we
> are going with new hardware.
>
> With new hardware you just use malloc or mmap to allocate memory and then
> you use it directly with the device. Device driver can migrate any part of
> the process address space to device memory. In this scheme you have your
> usual VMAs but there is nothing special about them.
Assuming that the whole device memory is CPU accessible and it looks
like the direction where we are going:
- You forgot about use case when we want or need to allocate memory
directly on device (why we need to migrate anything if not needed?).
- We may want to use CPU to access such memory on device to avoid
any unnecessary migration back.
- We may have more device memory than the system one.
E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
not mentioning NVDIMM cards which could also be used as memory
storage for other device access.
- We also may want/need to share GPU memory between different
processes.
> Now when you try to do get_user_page() on any page that is inside the
> device it will fails because we do not allow any device memory to be pin.
> There is various reasons for that and they are not going away in any hw
> in the planing (so for next few years).
>
> Still we do want to support peer to peer mapping. Plan is to only do so
> with ODP capable hardware. Still we need to solve the IOMMU issue and
> it needs special handling inside the RDMA device. The way it works is
> that RDMA ask for a GPU page, GPU check if it has place inside its PCI
> bar to map this page for the device, this can fail. If it succeed then
> you need the IOMMU to let the RDMA device access the GPU PCI bar.
>
> So here we have 2 orthogonal problem. First one is how to make 2 drivers
> talks to each other to setup mapping to allow peer to peer But I would assume  and second is
> about IOMMU.
>
I think that there is the third problem:  A lot of existing user level API
(MPI, IB Verbs, file i/o, etc.) deal with pointers to the buffers.
Potentially it would be ideally to support use cases when those buffers are
located in device memory avoiding any unnecessary migration / 
double-buffering.
Currently a lot of infrastructure in kernel assumes that this is the user
pointer and call "get_user_pages"  to get s/g.   What is your opinion
how it should be changed to deal with cases when "buffer" is in
device memory?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-06 16:56                   ` Serguei Sagalovitch
@ 2017-01-06 17:37                     ` Jerome Glisse
  2017-01-06 18:26                       ` Jason Gunthorpe
  0 siblings, 1 reply; 126+ messages in thread
From: Jerome Glisse @ 2017-01-06 17:37 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: Jerome Glisse, Jason Gunthorpe, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Blinzer, Paul, Koenig, Christian, Suthikulpanit,
	Suravee, Sander, Ben, hch, david1.zhou, qiang.yu

On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > 
> > > > > I still don't understand what you driving at - you've said in both
> > > > > cases a user VMA exists.
> > > > In the former case no, there is no VMA directly but if you want one than
> > > > a device can provide one. But such VMA is useless as CPU access is not
> > > > expected.
> > > I disagree it is useless, the VMA is going to be necessary to support
> > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > not VMA based for setting up RDMA - that is shorted sighted and does
> > > not seem to reflect where the industry is going.
> > > 
> > > So focus on having VMA backed by actual physical memory that covers
> > > your GPU objects and ask how do we wire up the '__user *' to the DMA
> > > API in the best way so the DMA API still has enough information to
> > > setup IOMMUs and whatnot.
> > I am talking about 2 different thing. Existing hardware and API where you
> > _do not_ have a vma and you do not need one. This is just existing stuff.
> I do not understand why you assume that existing API doesn't  need one.
> I would say that a lot of __existing__ user level API and their support in
> kernel (especially outside of graphics domain) assumes that we have vma and
> deal with __user * pointers.

Well i am thinking to GPUDirect here. Some of GPUDirect use case do not have
vma (struct vm_area_struct) associated with them they directly apply to GPU
object that aren't expose to CPU. Yes some use case have vma for share buffer.

In the open source driver it is true that we have vma most often than not.

> > Some close driver provide a functionality on top of this design. Question
> > is do we want to do the same ? If yes and you insist on having a vma we
> > could provide one but this is does not apply and is useless for where we
> > are going with new hardware.
> > 
> > With new hardware you just use malloc or mmap to allocate memory and then
> > you use it directly with the device. Device driver can migrate any part of
> > the process address space to device memory. In this scheme you have your
> > usual VMAs but there is nothing special about them.
>
> Assuming that the whole device memory is CPU accessible and it looks
> like the direction where we are going:
> - You forgot about use case when we want or need to allocate memory
> directly on device (why we need to migrate anything if not needed?).
> - We may want to use CPU to access such memory on device to avoid
> any unnecessary migration back.
> - We may have more device memory than the system one.
> E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
> not mentioning NVDIMM cards which could also be used as memory
> storage for other device access.
> - We also may want/need to share GPU memory between different
> processes.

Here i am talking about platform where GPU memory is not accessible at
all by the CPU (because of PCIe restriction, think CPU atomic operation
on IO memory).

So i really distinguish between CAPI/CCIX and PCIe. Some platform will
have CAPI/CCIX other wont. HMM apply mostly to the latter. Some of HMM
functionalities are still usefull with CAPI/CCIX.

Note that HMM do support allocation on GPU first. In current design this
can happen when GPU is the first to access an unpopulated virtual address.


For platform where GPU memory is accessible plan is either something
like CDM (Coherent Device Memory) or rely on ZONE_DEVICE. So all GPU
memory have struct page and those are like ordinary pages. CDM still
wants some restrictions like avoiding CPU allocation to happen on GPU
when there is memory pressure ... For all intent and purposes this
will work transparently in respect to RDMA because we assume on those
system that the RDMA is CAPI/CCIX and that it can peer to other device.


> > Now when you try to do get_user_page() on any page that is inside the
> > device it will fails because we do not allow any device memory to be pin.
> > There is various reasons for that and they are not going away in any hw
> > in the planing (so for next few years).
> > 
> > Still we do want to support peer to peer mapping. Plan is to only do so
> > with ODP capable hardware. Still we need to solve the IOMMU issue and
> > it needs special handling inside the RDMA device. The way it works is
> > that RDMA ask for a GPU page, GPU check if it has place inside its PCI
> > bar to map this page for the device, this can fail. If it succeed then
> > you need the IOMMU to let the RDMA device access the GPU PCI bar.
> > 
> > So here we have 2 orthogonal problem. First one is how to make 2 drivers
> > talks to each other to setup mapping to allow peer to peer But I would assume  and second is
> > about IOMMU.
> > 
> I think that there is the third problem:  A lot of existing user level API
> (MPI, IB Verbs, file i/o, etc.) deal with pointers to the buffers.
> Potentially it would be ideally to support use cases when those buffers are
> located in device memory avoiding any unnecessary migration /
> double-buffering.
> Currently a lot of infrastructure in kernel assumes that this is the user
> pointer and call "get_user_pages"  to get s/g.   What is your opinion
> how it should be changed to deal with cases when "buffer" is in
> device memory?

For HMM plan is to restrict to ODP and either to replace ODP with HMM or change
ODP to not use get_user_pages_remote() but directly fetch informations from
CPU page table. Everything else stay as it is. I posted patchset to replace
ODP with HMM in the past.

For the older kind of API (GPUDirect on yesterday hardware) it uses special
userspace API. If we don't care about supporting those i don't mind much but
some people see benefit in not having to deal with vma (struct vm_area_struct).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-06 17:37                     ` Jerome Glisse
@ 2017-01-06 18:26                       ` Jason Gunthorpe
  2017-01-06 19:12                         ` Deucher, Alexander
  2017-01-06 22:10                         ` Logan Gunthorpe
  0 siblings, 2 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2017-01-06 18:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Serguei Sagalovitch, Jerome Glisse, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Blinzer, Paul, Koenig, Christian, Suthikulpanit,
	Suravee, Sander, Ben, hch, david1.zhou, qiang.yu

On Fri, Jan 06, 2017 at 12:37:22PM -0500, Jerome Glisse wrote:
> On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> > On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > > 
> > > > > > I still don't understand what you driving at - you've said in both
> > > > > > cases a user VMA exists.
> > > > > In the former case no, there is no VMA directly but if you want one than
> > > > > a device can provide one. But such VMA is useless as CPU access is not
> > > > > expected.
> > > > I disagree it is useless, the VMA is going to be necessary to support
> > > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > > not VMA based for setting up RDMA - that is shorted sighted and does
> > > > not seem to reflect where the industry is going.
> > > > 
> > > > So focus on having VMA backed by actual physical memory that covers
> > > > your GPU objects and ask how do we wire up the '__user *' to the DMA
> > > > API in the best way so the DMA API still has enough information to
> > > > setup IOMMUs and whatnot.
> > > I am talking about 2 different thing. Existing hardware and API where you
> > > _do not_ have a vma and you do not need one. This is just
> > > > existing stuff.

> > I do not understand why you assume that existing API doesn't  need one.
> > I would say that a lot of __existing__ user level API and their support in
> > kernel (especially outside of graphics domain) assumes that we have vma and
> > deal with __user * pointers.

+1

> Well i am thinking to GPUDirect here. Some of GPUDirect use case do not have
> vma (struct vm_area_struct) associated with them they directly apply to GPU
> object that aren't expose to CPU. Yes some use case have vma for share buffer.

Lets stop talkind about GPU direct. Today we can't even make VMA
pointing at a PCI bar work properly in the kernel - lets start there
please. People can argue over other options once that is done.

> For HMM plan is to restrict to ODP and either to replace ODP with HMM or change
> ODP to not use get_user_pages_remote() but directly fetch informations from
> CPU page table. Everything else stay as it is. I posted patchset to replace
> ODP with HMM in the past.

Make a generic API for all of this and you'd have my vote..

IMHO, you must support basic pinning semantics - that is necessary to
support generic short lived DMA (eg filesystem, etc). That hardware
can clearly do that if it can support ODP.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* RE: Enabling peer to peer device transactions for PCIe devices
  2017-01-06 18:26                       ` Jason Gunthorpe
@ 2017-01-06 19:12                         ` Deucher, Alexander
  2017-01-06 22:10                         ` Logan Gunthorpe
  1 sibling, 0 replies; 126+ messages in thread
From: Deucher, Alexander @ 2017-01-06 19:12 UTC (permalink / raw)
  To: 'Jason Gunthorpe', Jerome Glisse
  Cc: Sagalovitch, Serguei, Jerome Glisse,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org',
	Kuehling, Felix, Blinzer, Paul, Koenig, Christian, Suthikulpanit,
	Suravee, Sander, Ben, hch, Zhou, David(ChunMing),
	Yu, Qiang

> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgunthorpe@obsidianresearch.com]
> Sent: Friday, January 06, 2017 1:26 PM
> To: Jerome Glisse
> Cc: Sagalovitch, Serguei; Jerome Glisse; Deucher, Alexander; 'linux-
> kernel@vger.kernel.org'; 'linux-rdma@vger.kernel.org'; 'linux-
> nvdimm@lists.01.org'; 'Linux-media@vger.kernel.org'; 'dri-
> devel@lists.freedesktop.org'; 'linux-pci@vger.kernel.org'; Kuehling, Felix;
> Blinzer, Paul; Koenig, Christian; Suthikulpanit, Suravee; Sander, Ben;
> hch@infradead.org; Zhou, David(ChunMing); Yu, Qiang
> Subject: Re: Enabling peer to peer device transactions for PCIe devices
> 
> On Fri, Jan 06, 2017 at 12:37:22PM -0500, Jerome Glisse wrote:
> > On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> > > On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > > > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > > >
> > > > > > > I still don't understand what you driving at - you've said in both
> > > > > > > cases a user VMA exists.
> > > > > > In the former case no, there is no VMA directly but if you want one
> than
> > > > > > a device can provide one. But such VMA is useless as CPU access is
> not
> > > > > > expected.
> > > > > I disagree it is useless, the VMA is going to be necessary to support
> > > > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > > > not VMA based for setting up RDMA - that is shorted sighted and
> does
> > > > > not seem to reflect where the industry is going.
> > > > >
> > > > > So focus on having VMA backed by actual physical memory that
> covers
> > > > > your GPU objects and ask how do we wire up the '__user *' to the
> DMA
> > > > > API in the best way so the DMA API still has enough information to
> > > > > setup IOMMUs and whatnot.
> > > > I am talking about 2 different thing. Existing hardware and API where
> you
> > > > _do not_ have a vma and you do not need one. This is just
> > > > > existing stuff.
> 
> > > I do not understand why you assume that existing API doesn't  need one.
> > > I would say that a lot of __existing__ user level API and their support in
> > > kernel (especially outside of graphics domain) assumes that we have vma
> and
> > > deal with __user * pointers.
> 
> +1
> 
> > Well i am thinking to GPUDirect here. Some of GPUDirect use case do not
> have
> > vma (struct vm_area_struct) associated with them they directly apply to
> GPU
> > object that aren't expose to CPU. Yes some use case have vma for share
> buffer.
> 
> Lets stop talkind about GPU direct. Today we can't even make VMA
> pointing at a PCI bar work properly in the kernel - lets start there
> please. People can argue over other options once that is done.
> 
> > For HMM plan is to restrict to ODP and either to replace ODP with HMM or
> change
> > ODP to not use get_user_pages_remote() but directly fetch informations
> from
> > CPU page table. Everything else stay as it is. I posted patchset to replace
> > ODP with HMM in the past.
> 
> Make a generic API for all of this and you'd have my vote..
> 
> IMHO, you must support basic pinning semantics - that is necessary to
> support generic short lived DMA (eg filesystem, etc). That hardware
> can clearly do that if it can support ODP.

We would definitely like to have support for hardware that can't handle page faults gracefully.

Alex

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-06 18:26                       ` Jason Gunthorpe
  2017-01-06 19:12                         ` Deucher, Alexander
@ 2017-01-06 22:10                         ` Logan Gunthorpe
  2017-01-12  4:54                           ` Stephen Bates
  1 sibling, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2017-01-06 22:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse
  Cc: david1.zhou, qiang.yu, 'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	Kuehling, Felix, Serguei Sagalovitch,
	'linux-kernel@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	Koenig, Christian, hch, Deucher, Alexander, Sander, Ben,
	Suthikulpanit, Suravee, 'linux-pci@vger.kernel.org',
	Jerome Glisse, Blinzer, Paul,
	'Linux-media@vger.kernel.org'



On 06/01/17 11:26 AM, Jason Gunthorpe wrote:

> Make a generic API for all of this and you'd have my vote..
> 
> IMHO, you must support basic pinning semantics - that is necessary to
> support generic short lived DMA (eg filesystem, etc). That hardware
> can clearly do that if it can support ODP.

I agree completely.

What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
(ie. at least those backed with ZONE_DEVICE memory). Then
GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
(using whatever interface is most appropriate) and userspace can do what
it pleases with them. This makes _so_ much sense and actually largely
already works today (as demonstrated by iopmem).

Though, of course, there are many aspects that could still be improved
like denying CPU access to special VMAs and having get_user_pages avoid
pinning device memory, etc, etc. But all this would just be enhancements
to how VMAs work and not be effected by the basic design described above.

We experimented with GPU Direct and the peer memory patchset for IB and
they were broken by design. They were just a very specific hack into the
IB core and thus didn't help to support O_DIRECT or any other possible
DMA user. And the invalidation thing was completely nuts. We had to pray
an invalidation would never occur because, if it did, our application
would just break.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-06 22:10                         ` Logan Gunthorpe
@ 2017-01-12  4:54                           ` Stephen Bates
  2017-01-12 15:11                             ` Jerome Glisse
  2017-01-12 22:35                             ` Logan Gunthorpe
  0 siblings, 2 replies; 126+ messages in thread
From: Stephen Bates @ 2017-01-12  4:54 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Jerome Glisse, david1.zhou, qiang.yu,
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	Kuehling, Felix, Serguei Sagalovitch,
	'linux-kernel@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	Koenig, Christian, hch, Deucher, Alexander, Sander, Ben,
	Suthikulpanit, Suravee, 'linux-pci@vger.kernel.org',
	Jerome Glisse, Blinzer, Paul,
	'Linux-media@vger.kernel.org'

On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
>
>
> On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
>
>
>> Make a generic API for all of this and you'd have my vote..
>>
>>
>> IMHO, you must support basic pinning semantics - that is necessary to
>> support generic short lived DMA (eg filesystem, etc). That hardware can
>> clearly do that if it can support ODP.
>
> I agree completely.
>
>
> What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> (ie. at least those backed with ZONE_DEVICE memory). Then
> GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> (using whatever interface is most appropriate) and userspace can do what
> it pleases with them. This makes _so_ much sense and actually largely
> already works today (as demonstrated by iopmem).

+1 for iopmem ;-)

I feel like we are going around and around on this topic. I would like to
see something that is upstream that enables P2P even if it is only the
minimum viable useful functionality to begin. I think aiming for the moon
(which is what HMM and things like it are) are simply going to take more
time if they ever get there.

There is a use case for in-kernel P2P PCIe transfers between two NVMe
devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
BARs on the NIC). I am even seeing users who now want to move data P2P
between FPGAs and NVMe SSDs and the upstream kernel should be able to
support these users or they will look elsewhere.

The iopmem patchset addressed all the use cases above and while it is not
an in kernel API it could have been modified to be one reasonably easily.
As Logan states the driver can then choose to pass the VMAs to user-space
in a manner that makes sense.

Earlier in the thread someone mentioned LSF/MM. There is already a
proposal to discuss this topic so if you are interested please respond to
the email letting the committee know this topic is of interest to you [1].

Also earlier in the thread someone discussed the issues around the IOMMU.
Given the known issues around P2P transfers in certain CPU root complexes
[2] it might just be a case of only allowing P2P when a PCIe switch
connects the two EPs. Another option is just to use CONFIG_EXPERT and make
sure people are aware of the pitfalls if they invoke the P2P option.

Finally, as Jason noted, we could all just wait until
CAPI/OpenCAPI/CCIX/GenZ comes along. However given that these interfaces
are the remit of the CPU vendors I think it behooves us to solve this
problem before then. Also some of the above mentioned protocols are not
even switchable and may not be amenable to a P2P topology...

Stephen

[1] http://marc.info/?l=linux-mm&m=148156541804940&w=2
[2] https://community.mellanox.com/docs/DOC-1119

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-12  4:54                           ` Stephen Bates
@ 2017-01-12 15:11                             ` Jerome Glisse
  2017-01-12 17:17                               ` Jason Gunthorpe
  2017-01-13 13:04                               ` Christian König
  2017-01-12 22:35                             ` Logan Gunthorpe
  1 sibling, 2 replies; 126+ messages in thread
From: Jerome Glisse @ 2017-01-12 15:11 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Logan Gunthorpe, Jason Gunthorpe, david1.zhou, qiang.yu,
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	Kuehling, Felix, Serguei Sagalovitch,
	'linux-kernel@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	Koenig, Christian, hch, Deucher, Alexander, Sander, Ben,
	Suthikulpanit, Suravee, 'linux-pci@vger.kernel.org',
	Jerome Glisse, Blinzer, Paul,
	'Linux-media@vger.kernel.org'

On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
> On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
> >
> >
> > On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
> >
> >
> >> Make a generic API for all of this and you'd have my vote..
> >>
> >>
> >> IMHO, you must support basic pinning semantics - that is necessary to
> >> support generic short lived DMA (eg filesystem, etc). That hardware can
> >> clearly do that if it can support ODP.
> >
> > I agree completely.
> >
> >
> > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> > (ie. at least those backed with ZONE_DEVICE memory). Then
> > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> > (using whatever interface is most appropriate) and userspace can do what
> > it pleases with them. This makes _so_ much sense and actually largely
> > already works today (as demonstrated by iopmem).
> 
> +1 for iopmem ;-)
> 
> I feel like we are going around and around on this topic. I would like to
> see something that is upstream that enables P2P even if it is only the
> minimum viable useful functionality to begin. I think aiming for the moon
> (which is what HMM and things like it are) are simply going to take more
> time if they ever get there.
> 
> There is a use case for in-kernel P2P PCIe transfers between two NVMe
> devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
> BARs on the NIC). I am even seeing users who now want to move data P2P
> between FPGAs and NVMe SSDs and the upstream kernel should be able to
> support these users or they will look elsewhere.
> 
> The iopmem patchset addressed all the use cases above and while it is not
> an in kernel API it could have been modified to be one reasonably easily.
> As Logan states the driver can then choose to pass the VMAs to user-space
> in a manner that makes sense.
> 
> Earlier in the thread someone mentioned LSF/MM. There is already a
> proposal to discuss this topic so if you are interested please respond to
> the email letting the committee know this topic is of interest to you [1].
> 
> Also earlier in the thread someone discussed the issues around the IOMMU.
> Given the known issues around P2P transfers in certain CPU root complexes
> [2] it might just be a case of only allowing P2P when a PCIe switch
> connects the two EPs. Another option is just to use CONFIG_EXPERT and make
> sure people are aware of the pitfalls if they invoke the P2P option.


iopmem is not applicable to GPU what i propose is to split the issue in 2
so that everyone can reuse the part that needs to be common namely the DMA
API part where you have to create IOMMU mapping for one device to point
to the other device memory.

We can have a DMA API that is agnostic to how the device memory is manage
(so does not matter if device memory have struct page or not). This what
i have been arguing in this thread. To make progress on this issue we need
to stop conflicting different use case.

So i say let solve the IOMMU issue first and let everyone use it in their
own way with their device. I do not think we can share much more than
that.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-12 15:11                             ` Jerome Glisse
@ 2017-01-12 17:17                               ` Jason Gunthorpe
  2017-01-13 13:04                               ` Christian König
  1 sibling, 0 replies; 126+ messages in thread
From: Jason Gunthorpe @ 2017-01-12 17:17 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Stephen Bates, Logan Gunthorpe, david1.zhou, qiang.yu,
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	Kuehling, Felix, Serguei Sagalovitch,
	'linux-kernel@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	Koenig, Christian, hch, Deucher, Alexander, Sander, Ben,
	Suthikulpanit, Suravee, 'linux-pci@vger.kernel.org',
	Jerome Glisse, Blinzer, Paul,
	'Linux-media@vger.kernel.org'

On Thu, Jan 12, 2017 at 10:11:29AM -0500, Jerome Glisse wrote:
> On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
> > > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> > > (ie. at least those backed with ZONE_DEVICE memory). Then
> > > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> > > (using whatever interface is most appropriate) and userspace can do what
> > > it pleases with them. This makes _so_ much sense and actually largely
> > > already works today (as demonstrated by iopmem).

> So i say let solve the IOMMU issue first and let everyone use it in their
> own way with their device. I do not think we can share much more than
> that.

Solve it for the easy ZONE_DIRECT/etc case then.

Jason

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-12  4:54                           ` Stephen Bates
  2017-01-12 15:11                             ` Jerome Glisse
@ 2017-01-12 22:35                             ` Logan Gunthorpe
  1 sibling, 0 replies; 126+ messages in thread
From: Logan Gunthorpe @ 2017-01-12 22:35 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Jason Gunthorpe, Jerome Glisse, david1.zhou, qiang.yu,
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	Kuehling, Felix, Serguei Sagalovitch,
	'linux-kernel@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	Koenig, Christian, hch, Deucher, Alexander, Sander, Ben,
	Suthikulpanit, Suravee, 'linux-pci@vger.kernel.org',
	Jerome Glisse, Blinzer, Paul,
	'Linux-media@vger.kernel.org'



On 11/01/17 09:54 PM, Stephen Bates wrote:
> The iopmem patchset addressed all the use cases above and while it is not
> an in kernel API it could have been modified to be one reasonably easily.
> As Logan states the driver can then choose to pass the VMAs to user-space
> in a manner that makes sense.

Just to clarify: the iopmem patchset had one patch that allowed for
slightly more flexible zone device mappings which ought to be useful for
everyone.

The other patch (which was iopmem proper) was more of an example of how
the zone_device memory _could_ be exposed to userspace with "iopmem"
hardware that looks similar to nvdimm hardware. Iopmem was not really
useful, in itself, for NVMe devices and it was never expected to be
useful for GPUs.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-01-12 15:11                             ` Jerome Glisse
  2017-01-12 17:17                               ` Jason Gunthorpe
@ 2017-01-13 13:04                               ` Christian König
  1 sibling, 0 replies; 126+ messages in thread
From: Christian König @ 2017-01-13 13:04 UTC (permalink / raw)
  To: Jerome Glisse, Stephen Bates
  Cc: Logan Gunthorpe, Jason Gunthorpe, david1.zhou, qiang.yu,
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	Kuehling, Felix, Serguei Sagalovitch,
	'linux-kernel@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	hch, Deucher, Alexander, Sander, Ben, Suthikulpanit, Suravee,
	'linux-pci@vger.kernel.org',
	Jerome Glisse, Blinzer, Paul,
	'Linux-media@vger.kernel.org'

Am 12.01.2017 um 16:11 schrieb Jerome Glisse:
> On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
>> On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
>>>
>>> On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
>>>
>>>
>>>> Make a generic API for all of this and you'd have my vote..
>>>>
>>>>
>>>> IMHO, you must support basic pinning semantics - that is necessary to
>>>> support generic short lived DMA (eg filesystem, etc). That hardware can
>>>> clearly do that if it can support ODP.
>>> I agree completely.
>>>
>>>
>>> What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
>>> (ie. at least those backed with ZONE_DEVICE memory). Then
>>> GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
>>> (using whatever interface is most appropriate) and userspace can do what
>>> it pleases with them. This makes _so_ much sense and actually largely
>>> already works today (as demonstrated by iopmem).
>> +1 for iopmem ;-)
>>
>> I feel like we are going around and around on this topic. I would like to
>> see something that is upstream that enables P2P even if it is only the
>> minimum viable useful functionality to begin. I think aiming for the moon
>> (which is what HMM and things like it are) are simply going to take more
>> time if they ever get there.
>>
>> There is a use case for in-kernel P2P PCIe transfers between two NVMe
>> devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
>> BARs on the NIC). I am even seeing users who now want to move data P2P
>> between FPGAs and NVMe SSDs and the upstream kernel should be able to
>> support these users or they will look elsewhere.
>>
>> The iopmem patchset addressed all the use cases above and while it is not
>> an in kernel API it could have been modified to be one reasonably easily.
>> As Logan states the driver can then choose to pass the VMAs to user-space
>> in a manner that makes sense.
>>
>> Earlier in the thread someone mentioned LSF/MM. There is already a
>> proposal to discuss this topic so if you are interested please respond to
>> the email letting the committee know this topic is of interest to you [1].
>>
>> Also earlier in the thread someone discussed the issues around the IOMMU.
>> Given the known issues around P2P transfers in certain CPU root complexes
>> [2] it might just be a case of only allowing P2P when a PCIe switch
>> connects the two EPs. Another option is just to use CONFIG_EXPERT and make
>> sure people are aware of the pitfalls if they invoke the P2P option.
>
> iopmem is not applicable to GPU what i propose is to split the issue in 2
> so that everyone can reuse the part that needs to be common namely the DMA
> API part where you have to create IOMMU mapping for one device to point
> to the other device memory.
>
> We can have a DMA API that is agnostic to how the device memory is manage
> (so does not matter if device memory have struct page or not). This what
> i have been arguing in this thread. To make progress on this issue we need
> to stop conflicting different use case.
>
> So i say let solve the IOMMU issue first and let everyone use it in their
> own way with their device. I do not think we can share much more than
> that.

Yeah, exactly what I said from the very beginning as well. Just hacking 
together quick solutions doesn't really solve the problem in the long term.

What we need is proper adjusting of the DMA API towards handling of P2P 
and then build solutions for the different use cases on top of that.

We should also avoid falling into the trap of trying to just handle the 
existing get_user_pages and co interfaces so that the existing code 
doesn't need to change. P2P needs to be validated for each use case 
individually and not implemented in workarounds with fingers crossed and 
hoped for the best.

Regards,
Christian.

>
> Cheers,
> Jérôme

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2016-11-21 20:36 Enabling peer to peer device transactions for PCIe devices Deucher, Alexander
  2016-11-22 18:11 ` Dan Williams
  2017-01-05 18:39 ` Jerome Glisse
@ 2017-10-20 12:36 ` Ludwig Petrosyan
  2017-10-20 15:48   ` Logan Gunthorpe
  2 siblings, 1 reply; 126+ messages in thread
From: Ludwig Petrosyan @ 2017-10-20 12:36 UTC (permalink / raw)
  To: Deucher, Alexander, 'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org'
  Cc: Koenig, Christian, Sagalovitch, Serguei, Blinzer, Paul, Kuehling,
	Felix, Sander, Ben, Suthikulpanit, Suravee, Bridgman, John

Dear Linux kernel group

my name is Ludwig Petrosyan I am working in DESY (Germany)

we are responsible for the control system of  all accelerators in DESY.

For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
central Bus.

I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
endpoints).

The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
and/or usual Read/Write)

Could You please advise me where to start, is there some Documentation 
how to do it.


with best regards


Ludwig


On 11/21/2016 09:36 PM, Deucher, Alexander wrote:
> This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward.  Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory.  Also in cases where both devices are behind a switch, it avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based.  Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc.  Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
>   
> Some use cases:
> 1. Storage devices streaming directly to GPU device memory
> 2. GPU device memory to GPU device memory streaming
> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
> 4. DVB/V4L/SDI devices streaming directly to storage devices
>   
> Here is a relatively simple example of how this could work for testing.  This is obviously not a complete solution.
> - Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
> - get_user_pages_fast() will  return corresponding struct pages when CPU address points to the device memory
> - put_page() will deal with struct pages for device memory
>   
> Previously proposed solutions and related proposals:
> 1.P2P DMA
> DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
> Pros: Low impact, already largely reviewed.
> Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
>   
> 2. ZONE_DEVICE IO
> Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
> Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
> Pro: Doesn't waste system memory for ZONE metadata
> Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
>   
> 3. DMA-BUF
> RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
> Pros: uses existing dma-buf interface
> Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.
>
> 4. iopmem
> iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
>   
> 5. HMM
> Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
>
> 6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
>   
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-10-20 12:36 ` Ludwig Petrosyan
@ 2017-10-20 15:48   ` Logan Gunthorpe
  2017-10-22  6:13     ` Petrosyan, Ludwig
  0 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2017-10-20 15:48 UTC (permalink / raw)
  To: Ludwig Petrosyan, Deucher, Alexander,
	'linux-kernel@vger.kernel.org',
	'linux-rdma@vger.kernel.org',
	'linux-nvdimm@lists.01.org',
	'Linux-media@vger.kernel.org',
	'dri-devel@lists.freedesktop.org',
	'linux-pci@vger.kernel.org'
  Cc: Bridgman, John, Kuehling, Felix, Sagalovitch, Serguei, Blinzer,
	Paul, Koenig, Christian, Suthikulpanit, Suravee, Sander, Ben

Hi Ludwig,

P2P transactions are still *very* experimental at the moment and take a 
lot of expertise to get working in a general setup. It will definitely 
require changes to the kernel, including the drivers of all the devices 
you are trying to make talk to eachother. If you're up for it you can 
take a look at:

https://github.com/sbates130272/linux-p2pmem/

Which has our current rough work making NVMe fabrics use p2p transactions.

Logan

On 10/20/2017 6:36 AM, Ludwig Petrosyan wrote:
> Dear Linux kernel group
> 
> my name is Ludwig Petrosyan I am working in DESY (Germany)
> 
> we are responsible for the control system of  all accelerators in DESY.
> 
> For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
> central Bus.
> 
> I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
> endpoints).
> 
> The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
> and/or usual Read/Write)
> 
> Could You please advise me where to start, is there some Documentation 
> how to do it.
> 
> 
> with best regards
> 
> 
> Ludwig
> 
> 
> On 11/21/2016 09:36 PM, Deucher, Alexander wrote:
>> This is certainly not the first time this has been brought up, but I'd 
>> like to try and get some consensus on the best way to move this 
>> forward.  Allowing devices to talk directly improves performance and 
>> reduces latency by avoiding the use of staging buffers in system 
>> memory.  Also in cases where both devices are behind a switch, it 
>> avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, 
>> CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
>> able to take a CPU virtual address and be able to get to a physical 
>> address taking into account IOMMUs, etc.  Having struct pages for the 
>> memory would allow it to work more generally and wouldn't require as 
>> much explicit support in drivers that wanted to use it.
>> Some use cases:
>> 1. Storage devices streaming directly to GPU device memory
>> 2. GPU device memory to GPU device memory streaming
>> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
>> 4. DVB/V4L/SDI devices streaming directly to storage devices
>> Here is a relatively simple example of how this could work for 
>> testing.  This is obviously not a complete solution.
>> - Device memory will be registered with Linux memory sub-system by 
>> created corresponding struct page structures for device memory
>> - get_user_pages_fast() will  return corresponding struct pages when 
>> CPU address points to the device memory
>> - put_page() will deal with struct pages for device memory
>> Previously proposed solutions and related proposals:
>> 1.P2P DMA
>> DMA-API/PCI map_peer_resource support for peer-to-peer 
>> (http://www.spinics.net/lists/linux-pci/msg44560.html)
>> Pros: Low impact, already largely reviewed.
>> Cons: requires explicit support in all drivers that want to support 
>> it, doesn't handle S/G in device memory.
>> 2. ZONE_DEVICE IO
>> Direct I/O and DMA for persistent memory 
>> (https://lwn.net/Articles/672457/)
>> Add support for ZONE_DEVICE IO memory with struct pages. 
>> (https://patchwork.kernel.org/patch/8583221/)
>> Pro: Doesn't waste system memory for ZONE metadata
>> Cons: CPU access to ZONE metadata slow, may be lost, corrupted on 
>> device reset.
>> 3. DMA-BUF
>> RDMA subsystem DMA-BUF support 
>> (http://www.spinics.net/lists/linux-rdma/msg38748.html)
>> Pros: uses existing dma-buf interface
>> Cons: dma-buf is handle based, requires explicit dma-buf support in 
>> drivers.
>>
>> 4. iopmem
>> iopmem : A block device for PCIe memory 
>> (https://lwn.net/Articles/703895/)
>> 5. HMM
>> Heterogeneous Memory Management 
>> (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
>>
>> 6. Some new mmap-like interface that takes a userptr and a length and 
>> returns a dma-buf and offset?
>> Alex
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-10-20 15:48   ` Logan Gunthorpe
@ 2017-10-22  6:13     ` Petrosyan, Ludwig
  2017-10-22 17:19       ` Logan Gunthorpe
  2017-10-23 16:08       ` David Laight
  0 siblings, 2 replies; 126+ messages in thread
From: Petrosyan, Ludwig @ 2017-10-22  6:13 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Deucher, Alexander, linux-kernel, linux-rdma, linux-nvdimm,
	Linux-media, dri-devel, linux-pci, Bridgman, John, Kuehling,
	Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig, Christian,
	Suthikulpanit, Suravee, Sander, Ben

Hello Logan

Thank You very much for respond.
Could be I have done is stupid...
But at first sight it has to be simple:
The PCIe Write transactions are address routed, so if in the packet header the other endpoint address is written the TLP has to be routed (by PCIe Switch to the endpoint), the DMA reading from the end point is really write transactions from the endpoint, usually (Xilinx core) to start DMA one has to write to the DMA control register of the endpoint the destination address. So I have change the device driver to set in this register the physical address of the other endpoint (get_resource start called to other endpoint, and it is the same address which I could see in lspci -vvvv -s bus-address of the switch port, memories behind bridge), so now the endpoint has to start send writes TLP with the other endpoint address in the TLP header.
But this is not working (I want to understand why ...), but I could see the first address of the destination endpoint is changed (with the wrong value 0xFF),
now I want to try prepare in the driver of one endpoint the DMA buffer , but using physical address of the other endpoint,
Could be it will never work, but I want to understand why, there is my error ...

with best regards

Ludwig

----- Original Message -----
From: "Logan Gunthorpe" <logang@deltatee.com>
To: "Ludwig Petrosyan" <ludwig.petrosyan@desy.de>, "Deucher, Alexander" <Alexander.Deucher@amd.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>, "linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>, "Linux-media@vger.kernel.org" <Linux-media@vger.kernel.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>, "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Cc: "Bridgman, John" <John.Bridgman@amd.com>, "Kuehling, Felix" <Felix.Kuehling@amd.com>, "Sagalovitch, Serguei" <Serguei.Sagalovitch@amd.com>, "Blinzer, Paul" <Paul.Blinzer@amd.com>, "Koenig, Christian" <Christian.Koenig@amd.com>, "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>, "Sander, Ben" <ben.sander@amd.com>
Sent: Friday, 20 October, 2017 17:48:58
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Hi Ludwig,

P2P transactions are still *very* experimental at the moment and take a 
lot of expertise to get working in a general setup. It will definitely 
require changes to the kernel, including the drivers of all the devices 
you are trying to make talk to eachother. If you're up for it you can 
take a look at:

https://github.com/sbates130272/linux-p2pmem/

Which has our current rough work making NVMe fabrics use p2p transactions.

Logan

On 10/20/2017 6:36 AM, Ludwig Petrosyan wrote:
> Dear Linux kernel group
> 
> my name is Ludwig Petrosyan I am working in DESY (Germany)
> 
> we are responsible for the control system of  all accelerators in DESY.
> 
> For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
> central Bus.
> 
> I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
> endpoints).
> 
> The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
> and/or usual Read/Write)
> 
> Could You please advise me where to start, is there some Documentation 
> how to do it.
> 
> 
> with best regards
> 
> 
> Ludwig
> 
> 
> On 11/21/2016 09:36 PM, Deucher, Alexander wrote:
>> This is certainly not the first time this has been brought up, but I'd 
>> like to try and get some consensus on the best way to move this 
>> forward.  Allowing devices to talk directly improves performance and 
>> reduces latency by avoiding the use of staging buffers in system 
>> memory.  Also in cases where both devices are behind a switch, it 
>> avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, 
>> CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
>> able to take a CPU virtual address and be able to get to a physical 
>> address taking into account IOMMUs, etc.  Having struct pages for the 
>> memory would allow it to work more generally and wouldn't require as 
>> much explicit support in drivers that wanted to use it.
>> Some use cases:
>> 1. Storage devices streaming directly to GPU device memory
>> 2. GPU device memory to GPU device memory streaming
>> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
>> 4. DVB/V4L/SDI devices streaming directly to storage devices
>> Here is a relatively simple example of how this could work for 
>> testing.  This is obviously not a complete solution.
>> - Device memory will be registered with Linux memory sub-system by 
>> created corresponding struct page structures for device memory
>> - get_user_pages_fast() will  return corresponding struct pages when 
>> CPU address points to the device memory
>> - put_page() will deal with struct pages for device memory
>> Previously proposed solutions and related proposals:
>> 1.P2P DMA
>> DMA-API/PCI map_peer_resource support for peer-to-peer 
>> (http://www.spinics.net/lists/linux-pci/msg44560.html)
>> Pros: Low impact, already largely reviewed.
>> Cons: requires explicit support in all drivers that want to support 
>> it, doesn't handle S/G in device memory.
>> 2. ZONE_DEVICE IO
>> Direct I/O and DMA for persistent memory 
>> (https://lwn.net/Articles/672457/)
>> Add support for ZONE_DEVICE IO memory with struct pages. 
>> (https://patchwork.kernel.org/patch/8583221/)
>> Pro: Doesn't waste system memory for ZONE metadata
>> Cons: CPU access to ZONE metadata slow, may be lost, corrupted on 
>> device reset.
>> 3. DMA-BUF
>> RDMA subsystem DMA-BUF support 
>> (http://www.spinics.net/lists/linux-rdma/msg38748.html)
>> Pros: uses existing dma-buf interface
>> Cons: dma-buf is handle based, requires explicit dma-buf support in 
>> drivers.
>>
>> 4. iopmem
>> iopmem : A block device for PCIe memory 
>> (https://lwn.net/Articles/703895/)
>> 5. HMM
>> Heterogeneous Memory Management 
>> (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
>>
>> 6. Some new mmap-like interface that takes a userptr and a length and 
>> returns a dma-buf and offset?
>> Alex
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-10-22  6:13     ` Petrosyan, Ludwig
@ 2017-10-22 17:19       ` Logan Gunthorpe
  2017-10-23 16:08       ` David Laight
  1 sibling, 0 replies; 126+ messages in thread
From: Logan Gunthorpe @ 2017-10-22 17:19 UTC (permalink / raw)
  To: Petrosyan, Ludwig
  Cc: Deucher, Alexander, linux-kernel, linux-rdma, linux-nvdimm,
	Linux-media, dri-devel, linux-pci, Bridgman, John, Kuehling,
	Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig, Christian,
	Suthikulpanit, Suravee, Sander, Ben


On 22/10/17 12:13 AM, Petrosyan, Ludwig wrote:
> But at first sight it has to be simple:
> The PCIe Write transactions are address routed, so if in the packet header the other endpoint address is written the TLP has to be routed (by PCIe Switch to the endpoint), the DMA reading from the end point is really write transactions from the endpoint, usually (Xilinx core) to start DMA one has to write to the DMA control register of the endpoint the destination address. So I have change the device driver to set in this register the physical address of the other endpoint (get_resource start called to other endpoint, and it is the same address which I could see in lspci -vvvv -s bus-address of the switch port, memories behind bridge), so now the endpoint has to start send writes TLP with the other endpoint address in the TLP header.
> But this is not working (I want to understand why ...), but I could see the first address of the destination endpoint is changed (with the wrong value 0xFF),
> now I want to try prepare in the driver of one endpoint the DMA buffer , but using physical address of the other endpoint,
> Could be it will never work, but I want to understand why, there is my error ...

Hmm, well if I understand you correctly it sounds like, in theory, it
should work. But there could be any number of reasons why it does not.
You may need to get a hold of a PCIe analyzer to figure out what's
actually going on.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* RE: Enabling peer to peer device transactions for PCIe devices
  2017-10-22  6:13     ` Petrosyan, Ludwig
  2017-10-22 17:19       ` Logan Gunthorpe
@ 2017-10-23 16:08       ` David Laight
  2017-10-23 22:04         ` Logan Gunthorpe
  1 sibling, 1 reply; 126+ messages in thread
From: David Laight @ 2017-10-23 16:08 UTC (permalink / raw)
  To: 'Petrosyan, Ludwig', Logan Gunthorpe
  Cc: Deucher, Alexander, linux-kernel, linux-rdma, linux-nvdimm,
	Linux-media, dri-devel, linux-pci, Bridgman, John, Kuehling,
	Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig, Christian,
	Suthikulpanit, Suravee, Sander, Ben

From: Petrosyan Ludwig
> Sent: 22 October 2017 07:14
> Could be I have done is stupid...
> But at first sight it has to be simple:
> The PCIe Write transactions are address routed, so if in the packet header the other endpoint address
> is written the TLP has to be routed (by PCIe Switch to the endpoint), the DMA reading from the end
> point is really write transactions from the endpoint, usually (Xilinx core) to start DMA one has to
> write to the DMA control register of the endpoint the destination address. So I have change the device
> driver to set in this register the physical address of the other endpoint (get_resource start called
> to other endpoint, and it is the same address which I could see in lspci -vvvv -s bus-address of the
> switch port, memories behind bridge), so now the endpoint has to start send writes TLP with the other
> endpoint address in the TLP header.
> But this is not working (I want to understand why ...), but I could see the first address of the
> destination endpoint is changed (with the wrong value 0xFF),
> now I want to try prepare in the driver of one endpoint the DMA buffer , but using physical address of
> the other endpoint,
> Could be it will never work, but I want to understand why, there is my error ...

It is also worth checking that the hardware actually supports p2p transfers.
Writes are more likely to be supported then reads.
ISTR that some intel cpus support some p2p writes, but there could easily
be errata against them.

I'd certainly test a single word write to read/write memory location.
First verify against kernel memory, then against a 'slave' board.

I don't know about Xilinx fpga, but we've had 'fun' getting Altera fpga
to do sensible PCIe cycles (I ended up writing a simple dma controller 
that would generate long TLP).
We also found a bug in the Altera logic that processed interleaved
read completions.

	David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-10-23 16:08       ` David Laight
@ 2017-10-23 22:04         ` Logan Gunthorpe
  2017-10-24  5:58           ` Petrosyan, Ludwig
  0 siblings, 1 reply; 126+ messages in thread
From: Logan Gunthorpe @ 2017-10-23 22:04 UTC (permalink / raw)
  To: David Laight, 'Petrosyan, Ludwig'
  Cc: Deucher, Alexander, linux-kernel, linux-rdma, linux-nvdimm,
	Linux-media, dri-devel, linux-pci, Bridgman, John, Kuehling,
	Felix, Sagalovitch, Serguei, Blinzer, Paul, Koenig, Christian,
	Suthikulpanit, Suravee, Sander, Ben



On 23/10/17 10:08 AM, David Laight wrote:
> It is also worth checking that the hardware actually supports p2p transfers.
> Writes are more likely to be supported then reads.
> ISTR that some intel cpus support some p2p writes, but there could easily
> be errata against them.

Ludwig mentioned a PCIe switch. The few switches I'm aware of support 
P2P transfers. So if everything is setup correctly, the TLPs shouldn't 
even touch the CPU.

But, yes, generally it's a good idea to start with writes and see if 
they work first.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-10-23 22:04         ` Logan Gunthorpe
@ 2017-10-24  5:58           ` Petrosyan, Ludwig
  2017-10-24 14:58             ` David Laight
  0 siblings, 1 reply; 126+ messages in thread
From: Petrosyan, Ludwig @ 2017-10-24  5:58 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: David Laight, Alexander Deucher, linux-kernel, linux-rdma,
	linux-nvdimm, Linux-media, dri-devel, linux-pci, John Bridgman,
	Felix Kuehling, Serguei Sagalovitch, Paul Blinzer,
	Christian Koenig, Suravee Suthikulpanit, Ben Sander

Yes I agree it has to be started with the write transaction, according of PCIe standard all write transaction are address routed, and I agree with Logan:
if in write transaction TLP the endpoint address written in header the TLP should not touch CPU, the PCIe Switch has to route it to endpoint.
The idea was: in MTCA system there is PCIe Switch on MCH (MTCA crate HUB) this switch connects CPU to other Crate Slots, so one port is Upstream and others are Downstream  ports, DMA read from CPU is usual write on endpoint side, Xilinx DMA core has two registers Destination Address and Source Address,
device driver to make DMA has to set up these registers,
usually device driver to start DMA read from Board sets Source address as FPGA memory address and Destination address is DMA prepared system address,
in case of testing p2p I set Destination address as physical address of other endpoint.
More detailed:
we have so called pcie universal driver: the idea behind is
1. all pcie configuration staff, find enabled BARs, mapping BARs, usual read/write and common ioctl (get slot number, get driver version ...) implemented in universal driver and EXPORTed.
2. if some system function in new kernel are changed we change it only in universal parts (easy support a big number of drivers )
so the universal driver something like PCIe Driver API
3. the universal driver provides read/writ functions so we have the same device access API for any PCIe device, we could use the same user application with any PCIe device

now. during BARs finding and mapping universal driver keeps pcie endpoint physical address in some internal structures, any top driver may get physical address
of other pcie endpoint by slot number.
in may case during get_resorce the physical address is 0xB2000000, I check lspci -H1 -vvvv -s psie switch port bus address (the endpoint connected to this port, checked by lspci -H1 -t) the same address (0xB200000) is the memory behind bridge, 
I want to make p2p writes to offset 0x40000, so I set DMA destination address 0xB2400000
is something wrong?

thanks for help
regards

Ludwig

----- Original Message -----
From: "Logan Gunthorpe" <logang@deltatee.com>
To: "David Laight" <David.Laight@ACULAB.COM>, "Petrosyan, Ludwig" <ludwig.petrosyan@desy.de>
Cc: "Alexander Deucher" <Alexander.Deucher@amd.com>, "linux-kernel" <linux-kernel@vger.kernel.org>, "linux-rdma" <linux-rdma@vger.kernel.org>, "linux-nvdimm" <linux-nvdimm@lists.01.org>, "Linux-media" <Linux-media@vger.kernel.org>, "dri-devel" <dri-devel@lists.freedesktop.org>, "linux-pci" <linux-pci@vger.kernel.org>, "John Bridgman" <John.Bridgman@amd.com>, "Felix Kuehling" <Felix.Kuehling@amd.com>, "Serguei Sagalovitch" <Serguei.Sagalovitch@amd.com>, "Paul Blinzer" <Paul.Blinzer@amd.com>, "Christian Koenig" <Christian.Koenig@amd.com>, "Suravee Suthikulpanit" <Suravee.Suthikulpanit@amd.com>, "Ben Sander" <ben.sander@amd.com>
Sent: Tuesday, 24 October, 2017 00:04:26
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 23/10/17 10:08 AM, David Laight wrote:
> It is also worth checking that the hardware actually supports p2p transfers.
> Writes are more likely to be supported then reads.
> ISTR that some intel cpus support some p2p writes, but there could easily
> be errata against them.

Ludwig mentioned a PCIe switch. The few switches I'm aware of support 
P2P transfers. So if everything is setup correctly, the TLPs shouldn't 
even touch the CPU.

But, yes, generally it's a good idea to start with writes and see if 
they work first.

Logan

^ permalink raw reply	[flat|nested] 126+ messages in thread

* RE: Enabling peer to peer device transactions for PCIe devices
  2017-10-24  5:58           ` Petrosyan, Ludwig
@ 2017-10-24 14:58             ` David Laight
  2017-10-26 13:28               ` Petrosyan, Ludwig
  0 siblings, 1 reply; 126+ messages in thread
From: David Laight @ 2017-10-24 14:58 UTC (permalink / raw)
  To: 'Petrosyan, Ludwig', Logan Gunthorpe
  Cc: Alexander Deucher, linux-kernel, linux-rdma, linux-nvdimm,
	Linux-media, dri-devel, linux-pci, John Bridgman, Felix Kuehling,
	Serguei Sagalovitch, Paul Blinzer, Christian Koenig,
	Suravee Suthikulpanit, Ben Sander

Please don't top post, write shorter lines, and add the odd blank line.
Big blocks of text are hard to read quickly.

> From: Petrosyan, Ludwig [mailto:ludwig.petrosyan@desy.de]
> Yes I agree it has to be started with the write transaction, according of PCIe standard all write
> transaction are address routed, and I agree with Logan:
> if in write transaction TLP the endpoint address written in header the TLP should not touch CPU, the
> PCIe Switch has to route it to endpoint.

That depends, IIRC there is a feature for PCIe switches to force them
to send all transactions to the root hub.
This is there so that the host can enforce rules to stop p2p transfers.
It might enabled on the switch you have.

> The idea was: in MTCA system there is PCIe Switch on MCH (MTCA crate HUB) this switch connects CPU to
> other Crate Slots, so one port is Upstream and others are Downstream  ports, DMA read from CPU is
> usual write on endpoint side, Xilinx DMA core has two registers Destination Address and Source
> Address,
> device driver to make DMA has to set up these registers,
> usually device driver to start DMA read from Board sets Source address as FPGA memory address and
> Destination address is DMA prepared system address,
> in case of testing p2p I set Destination address as physical address of other endpoint.

Unnecessary detail...

> More detailed:
> we have so called pcie universal driver: the idea behind is
> 1. all pcie configuration staff, find enabled BARs, mapping BARs, usual read/write and common ioctl
> (get slot number, get driver version ...) implemented in universal driver and EXPORTed.
> 2. if some system function in new kernel are changed we change it only in universal parts (easy
> support a big number of drivers )
> so the universal driver something like PCIe Driver API
> 3. the universal driver provides read/writ functions so we have the same device access API for any
> PCIe device, we could use the same user application with any PCIe device

More crap...

> now. during BARs finding and mapping universal driver keeps pcie endpoint physical address in some
> internal structures, any top driver may get physical address
> of other pcie endpoint by slot number.
> in may case during get_resorce the physical address is 0xB2000000, I check lspci -H1 -vvvv -s psie
> switch port bus address (the endpoint connected to this port, checked by lspci -H1 -t) the same
> address (0xB200000) is the memory behind bridge,

Overly verbose...

> I want to make p2p writes to offset 0x40000, so I set DMA destination address 0xB2400000
> is something wrong?

Possibly.

You almost certainly need the address that is written into the BAR of the
target endpoint.
This could well be different from the physical address that the cpu uses
to write to the endpoint (as well as the cpu virtual address).

lspci lies [1], run lspci -x  (or hexdump config space through /sys/devices)
to see what is actually in the BAR.

[1] The addresses come from somewhere other than reading the BAR.
If the endpoint resets the BAR lspci will still report the old
addresses.

	David

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Enabling peer to peer device transactions for PCIe devices
  2017-10-24 14:58             ` David Laight
@ 2017-10-26 13:28               ` Petrosyan, Ludwig
  0 siblings, 0 replies; 126+ messages in thread
From: Petrosyan, Ludwig @ 2017-10-26 13:28 UTC (permalink / raw)
  To: David Laight
  Cc: Logan Gunthorpe, Alexander Deucher, linux-kernel, linux-rdma,
	linux-nvdimm, Linux-media, dri-devel, linux-pci, John Bridgman,
	Felix Kuehling, Serguei Sagalovitch, Paul Blinzer,
	Christian Koenig, Suravee Suthikulpanit, Ben Sander



----- Original Message -----
> From: "David Laight" <David.Laight@ACULAB.COM>
> To: "Petrosyan, Ludwig" <ludwig.petrosyan@desy.de>, "Logan Gunthorpe" <logang@deltatee.com>
> Cc: "Alexander Deucher" <Alexander.Deucher@amd.com>, "linux-kernel" <linux-kernel@vger.kernel.org>, "linux-rdma"
> <linux-rdma@vger.kernel.org>, "linux-nvdimm" <linux-nvdimm@lists.01.org>, "Linux-media" <Linux-media@vger.kernel.org>,
> "dri-devel" <dri-devel@lists.freedesktop.org>, "linux-pci" <linux-pci@vger.kernel.org>, "John Bridgman"
> <John.Bridgman@amd.com>, "Felix Kuehling" <Felix.Kuehling@amd.com>, "Serguei Sagalovitch"
> <Serguei.Sagalovitch@amd.com>, "Paul Blinzer" <Paul.Blinzer@amd.com>, "Christian Koenig" <Christian.Koenig@amd.com>,
> "Suravee Suthikulpanit" <Suravee.Suthikulpanit@amd.com>, "Ben Sander" <ben.sander@amd.com>
> Sent: Tuesday, 24 October, 2017 16:58:24
> Subject: RE: Enabling peer to peer device transactions for PCIe devices

> Please don't top post, write shorter lines, and add the odd blank line.
> Big blocks of text are hard to read quickly.
> 

OK this time I am very short. 
peer2peer works

Ludwig

^ permalink raw reply	[flat|nested] 126+ messages in thread

end of thread, other threads:[~2017-10-26 13:28 UTC | newest]

Thread overview: 126+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-21 20:36 Enabling peer to peer device transactions for PCIe devices Deucher, Alexander
2016-11-22 18:11 ` Dan Williams
     [not found]   ` <75a1f44f-c495-7d1e-7e1c-17e89555edba@amd.com>
2016-11-22 20:01     ` Dan Williams
2016-11-22 20:10       ` Daniel Vetter
2016-11-22 20:24         ` Dan Williams
2016-11-22 20:35         ` Serguei Sagalovitch
2016-11-22 21:03           ` Daniel Vetter
2016-11-22 21:21             ` Dan Williams
2016-11-22 22:21               ` Sagalovitch, Serguei
2016-11-23  7:49               ` Daniel Vetter
2016-11-23  8:51                 ` Christian König
2016-11-23 19:27                   ` Serguei Sagalovitch
2016-11-23 17:03                 ` Dave Hansen
2016-11-23 17:13     ` Logan Gunthorpe
2016-11-23 17:27       ` Bart Van Assche
2016-11-23 18:40         ` Dan Williams
2016-11-23 19:12           ` Jason Gunthorpe
2016-11-23 19:24             ` Serguei Sagalovitch
2016-11-23 19:06         ` Serguei Sagalovitch
2016-11-23 19:05       ` Jason Gunthorpe
2016-11-23 19:14         ` Serguei Sagalovitch
2016-11-23 19:32           ` Jason Gunthorpe
     [not found]             ` <c2c88376-5ba7-37d1-4d3e-592383ebb00a@amd.com>
2016-11-23 20:33               ` Jason Gunthorpe
2016-11-23 21:11                 ` Logan Gunthorpe
2016-11-23 21:55                   ` Jason Gunthorpe
2016-11-23 22:42                     ` Dan Williams
2016-11-23 23:25                       ` Jason Gunthorpe
2016-11-24  9:45                         ` Christian König
2016-11-24 16:26                           ` Jason Gunthorpe
2016-11-24 17:00                             ` Serguei Sagalovitch
2016-11-24 17:55                           ` Logan Gunthorpe
2016-11-25 13:06                             ` Christian König
2016-11-25 16:45                               ` Logan Gunthorpe
2016-11-25 17:20                                 ` Serguei Sagalovitch
2016-11-25 20:26                                   ` Felix Kuehling
2016-11-25 20:48                                     ` Serguei Sagalovitch
2016-11-24  0:40                     ` Sagalovitch, Serguei
2016-11-24 16:24                       ` Jason Gunthorpe
2016-11-24  1:25                     ` Logan Gunthorpe
2016-11-24 16:42                       ` Jason Gunthorpe
2016-11-24 18:11                         ` Logan Gunthorpe
2016-11-25  7:58                           ` Christoph Hellwig
2016-11-25 19:41                             ` Jason Gunthorpe
2016-11-25 17:59                           ` Serguei Sagalovitch
2016-11-25 13:22                         ` Christian König
2016-11-25 17:16                           ` Serguei Sagalovitch
2016-11-25 19:34                             ` Jason Gunthorpe
2016-11-25 19:49                               ` Serguei Sagalovitch
2016-11-25 20:19                                 ` Jason Gunthorpe
2016-11-25 23:41                               ` Alex Deucher
2016-11-25 19:32                           ` Jason Gunthorpe
2016-11-25 20:40                             ` Christian König
2016-11-25 20:51                               ` Felix Kuehling
2016-11-25 21:18                               ` Jason Gunthorpe
2016-11-27  8:16                             ` Haggai Eran
2016-11-27 14:02                             ` Haggai Eran
2016-11-27 14:07                               ` Christian König
2016-11-28  5:31                                 ` zhoucm1
2016-11-28 14:48                               ` Serguei Sagalovitch
2016-11-28 18:36                                 ` Haggai Eran
2016-11-28 16:57                               ` Jason Gunthorpe
2016-11-28 18:19                                 ` Haggai Eran
2016-11-28 19:02                                   ` Jason Gunthorpe
2016-11-30 10:45                                     ` Haggai Eran
2016-11-30 16:23                                       ` Jason Gunthorpe
2016-11-30 17:28                                         ` Serguei Sagalovitch
2016-12-04  7:33                                           ` Haggai Eran
2016-11-30 18:01                                         ` Logan Gunthorpe
2016-12-04  7:42                                           ` Haggai Eran
2016-12-04 13:06                                             ` Stephen Bates
2016-12-04 13:23                                             ` Stephen Bates
2016-12-05 17:18                                               ` Jason Gunthorpe
2016-12-05 17:40                                                 ` Dan Williams
2016-12-05 18:02                                                   ` Jason Gunthorpe
2016-12-05 18:08                                                     ` Dan Williams
2016-12-05 18:39                                                       ` Logan Gunthorpe
2016-12-05 18:48                                                         ` Dan Williams
2016-12-05 19:14                                                           ` Jason Gunthorpe
2016-12-05 19:27                                                             ` Logan Gunthorpe
2016-12-05 19:46                                                               ` Jason Gunthorpe
2016-12-05 19:59                                                                 ` Logan Gunthorpe
2016-12-05 20:06                                                                 ` Christoph Hellwig
2016-12-06  8:06                                                           ` Stephen Bates
2016-12-06 16:38                                                             ` Jason Gunthorpe
2016-12-06 16:51                                                               ` Logan Gunthorpe
2016-12-06 17:28                                                                 ` Jason Gunthorpe
2016-12-06 21:47                                                                   ` Logan Gunthorpe
2016-12-06 22:02                                                                     ` Dan Williams
2016-12-06 17:12                                                               ` Christoph Hellwig
2016-12-04  7:53                                         ` Haggai Eran
2016-11-30 17:10                                       ` Deucher, Alexander
2016-11-28 18:20                                 ` Logan Gunthorpe
2016-11-28 19:35                                   ` Serguei Sagalovitch
2016-11-28 21:36                                     ` Logan Gunthorpe
2016-11-28 21:55                                       ` Serguei Sagalovitch
2016-11-28 22:24                                         ` Jason Gunthorpe
2017-01-05 18:39 ` Jerome Glisse
2017-01-05 19:01   ` Jason Gunthorpe
2017-01-05 19:54     ` Jerome Glisse
2017-01-05 20:07       ` Jason Gunthorpe
2017-01-05 20:19         ` Jerome Glisse
2017-01-05 22:42           ` Jason Gunthorpe
2017-01-05 23:23             ` Jerome Glisse
2017-01-06  0:30               ` Jason Gunthorpe
2017-01-06  0:41                 ` Serguei Sagalovitch
2017-01-06  1:58                 ` Jerome Glisse
2017-01-06 16:56                   ` Serguei Sagalovitch
2017-01-06 17:37                     ` Jerome Glisse
2017-01-06 18:26                       ` Jason Gunthorpe
2017-01-06 19:12                         ` Deucher, Alexander
2017-01-06 22:10                         ` Logan Gunthorpe
2017-01-12  4:54                           ` Stephen Bates
2017-01-12 15:11                             ` Jerome Glisse
2017-01-12 17:17                               ` Jason Gunthorpe
2017-01-13 13:04                               ` Christian König
2017-01-12 22:35                             ` Logan Gunthorpe
2017-01-06 15:08     ` Henrique Almeida
2017-10-20 12:36 ` Ludwig Petrosyan
2017-10-20 15:48   ` Logan Gunthorpe
2017-10-22  6:13     ` Petrosyan, Ludwig
2017-10-22 17:19       ` Logan Gunthorpe
2017-10-23 16:08       ` David Laight
2017-10-23 22:04         ` Logan Gunthorpe
2017-10-24  5:58           ` Petrosyan, Ludwig
2017-10-24 14:58             ` David Laight
2017-10-26 13:28               ` Petrosyan, Ludwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).