Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
@ 2020-02-07 18:24 Jason Gunthorpe
  2020-02-07 19:46 ` Matthew Wilcox
  2020-02-08 13:10 ` Christian König
  0 siblings, 2 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2020-02-07 18:24 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, linux-pci, linux-rdma, Christian König,
	Daniel Vetter, Logan Gunthorpe, Stephen Bates,
	Jérôme Glisse, Ira Weiny, Christoph Hellwig,
	John Hubbard, Ralph Campbell, Dan Williams, Don Dutile

Many systems can now support direct DMA between two PCI devices, for
instance between a RDMA NIC and a NVMe CMB, or a RDMA NIC and GPU
graphics memory. In many system architectures this peer-to-peer PCI-E
DMA transfer is critical to achieving performance as there is simply
not enough system memory/PCI-E bandwidth for data traffic to go
through the CPU socket.

For many years various out of tree solutions have existed to serve
this need. Recently some components have been accpeted into mainline,
such as the p2pdma system, which allows co-operating drivers to setup
P2P DMA transfers at the PCI level. This has allowed some kernel P2P
DMA transfers related to NVMe CMB and RDMA to become supported.

A major next step is to enable P2P transfers under userspace
control. This is a very broad topic, but for this session I propose to
focus on initial cases of supporting drivers can setup a P2P transfer
from a PCI BAR page mmap'd to userspace. This is the basic starting
point for future discussions on how to adapt get_user_pages() IO paths
(ie O_DIRECT, net zero copy TX, RDMA, etc) to support PCI BAR memory.

As all current drivers doing DMA from user space must go through
get_user_pages() (or its new sibling hmm_range_fault()), some
extension of the get_user_pages() API is needed to allow drivers
supporting P2P to see the pages.

get_user_pages() will require some 'struct page' and 'struct
vm_area_struct' representation of the BAR memory beyond what today's
io_remap_pfn_range()/etc produces.

This topic has been discussed in small groups in various conferences
over the last year, (plumbers, ALPSS, LSF/MM 2019, etc). Having a
larger group together would be productive, especially as the direction
has a notable impact on the general mm.

For patch sets, we've seen a number of attempts so far, but little has
been merged yet. Common elements of past discussions have been:
 - Building struct page for BAR memory
 - Stuffing BAR memory into scatter/gather lists, bios and skbs
 - DMA mapping BAR memory
 - Referencing BAR memory without a struct page
 - Managing lifetime of BAR memory across multiple drivers

Based on past work, the people in the CC list would be recommended
participants:

 Christian König <christian.koenig@amd.com>
 Daniel Vetter <daniel.vetter@ffwll.ch>
 Logan Gunthorpe <logang@deltatee.com>
 Stephen Bates <sbates@raithlin.com>
 Jérôme Glisse <jglisse@redhat.com>
 Ira Weiny <iweiny@intel.com>
 Christoph Hellwig <hch@lst.de>
 John Hubbard <jhubbard@nvidia.com>
 Ralph Campbell <rcampbell@nvidia.com>
 Dan Williams <dan.j.williams@intel.com>
 Don Dutile <ddutile@redhat.com>

Regards,
Jason

Description of the p2pdma work:
 https://lwn.net/Articles/767281/

Discussion slot at Plumbers:
 https://linuxplumbersconf.org/event/4/contributions/369/

DRM work on DMABUF as a user facing object for P2P:
 https://www.spinics.net/lists/amd-gfx/msg32469.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
  2020-02-07 18:24 [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory Jason Gunthorpe
@ 2020-02-07 19:46 ` Matthew Wilcox
  2020-02-07 20:13   ` Jason Gunthorpe
  2020-02-08 13:10 ` Christian König
  1 sibling, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2020-02-07 19:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: lsf-pc, linux-mm, linux-pci, linux-rdma, Christian König,
	Daniel Vetter, Logan Gunthorpe, Stephen Bates,
	Jérôme Glisse, Ira Weiny, Christoph Hellwig,
	John Hubbard, Ralph Campbell, Dan Williams, Don Dutile,
	Thomas Hellström (VMware),
	Joao Martins

On Fri, Feb 07, 2020 at 02:24:57PM -0400, Jason Gunthorpe wrote:
> Many systems can now support direct DMA between two PCI devices, for
> instance between a RDMA NIC and a NVMe CMB, or a RDMA NIC and GPU
> graphics memory. In many system architectures this peer-to-peer PCI-E
> DMA transfer is critical to achieving performance as there is simply
> not enough system memory/PCI-E bandwidth for data traffic to go
> through the CPU socket.
> 
> For many years various out of tree solutions have existed to serve
> this need. Recently some components have been accpeted into mainline,
> such as the p2pdma system, which allows co-operating drivers to setup
> P2P DMA transfers at the PCI level. This has allowed some kernel P2P
> DMA transfers related to NVMe CMB and RDMA to become supported.
> 
> A major next step is to enable P2P transfers under userspace
> control. This is a very broad topic, but for this session I propose to
> focus on initial cases of supporting drivers can setup a P2P transfer
> from a PCI BAR page mmap'd to userspace. This is the basic starting
> point for future discussions on how to adapt get_user_pages() IO paths
> (ie O_DIRECT, net zero copy TX, RDMA, etc) to support PCI BAR memory.
> 
> As all current drivers doing DMA from user space must go through
> get_user_pages() (or its new sibling hmm_range_fault()), some
> extension of the get_user_pages() API is needed to allow drivers
> supporting P2P to see the pages.
> 
> get_user_pages() will require some 'struct page' and 'struct
> vm_area_struct' representation of the BAR memory beyond what today's
> io_remap_pfn_range()/etc produces.
> 
> This topic has been discussed in small groups in various conferences
> over the last year, (plumbers, ALPSS, LSF/MM 2019, etc). Having a
> larger group together would be productive, especially as the direction
> has a notable impact on the general mm.
> 
> For patch sets, we've seen a number of attempts so far, but little has
> been merged yet. Common elements of past discussions have been:
>  - Building struct page for BAR memory
>  - Stuffing BAR memory into scatter/gather lists, bios and skbs
>  - DMA mapping BAR memory
>  - Referencing BAR memory without a struct page
>  - Managing lifetime of BAR memory across multiple drivers
> 
> Based on past work, the people in the CC list would be recommended
> participants:
> 
>  Christian König <christian.koenig@amd.com>
>  Daniel Vetter <daniel.vetter@ffwll.ch>
>  Logan Gunthorpe <logang@deltatee.com>
>  Stephen Bates <sbates@raithlin.com>
>  Jérôme Glisse <jglisse@redhat.com>
>  Ira Weiny <iweiny@intel.com>
>  Christoph Hellwig <hch@lst.de>
>  John Hubbard <jhubbard@nvidia.com>
>  Ralph Campbell <rcampbell@nvidia.com>
>  Dan Williams <dan.j.williams@intel.com>
>  Don Dutile <ddutile@redhat.com>

That's a long list, and you're missing 

"Thomas Hellström (VMware)" <thomas_os@shipmail.org>
Joao Martins <joao.m.martins@oracle.com>

both of whom have been working on related projects (for PFNs without pages).
Hey, you missed me too!  ;-)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
  2020-02-07 19:46 ` Matthew Wilcox
@ 2020-02-07 20:13   ` Jason Gunthorpe
  2020-02-07 20:42     ` Matthew Wilcox
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2020-02-07 20:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-mm, linux-pci, linux-rdma, Christian König,
	Daniel Vetter, Logan Gunthorpe, Stephen Bates,
	Jérôme Glisse, Ira Weiny, Christoph Hellwig,
	John Hubbard, Ralph Campbell, Dan Williams, Don Dutile,
	Thomas Hellström (VMware),
	Joao Martins

On Fri, Feb 07, 2020 at 11:46:20AM -0800, Matthew Wilcox wrote:
> > 
> >  Christian König <christian.koenig@amd.com>
> >  Daniel Vetter <daniel.vetter@ffwll.ch>
> >  Logan Gunthorpe <logang@deltatee.com>
> >  Stephen Bates <sbates@raithlin.com>
> >  Jérôme Glisse <jglisse@redhat.com>
> >  Ira Weiny <iweiny@intel.com>
> >  Christoph Hellwig <hch@lst.de>
> >  John Hubbard <jhubbard@nvidia.com>
> >  Ralph Campbell <rcampbell@nvidia.com>
> >  Dan Williams <dan.j.williams@intel.com>
> >  Don Dutile <ddutile@redhat.com>
> 
> That's a long list, and you're missing 
> 
> "Thomas Hellström (VMware)" <thomas_os@shipmail.org>
> Joao Martins <joao.m.martins@oracle.com>

Great, thanks, I'm not really aware of what the related work is
though?

> both of whom have been working on related projects (for PFNs without pages).
> Hey, you missed me too!  ;-)

Ah I was not daring to propose a discussion on 'PFNs without pages'
again :)

The early exploratory work here has been creating ZONE_DEVICE pages as
is already done for P2P and now moving to also mmap them to userspace.

Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
  2020-02-07 20:13   ` Jason Gunthorpe
@ 2020-02-07 20:42     ` Matthew Wilcox
  2020-02-14 10:35       ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2020-02-07 20:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: lsf-pc, linux-mm, linux-pci, linux-rdma, Christian König,
	Daniel Vetter, Logan Gunthorpe, Stephen Bates,
	Jérôme Glisse, Ira Weiny, Christoph Hellwig,
	John Hubbard, Ralph Campbell, Dan Williams, Don Dutile,
	Thomas Hellström (VMware),
	Joao Martins

On Fri, Feb 07, 2020 at 04:13:51PM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 07, 2020 at 11:46:20AM -0800, Matthew Wilcox wrote:
> > > 
> > >  Christian König <christian.koenig@amd.com>
> > >  Daniel Vetter <daniel.vetter@ffwll.ch>
> > >  Logan Gunthorpe <logang@deltatee.com>
> > >  Stephen Bates <sbates@raithlin.com>
> > >  Jérôme Glisse <jglisse@redhat.com>
> > >  Ira Weiny <iweiny@intel.com>
> > >  Christoph Hellwig <hch@lst.de>
> > >  John Hubbard <jhubbard@nvidia.com>
> > >  Ralph Campbell <rcampbell@nvidia.com>
> > >  Dan Williams <dan.j.williams@intel.com>
> > >  Don Dutile <ddutile@redhat.com>
> > 
> > That's a long list, and you're missing 
> > 
> > "Thomas Hellström (VMware)" <thomas_os@shipmail.org>
> > Joao Martins <joao.m.martins@oracle.com>
> 
> Great, thanks, I'm not really aware of what the related work is
> though?

Thomas has been working on huge pages for graphics BARs, so that's involved
touching 'special' (ie pageless) VMAs:
https://lore.kernel.org/linux-mm/20200205125353.2760-1-thomas_os@shipmail.org/

Joao has been working on removing the need for KVM hosts to have struct pages
that cover the memory of their guests:
https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/

> > both of whom have been working on related projects (for PFNs without pages).
> > Hey, you missed me too!  ;-)
> 
> Ah I was not daring to propose a discussion on 'PFNs without pages'
> again :)
> 
> The early exploratory work here has been creating ZONE_DEVICE pages as
> is already done for P2P and now moving to also mmap them to userspace.

Dynamically allocating struct pages interests me too ;-)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
  2020-02-07 18:24 [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory Jason Gunthorpe
  2020-02-07 19:46 ` Matthew Wilcox
@ 2020-02-08 13:10 ` Christian König
  2020-02-08 13:54   ` Jason Gunthorpe
  1 sibling, 1 reply; 10+ messages in thread
From: Christian König @ 2020-02-08 13:10 UTC (permalink / raw)
  To: Jason Gunthorpe, lsf-pc
  Cc: linux-mm, linux-pci, linux-rdma, Daniel Vetter, Logan Gunthorpe,
	Stephen Bates, Jérôme Glisse, Ira Weiny,
	Christoph Hellwig, John Hubbard, Ralph Campbell, Dan Williams,
	Don Dutile

Am 07.02.20 um 19:24 schrieb Jason Gunthorpe:
> Many systems can now support direct DMA between two PCI devices, for
> instance between a RDMA NIC and a NVMe CMB, or a RDMA NIC and GPU
> graphics memory. In many system architectures this peer-to-peer PCI-E
> DMA transfer is critical to achieving performance as there is simply
> not enough system memory/PCI-E bandwidth for data traffic to go
> through the CPU socket.
>
> For many years various out of tree solutions have existed to serve
> this need. Recently some components have been accpeted into mainline,
> such as the p2pdma system, which allows co-operating drivers to setup
> P2P DMA transfers at the PCI level. This has allowed some kernel P2P
> DMA transfers related to NVMe CMB and RDMA to become supported.
>
> A major next step is to enable P2P transfers under userspace
> control. This is a very broad topic, but for this session I propose to
> focus on initial cases of supporting drivers can setup a P2P transfer
> from a PCI BAR page mmap'd to userspace. This is the basic starting
> point for future discussions on how to adapt get_user_pages() IO paths
> (ie O_DIRECT, net zero copy TX, RDMA, etc) to support PCI BAR memory.
>
> As all current drivers doing DMA from user space must go through
> get_user_pages() (or its new sibling hmm_range_fault()), some
> extension of the get_user_pages() API is needed to allow drivers
> supporting P2P to see the pages.
>
> get_user_pages() will require some 'struct page' and 'struct
> vm_area_struct' representation of the BAR memory beyond what today's
> io_remap_pfn_range()/etc produces.
>
> This topic has been discussed in small groups in various conferences
> over the last year, (plumbers, ALPSS, LSF/MM 2019, etc). Having a
> larger group together would be productive, especially as the direction
> has a notable impact on the general mm.
>
> For patch sets, we've seen a number of attempts so far, but little has
> been merged yet. Common elements of past discussions have been:
>   - Building struct page for BAR memory
>   - Stuffing BAR memory into scatter/gather lists, bios and skbs
>   - DMA mapping BAR memory
>   - Referencing BAR memory without a struct page
>   - Managing lifetime of BAR memory across multiple drivers

I can only repeat Jérôme that this most likely will never work correctly 
with get_user_pages().

One of the main issues is that if you want to cover all use cases you 
also need to take into account P2P operations which are hidden from the CPU.

E.g. you have memory which is not even CPU addressable, but can be 
shared between GPUs using XGMI, NVLink, SLI etc....

Since you can't get a struct page for something the CPU can't even have 
an address for the whole idea of using get_user_pages() fails from the 
very beginning.

That's also the reason why for GPUs we opted to use DMA-buf based 
sharing of buffers between drivers instead.

So we need to figure out how express DMA addresses outside of the CPU 
address space first before we can even think about something like 
extending get_user_pages() for P2P in an HMM scenario.

Regards,
Christian.

>
> Based on past work, the people in the CC list would be recommended
> participants:
>
>   Christian König <christian.koenig@amd.com>
>   Daniel Vetter <daniel.vetter@ffwll.ch>
>   Logan Gunthorpe <logang@deltatee.com>
>   Stephen Bates <sbates@raithlin.com>
>   Jérôme Glisse <jglisse@redhat.com>
>   Ira Weiny <iweiny@intel.com>
>   Christoph Hellwig <hch@lst.de>
>   John Hubbard <jhubbard@nvidia.com>
>   Ralph Campbell <rcampbell@nvidia.com>
>   Dan Williams <dan.j.williams@intel.com>
>   Don Dutile <ddutile@redhat.com>
>
> Regards,
> Jason
>
> Description of the p2pdma work:
>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F767281%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C942df05e20d14566df3708d7abfb0dbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637166967083315894&amp;sdata=j5YBrBF2zIjn0oZwbBn5%2BYabv8uWaawwtkVIWnO2GPs%3D&amp;reserved=0
>
> Discussion slot at Plumbers:
>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinuxplumbersconf.org%2Fevent%2F4%2Fcontributions%2F369%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C942df05e20d14566df3708d7abfb0dbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637166967083325894&amp;sdata=TbXLNXBDExHiViEE%2FYRpavsJ%2Fd68KOfg8xp%2BKk1ZJJU%3D&amp;reserved=0
>
> DRM work on DMABUF as a user facing object for P2P:
>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Famd-gfx%2Fmsg32469.html&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C942df05e20d14566df3708d7abfb0dbb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637166967083325894&amp;sdata=LBVbNR5bsknqL4MQf9RUyh7TDD9nD6yR5KJvKx5STds%3D&amp;reserved=0



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
  2020-02-08 13:10 ` Christian König
@ 2020-02-08 13:54   ` Jason Gunthorpe
  2020-02-08 16:38     ` Christian König
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2020-02-08 13:54 UTC (permalink / raw)
  To: Christian König
  Cc: lsf-pc, linux-mm, linux-pci, linux-rdma, Daniel Vetter,
	Logan Gunthorpe, Stephen Bates, Jérôme Glisse,
	Ira Weiny, Christoph Hellwig, John Hubbard, Ralph Campbell,
	Dan Williams, Don Dutile

On Sat, Feb 08, 2020 at 02:10:59PM +0100, Christian König wrote:
> > For patch sets, we've seen a number of attempts so far, but little has
> > been merged yet. Common elements of past discussions have been:
> >   - Building struct page for BAR memory
> >   - Stuffing BAR memory into scatter/gather lists, bios and skbs
> >   - DMA mapping BAR memory
> >   - Referencing BAR memory without a struct page
> >   - Managing lifetime of BAR memory across multiple drivers
> 
> I can only repeat Jérôme that this most likely will never work correctly
> with get_user_pages().

I suppose I'm using 'get_user_pages()' as something of a placeholder
here to refer to the existing family of kernel DMA consumers that call
get_user_pages to work on VMA backed process visible memory.

We have to have something like get_user_pages() because the kernel
call-sites are fundamentally only dealing with userspace VA. That is
how their uAPIs are designed, and we want to keep them working.

So, if something doesn't fit into get_user_pages(), ie because it
doesn't have a VMA in the first place, then that is some other
discussion. DMA buf seems like a pretty good answer.

> E.g. you have memory which is not even CPU addressable, but can be shared
> between GPUs using XGMI, NVLink, SLI etc....

For this kind of memory if it is mapped into a VMA with
DEVICE_PRIVATE, as Jerome has imagined, then it would be part of this
discussion.

> So we need to figure out how express DMA addresses outside of the CPU
> address space first before we can even think about something like extending
> get_user_pages() for P2P in an HMM scenario.

Why? This is discussion is not exclusively for GPU. We have many use
cases that do not have CPU invisible memory to worry about, and I
don't think defining how DMA mapping works for cpu-invisible
interconnect overlaps with figuring out how to make get_user_pages
work with existing ZONE_DEVICE memory types.

ie the challenge here is how to deliver the required information to
the p2pdma subsystem so a get_user_pages() call site can do a DMA map.

Improving the p2pdma subsystem to handle more complex cases like CPU
invisible memory and interconnect is a different topic, I think :)

Regards,
Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
  2020-02-08 13:54   ` Jason Gunthorpe
@ 2020-02-08 16:38     ` Christian König
  2020-02-08 17:43       ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Christian König @ 2020-02-08 16:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: lsf-pc, linux-mm, linux-pci, linux-rdma, Daniel Vetter,
	Logan Gunthorpe, Stephen Bates, Jérôme Glisse,
	Ira Weiny, Christoph Hellwig, John Hubbard, Ralph Campbell,
	Dan Williams, Don Dutile

Am 08.02.20 um 14:54 schrieb Jason Gunthorpe:
> On Sat, Feb 08, 2020 at 02:10:59PM +0100, Christian König wrote:
>>> For patch sets, we've seen a number of attempts so far, but little has
>>> been merged yet. Common elements of past discussions have been:
>>>    - Building struct page for BAR memory
>>>    - Stuffing BAR memory into scatter/gather lists, bios and skbs
>>>    - DMA mapping BAR memory
>>>    - Referencing BAR memory without a struct page
>>>    - Managing lifetime of BAR memory across multiple drivers
>> I can only repeat Jérôme that this most likely will never work correctly
>> with get_user_pages().
> I suppose I'm using 'get_user_pages()' as something of a placeholder
> here to refer to the existing family of kernel DMA consumers that call
> get_user_pages to work on VMA backed process visible memory.
>
> We have to have something like get_user_pages() because the kernel
> call-sites are fundamentally only dealing with userspace VA. That is
> how their uAPIs are designed, and we want to keep them working.
>
> So, if something doesn't fit into get_user_pages(), ie because it
> doesn't have a VMA in the first place, then that is some other
> discussion. DMA buf seems like a pretty good answer.

Well we do have a VMA, but I strongly think that get_user_pages() is the 
wrong approach for the job.

What we should do instead is to grab the VMA for the addresses and then 
say through the vm_operations_struct: "Hello I'm driver X and want to do 
P2P with you. Who are you? What are your capabilities? Should we use 
PCIe or shortcut through some other interconnect? etc etc ect...".

>> E.g. you have memory which is not even CPU addressable, but can be shared
>> between GPUs using XGMI, NVLink, SLI etc....
> For this kind of memory if it is mapped into a VMA with
> DEVICE_PRIVATE, as Jerome has imagined, then it would be part of this
> discussion.

I think what Jerome had in mind with its P2P ideas around HMM was that 
we could do this with anonymous memory which was migrated to a GPU 
device. That turned out to be rather complicated because you would need 
to be able to figure out to which driver you need to talk to for the 
migrated address, which in turn wasn't related to the VMA in any way.

What you have here is probably a rather different use case since the 
whole VMA is belonging to a driver. That makes things quite a bit easier 
to handle.

>> So we need to figure out how express DMA addresses outside of the CPU
>> address space first before we can even think about something like extending
>> get_user_pages() for P2P in an HMM scenario.
> Why?

Because that's how get_user_pages() works. IIRC you call it with 
userspace address+length and get a filled struct pages and VMAs array in 
return.

When you don't have CPU addresses for you memory the whole idea of that 
interface falls apart. So I think we need to get away from 
get_user_pages() and work more high level here.

> This is discussion is not exclusively for GPU. We have many use
> cases that do not have CPU invisible memory to worry about, and I
> don't think defining how DMA mapping works for cpu-invisible
> interconnect overlaps with figuring out how to make get_user_pages
> work with existing ZONE_DEVICE memory types.
>
> ie the challenge here is how to deliver the required information to
> the p2pdma subsystem so a get_user_pages() call site can do a DMA map.
>
> Improving the p2pdma subsystem to handle more complex cases like CPU
> invisible memory and interconnect is a different topic, I think :)

Well you can of course ignore those, but P2P over PCIe is actually only 
a rather specific use case and I would say when we start to tackle this 
we should come up with something that works in all areas.

Regards,
Christian.

>
> Regards,
> Jason



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
  2020-02-08 16:38     ` Christian König
@ 2020-02-08 17:43       ` Jason Gunthorpe
  2020-02-10 18:39         ` Logan Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2020-02-08 17:43 UTC (permalink / raw)
  To: Christian König
  Cc: lsf-pc, linux-mm, linux-pci, linux-rdma, Daniel Vetter,
	Logan Gunthorpe, Stephen Bates, Jérôme Glisse,
	Ira Weiny, Christoph Hellwig, John Hubbard, Ralph Campbell,
	Dan Williams, Don Dutile

On Sat, Feb 08, 2020 at 05:38:51PM +0100, Christian König wrote:
> Am 08.02.20 um 14:54 schrieb Jason Gunthorpe:
> > On Sat, Feb 08, 2020 at 02:10:59PM +0100, Christian König wrote:
> > > > For patch sets, we've seen a number of attempts so far, but little has
> > > > been merged yet. Common elements of past discussions have been:
> > > >    - Building struct page for BAR memory
> > > >    - Stuffing BAR memory into scatter/gather lists, bios and skbs
> > > >    - DMA mapping BAR memory
> > > >    - Referencing BAR memory without a struct page
> > > >    - Managing lifetime of BAR memory across multiple drivers
> > > I can only repeat Jérôme that this most likely will never work correctly
> > > with get_user_pages().
> > I suppose I'm using 'get_user_pages()' as something of a placeholder
> > here to refer to the existing family of kernel DMA consumers that call
> > get_user_pages to work on VMA backed process visible memory.
> > 
> > We have to have something like get_user_pages() because the kernel
> > call-sites are fundamentally only dealing with userspace VA. That is
> > how their uAPIs are designed, and we want to keep them working.
> > 
> > So, if something doesn't fit into get_user_pages(), ie because it
> > doesn't have a VMA in the first place, then that is some other
> > discussion. DMA buf seems like a pretty good answer.
> 
> Well we do have a VMA, but I strongly think that get_user_pages() is the
> wrong approach for the job.
> 
> What we should do instead is to grab the VMA for the addresses and then say
> through the vm_operations_struct: "Hello I'm driver X and want to do P2P
> with you. Who are you? What are your capabilities? Should we use PCIe or
> shortcut through some other interconnect? etc etc ect...".

This is a very topical discussion. So far all the non-struct page
approaches have fallen down in some way or another.

The big problem with a VMA centric scheme is that the VMA is ephemeral
relative to the DMA mapping, so when it comes time to unmap it is not
so clear what to do to 'put' the reference. There has also been
resistance to adding new ops to a VMA.

For instance a 'get dma buf' VMA op would solve the lifetime problems,
but significantly complicates most of the existing get_user_pages()
users as they now have to track lists of dma buf pointers so they can
de-ref the dma bufs that covered the user VA range during 'get'

FWIW, if the outcome of the discussion was to have some 'get dma buf'
VMA op that would probably be reasonable. I've talked about this
before with various people, it isn't quite as good as struct pages,
but some subsystems like RDMA can probably make it work.

> > > E.g. you have memory which is not even CPU addressable, but can be shared
> > > between GPUs using XGMI, NVLink, SLI etc....
> > For this kind of memory if it is mapped into a VMA with
> > DEVICE_PRIVATE, as Jerome has imagined, then it would be part of this
> > discussion.
> 
> I think what Jerome had in mind with its P2P ideas around HMM was that we
> could do this with anonymous memory which was migrated to a GPU device. That
> turned out to be rather complicated because you would need to be able to
> figure out to which driver you need to talk to for the migrated address,
> which in turn wasn't related to the VMA in any way.

Jerome's VMA proposal tied explicitly the lifetime of the VMA to the
lifetime of the DMA map by forcing the use of 'shared virtual memory'
(ie mmu notifiers, etc) techniques which have a very narrow usability
with HW. This is how the lifetime problem was solved in those patches.

This path has huge drawbacks for everything that is not a GPU use
case. Ie we can't fit it into virtio to solve it's current P2P DMA
problem.

> > > So we need to figure out how express DMA addresses outside of the CPU
> > > address space first before we can even think about something like extending
> > > get_user_pages() for P2P in an HMM scenario.
> > Why?
> 
> Because that's how get_user_pages() works. IIRC you call it with userspace
> address+length and get a filled struct pages and VMAs array in return.
> 
> When you don't have CPU addresses for you memory the whole idea of that
> interface falls apart. So I think we need to get away from get_user_pages()
> and work more high level here.

get_user_pages() is struct page focused, and there is some general
expectation that GPUs will have to create DEVICE_PRIVATE struct pages
for their entire hidden memory so that they can do all the HMM tricks
with anonymous memory. They also have to recongize the DEVICE_PIVATE
pages during hmm driven page faults.

Removing the DEVICE_PRIVATE from the anonymous page setup seems
impossible at the current moment - thus it seems like we are stuck
with struct pages, may as well use them?

Literally nobody like this, but all the non-struct-page proposals have
failed to get traction so far.

> > Improving the p2pdma subsystem to handle more complex cases like CPU
> > invisible memory and interconnect is a different topic, I think :)
> 
> Well you can of course ignore those, but P2P over PCIe is actually only a
> rather specific use case and I would say when we start to tackle this we
> should come up with something that works in all areas.

Well, it is the general 'standard based' problem.

Frankly, I don't think the general kernel community can tackle the
undocumented proprietary interconnect problem, as nobody really knows
what these things are. The people building these things needs to lead
that forward somehow.

Today, the closest we have, is the DEVICE_PRIVATE 'loopback' that
things like nouveau attempt to implement. They are supposed to be
able to translate the DEVICE_PRIVATE pages into some 'device internal'
address. Presumably that can reach through the device internal
interconnect, but I don't know if nouveau has gone that far.

Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
  2020-02-08 17:43       ` Jason Gunthorpe
@ 2020-02-10 18:39         ` Logan Gunthorpe
  0 siblings, 0 replies; 10+ messages in thread
From: Logan Gunthorpe @ 2020-02-10 18:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König
  Cc: lsf-pc, linux-mm, linux-pci, linux-rdma, Daniel Vetter,
	Stephen Bates, Jérôme Glisse, Ira Weiny,
	Christoph Hellwig, John Hubbard, Ralph Campbell, Dan Williams,
	Don Dutile



On 2020-02-08 10:43 a.m., Jason Gunthorpe wrote:
> On Sat, Feb 08, 2020 at 05:38:51PM +0100, Christian König wrote:
>> Am 08.02.20 um 14:54 schrieb Jason Gunthorpe:
>>> On Sat, Feb 08, 2020 at 02:10:59PM +0100, Christian König wrote:
>>>>> For patch sets, we've seen a number of attempts so far, but little has
>>>>> been merged yet. Common elements of past discussions have been:
>>>>>    - Building struct page for BAR memory
>>>>>    - Stuffing BAR memory into scatter/gather lists, bios and skbs
>>>>>    - DMA mapping BAR memory
>>>>>    - Referencing BAR memory without a struct page
>>>>>    - Managing lifetime of BAR memory across multiple drivers
>>>> I can only repeat Jérôme that this most likely will never work correctly
>>>> with get_user_pages().
>>> I suppose I'm using 'get_user_pages()' as something of a placeholder
>>> here to refer to the existing family of kernel DMA consumers that call
>>> get_user_pages to work on VMA backed process visible memory.
>>>
>>> We have to have something like get_user_pages() because the kernel
>>> call-sites are fundamentally only dealing with userspace VA. That is
>>> how their uAPIs are designed, and we want to keep them working.
>>>
>>> So, if something doesn't fit into get_user_pages(), ie because it
>>> doesn't have a VMA in the first place, then that is some other
>>> discussion. DMA buf seems like a pretty good answer.
>>
>> Well we do have a VMA, but I strongly think that get_user_pages() is the
>> wrong approach for the job.
>>
>> What we should do instead is to grab the VMA for the addresses and then say
>> through the vm_operations_struct: "Hello I'm driver X and want to do P2P
>> with you. Who are you? What are your capabilities? Should we use PCIe or
>> shortcut through some other interconnect? etc etc ect...".
> 
> This is a very topical discussion. So far all the non-struct page
> approaches have fallen down in some way or another.
> 
> The big problem with a VMA centric scheme is that the VMA is ephemeral
> relative to the DMA mapping, so when it comes time to unmap it is not
> so clear what to do to 'put' the reference. There has also been
> resistance to adding new ops to a VMA.
> 
> For instance a 'get dma buf' VMA op would solve the lifetime problems,
> but significantly complicates most of the existing get_user_pages()
> users as they now have to track lists of dma buf pointers so they can
> de-ref the dma bufs that covered the user VA range during 'get'
> 
> FWIW, if the outcome of the discussion was to have some 'get dma buf'
> VMA op that would probably be reasonable. I've talked about this
> before with various people, it isn't quite as good as struct pages,
> but some subsystems like RDMA can probably make it work.
> 
>>>> E.g. you have memory which is not even CPU addressable, but can be shared
>>>> between GPUs using XGMI, NVLink, SLI etc....
>>> For this kind of memory if it is mapped into a VMA with
>>> DEVICE_PRIVATE, as Jerome has imagined, then it would be part of this
>>> discussion.
>>
>> I think what Jerome had in mind with its P2P ideas around HMM was that we
>> could do this with anonymous memory which was migrated to a GPU device. That
>> turned out to be rather complicated because you would need to be able to
>> figure out to which driver you need to talk to for the migrated address,
>> which in turn wasn't related to the VMA in any way.
> 
> Jerome's VMA proposal tied explicitly the lifetime of the VMA to the
> lifetime of the DMA map by forcing the use of 'shared virtual memory'
> (ie mmu notifiers, etc) techniques which have a very narrow usability
> with HW. This is how the lifetime problem was solved in those patches.
> 
> This path has huge drawbacks for everything that is not a GPU use
> case. Ie we can't fit it into virtio to solve it's current P2P DMA
> problem.
> 
>>>> So we need to figure out how express DMA addresses outside of the CPU
>>>> address space first before we can even think about something like extending
>>>> get_user_pages() for P2P in an HMM scenario.
>>> Why?
>>
>> Because that's how get_user_pages() works. IIRC you call it with userspace
>> address+length and get a filled struct pages and VMAs array in return.
>>
>> When you don't have CPU addresses for you memory the whole idea of that
>> interface falls apart. So I think we need to get away from get_user_pages()
>> and work more high level here.
> 
> get_user_pages() is struct page focused, and there is some general
> expectation that GPUs will have to create DEVICE_PRIVATE struct pages
> for their entire hidden memory so that they can do all the HMM tricks
> with anonymous memory. They also have to recongize the DEVICE_PIVATE
> pages during hmm driven page faults.
> 
> Removing the DEVICE_PRIVATE from the anonymous page setup seems
> impossible at the current moment - thus it seems like we are stuck
> with struct pages, may as well use them?
> 
> Literally nobody like this, but all the non-struct-page proposals have
> failed to get traction so far.

Yes, I agree with Jason. We need to be able to make incremental progress
on things that we can do today. We can't be stuck making no progress
because every time something is proposed someone pops up and says it
won't work for some significantly more complicated use case. Supporting
existing P2P DMA functionality in userspace is a use case that people
care about and we shouldn't be that far away from supporting with the
existing struct page infrastructure we have today.

Supporting buses the CPU has no visibility into is a separate discussion
for after we've made more progress on the easier cases. Or, until more
cleanup has gone into making struct page more replaceable with something
else.

Logan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory
  2020-02-07 20:42     ` Matthew Wilcox
@ 2020-02-14 10:35       ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2020-02-14 10:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, lsf-pc, linux-mm, linux-pci, linux-rdma,
	Christian König, Daniel Vetter, Logan Gunthorpe,
	Stephen Bates, Jérôme Glisse, Ira Weiny,
	Christoph Hellwig, John Hubbard, Ralph Campbell, Dan Williams,
	Don Dutile, Thomas Hellström (VMware),
	Joao Martins

On Fri 07-02-20 12:42:01, Matthew Wilcox wrote:
> On Fri, Feb 07, 2020 at 04:13:51PM -0400, Jason Gunthorpe wrote:
> > On Fri, Feb 07, 2020 at 11:46:20AM -0800, Matthew Wilcox wrote:
> > > > 
> > > >  Christian König <christian.koenig@amd.com>
> > > >  Daniel Vetter <daniel.vetter@ffwll.ch>
> > > >  Logan Gunthorpe <logang@deltatee.com>
> > > >  Stephen Bates <sbates@raithlin.com>
> > > >  Jérôme Glisse <jglisse@redhat.com>
> > > >  Ira Weiny <iweiny@intel.com>
> > > >  Christoph Hellwig <hch@lst.de>
> > > >  John Hubbard <jhubbard@nvidia.com>
> > > >  Ralph Campbell <rcampbell@nvidia.com>
> > > >  Dan Williams <dan.j.williams@intel.com>
> > > >  Don Dutile <ddutile@redhat.com>
> > > 
> > > That's a long list, and you're missing 
> > > 
> > > "Thomas Hellström (VMware)" <thomas_os@shipmail.org>
> > > Joao Martins <joao.m.martins@oracle.com>
> > 
> > Great, thanks, I'm not really aware of what the related work is
> > though?
> 
> Thomas has been working on huge pages for graphics BARs, so that's involved
> touching 'special' (ie pageless) VMAs:
> https://lore.kernel.org/linux-mm/20200205125353.2760-1-thomas_os@shipmail.org/
> 
> Joao has been working on removing the need for KVM hosts to have struct pages
> that cover the memory of their guests:
> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/

I do not see those people requesting attendance. Please note that the
deadline is approaching. Hint hint...

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-07 18:24 [LSF/MM TOPIC] get_user_pages() for PCI BAR Memory Jason Gunthorpe
2020-02-07 19:46 ` Matthew Wilcox
2020-02-07 20:13   ` Jason Gunthorpe
2020-02-07 20:42     ` Matthew Wilcox
2020-02-14 10:35       ` Michal Hocko
2020-02-08 13:10 ` Christian König
2020-02-08 13:54   ` Jason Gunthorpe
2020-02-08 16:38     ` Christian König
2020-02-08 17:43       ` Jason Gunthorpe
2020-02-10 18:39         ` Logan Gunthorpe

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git