All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-21 15:03 ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2023-01-21 15:03 UTC (permalink / raw)
  To: lsf-pc, linux-mm, iommu, linux-rdma
  Cc: Matthew Wilcox, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

I would like to have a session at LSF to talk about Matthew's
physr discussion starter:

 https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/

I have become interested in this with some immediacy because of
IOMMUFD and this other discussion with Christoph:

 https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
    
Which results in, more or less, we have no way to do P2P DMA
operations without struct page - and from the RDMA side solving this
well at the DMA API means advancing at least some part of the physr
idea.

So - my objective is to enable to DMA API to "DMA map" something that
is not a scatterlist, may or may not contain struct pages, but can
still contain P2P DMA data. From there I would move RDMA MR's to use
this new API, modify DMABUF to export it, complete the above VFIO
series, and finally, use all of this to add back P2P support to VFIO
when working with IOMMUFD by allowing IOMMUFD to obtain a safe
reference to the VFIO memory using DMABUF. From there we'd want to see
pin_user_pages optimized, and that also will need some discussion how
best to structure it.

I also have several ideas on how something like physr can optimize the
iommu driver ops when working with dma-iommu.c and IOMMUFD.

I've been working on an implementation and hope to have something
draft to show on the lists in a few weeks. It is pretty clear there
are several interesting decisions to make that I think will benefit
from a live discussion.

Providing a kernel-wide alternative to scatterlist is something that
has general interest across all the driver subsystems. I've started to
view the general problem rather like xarray where the main focus is to
create the appropriate abstraction and then go about transforming
users to take advatange of the cleaner abstraction. scatterlist
suffers here because it has an incredibly leaky API, a huge number of
(often sketchy driver) users, and has historically been very difficult
to improve.

The session would quickly go over the current state of whatever the
mailing list discussion evolves into and an open discussion around the
different ideas.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-21 15:03 ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2023-01-21 15:03 UTC (permalink / raw)
  To: lsf-pc, linux-mm, iommu, linux-rdma
  Cc: nvdimm, linux-rdma, John Hubbard, Matthew Wilcox, Ming Lei,
	linux-block, linux-mm, dri-devel, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

I would like to have a session at LSF to talk about Matthew's
physr discussion starter:

 https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/

I have become interested in this with some immediacy because of
IOMMUFD and this other discussion with Christoph:

 https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
    
Which results in, more or less, we have no way to do P2P DMA
operations without struct page - and from the RDMA side solving this
well at the DMA API means advancing at least some part of the physr
idea.

So - my objective is to enable to DMA API to "DMA map" something that
is not a scatterlist, may or may not contain struct pages, but can
still contain P2P DMA data. From there I would move RDMA MR's to use
this new API, modify DMABUF to export it, complete the above VFIO
series, and finally, use all of this to add back P2P support to VFIO
when working with IOMMUFD by allowing IOMMUFD to obtain a safe
reference to the VFIO memory using DMABUF. From there we'd want to see
pin_user_pages optimized, and that also will need some discussion how
best to structure it.

I also have several ideas on how something like physr can optimize the
iommu driver ops when working with dma-iommu.c and IOMMUFD.

I've been working on an implementation and hope to have something
draft to show on the lists in a few weeks. It is pretty clear there
are several interesting decisions to make that I think will benefit
from a live discussion.

Providing a kernel-wide alternative to scatterlist is something that
has general interest across all the driver subsystems. I've started to
view the general problem rather like xarray where the main focus is to
create the appropriate abstraction and then go about transforming
users to take advatange of the cleaner abstraction. scatterlist
suffers here because it has an incredibly leaky API, a huge number of
(often sketchy driver) users, and has historically been very difficult
to improve.

The session would quickly go over the current state of whatever the
mailing list discussion evolves into and an open discussion around the
different ideas.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
  2023-01-21 15:03 ` Jason Gunthorpe
@ 2023-01-23  4:36   ` Matthew Wilcox
  -1 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2023-01-23  4:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: lsf-pc, linux-mm, iommu, linux-rdma, Christoph Hellwig,
	Joao Martins, John Hubbard, Logan Gunthorpe, Ming Lei,
	linux-block, netdev, dri-devel, nvdimm, Shakeel Butt

On Sat, Jan 21, 2023 at 11:03:05AM -0400, Jason Gunthorpe wrote:
> I would like to have a session at LSF to talk about Matthew's
> physr discussion starter:
> 
>  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/

I'm definitely interested in discussing phyrs (even if you'd rather
pronounce it "fizzers" than "fires" ;-)

> I've been working on an implementation and hope to have something
> draft to show on the lists in a few weeks. It is pretty clear there
> are several interesting decisions to make that I think will benefit
> from a live discussion.

Cool!  Here's my latest noodlings:
https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/phyr

Just the top two commits; the other stuff is unrelated.  Shakeel has
also been interested in this.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-23  4:36   ` Matthew Wilcox
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2023-01-23  4:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, Shakeel Butt, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Sat, Jan 21, 2023 at 11:03:05AM -0400, Jason Gunthorpe wrote:
> I would like to have a session at LSF to talk about Matthew's
> physr discussion starter:
> 
>  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/

I'm definitely interested in discussing phyrs (even if you'd rather
pronounce it "fizzers" than "fires" ;-)

> I've been working on an implementation and hope to have something
> draft to show on the lists in a few weeks. It is pretty clear there
> are several interesting decisions to make that I think will benefit
> from a live discussion.

Cool!  Here's my latest noodlings:
https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/phyr

Just the top two commits; the other stuff is unrelated.  Shakeel has
also been interested in this.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
  2023-01-23  4:36   ` Matthew Wilcox
@ 2023-01-23 13:44     ` Jason Gunthorpe
  -1 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2023-01-23 13:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-mm, iommu, linux-rdma, Christoph Hellwig,
	Joao Martins, John Hubbard, Logan Gunthorpe, Ming Lei,
	linux-block, netdev, dri-devel, nvdimm, Shakeel Butt

On Mon, Jan 23, 2023 at 04:36:25AM +0000, Matthew Wilcox wrote:

> > I've been working on an implementation and hope to have something
> > draft to show on the lists in a few weeks. It is pretty clear there
> > are several interesting decisions to make that I think will benefit
> > from a live discussion.
> 
> Cool!  Here's my latest noodlings:
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/phyr
> 
> Just the top two commits; the other stuff is unrelated.  Shakeel has
> also been interested in this.

I've gone from quite a different starting point - I've been working
DMA API upwards, so what does the dma_map_XX look like, what APIs do
we need to support the dma_map_ops implementations to iterate/etc, how
do we form and return the dma mapped list, how does P2P, with all the
checks, actually work, etc. These help inform what we want from the
"phyr" as an API.

The DMA API is the fundamental reason why everything has to use
scatterlist - it is the only way to efficiently DMA map anything more
than a few pages. If we can't solve that then everything else is
useless, IMHO.

If we have an agreement on DMA API then things like converting RDMA to
use it and adding it to DMABUF are comparatively straightforward.

There are 24 implementations of dma_map_ops, so my approach is to try
to build a non-leaky 'phyr' API that doesn't actually care how the
physical ranges are stored, separates CPU and DMA and then use that to
get all 24 implementations.

With a good API we can fiddle with the exact nature of the phyr as we
like.

I've also been exploring the idea that with a non-leaking API we don't
actually need to settle on one phyr to rule them all. bio_vec can stay
as is, but become directly dma mappable, rdma/drm can use something
better suited to the page list use cases (eg 8 bytes/entry not 16),
and a non-leaking API can multiplex these different memory layouts and
allow one dma_map_ops implementation to work on both.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-23 13:44     ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2023-01-23 13:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, Shakeel Butt, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Mon, Jan 23, 2023 at 04:36:25AM +0000, Matthew Wilcox wrote:

> > I've been working on an implementation and hope to have something
> > draft to show on the lists in a few weeks. It is pretty clear there
> > are several interesting decisions to make that I think will benefit
> > from a live discussion.
> 
> Cool!  Here's my latest noodlings:
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/phyr
> 
> Just the top two commits; the other stuff is unrelated.  Shakeel has
> also been interested in this.

I've gone from quite a different starting point - I've been working
DMA API upwards, so what does the dma_map_XX look like, what APIs do
we need to support the dma_map_ops implementations to iterate/etc, how
do we form and return the dma mapped list, how does P2P, with all the
checks, actually work, etc. These help inform what we want from the
"phyr" as an API.

The DMA API is the fundamental reason why everything has to use
scatterlist - it is the only way to efficiently DMA map anything more
than a few pages. If we can't solve that then everything else is
useless, IMHO.

If we have an agreement on DMA API then things like converting RDMA to
use it and adding it to DMABUF are comparatively straightforward.

There are 24 implementations of dma_map_ops, so my approach is to try
to build a non-leaky 'phyr' API that doesn't actually care how the
physical ranges are stored, separates CPU and DMA and then use that to
get all 24 implementations.

With a good API we can fiddle with the exact nature of the phyr as we
like.

I've also been exploring the idea that with a non-leaking API we don't
actually need to settle on one phyr to rule them all. bio_vec can stay
as is, but become directly dma mappable, rdma/drm can use something
better suited to the page list use cases (eg 8 bytes/entry not 16),
and a non-leaking API can multiplex these different memory layouts and
allow one dma_map_ops implementation to work on both.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
  2023-01-21 15:03 ` Jason Gunthorpe
@ 2023-01-23 19:36   ` Dan Williams
  -1 siblings, 0 replies; 27+ messages in thread
From: Dan Williams @ 2023-01-23 19:36 UTC (permalink / raw)
  To: Jason Gunthorpe via Lsf-pc, lsf-pc, linux-mm, iommu, linux-rdma
  Cc: nvdimm, linux-rdma, John Hubbard, Matthew Wilcox, Ming Lei,
	linux-block, linux-mm, dri-devel, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

Jason Gunthorpe via Lsf-pc wrote:
> I would like to have a session at LSF to talk about Matthew's
> physr discussion starter:
> 
>  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> 
> I have become interested in this with some immediacy because of
> IOMMUFD and this other discussion with Christoph:
> 
>  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/

I think this is a worthwhile discussion. My main hangup with 'struct
page' elimination in general is that if anything needs to be allocated
to describe a physical address for other parts of the kernel to operate
on it, why not a 'struct page'? There are of course several difficulties
allocating a 'struct page' array, but I look at subsection support and
the tail page space optimization work as evidence that some of the pain
can be mitigated, what more needs to be done? I also think this is
somewhat of a separate consideration than replacing a bio_vec with phyr
where that has value independent of the mechanism used to manage
phys_addr_t => dma_addr_t.

> Which results in, more or less, we have no way to do P2P DMA
> operations without struct page - and from the RDMA side solving this
> well at the DMA API means advancing at least some part of the physr
> idea.
> 
> So - my objective is to enable to DMA API to "DMA map" something that
> is not a scatterlist, may or may not contain struct pages, but can
> still contain P2P DMA data. From there I would move RDMA MR's to use
> this new API, modify DMABUF to export it, complete the above VFIO
> series, and finally, use all of this to add back P2P support to VFIO
> when working with IOMMUFD by allowing IOMMUFD to obtain a safe
> reference to the VFIO memory using DMABUF. From there we'd want to see
> pin_user_pages optimized, and that also will need some discussion how
> best to structure it.
> 
> I also have several ideas on how something like physr can optimize the
> iommu driver ops when working with dma-iommu.c and IOMMUFD.
> 
> I've been working on an implementation and hope to have something
> draft to show on the lists in a few weeks. It is pretty clear there
> are several interesting decisions to make that I think will benefit
> from a live discussion.
> 
> Providing a kernel-wide alternative to scatterlist is something that
> has general interest across all the driver subsystems. I've started to
> view the general problem rather like xarray where the main focus is to
> create the appropriate abstraction and then go about transforming
> users to take advatange of the cleaner abstraction. scatterlist
> suffers here because it has an incredibly leaky API, a huge number of
> (often sketchy driver) users, and has historically been very difficult
> to improve.

When I read "general interest across all the driver subsystems" it is
hard not to ask "have all possible avenues to enable 'struct page' been
exhausted?"

> The session would quickly go over the current state of whatever the
> mailing list discussion evolves into and an open discussion around the
> different ideas.

Sounds good to me.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-23 19:36   ` Dan Williams
  0 siblings, 0 replies; 27+ messages in thread
From: Dan Williams @ 2023-01-23 19:36 UTC (permalink / raw)
  To: Jason Gunthorpe via Lsf-pc, lsf-pc, linux-mm, iommu, linux-rdma
  Cc: nvdimm, linux-rdma, John Hubbard, Matthew Wilcox, Ming Lei,
	linux-block, linux-mm, dri-devel, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

Jason Gunthorpe via Lsf-pc wrote:
> I would like to have a session at LSF to talk about Matthew's
> physr discussion starter:
> 
>  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> 
> I have become interested in this with some immediacy because of
> IOMMUFD and this other discussion with Christoph:
> 
>  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/

I think this is a worthwhile discussion. My main hangup with 'struct
page' elimination in general is that if anything needs to be allocated
to describe a physical address for other parts of the kernel to operate
on it, why not a 'struct page'? There are of course several difficulties
allocating a 'struct page' array, but I look at subsection support and
the tail page space optimization work as evidence that some of the pain
can be mitigated, what more needs to be done? I also think this is
somewhat of a separate consideration than replacing a bio_vec with phyr
where that has value independent of the mechanism used to manage
phys_addr_t => dma_addr_t.

> Which results in, more or less, we have no way to do P2P DMA
> operations without struct page - and from the RDMA side solving this
> well at the DMA API means advancing at least some part of the physr
> idea.
> 
> So - my objective is to enable to DMA API to "DMA map" something that
> is not a scatterlist, may or may not contain struct pages, but can
> still contain P2P DMA data. From there I would move RDMA MR's to use
> this new API, modify DMABUF to export it, complete the above VFIO
> series, and finally, use all of this to add back P2P support to VFIO
> when working with IOMMUFD by allowing IOMMUFD to obtain a safe
> reference to the VFIO memory using DMABUF. From there we'd want to see
> pin_user_pages optimized, and that also will need some discussion how
> best to structure it.
> 
> I also have several ideas on how something like physr can optimize the
> iommu driver ops when working with dma-iommu.c and IOMMUFD.
> 
> I've been working on an implementation and hope to have something
> draft to show on the lists in a few weeks. It is pretty clear there
> are several interesting decisions to make that I think will benefit
> from a live discussion.
> 
> Providing a kernel-wide alternative to scatterlist is something that
> has general interest across all the driver subsystems. I've started to
> view the general problem rather like xarray where the main focus is to
> create the appropriate abstraction and then go about transforming
> users to take advatange of the cleaner abstraction. scatterlist
> suffers here because it has an incredibly leaky API, a huge number of
> (often sketchy driver) users, and has historically been very difficult
> to improve.

When I read "general interest across all the driver subsystems" it is
hard not to ask "have all possible avenues to enable 'struct page' been
exhausted?"

> The session would quickly go over the current state of whatever the
> mailing list discussion evolves into and an open discussion around the
> different ideas.

Sounds good to me.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
  2023-01-23 13:44     ` Jason Gunthorpe
@ 2023-01-23 19:47       ` Bart Van Assche
  -1 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2023-01-23 19:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Wilcox
  Cc: lsf-pc, linux-mm, iommu, linux-rdma, Christoph Hellwig,
	Joao Martins, John Hubbard, Logan Gunthorpe, Ming Lei,
	linux-block, netdev, dri-devel, nvdimm, Shakeel Butt

On 1/23/23 05:44, Jason Gunthorpe wrote:
> I've gone from quite a different starting point - I've been working
> DMA API upwards, so what does the dma_map_XX look like, what APIs do
> we need to support the dma_map_ops implementations to iterate/etc, how
> do we form and return the dma mapped list, how does P2P, with all the
> checks, actually work, etc. These help inform what we want from the
> "phyr" as an API.

I'm interested in this topic. I'm wondering whether eliminating 
scatterlists could help to make the block layer faster.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-23 19:47       ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2023-01-23 19:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Wilcox
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, Shakeel Butt, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On 1/23/23 05:44, Jason Gunthorpe wrote:
> I've gone from quite a different starting point - I've been working
> DMA API upwards, so what does the dma_map_XX look like, what APIs do
> we need to support the dma_map_ops implementations to iterate/etc, how
> do we form and return the dma mapped list, how does P2P, with all the
> checks, actually work, etc. These help inform what we want from the
> "phyr" as an API.

I'm interested in this topic. I'm wondering whether eliminating 
scatterlists could help to make the block layer faster.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
  2023-01-23 19:36   ` Dan Williams
@ 2023-01-23 20:11     ` Matthew Wilcox
  -1 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2023-01-23 20:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, netdev, Joao Martins,
	Jason Gunthorpe via Lsf-pc, Logan Gunthorpe, Christoph Hellwig

On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote:
> Jason Gunthorpe via Lsf-pc wrote:
> > I would like to have a session at LSF to talk about Matthew's
> > physr discussion starter:
> > 
> >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> > 
> > I have become interested in this with some immediacy because of
> > IOMMUFD and this other discussion with Christoph:
> > 
> >  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> 
> I think this is a worthwhile discussion. My main hangup with 'struct
> page' elimination in general is that if anything needs to be allocated

You're the first one to bring up struct page elimination.  Neither Jason
nor I have that as our motivation.  But there are reasons why struct page
is a bad data structure, and Xen proves that you don't need to have such
a data structure in order to do I/O.

> When I read "general interest across all the driver subsystems" it is
> hard not to ask "have all possible avenues to enable 'struct page' been
> exhausted?"

Yes, we should definitely expend yet more resources chasing a poor
implementation.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-23 20:11     ` Matthew Wilcox
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2023-01-23 20:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe via Lsf-pc, lsf-pc, linux-mm, iommu, linux-rdma,
	nvdimm, John Hubbard, Ming Lei, linux-block, dri-devel, netdev,
	Joao Martins, Logan Gunthorpe, Christoph Hellwig

On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote:
> Jason Gunthorpe via Lsf-pc wrote:
> > I would like to have a session at LSF to talk about Matthew's
> > physr discussion starter:
> > 
> >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> > 
> > I have become interested in this with some immediacy because of
> > IOMMUFD and this other discussion with Christoph:
> > 
> >  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> 
> I think this is a worthwhile discussion. My main hangup with 'struct
> page' elimination in general is that if anything needs to be allocated

You're the first one to bring up struct page elimination.  Neither Jason
nor I have that as our motivation.  But there are reasons why struct page
is a bad data structure, and Xen proves that you don't need to have such
a data structure in order to do I/O.

> When I read "general interest across all the driver subsystems" it is
> hard not to ask "have all possible avenues to enable 'struct page' been
> exhausted?"

Yes, we should definitely expend yet more resources chasing a poor
implementation.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
  2023-01-23 20:11     ` Matthew Wilcox
@ 2023-01-23 20:50       ` Dan Williams
  -1 siblings, 0 replies; 27+ messages in thread
From: Dan Williams @ 2023-01-23 20:50 UTC (permalink / raw)
  To: Matthew Wilcox, Dan Williams
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, netdev, Joao Martins,
	Jason Gunthorpe via Lsf-pc, Logan Gunthorpe, Christoph Hellwig

Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote:
> > Jason Gunthorpe via Lsf-pc wrote:
> > > I would like to have a session at LSF to talk about Matthew's
> > > physr discussion starter:
> > > 
> > >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> > > 
> > > I have become interested in this with some immediacy because of
> > > IOMMUFD and this other discussion with Christoph:
> > > 
> > >  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> > 
> > I think this is a worthwhile discussion. My main hangup with 'struct
> > page' elimination in general is that if anything needs to be allocated
> 
> You're the first one to bring up struct page elimination.  Neither Jason
> nor I have that as our motivation.

Oh, ok, then maybe I misread the concern in the vfio discussion. I
thought the summary there is debating the ongoing requirement for
'struct page' for P2PDMA?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-23 20:50       ` Dan Williams
  0 siblings, 0 replies; 27+ messages in thread
From: Dan Williams @ 2023-01-23 20:50 UTC (permalink / raw)
  To: Matthew Wilcox, Dan Williams
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, netdev, Joao Martins,
	Jason Gunthorpe via Lsf-pc, Logan Gunthorpe, Christoph Hellwig

Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote:
> > Jason Gunthorpe via Lsf-pc wrote:
> > > I would like to have a session at LSF to talk about Matthew's
> > > physr discussion starter:
> > > 
> > >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> > > 
> > > I have become interested in this with some immediacy because of
> > > IOMMUFD and this other discussion with Christoph:
> > > 
> > >  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> > 
> > I think this is a worthwhile discussion. My main hangup with 'struct
> > page' elimination in general is that if anything needs to be allocated
> 
> You're the first one to bring up struct page elimination.  Neither Jason
> nor I have that as our motivation.

Oh, ok, then maybe I misread the concern in the vfio discussion. I
thought the summary there is debating the ongoing requirement for
'struct page' for P2PDMA?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
  2023-01-23 20:50       ` Dan Williams
  (?)
@ 2023-01-23 22:46       ` Matthew Wilcox
  -1 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2023-01-23 22:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, netdev, Joao Martins,
	Jason Gunthorpe via Lsf-pc, Logan Gunthorpe, Christoph Hellwig

On Mon, Jan 23, 2023 at 12:50:52PM -0800, Dan Williams wrote:
> Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote:
> > > Jason Gunthorpe via Lsf-pc wrote:
> > > > I would like to have a session at LSF to talk about Matthew's
> > > > physr discussion starter:
> > > > 
> > > >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> > > > 
> > > > I have become interested in this with some immediacy because of
> > > > IOMMUFD and this other discussion with Christoph:
> > > > 
> > > >  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> > > 
> > > I think this is a worthwhile discussion. My main hangup with 'struct
> > > page' elimination in general is that if anything needs to be allocated
> > 
> > You're the first one to bring up struct page elimination.  Neither Jason
> > nor I have that as our motivation.
> 
> Oh, ok, then maybe I misread the concern in the vfio discussion. I
> thought the summary there is debating the ongoing requirement for
> 'struct page' for P2PDMA?

My reading of that thread is that while it started out that way, it
became more about "So what would a good interface be for doing this".  And
Jason's right, he and I are approaching this from different directions.
My concern is from the GUP side where we start out by getting a folio
(which we know is physically contiguous) and decomposing it into pages.
Then we aggregate all those pages together which are physically contiguous
and stuff them into a bio_vec.  After that, I lose interest; I was
planning on having DMA mapping interfaces which took in an array of
phyr and spat out scatterlists.  Then we could shrink the scatterlist
by removing page_link and offset, leaving us with only dma_address,
length and maybe flags.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
  2023-01-23 19:47       ` Bart Van Assche
@ 2023-01-24  6:15         ` Chaitanya Kulkarni
  -1 siblings, 0 replies; 27+ messages in thread
From: Chaitanya Kulkarni @ 2023-01-24  6:15 UTC (permalink / raw)
  To: lsf-pc, Bart Van Assche, Jason Gunthorpe, Matthew Wilcox
  Cc: linux-mm, iommu, linux-rdma, Christoph Hellwig, Joao Martins,
	John Hubbard, Logan Gunthorpe, Ming Lei, linux-block, netdev,
	dri-devel, nvdimm, Shakeel Butt

On 1/23/23 11:47, Bart Van Assche wrote:
> On 1/23/23 05:44, Jason Gunthorpe wrote:
>> I've gone from quite a different starting point - I've been working
>> DMA API upwards, so what does the dma_map_XX look like, what APIs do
>> we need to support the dma_map_ops implementations to iterate/etc, how
>> do we form and return the dma mapped list, how does P2P, with all the
>> checks, actually work, etc. These help inform what we want from the
>> "phyr" as an API.
> 
> I'm interested in this topic. I'm wondering whether eliminating 
> scatterlists could help to make the block layer faster.
> 
> Thanks,
> 
> Bart.
> 

I think it will be very interesting to discuss this in great detail
and come up with the plan.

+1 from me.

-ck


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-24  6:15         ` Chaitanya Kulkarni
  0 siblings, 0 replies; 27+ messages in thread
From: Chaitanya Kulkarni @ 2023-01-24  6:15 UTC (permalink / raw)
  To: lsf-pc, Bart Van Assche, Jason Gunthorpe, Matthew Wilcox
  Cc: nvdimm, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, Shakeel Butt, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On 1/23/23 11:47, Bart Van Assche wrote:
> On 1/23/23 05:44, Jason Gunthorpe wrote:
>> I've gone from quite a different starting point - I've been working
>> DMA API upwards, so what does the dma_map_XX look like, what APIs do
>> we need to support the dma_map_ops implementations to iterate/etc, how
>> do we form and return the dma mapped list, how does P2P, with all the
>> checks, actually work, etc. These help inform what we want from the
>> "phyr" as an API.
> 
> I'm interested in this topic. I'm wondering whether eliminating 
> scatterlists could help to make the block layer faster.
> 
> Thanks,
> 
> Bart.
> 

I think it will be very interesting to discuss this in great detail
and come up with the plan.

+1 from me.

-ck


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
  2023-01-21 15:03 ` Jason Gunthorpe
@ 2023-01-26  1:45   ` Zhu Yanjun
  -1 siblings, 0 replies; 27+ messages in thread
From: Zhu Yanjun @ 2023-01-26  1:45 UTC (permalink / raw)
  To: Jason Gunthorpe, lsf-pc, linux-mm, iommu, linux-rdma
  Cc: Matthew Wilcox, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, dri-devel,
	nvdimm

在 2023/1/21 23:03, Jason Gunthorpe 写道:
> I would like to have a session at LSF to talk about Matthew's
> physr discussion starter:
> 
>   https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> 
> I have become interested in this with some immediacy because of
> IOMMUFD and this other discussion with Christoph:
> 
>   https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/

I read through the above patches. I am interested in the dma-buf.

Zhu Yanjun

>      
> Which results in, more or less, we have no way to do P2P DMA
> operations without struct page - and from the RDMA side solving this
> well at the DMA API means advancing at least some part of the physr
> idea.
> 
> So - my objective is to enable to DMA API to "DMA map" something that
> is not a scatterlist, may or may not contain struct pages, but can
> still contain P2P DMA data. From there I would move RDMA MR's to use
> this new API, modify DMABUF to export it, complete the above VFIO
> series, and finally, use all of this to add back P2P support to VFIO
> when working with IOMMUFD by allowing IOMMUFD to obtain a safe
> reference to the VFIO memory using DMABUF. From there we'd want to see
> pin_user_pages optimized, and that also will need some discussion how
> best to structure it.
> 
> I also have several ideas on how something like physr can optimize the
> iommu driver ops when working with dma-iommu.c and IOMMUFD.
> 
> I've been working on an implementation and hope to have something
> draft to show on the lists in a few weeks. It is pretty clear there
> are several interesting decisions to make that I think will benefit
> from a live discussion.
> 
> Providing a kernel-wide alternative to scatterlist is something that
> has general interest across all the driver subsystems. I've started to
> view the general problem rather like xarray where the main focus is to
> create the appropriate abstraction and then go about transforming
> users to take advatange of the cleaner abstraction. scatterlist
> suffers here because it has an incredibly leaky API, a huge number of
> (often sketchy driver) users, and has historically been very difficult
> to improve.
> 
> The session would quickly go over the current state of whatever the
> mailing list discussion evolves into and an open discussion around the
> different ideas.
> 
> Thanks,
> Jason
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-26  1:45   ` Zhu Yanjun
  0 siblings, 0 replies; 27+ messages in thread
From: Zhu Yanjun @ 2023-01-26  1:45 UTC (permalink / raw)
  To: Jason Gunthorpe, lsf-pc, linux-mm, iommu, linux-rdma
  Cc: nvdimm, John Hubbard, Matthew Wilcox, Ming Lei, linux-block,
	dri-devel, netdev, Joao Martins, Logan Gunthorpe,
	Christoph Hellwig

在 2023/1/21 23:03, Jason Gunthorpe 写道:
> I would like to have a session at LSF to talk about Matthew's
> physr discussion starter:
> 
>   https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> 
> I have become interested in this with some immediacy because of
> IOMMUFD and this other discussion with Christoph:
> 
>   https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/

I read through the above patches. I am interested in the dma-buf.

Zhu Yanjun

>      
> Which results in, more or less, we have no way to do P2P DMA
> operations without struct page - and from the RDMA side solving this
> well at the DMA API means advancing at least some part of the physr
> idea.
> 
> So - my objective is to enable to DMA API to "DMA map" something that
> is not a scatterlist, may or may not contain struct pages, but can
> still contain P2P DMA data. From there I would move RDMA MR's to use
> this new API, modify DMABUF to export it, complete the above VFIO
> series, and finally, use all of this to add back P2P support to VFIO
> when working with IOMMUFD by allowing IOMMUFD to obtain a safe
> reference to the VFIO memory using DMABUF. From there we'd want to see
> pin_user_pages optimized, and that also will need some discussion how
> best to structure it.
> 
> I also have several ideas on how something like physr can optimize the
> iommu driver ops when working with dma-iommu.c and IOMMUFD.
> 
> I've been working on an implementation and hope to have something
> draft to show on the lists in a few weeks. It is pretty clear there
> are several interesting decisions to make that I think will benefit
> from a live discussion.
> 
> Providing a kernel-wide alternative to scatterlist is something that
> has general interest across all the driver subsystems. I've started to
> view the general problem rather like xarray where the main focus is to
> create the appropriate abstraction and then go about transforming
> users to take advatange of the cleaner abstraction. scatterlist
> suffers here because it has an incredibly leaky API, a huge number of
> (often sketchy driver) users, and has historically been very difficult
> to improve.
> 
> The session would quickly go over the current state of whatever the
> mailing list discussion evolves into and an open discussion around the
> different ideas.
> 
> Thanks,
> Jason
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
  2023-01-23  4:36   ` Matthew Wilcox
@ 2023-01-26  9:39     ` Mike Rapoport
  -1 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2023-01-26  9:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, lsf-pc, linux-mm, iommu, linux-rdma,
	Christoph Hellwig, Joao Martins, John Hubbard, Logan Gunthorpe,
	Ming Lei, linux-block, netdev, dri-devel, nvdimm, Shakeel Butt

On Mon, Jan 23, 2023 at 04:36:25AM +0000, Matthew Wilcox wrote:
> On Sat, Jan 21, 2023 at 11:03:05AM -0400, Jason Gunthorpe wrote:
> > I would like to have a session at LSF to talk about Matthew's
> > physr discussion starter:
> > 
> >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> 
> I'm definitely interested in discussing phyrs (even if you'd rather
> pronounce it "fizzers" than "fires" ;-)

I'm also interested in this discussion. With my accent it will be фыр,
though ;-)
 
> > I've been working on an implementation and hope to have something
> > draft to show on the lists in a few weeks. It is pretty clear there
> > are several interesting decisions to make that I think will benefit
> > from a live discussion.
> 
> Cool!  Here's my latest noodlings:
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/phyr
> 
> Just the top two commits; the other stuff is unrelated.  Shakeel has
> also been interested in this.
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-26  9:39     ` Mike Rapoport
  0 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2023-01-26  9:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, Shakeel Butt, Jason Gunthorpe,
	netdev, Joao Martins, Logan Gunthorpe, Christoph Hellwig

On Mon, Jan 23, 2023 at 04:36:25AM +0000, Matthew Wilcox wrote:
> On Sat, Jan 21, 2023 at 11:03:05AM -0400, Jason Gunthorpe wrote:
> > I would like to have a session at LSF to talk about Matthew's
> > physr discussion starter:
> > 
> >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> 
> I'm definitely interested in discussing phyrs (even if you'd rather
> pronounce it "fizzers" than "fires" ;-)

I'm also interested in this discussion. With my accent it will be фыр,
though ;-)
 
> > I've been working on an implementation and hope to have something
> > draft to show on the lists in a few weeks. It is pretty clear there
> > are several interesting decisions to make that I think will benefit
> > from a live discussion.
> 
> Cool!  Here's my latest noodlings:
> https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/phyr
> 
> Just the top two commits; the other stuff is unrelated.  Shakeel has
> also been interested in this.
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
  2023-01-23 20:50       ` Dan Williams
@ 2023-01-26 19:38         ` Jason Gunthorpe
  -1 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2023-01-26 19:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, nvdimm, lsf-pc, linux-rdma, John Hubbard,
	dri-devel, Ming Lei, linux-block, linux-mm, iommu, netdev,
	Joao Martins, Jason Gunthorpe via Lsf-pc, Logan Gunthorpe,
	Christoph Hellwig

On Mon, Jan 23, 2023 at 12:50:52PM -0800, Dan Williams wrote:
> Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote:
> > > Jason Gunthorpe via Lsf-pc wrote:
> > > > I would like to have a session at LSF to talk about Matthew's
> > > > physr discussion starter:
> > > > 
> > > >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> > > > 
> > > > I have become interested in this with some immediacy because of
> > > > IOMMUFD and this other discussion with Christoph:
> > > > 
> > > >  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> > > 
> > > I think this is a worthwhile discussion. My main hangup with 'struct
> > > page' elimination in general is that if anything needs to be allocated
> > 
> > You're the first one to bring up struct page elimination.  Neither Jason
> > nor I have that as our motivation.
> 
> Oh, ok, then maybe I misread the concern in the vfio discussion. I
> thought the summary there is debating the ongoing requirement for
> 'struct page' for P2PDMA?

The VFIO problem is we need a unique pgmap at 4k granuals (or maybe
smaller, technically), tightly packed, because VFIO exposes PCI BAR
space that can be sized in such small amounts.

So, using struct page means some kind of adventure in the memory
hotplug code to allow tightly packed 4k pgmaps.

And that is assuming that every architecture that wants to support
VFIO supports pgmap and memory hot plug. I was just told that s390
doesn't, that is kind of important..

If there is a straightforward way to get a pgmap into VFIO then I'd do
that and give up this quest :)

I've never been looking at this from the angle of eliminating struct
page, but from the perspective of allowing the DMA API to correctly do
scatter/gather IO to non-struct page P2P memory because I *can't* get
a struct page for it. Ie make dma_map_resource() better. Make P2P
DMABUF work properly.

This has to come along with a different way to store address ranges
because the basic datum that needs to cross all the functional
boundaries we have is an address range list.

My general current sketch is we'd allocate some 'DMA P2P provider'
structure analogous to the MEMORY_DEVICE_PCI_P2PDMA pgmap and a single
provider would cover the entire MMIO aperture - eg the providing
device's MMIO BAR. This is enough information for the DMA API to do
its job.

We get this back either by searching an interval treey thing on the
physical address or by storing it directly in the address range list.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF proposal]: Physr discussion
@ 2023-01-26 19:38         ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2023-01-26 19:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, iommu, Matthew Wilcox, netdev,
	Joao Martins, Jason Gunthorpe via Lsf-pc, Logan Gunthorpe,
	Christoph Hellwig

On Mon, Jan 23, 2023 at 12:50:52PM -0800, Dan Williams wrote:
> Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 11:36:51AM -0800, Dan Williams wrote:
> > > Jason Gunthorpe via Lsf-pc wrote:
> > > > I would like to have a session at LSF to talk about Matthew's
> > > > physr discussion starter:
> > > > 
> > > >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> > > > 
> > > > I have become interested in this with some immediacy because of
> > > > IOMMUFD and this other discussion with Christoph:
> > > > 
> > > >  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> > > 
> > > I think this is a worthwhile discussion. My main hangup with 'struct
> > > page' elimination in general is that if anything needs to be allocated
> > 
> > You're the first one to bring up struct page elimination.  Neither Jason
> > nor I have that as our motivation.
> 
> Oh, ok, then maybe I misread the concern in the vfio discussion. I
> thought the summary there is debating the ongoing requirement for
> 'struct page' for P2PDMA?

The VFIO problem is we need a unique pgmap at 4k granuals (or maybe
smaller, technically), tightly packed, because VFIO exposes PCI BAR
space that can be sized in such small amounts.

So, using struct page means some kind of adventure in the memory
hotplug code to allow tightly packed 4k pgmaps.

And that is assuming that every architecture that wants to support
VFIO supports pgmap and memory hot plug. I was just told that s390
doesn't, that is kind of important..

If there is a straightforward way to get a pgmap into VFIO then I'd do
that and give up this quest :)

I've never been looking at this from the angle of eliminating struct
page, but from the perspective of allowing the DMA API to correctly do
scatter/gather IO to non-struct page P2P memory because I *can't* get
a struct page for it. Ie make dma_map_resource() better. Make P2P
DMABUF work properly.

This has to come along with a different way to store address ranges
because the basic datum that needs to cross all the functional
boundaries we have is an address range list.

My general current sketch is we'd allocate some 'DMA P2P provider'
structure analogous to the MEMORY_DEVICE_PCI_P2PDMA pgmap and a single
provider would cover the entire MMIO aperture - eg the providing
device's MMIO BAR. This is enough information for the DMA API to do
its job.

We get this back either by searching an interval treey thing on the
physical address or by storing it directly in the address range list.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
  2023-01-21 15:03 ` Jason Gunthorpe
@ 2023-02-28 20:59   ` T.J. Mercier
  -1 siblings, 0 replies; 27+ messages in thread
From: T.J. Mercier @ 2023-02-28 20:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: lsf-pc, linux-mm, iommu, linux-rdma, Matthew Wilcox,
	Christoph Hellwig, Joao Martins, John Hubbard, Logan Gunthorpe,
	Ming Lei, linux-block, netdev, dri-devel, nvdimm

On Sat, Jan 21, 2023 at 7:03 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> I would like to have a session at LSF to talk about Matthew's
> physr discussion starter:
>
>  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
>
> I have become interested in this with some immediacy because of
> IOMMUFD and this other discussion with Christoph:
>
>  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
>
> Which results in, more or less, we have no way to do P2P DMA
> operations without struct page - and from the RDMA side solving this
> well at the DMA API means advancing at least some part of the physr
> idea.
>
> So - my objective is to enable to DMA API to "DMA map" something that
> is not a scatterlist, may or may not contain struct pages, but can
> still contain P2P DMA data. From there I would move RDMA MR's to use
> this new API, modify DMABUF to export it, complete the above VFIO
> series, and finally, use all of this to add back P2P support to VFIO
> when working with IOMMUFD by allowing IOMMUFD to obtain a safe
> reference to the VFIO memory using DMABUF. From there we'd want to see
> pin_user_pages optimized, and that also will need some discussion how
> best to structure it.
>
> I also have several ideas on how something like physr can optimize the
> iommu driver ops when working with dma-iommu.c and IOMMUFD.
>
> I've been working on an implementation and hope to have something
> draft to show on the lists in a few weeks. It is pretty clear there
> are several interesting decisions to make that I think will benefit
> from a live discussion.
>
> Providing a kernel-wide alternative to scatterlist is something that
> has general interest across all the driver subsystems. I've started to
> view the general problem rather like xarray where the main focus is to
> create the appropriate abstraction and then go about transforming
> users to take advatange of the cleaner abstraction. scatterlist
> suffers here because it has an incredibly leaky API, a huge number of
> (often sketchy driver) users, and has historically been very difficult
> to improve.
>
> The session would quickly go over the current state of whatever the
> mailing list discussion evolves into and an open discussion around the
> different ideas.
>
> Thanks,
> Jason
>

Hi, I'm interested in participating in this discussion!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
@ 2023-02-28 20:59   ` T.J. Mercier
  0 siblings, 0 replies; 27+ messages in thread
From: T.J. Mercier @ 2023-02-28 20:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nvdimm, lsf-pc, linux-rdma, John Hubbard, Matthew Wilcox,
	Ming Lei, linux-block, linux-mm, iommu, dri-devel, netdev,
	Joao Martins, Logan Gunthorpe, Christoph Hellwig

On Sat, Jan 21, 2023 at 7:03 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> I would like to have a session at LSF to talk about Matthew's
> physr discussion starter:
>
>  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
>
> I have become interested in this with some immediacy because of
> IOMMUFD and this other discussion with Christoph:
>
>  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
>
> Which results in, more or less, we have no way to do P2P DMA
> operations without struct page - and from the RDMA side solving this
> well at the DMA API means advancing at least some part of the physr
> idea.
>
> So - my objective is to enable to DMA API to "DMA map" something that
> is not a scatterlist, may or may not contain struct pages, but can
> still contain P2P DMA data. From there I would move RDMA MR's to use
> this new API, modify DMABUF to export it, complete the above VFIO
> series, and finally, use all of this to add back P2P support to VFIO
> when working with IOMMUFD by allowing IOMMUFD to obtain a safe
> reference to the VFIO memory using DMABUF. From there we'd want to see
> pin_user_pages optimized, and that also will need some discussion how
> best to structure it.
>
> I also have several ideas on how something like physr can optimize the
> iommu driver ops when working with dma-iommu.c and IOMMUFD.
>
> I've been working on an implementation and hope to have something
> draft to show on the lists in a few weeks. It is pretty clear there
> are several interesting decisions to make that I think will benefit
> from a live discussion.
>
> Providing a kernel-wide alternative to scatterlist is something that
> has general interest across all the driver subsystems. I've started to
> view the general problem rather like xarray where the main focus is to
> create the appropriate abstraction and then go about transforming
> users to take advatange of the cleaner abstraction. scatterlist
> suffers here because it has an incredibly leaky API, a huge number of
> (often sketchy driver) users, and has historically been very difficult
> to improve.
>
> The session would quickly go over the current state of whatever the
> mailing list discussion evolves into and an open discussion around the
> different ideas.
>
> Thanks,
> Jason
>

Hi, I'm interested in participating in this discussion!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
  2023-02-28 20:59   ` T.J. Mercier
@ 2023-04-17 19:59     ` Jason Gunthorpe
  -1 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2023-04-17 19:59 UTC (permalink / raw)
  To: linux-mm
  Cc: lsf-pc, linux-mm, iommu, linux-rdma, Matthew Wilcox,
	Christoph Hellwig, Joao Martins, John Hubbard, Logan Gunthorpe,
	Ming Lei, linux-block, netdev, dri-devel, nvdimm, T.J. Mercier,
	Zhu Yanjun, Dan Williams, Mike Rapoport, Bart Van Assche,
	Chaitanya Kulkarni

On Tue, Feb 28, 2023 at 12:59:41PM -0800, T.J. Mercier wrote:
> On Sat, Jan 21, 2023 at 7:03 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > I would like to have a session at LSF to talk about Matthew's
> > physr discussion starter:
> >
> >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> >
> > I have become interested in this with some immediacy because of
> > IOMMUFD and this other discussion with Christoph:
> >
> >  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> >
> > Which results in, more or less, we have no way to do P2P DMA
> > operations without struct page - and from the RDMA side solving this
> > well at the DMA API means advancing at least some part of the physr
> > idea.

[..]

I got fairly far along this and had to put it aside for some other
tasks, but here is what I came up with so far:

https://github.com/jgunthorpe/linux/commits/rlist

      PCI/P2PDMA: Do not store bus_off in the pci_p2pdma_map_state
      PCI/P2PDMA: Split out the information about the providing device from pgmap
      PCI/P2PDMA: Move the DMA API helpers to p2pdma_provider
      lib/rlist: Introduce range list
      lib/rlist: Introduce rlist cpu range iterator
      PCI/P2PDMA: Store the p2pdma_provider structs in an xarray
      lib/rlist: Introduce rlist_dma
      dma: Add DMA direct support for rlist mapping
      dma: Generic rlist dma_map_ops
      dma: Add DMA API support for mapping a rlist_cpu to a rlist_dma
      iommu/dma: Implement native rlist dma_map_ops
      dma: Use generic_dma.*_rlist in simple dma_map_ops implementations
      dma: Use generic_dma.*_rlist when map_sg just does map_page for n=1
      dma: Use generic_dma.*_rlist when iommu_area_alloc() is used
      dma/dummy: Add rlist
      s390/dma: Use generic_dma.*_rlist
      mm/gup: Create a wrapper for pin_user_pages to return a rlist
      dmabuf: WIP DMABUF exports the backing memory through rcpu
      RDMA/mlx5: Use rdma_umem_for_each_dma_block()
      RMDA/mlx: Use rdma_umem_for_each_dma_block() instead of sg_dma_address
      RDMA/mlx5: Use the length of the MR not the umem
      RDMA/umem: Add ib_umem_length() instead of open coding
      RDMA: Add IB DMA API wrappers for rlist
      RDMA: Switch ib_umem to rlist
      cover-letter: RFC Create an alternative to scatterlist in the DMA API

It is huge and scary. It is not quite nice enough to post but should
be an interesting starting point for LSF/MM. At least it broadly shows
all the touching required and why this is such a nasty problem.

The draft cover letter explaining what the series does:

    cover-letter: RFC Create an alternative to scatterlist in the DMA API
    
    This was kicked off by Matthew with his phyr idea from this thread:
    
    https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
    
    Hwoevr, I have become interested in this with some immediacy because of
    IOMMUFD and this other discussion with Christoph:
    
    https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
    
    Which results in, more or less, we have no way to do P2P DMA operations
    without struct page. This becomes complicated when we touch RDMA which
    highly relies on scatterlist for its internal implementations, so being
    unable to use scatterlist to store only dma_addr_t's means RDMA needs a
    complete scatterlist replacement that can.
    
    So - my objective is to enable to DMA API to "DMA map" something that is
    not a scatterlist, may or may not contain struct pages, but can still
    contain P2P DMA physical addresses. With this tool, transform the RDMA
    subystem to use the new DMA API and then go into DMABUF and stop creating
    scatterlists without any CPU pages. From that point we could implement
    DMABUF in VFIO (as above) and use the DMABUF to feed the MMIO pages into
    IOMMUFD to restore the PCI P2P support in VMs withotu creating the
    follow_pte security problem that VFIO has.
    
    After going through the thread again, and making some sketches, I've come
    up with this suggestion as a path forward, explored very roughly in this
    RFC:
    
    1) Create something I've called a 'range list CPU iterator'. This is an
       API that abstractly iterates over CPU physical memory ranges. It
       has useful helpers to iterate over things in 'struct page/folio *',
       physical ranges, copy to/from, and so on
    
       It has the necessary extra bits beyond the physr sketch to support P2P
       in the DMA API based on what was done for the pgmap based stuff. ie we
       need to know the provider of the non-struct page memory to get the
       struct device to compute the p2p distance and compute the pci_offset.
    
       The immediate idea is this is an iterator, not a data structure. So it
       can iterate over different kinds of storage. This frees us from having
       to immediatly consolidate all the different storage schemes in the
       kernel and lets that work happen over time.
    
       I imagine we would want to have this work with struct page * (for GUP)
       and bio_vec (for storage/net) and something else for the "kitchen sink"
       with DMABUF/etc. We may also want to allow it to wrapper scatterlist to
       provide for a more gradual code migration.
    
       Things are organized so sometime in the future this could collapse down
       into something that is not a multi-storage iterator, but perhaps just
       a single storage type that everyone is happy with.
    
       In the mean time we can use the API to progress all the other related
       infrastructure.
    
       Fundamentally this tries to avoid the scatterlist mistake of leaking
       too much of the storage implementation detail to the user.
    
    2) Create a general storage called the "range list". This is intended to
       be a general catch-all like scatterlist is, and it is optimized
       towards page list users, so it is quite good for what RDMA wants.
    
    3) Create a "range list DMA iterator" which is the dma_addr_t version of
       #1. This needs to have all the goodies to break up the ranges into
       things HW would like, such as page lists, or restricted scatter/gather
       records.
    
       I've been able to draft several optimizations in the DMA mapping path
       that should help offset some of the CPU cost of the more abstracted
       iterators:
    
           - DMA direct can directly re-use the CPU list with no iteration or
             memory allocation.
    
           - The IOMMU path can do only one iteration by pre-recording if the
             CPU list was all page aligned when it was created
    
    The following patches go deeper into my thinking, present fairly complete
    drafts of what things could look like, and more broadly explores the whole
    idea.
    
    At the end of the series we have
     - rlist, rlist_cpu, rlist_dma implementations
     - An rlist implementation for every dma_map_ops
     - Good rlist implementations for DMA direct and dma-iommu.c
     - A pin_user_pages() wrapper
     - RDMA umem converted and compiling with some RDMA drivers
     - Compile tested only :)
    
    It is a huge amount of work, I'd like to get a sense of what people think
    before going more deepely into a more final tested implementation. I know
    this is not quite what Matthew and Christoph have talked about.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF proposal]: Physr discussion
@ 2023-04-17 19:59     ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2023-04-17 19:59 UTC (permalink / raw)
  To: linux-mm
  Cc: nvdimm, Chaitanya Kulkarni, Bart Van Assche, lsf-pc, linux-rdma,
	John Hubbard, Zhu Yanjun, Dan Williams, Matthew Wilcox, Ming Lei,
	linux-block, linux-mm, iommu, dri-devel, Mike Rapoport, netdev,
	Joao Martins, Logan Gunthorpe, Christoph Hellwig, T.J. Mercier

On Tue, Feb 28, 2023 at 12:59:41PM -0800, T.J. Mercier wrote:
> On Sat, Jan 21, 2023 at 7:03 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > I would like to have a session at LSF to talk about Matthew's
> > physr discussion starter:
> >
> >  https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
> >
> > I have become interested in this with some immediacy because of
> > IOMMUFD and this other discussion with Christoph:
> >
> >  https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
> >
> > Which results in, more or less, we have no way to do P2P DMA
> > operations without struct page - and from the RDMA side solving this
> > well at the DMA API means advancing at least some part of the physr
> > idea.

[..]

I got fairly far along this and had to put it aside for some other
tasks, but here is what I came up with so far:

https://github.com/jgunthorpe/linux/commits/rlist

      PCI/P2PDMA: Do not store bus_off in the pci_p2pdma_map_state
      PCI/P2PDMA: Split out the information about the providing device from pgmap
      PCI/P2PDMA: Move the DMA API helpers to p2pdma_provider
      lib/rlist: Introduce range list
      lib/rlist: Introduce rlist cpu range iterator
      PCI/P2PDMA: Store the p2pdma_provider structs in an xarray
      lib/rlist: Introduce rlist_dma
      dma: Add DMA direct support for rlist mapping
      dma: Generic rlist dma_map_ops
      dma: Add DMA API support for mapping a rlist_cpu to a rlist_dma
      iommu/dma: Implement native rlist dma_map_ops
      dma: Use generic_dma.*_rlist in simple dma_map_ops implementations
      dma: Use generic_dma.*_rlist when map_sg just does map_page for n=1
      dma: Use generic_dma.*_rlist when iommu_area_alloc() is used
      dma/dummy: Add rlist
      s390/dma: Use generic_dma.*_rlist
      mm/gup: Create a wrapper for pin_user_pages to return a rlist
      dmabuf: WIP DMABUF exports the backing memory through rcpu
      RDMA/mlx5: Use rdma_umem_for_each_dma_block()
      RMDA/mlx: Use rdma_umem_for_each_dma_block() instead of sg_dma_address
      RDMA/mlx5: Use the length of the MR not the umem
      RDMA/umem: Add ib_umem_length() instead of open coding
      RDMA: Add IB DMA API wrappers for rlist
      RDMA: Switch ib_umem to rlist
      cover-letter: RFC Create an alternative to scatterlist in the DMA API

It is huge and scary. It is not quite nice enough to post but should
be an interesting starting point for LSF/MM. At least it broadly shows
all the touching required and why this is such a nasty problem.

The draft cover letter explaining what the series does:

    cover-letter: RFC Create an alternative to scatterlist in the DMA API
    
    This was kicked off by Matthew with his phyr idea from this thread:
    
    https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/
    
    Hwoevr, I have become interested in this with some immediacy because of
    IOMMUFD and this other discussion with Christoph:
    
    https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
    
    Which results in, more or less, we have no way to do P2P DMA operations
    without struct page. This becomes complicated when we touch RDMA which
    highly relies on scatterlist for its internal implementations, so being
    unable to use scatterlist to store only dma_addr_t's means RDMA needs a
    complete scatterlist replacement that can.
    
    So - my objective is to enable to DMA API to "DMA map" something that is
    not a scatterlist, may or may not contain struct pages, but can still
    contain P2P DMA physical addresses. With this tool, transform the RDMA
    subystem to use the new DMA API and then go into DMABUF and stop creating
    scatterlists without any CPU pages. From that point we could implement
    DMABUF in VFIO (as above) and use the DMABUF to feed the MMIO pages into
    IOMMUFD to restore the PCI P2P support in VMs withotu creating the
    follow_pte security problem that VFIO has.
    
    After going through the thread again, and making some sketches, I've come
    up with this suggestion as a path forward, explored very roughly in this
    RFC:
    
    1) Create something I've called a 'range list CPU iterator'. This is an
       API that abstractly iterates over CPU physical memory ranges. It
       has useful helpers to iterate over things in 'struct page/folio *',
       physical ranges, copy to/from, and so on
    
       It has the necessary extra bits beyond the physr sketch to support P2P
       in the DMA API based on what was done for the pgmap based stuff. ie we
       need to know the provider of the non-struct page memory to get the
       struct device to compute the p2p distance and compute the pci_offset.
    
       The immediate idea is this is an iterator, not a data structure. So it
       can iterate over different kinds of storage. This frees us from having
       to immediatly consolidate all the different storage schemes in the
       kernel and lets that work happen over time.
    
       I imagine we would want to have this work with struct page * (for GUP)
       and bio_vec (for storage/net) and something else for the "kitchen sink"
       with DMABUF/etc. We may also want to allow it to wrapper scatterlist to
       provide for a more gradual code migration.
    
       Things are organized so sometime in the future this could collapse down
       into something that is not a multi-storage iterator, but perhaps just
       a single storage type that everyone is happy with.
    
       In the mean time we can use the API to progress all the other related
       infrastructure.
    
       Fundamentally this tries to avoid the scatterlist mistake of leaking
       too much of the storage implementation detail to the user.
    
    2) Create a general storage called the "range list". This is intended to
       be a general catch-all like scatterlist is, and it is optimized
       towards page list users, so it is quite good for what RDMA wants.
    
    3) Create a "range list DMA iterator" which is the dma_addr_t version of
       #1. This needs to have all the goodies to break up the ranges into
       things HW would like, such as page lists, or restricted scatter/gather
       records.
    
       I've been able to draft several optimizations in the DMA mapping path
       that should help offset some of the CPU cost of the more abstracted
       iterators:
    
           - DMA direct can directly re-use the CPU list with no iteration or
             memory allocation.
    
           - The IOMMU path can do only one iteration by pre-recording if the
             CPU list was all page aligned when it was created
    
    The following patches go deeper into my thinking, present fairly complete
    drafts of what things could look like, and more broadly explores the whole
    idea.
    
    At the end of the series we have
     - rlist, rlist_cpu, rlist_dma implementations
     - An rlist implementation for every dma_map_ops
     - Good rlist implementations for DMA direct and dma-iommu.c
     - A pin_user_pages() wrapper
     - RDMA umem converted and compiling with some RDMA drivers
     - Compile tested only :)
    
    It is a huge amount of work, I'd like to get a sense of what people think
    before going more deepely into a more final tested implementation. I know
    this is not quite what Matthew and Christoph have talked about.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-04-17 19:59 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-21 15:03 [LSF/MM/BPF proposal]: Physr discussion Jason Gunthorpe
2023-01-21 15:03 ` Jason Gunthorpe
2023-01-23  4:36 ` Matthew Wilcox
2023-01-23  4:36   ` Matthew Wilcox
2023-01-23 13:44   ` Jason Gunthorpe
2023-01-23 13:44     ` Jason Gunthorpe
2023-01-23 19:47     ` Bart Van Assche
2023-01-23 19:47       ` Bart Van Assche
2023-01-24  6:15       ` Chaitanya Kulkarni
2023-01-24  6:15         ` Chaitanya Kulkarni
2023-01-26  9:39   ` Mike Rapoport
2023-01-26  9:39     ` Mike Rapoport
2023-01-23 19:36 ` [Lsf-pc] " Dan Williams
2023-01-23 19:36   ` Dan Williams
2023-01-23 20:11   ` Matthew Wilcox
2023-01-23 20:11     ` Matthew Wilcox
2023-01-23 20:50     ` Dan Williams
2023-01-23 20:50       ` Dan Williams
2023-01-23 22:46       ` Matthew Wilcox
2023-01-26 19:38       ` Jason Gunthorpe
2023-01-26 19:38         ` Jason Gunthorpe
2023-01-26  1:45 ` Zhu Yanjun
2023-01-26  1:45   ` Zhu Yanjun
2023-02-28 20:59 ` T.J. Mercier
2023-02-28 20:59   ` T.J. Mercier
2023-04-17 19:59   ` Jason Gunthorpe
2023-04-17 19:59     ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.