All of lore.kernel.org
 help / color / mirror / Atom feed
* Phyr Starter
@ 2022-01-10 19:34 ` Matthew Wilcox
  0 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-10 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Jason Gunthorpe, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

TLDR: I want to introduce a new data type:

struct phyr {
        phys_addr_t addr;
        size_t len;
};

and use it to replace bio_vec as well as using it to replace the array
of struct pages used by get_user_pages() and friends.

---

There are two distinct problems I want to address: doing I/O to memory
which does not have a struct page and efficiently doing I/O to large
blobs of physically contiguous memory, regardless of whether it has a
struct page.  There are some other improvements which I regard as minor.

There are many types of memory that one might want to do I/O to that do
not have a struct page, some examples:
 - Memory on a graphics card (or other PCI card, but gfx seems to be
   the primary provider of DRAM on the PCI bus today)
 - DAX, or other pmem (there are some fake pages today, but this is
   mostly a workaround for the IO problem today)
 - Guest memory being accessed from the hypervisor (KVM needs to
   create structpages to make this happen.  Xen doesn't ...)
All of these kinds of memories can be addressed by the CPU and so also
by a bus master.  That is, there is a physical address that the CPU
can use which will address this memory, and there is a way to convert
that to a DMA address which can be programmed into another device.
There's no intent here to support memory which can be accessed by a
complex scheme like writing an address to a control register and then
accessing the memory through a FIFO; this is for memory which can be
accessed by DMA and CPU loads and stores.

For get_user_pages() and friends, we currently fill an array of struct
pages, each one representing PAGE_SIZE bytes.  For an application that
is using 1GB hugepages, writing 2^18 entries is a significant overhead.
It also makes drivers hard to write as they have to recoalesce the
struct pages, even though the VM can tell it whether those 2^18 pages
are contiguous.

On the minor side, struct phyr can represent any mappable chunk of memory.
A bio_vec is limited to 2^32 bytes, while on 64-bit machines a phyr
can represent larger than 4GB.  A phyr is the same size as a bio_vec
on 64 bit (16 bytes), and the same size for 32-bit with PAE (12 bytes).
It is smaller for 32-bit machines without PAE (8 bytes instead of 12).

Finally, it may be possible to stop using scatterlist to describe the
input to the DMA-mapping operation.  We may be able to get struct
scatterlist down to just dma_address and dma_length, with chaining
handled through an enclosing struct.

I would like to see phyr replace bio_vec everywhere it's currently used.
I don't have time to do that work now because I'm busy with folios.
If someone else wants to take that on, I shall cheer from the sidelines.
What I do intend to do is:

 - Add an interface to gup.c to pin/unpin N phyrs
 - Add a sg_map_phyrs()
   This will take an array of phyrs and allocate an sg for them
 - Whatever else I need to do to make one RDMA driver happy with
   this scheme

At that point, I intend to stop and let others more familiar with this
area of the kernel continue the conversion of drivers.

P.S. If you've had the Prodigy song running through your head the whole
time you've been reading this email ... I'm sorry / You're welcome.
If people insist, we can rename this to phys_range or something boring,
but I quite like the spelling of phyr with the pronunciation of "fire".

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Phyr Starter
@ 2022-01-10 19:34 ` Matthew Wilcox
  0 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-10 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: nvdimm, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, Jason Gunthorpe, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

TLDR: I want to introduce a new data type:

struct phyr {
        phys_addr_t addr;
        size_t len;
};

and use it to replace bio_vec as well as using it to replace the array
of struct pages used by get_user_pages() and friends.

---

There are two distinct problems I want to address: doing I/O to memory
which does not have a struct page and efficiently doing I/O to large
blobs of physically contiguous memory, regardless of whether it has a
struct page.  There are some other improvements which I regard as minor.

There are many types of memory that one might want to do I/O to that do
not have a struct page, some examples:
 - Memory on a graphics card (or other PCI card, but gfx seems to be
   the primary provider of DRAM on the PCI bus today)
 - DAX, or other pmem (there are some fake pages today, but this is
   mostly a workaround for the IO problem today)
 - Guest memory being accessed from the hypervisor (KVM needs to
   create structpages to make this happen.  Xen doesn't ...)
All of these kinds of memories can be addressed by the CPU and so also
by a bus master.  That is, there is a physical address that the CPU
can use which will address this memory, and there is a way to convert
that to a DMA address which can be programmed into another device.
There's no intent here to support memory which can be accessed by a
complex scheme like writing an address to a control register and then
accessing the memory through a FIFO; this is for memory which can be
accessed by DMA and CPU loads and stores.

For get_user_pages() and friends, we currently fill an array of struct
pages, each one representing PAGE_SIZE bytes.  For an application that
is using 1GB hugepages, writing 2^18 entries is a significant overhead.
It also makes drivers hard to write as they have to recoalesce the
struct pages, even though the VM can tell it whether those 2^18 pages
are contiguous.

On the minor side, struct phyr can represent any mappable chunk of memory.
A bio_vec is limited to 2^32 bytes, while on 64-bit machines a phyr
can represent larger than 4GB.  A phyr is the same size as a bio_vec
on 64 bit (16 bytes), and the same size for 32-bit with PAE (12 bytes).
It is smaller for 32-bit machines without PAE (8 bytes instead of 12).

Finally, it may be possible to stop using scatterlist to describe the
input to the DMA-mapping operation.  We may be able to get struct
scatterlist down to just dma_address and dma_length, with chaining
handled through an enclosing struct.

I would like to see phyr replace bio_vec everywhere it's currently used.
I don't have time to do that work now because I'm busy with folios.
If someone else wants to take that on, I shall cheer from the sidelines.
What I do intend to do is:

 - Add an interface to gup.c to pin/unpin N phyrs
 - Add a sg_map_phyrs()
   This will take an array of phyrs and allocate an sg for them
 - Whatever else I need to do to make one RDMA driver happy with
   this scheme

At that point, I intend to stop and let others more familiar with this
area of the kernel continue the conversion of drivers.

P.S. If you've had the Prodigy song running through your head the whole
time you've been reading this email ... I'm sorry / You're welcome.
If people insist, we can rename this to phys_range or something boring,
but I quite like the spelling of phyr with the pronunciation of "fire".

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-10 19:34 ` Matthew Wilcox
@ 2022-01-11  0:41   ` Jason Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11  0:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:

> Finally, it may be possible to stop using scatterlist to describe the
> input to the DMA-mapping operation.  We may be able to get struct
> scatterlist down to just dma_address and dma_length, with chaining
> handled through an enclosing struct.

Can you talk about this some more? IMHO one of the key properties of
the scatterlist is that it can hold huge amounts of pages without
having to do any kind of special allocation due to the chaining.

The same will be true of the phyr idea right?

> I would like to see phyr replace bio_vec everywhere it's currently used.
> I don't have time to do that work now because I'm busy with folios.
> If someone else wants to take that on, I shall cheer from the sidelines.
> What I do intend to do is:

I wonder if we mixed things though..

IMHO there is a lot of optimization to be had by having a
datastructure that is expressly 'the physical pages underlying a
contiguous chunk of va'

If you limit to that scenario then we can be more optimal because
things like byte granular offsets and size in the interior pages don't
need to exist. Every interior chunk is always aligned to its order and
we only need to record the order.

An overall starting offset and total length allow computing the slice
of the first/last entry.

If the physical address is always aligned then we get 12 free bits
from the min 4k alignment and also only need to store order, not an
arbitary byte granular length.

The win is I think we can meaningfully cover most common cases using
only 8 bytes per physical chunk. The 12 bits can be used to encode the
common orders (4k, 2M, 1G, etc) and some smart mechanism to get
another 16 bits to cover 'everything'.

IMHO storage density here is quite important, we end up having to keep
this stuff around for a long time.

I say this here, because I've always though bio_vec/etc are more
general than we actually need, being byte granular at every chunk.

>  - Add an interface to gup.c to pin/unpin N phyrs
>  - Add a sg_map_phyrs()
>    This will take an array of phyrs and allocate an sg for them
>  - Whatever else I need to do to make one RDMA driver happy with
>    this scheme

I spent alot of time already cleaning all the DMA code in RDMA - it is
now nicely uniform and ready to do this sort of change. I was
expecting to be a bio_vec, but this is fine too.

What is needed is a full scatterlist replacement, including the IOMMU
part.

For the IOMMU I would expect the datastructure to be re-used, we start
with a list of physical pages and then 'dma map' gives us a list of
IOVA physical pages, in another allocation, but exactly the same
datastructure.

This 'dma map' could return a pointer to the first datastructure if
there is no iommu, allocate a single entry list if the whole thing can
be linearly mapped with the iommu, and other baroque cases (like pci
offset/etc) will need to allocate full array. ie good HW runs fast and
is memory efficient.

It would be nice to see a patch sketching showing what this
datastructure could look like.

VFIO would like this structure as well as it currently is a very
inefficient page at a time loop when it iommu maps things.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11  0:41   ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11  0:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:

> Finally, it may be possible to stop using scatterlist to describe the
> input to the DMA-mapping operation.  We may be able to get struct
> scatterlist down to just dma_address and dma_length, with chaining
> handled through an enclosing struct.

Can you talk about this some more? IMHO one of the key properties of
the scatterlist is that it can hold huge amounts of pages without
having to do any kind of special allocation due to the chaining.

The same will be true of the phyr idea right?

> I would like to see phyr replace bio_vec everywhere it's currently used.
> I don't have time to do that work now because I'm busy with folios.
> If someone else wants to take that on, I shall cheer from the sidelines.
> What I do intend to do is:

I wonder if we mixed things though..

IMHO there is a lot of optimization to be had by having a
datastructure that is expressly 'the physical pages underlying a
contiguous chunk of va'

If you limit to that scenario then we can be more optimal because
things like byte granular offsets and size in the interior pages don't
need to exist. Every interior chunk is always aligned to its order and
we only need to record the order.

An overall starting offset and total length allow computing the slice
of the first/last entry.

If the physical address is always aligned then we get 12 free bits
from the min 4k alignment and also only need to store order, not an
arbitary byte granular length.

The win is I think we can meaningfully cover most common cases using
only 8 bytes per physical chunk. The 12 bits can be used to encode the
common orders (4k, 2M, 1G, etc) and some smart mechanism to get
another 16 bits to cover 'everything'.

IMHO storage density here is quite important, we end up having to keep
this stuff around for a long time.

I say this here, because I've always though bio_vec/etc are more
general than we actually need, being byte granular at every chunk.

>  - Add an interface to gup.c to pin/unpin N phyrs
>  - Add a sg_map_phyrs()
>    This will take an array of phyrs and allocate an sg for them
>  - Whatever else I need to do to make one RDMA driver happy with
>    this scheme

I spent alot of time already cleaning all the DMA code in RDMA - it is
now nicely uniform and ready to do this sort of change. I was
expecting to be a bio_vec, but this is fine too.

What is needed is a full scatterlist replacement, including the IOMMU
part.

For the IOMMU I would expect the datastructure to be re-used, we start
with a list of physical pages and then 'dma map' gives us a list of
IOVA physical pages, in another allocation, but exactly the same
datastructure.

This 'dma map' could return a pointer to the first datastructure if
there is no iommu, allocate a single entry list if the whole thing can
be linearly mapped with the iommu, and other baroque cases (like pci
offset/etc) will need to allocate full array. ie good HW runs fast and
is memory efficient.

It would be nice to see a patch sketching showing what this
datastructure could look like.

VFIO would like this structure as well as it currently is a very
inefficient page at a time loop when it iommu maps things.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11  0:41   ` Jason Gunthorpe
@ 2022-01-11  4:32     ` Matthew Wilcox
  -1 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11  4:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> 
> > Finally, it may be possible to stop using scatterlist to describe the
> > input to the DMA-mapping operation.  We may be able to get struct
> > scatterlist down to just dma_address and dma_length, with chaining
> > handled through an enclosing struct.
> 
> Can you talk about this some more? IMHO one of the key properties of
> the scatterlist is that it can hold huge amounts of pages without
> having to do any kind of special allocation due to the chaining.
> 
> The same will be true of the phyr idea right?

My thinking is that we'd pass a relatively small array of phyr (maybe 16
entries) to get_user_phyr().  If that turned out not to be big enough,
then we have two options; one is to map those 16 ranges with sg and use
the sg chaining functionality before throwing away the phyr and calling
get_user_phyr() again.  The other is to stash those 16 ranges somewhere
(eg a resizing array of some kind) and keep calling get_user_phyr()
to get the next batch of 16; once we've got the entire range, call
sg_map_phyr() passing all of the phyrs.

> > I would like to see phyr replace bio_vec everywhere it's currently used.
> > I don't have time to do that work now because I'm busy with folios.
> > If someone else wants to take that on, I shall cheer from the sidelines.
> > What I do intend to do is:
> 
> I wonder if we mixed things though..
> 
> IMHO there is a lot of optimization to be had by having a
> datastructure that is expressly 'the physical pages underlying a
> contiguous chunk of va'
> 
> If you limit to that scenario then we can be more optimal because
> things like byte granular offsets and size in the interior pages don't
> need to exist. Every interior chunk is always aligned to its order and
> we only need to record the order.
> 
> An overall starting offset and total length allow computing the slice
> of the first/last entry.
> 
> If the physical address is always aligned then we get 12 free bits
> from the min 4k alignment and also only need to store order, not an
> arbitary byte granular length.
> 
> The win is I think we can meaningfully cover most common cases using
> only 8 bytes per physical chunk. The 12 bits can be used to encode the
> common orders (4k, 2M, 1G, etc) and some smart mechanism to get
> another 16 bits to cover 'everything'.
> 
> IMHO storage density here is quite important, we end up having to keep
> this stuff around for a long time.
> 
> I say this here, because I've always though bio_vec/etc are more
> general than we actually need, being byte granular at every chunk.

Oh, I can do you one better on the bit-packing scheme.  There's a
representation of every power-of-two that is naturally aligned, using
just one extra bit.  Let's say we have 3 bits of address space and
4 bits to represent any power of two allocation within that address
space:

0000 index-0, order-0
0010 index-1, order-0
...
1110 index-7, order-0
0001 index-0, order-1
0101 index-2, order-1
1001 index-4, order-1
1101 index-6, order-1
0011 index-0, order-2
1011 index-4, order-2
0111 index-0, order-3

1111 has no meaning and can be used to represent an invalid range, if
that's useful.  The lowest clear bit decodes to the order, and
(x & (x+1))/2 gets you the index.

That leaves you with another 11 bits to represent something smart about
partial pages.

The question is whether this is the right kind of optimisation to be
doing.  I hear you that we want a dense format, but it's questionable
whether the kind of thing you're suggesting is actually denser than this
scheme.  For example, if we have 1GB pages and userspace happens to have
allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
as a single phyr.  A power-of-two scheme would have us use four entries
(3, 4-7, 8-9, 10).

Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very
cheap.  If I have to walk PTEs looking for pages which can be combined
together, I end up with interesting behaviour where the length of the
list shrinks and expands.  Using the example above, as I walk successive
PUDs, the data struct looks like this:

(3)
(3, 4)
(3, 4-5)
(3, 4-5, 6)
(3, 4-7)
(3, 4-7, 8)
(3, 4-7, 8-9)
(3, 4-7, 8-9, 10)

We could end up with a situation where we stop because the array is
full, even though if we kept going, it'd shrink back down below the
length of the array (in this example, an array of length 2 would stop
when it saw page 6, even though page 7 shrinks it back down again).

> What is needed is a full scatterlist replacement, including the IOMMU
> part.
> 
> For the IOMMU I would expect the datastructure to be re-used, we start
> with a list of physical pages and then 'dma map' gives us a list of
> IOVA physical pages, in another allocation, but exactly the same
> datastructure.
> 
> This 'dma map' could return a pointer to the first datastructure if
> there is no iommu, allocate a single entry list if the whole thing can
> be linearly mapped with the iommu, and other baroque cases (like pci
> offset/etc) will need to allocate full array. ie good HW runs fast and
> is memory efficient.
> 
> It would be nice to see a patch sketching showing what this
> datastructure could look like.
> 
> VFIO would like this structure as well as it currently is a very
> inefficient page at a time loop when it iommu maps things.

I agree that you need these things.  I think I'll run into trouble
if I try to do them for you ... so I'm going to stop after doing the
top end (pinning pages and getting them into an sg list) and let
people who know that area better than I do work on that.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11  4:32     ` Matthew Wilcox
  0 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11  4:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> 
> > Finally, it may be possible to stop using scatterlist to describe the
> > input to the DMA-mapping operation.  We may be able to get struct
> > scatterlist down to just dma_address and dma_length, with chaining
> > handled through an enclosing struct.
> 
> Can you talk about this some more? IMHO one of the key properties of
> the scatterlist is that it can hold huge amounts of pages without
> having to do any kind of special allocation due to the chaining.
> 
> The same will be true of the phyr idea right?

My thinking is that we'd pass a relatively small array of phyr (maybe 16
entries) to get_user_phyr().  If that turned out not to be big enough,
then we have two options; one is to map those 16 ranges with sg and use
the sg chaining functionality before throwing away the phyr and calling
get_user_phyr() again.  The other is to stash those 16 ranges somewhere
(eg a resizing array of some kind) and keep calling get_user_phyr()
to get the next batch of 16; once we've got the entire range, call
sg_map_phyr() passing all of the phyrs.

> > I would like to see phyr replace bio_vec everywhere it's currently used.
> > I don't have time to do that work now because I'm busy with folios.
> > If someone else wants to take that on, I shall cheer from the sidelines.
> > What I do intend to do is:
> 
> I wonder if we mixed things though..
> 
> IMHO there is a lot of optimization to be had by having a
> datastructure that is expressly 'the physical pages underlying a
> contiguous chunk of va'
> 
> If you limit to that scenario then we can be more optimal because
> things like byte granular offsets and size in the interior pages don't
> need to exist. Every interior chunk is always aligned to its order and
> we only need to record the order.
> 
> An overall starting offset and total length allow computing the slice
> of the first/last entry.
> 
> If the physical address is always aligned then we get 12 free bits
> from the min 4k alignment and also only need to store order, not an
> arbitary byte granular length.
> 
> The win is I think we can meaningfully cover most common cases using
> only 8 bytes per physical chunk. The 12 bits can be used to encode the
> common orders (4k, 2M, 1G, etc) and some smart mechanism to get
> another 16 bits to cover 'everything'.
> 
> IMHO storage density here is quite important, we end up having to keep
> this stuff around for a long time.
> 
> I say this here, because I've always though bio_vec/etc are more
> general than we actually need, being byte granular at every chunk.

Oh, I can do you one better on the bit-packing scheme.  There's a
representation of every power-of-two that is naturally aligned, using
just one extra bit.  Let's say we have 3 bits of address space and
4 bits to represent any power of two allocation within that address
space:

0000 index-0, order-0
0010 index-1, order-0
...
1110 index-7, order-0
0001 index-0, order-1
0101 index-2, order-1
1001 index-4, order-1
1101 index-6, order-1
0011 index-0, order-2
1011 index-4, order-2
0111 index-0, order-3

1111 has no meaning and can be used to represent an invalid range, if
that's useful.  The lowest clear bit decodes to the order, and
(x & (x+1))/2 gets you the index.

That leaves you with another 11 bits to represent something smart about
partial pages.

The question is whether this is the right kind of optimisation to be
doing.  I hear you that we want a dense format, but it's questionable
whether the kind of thing you're suggesting is actually denser than this
scheme.  For example, if we have 1GB pages and userspace happens to have
allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
as a single phyr.  A power-of-two scheme would have us use four entries
(3, 4-7, 8-9, 10).

Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very
cheap.  If I have to walk PTEs looking for pages which can be combined
together, I end up with interesting behaviour where the length of the
list shrinks and expands.  Using the example above, as I walk successive
PUDs, the data struct looks like this:

(3)
(3, 4)
(3, 4-5)
(3, 4-5, 6)
(3, 4-7)
(3, 4-7, 8)
(3, 4-7, 8-9)
(3, 4-7, 8-9, 10)

We could end up with a situation where we stop because the array is
full, even though if we kept going, it'd shrink back down below the
length of the array (in this example, an array of length 2 would stop
when it saw page 6, even though page 7 shrinks it back down again).

> What is needed is a full scatterlist replacement, including the IOMMU
> part.
> 
> For the IOMMU I would expect the datastructure to be re-used, we start
> with a list of physical pages and then 'dma map' gives us a list of
> IOVA physical pages, in another allocation, but exactly the same
> datastructure.
> 
> This 'dma map' could return a pointer to the first datastructure if
> there is no iommu, allocate a single entry list if the whole thing can
> be linearly mapped with the iommu, and other baroque cases (like pci
> offset/etc) will need to allocate full array. ie good HW runs fast and
> is memory efficient.
> 
> It would be nice to see a patch sketching showing what this
> datastructure could look like.
> 
> VFIO would like this structure as well as it currently is a very
> inefficient page at a time loop when it iommu maps things.

I agree that you need these things.  I think I'll run into trouble
if I try to do them for you ... so I'm going to stop after doing the
top end (pinning pages and getting them into an sg list) and let
people who know that area better than I do work on that.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-10 19:34 ` Matthew Wilcox
@ 2022-01-11  8:17   ` John Hubbard
  -1 siblings, 0 replies; 62+ messages in thread
From: John Hubbard @ 2022-01-11  8:17 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel
  Cc: Christoph Hellwig, Jason Gunthorpe, Joao Martins,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On 1/10/22 11:34, Matthew Wilcox wrote:
> TLDR: I want to introduce a new data type:
> 
> struct phyr {
>          phys_addr_t addr;
>          size_t len;
> };
> 
> and use it to replace bio_vec as well as using it to replace the array
> of struct pages used by get_user_pages() and friends.
> 
> ---

This would certainly solve quite a few problems at once. Very compelling.

Zooming in on the pinning aspect for a moment: last time I attempted to
convert O_DIRECT callers from gup to pup, I recall wanting very much to
record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
or some non-FOLL_PIN method. Because at the end of the IO, it is not
easy to disentangle which pages require put_page() and which require
unpin_user_page*().

And changing the bio_vec for *that* purpose was not really acceptable.

But now that you're looking to change it in a big way (and with some
spare bits avaiable...oohh!), maybe I can go that direction after all.

Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
if it exists at all?

Or any other thoughts in this area are very welcome.

> 
> There are two distinct problems I want to address: doing I/O to memory
> which does not have a struct page and efficiently doing I/O to large
> blobs of physically contiguous memory, regardless of whether it has a
> struct page.  There are some other improvements which I regard as minor.
> 
> There are many types of memory that one might want to do I/O to that do
> not have a struct page, some examples:
>   - Memory on a graphics card (or other PCI card, but gfx seems to be
>     the primary provider of DRAM on the PCI bus today)
>   - DAX, or other pmem (there are some fake pages today, but this is
>     mostly a workaround for the IO problem today)
>   - Guest memory being accessed from the hypervisor (KVM needs to
>     create structpages to make this happen.  Xen doesn't ...)
> All of these kinds of memories can be addressed by the CPU and so also
> by a bus master.  That is, there is a physical address that the CPU
> can use which will address this memory, and there is a way to convert
> that to a DMA address which can be programmed into another device.
> There's no intent here to support memory which can be accessed by a
> complex scheme like writing an address to a control register and then
> accessing the memory through a FIFO; this is for memory which can be
> accessed by DMA and CPU loads and stores.
> 
> For get_user_pages() and friends, we currently fill an array of struct
> pages, each one representing PAGE_SIZE bytes.  For an application that
> is using 1GB hugepages, writing 2^18 entries is a significant overhead.
> It also makes drivers hard to write as they have to recoalesce the
> struct pages, even though the VM can tell it whether those 2^18 pages
> are contiguous.
> 
> On the minor side, struct phyr can represent any mappable chunk of memory.
> A bio_vec is limited to 2^32 bytes, while on 64-bit machines a phyr
> can represent larger than 4GB.  A phyr is the same size as a bio_vec
> on 64 bit (16 bytes), and the same size for 32-bit with PAE (12 bytes).
> It is smaller for 32-bit machines without PAE (8 bytes instead of 12).
> 
> Finally, it may be possible to stop using scatterlist to describe the
> input to the DMA-mapping operation.  We may be able to get struct
> scatterlist down to just dma_address and dma_length, with chaining
> handled through an enclosing struct.
> 
> I would like to see phyr replace bio_vec everywhere it's currently used.
> I don't have time to do that work now because I'm busy with folios.
> If someone else wants to take that on, I shall cheer from the sidelines.

I'm starting to wonder if I should jump in here, in order to get this
as a way to make the O_DIRECT conversion much cleaner. But let's see.

> What I do intend to do is:
> 
>   - Add an interface to gup.c to pin/unpin N phyrs
>   - Add a sg_map_phyrs()
>     This will take an array of phyrs and allocate an sg for them
>   - Whatever else I need to do to make one RDMA driver happy with
>     this scheme
> 
> At that point, I intend to stop and let others more familiar with this
> area of the kernel continue the conversion of drivers.
> 
> P.S. If you've had the Prodigy song running through your head the whole
> time you've been reading this email ... I'm sorry / You're welcome.
> If people insist, we can rename this to phys_range or something boring,
> but I quite like the spelling of phyr with the pronunciation of "fire".

A more conservative or traditional name might look like:

     phys_vec (maintains some resemblance to what it's replacing)
     phys_range
     phys_addr_range

phyr is rather cool, but it also is awfully too close to "phys" for
reading comfort. And there is a lot to be said for self-descriptive names,
which phyr is not.

And of course, you're signing for another huge naming debate with Linus
if you go with the "cool" name here. :)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11  8:17   ` John Hubbard
  0 siblings, 0 replies; 62+ messages in thread
From: John Hubbard @ 2022-01-11  8:17 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel
  Cc: nvdimm, linux-rdma, netdev, dri-devel, Ming Lei, linux-block,
	linux-mm, Jason Gunthorpe, Joao Martins, Logan Gunthorpe,
	Christoph Hellwig

On 1/10/22 11:34, Matthew Wilcox wrote:
> TLDR: I want to introduce a new data type:
> 
> struct phyr {
>          phys_addr_t addr;
>          size_t len;
> };
> 
> and use it to replace bio_vec as well as using it to replace the array
> of struct pages used by get_user_pages() and friends.
> 
> ---

This would certainly solve quite a few problems at once. Very compelling.

Zooming in on the pinning aspect for a moment: last time I attempted to
convert O_DIRECT callers from gup to pup, I recall wanting very much to
record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
or some non-FOLL_PIN method. Because at the end of the IO, it is not
easy to disentangle which pages require put_page() and which require
unpin_user_page*().

And changing the bio_vec for *that* purpose was not really acceptable.

But now that you're looking to change it in a big way (and with some
spare bits avaiable...oohh!), maybe I can go that direction after all.

Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
if it exists at all?

Or any other thoughts in this area are very welcome.

> 
> There are two distinct problems I want to address: doing I/O to memory
> which does not have a struct page and efficiently doing I/O to large
> blobs of physically contiguous memory, regardless of whether it has a
> struct page.  There are some other improvements which I regard as minor.
> 
> There are many types of memory that one might want to do I/O to that do
> not have a struct page, some examples:
>   - Memory on a graphics card (or other PCI card, but gfx seems to be
>     the primary provider of DRAM on the PCI bus today)
>   - DAX, or other pmem (there are some fake pages today, but this is
>     mostly a workaround for the IO problem today)
>   - Guest memory being accessed from the hypervisor (KVM needs to
>     create structpages to make this happen.  Xen doesn't ...)
> All of these kinds of memories can be addressed by the CPU and so also
> by a bus master.  That is, there is a physical address that the CPU
> can use which will address this memory, and there is a way to convert
> that to a DMA address which can be programmed into another device.
> There's no intent here to support memory which can be accessed by a
> complex scheme like writing an address to a control register and then
> accessing the memory through a FIFO; this is for memory which can be
> accessed by DMA and CPU loads and stores.
> 
> For get_user_pages() and friends, we currently fill an array of struct
> pages, each one representing PAGE_SIZE bytes.  For an application that
> is using 1GB hugepages, writing 2^18 entries is a significant overhead.
> It also makes drivers hard to write as they have to recoalesce the
> struct pages, even though the VM can tell it whether those 2^18 pages
> are contiguous.
> 
> On the minor side, struct phyr can represent any mappable chunk of memory.
> A bio_vec is limited to 2^32 bytes, while on 64-bit machines a phyr
> can represent larger than 4GB.  A phyr is the same size as a bio_vec
> on 64 bit (16 bytes), and the same size for 32-bit with PAE (12 bytes).
> It is smaller for 32-bit machines without PAE (8 bytes instead of 12).
> 
> Finally, it may be possible to stop using scatterlist to describe the
> input to the DMA-mapping operation.  We may be able to get struct
> scatterlist down to just dma_address and dma_length, with chaining
> handled through an enclosing struct.
> 
> I would like to see phyr replace bio_vec everywhere it's currently used.
> I don't have time to do that work now because I'm busy with folios.
> If someone else wants to take that on, I shall cheer from the sidelines.

I'm starting to wonder if I should jump in here, in order to get this
as a way to make the O_DIRECT conversion much cleaner. But let's see.

> What I do intend to do is:
> 
>   - Add an interface to gup.c to pin/unpin N phyrs
>   - Add a sg_map_phyrs()
>     This will take an array of phyrs and allocate an sg for them
>   - Whatever else I need to do to make one RDMA driver happy with
>     this scheme
> 
> At that point, I intend to stop and let others more familiar with this
> area of the kernel continue the conversion of drivers.
> 
> P.S. If you've had the Prodigy song running through your head the whole
> time you've been reading this email ... I'm sorry / You're welcome.
> If people insist, we can rename this to phys_range or something boring,
> but I quite like the spelling of phyr with the pronunciation of "fire".

A more conservative or traditional name might look like:

     phys_vec (maintains some resemblance to what it's replacing)
     phys_range
     phys_addr_range

phyr is rather cool, but it also is awfully too close to "phys" for
reading comfort. And there is a lot to be said for self-descriptive names,
which phyr is not.

And of course, you're signing for another huge naming debate with Linus
if you go with the "cool" name here. :)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11  0:41   ` Jason Gunthorpe
@ 2022-01-11  9:05     ` Daniel Vetter
  -1 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2022-01-11  9:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, nvdimm, linux-rdma, John Hubbard, linux-kernel,
	dri-devel, Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

Dropping some thoughts from the gpu driver perspective, feel free to
tell me it's nonsense from the mm/block view :-)

Generally I think we really, really need something like this that's
across all subsystems and consistent.

On Tue, Jan 11, 2022 at 1:41 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> > Finally, it may be possible to stop using scatterlist to describe the
> > input to the DMA-mapping operation.  We may be able to get struct
> > scatterlist down to just dma_address and dma_length, with chaining
> > handled through an enclosing struct.
>
> Can you talk about this some more? IMHO one of the key properties of
> the scatterlist is that it can hold huge amounts of pages without
> having to do any kind of special allocation due to the chaining.
>
> The same will be true of the phyr idea right?
>
> > I would like to see phyr replace bio_vec everywhere it's currently used.
> > I don't have time to do that work now because I'm busy with folios.
> > If someone else wants to take that on, I shall cheer from the sidelines.
> > What I do intend to do is:
>
> I wonder if we mixed things though..
>
> IMHO there is a lot of optimization to be had by having a
> datastructure that is expressly 'the physical pages underlying a
> contiguous chunk of va'
>
> If you limit to that scenario then we can be more optimal because
> things like byte granular offsets and size in the interior pages don't
> need to exist. Every interior chunk is always aligned to its order and
> we only need to record the order.

At least from the gfx side of things only allowing page sized chunks
makes a lot of sense. sg chains kinda feel wrong because they allow
byte chunks (but really that's just not allowed), so quite some
mismatch.

If we go with page size I think hardcoding a PHYS_PAGE_SIZE KB(4)
would make sense, because thanks to x86 that's pretty much the lowest
common denominator that all hw (I know of at least) supports. Not
having to fiddle with "which page size do we have" in driver code
would be neat. It makes writing portable gup code in drivers just
needlessly silly.

> An overall starting offset and total length allow computing the slice
> of the first/last entry.
>
> If the physical address is always aligned then we get 12 free bits
> from the min 4k alignment and also only need to store order, not an
> arbitary byte granular length.
>
> The win is I think we can meaningfully cover most common cases using
> only 8 bytes per physical chunk. The 12 bits can be used to encode the
> common orders (4k, 2M, 1G, etc) and some smart mechanism to get
> another 16 bits to cover 'everything'.
>
> IMHO storage density here is quite important, we end up having to keep
> this stuff around for a long time.
>
> I say this here, because I've always though bio_vec/etc are more
> general than we actually need, being byte granular at every chunk.
>
> >  - Add an interface to gup.c to pin/unpin N phyrs
> >  - Add a sg_map_phyrs()
> >    This will take an array of phyrs and allocate an sg for them
> >  - Whatever else I need to do to make one RDMA driver happy with
> >    this scheme
>
> I spent alot of time already cleaning all the DMA code in RDMA - it is
> now nicely uniform and ready to do this sort of change. I was
> expecting to be a bio_vec, but this is fine too.
>
> What is needed is a full scatterlist replacement, including the IOMMU
> part.
>
> For the IOMMU I would expect the datastructure to be re-used, we start
> with a list of physical pages and then 'dma map' gives us a list of
> IOVA physical pages, in another allocation, but exactly the same
> datastructure.
>
> This 'dma map' could return a pointer to the first datastructure if
> there is no iommu, allocate a single entry list if the whole thing can
> be linearly mapped with the iommu, and other baroque cases (like pci
> offset/etc) will need to allocate full array. ie good HW runs fast and
> is memory efficient.
>
> It would be nice to see a patch sketching showing what this
> datastructure could look like.
>
> VFIO would like this structure as well as it currently is a very
> inefficient page at a time loop when it iommu maps things.

I also think that it would be nice if we can have this as a consistent
set of datastructures, both for dma_addr_t and phys_addr_t. Roughly my
wishlist:
- chained by default, because of the allocation headaches, so I'm with
Jason here
- comes compressed out of gup, I think we all agree on this, I don't
really care how it's compressed as long as I don't get 512 entries for
2M pages :-)
- ideally the dma_ and the phys_ part are split since for dma-buf and
some gpu driver internal use-case there's some where really only the
dma addr is valid, and nothing else. But we can also continue the
current hack of pretending the other side isn't there (atm we scramble
those pointers to make sure dma-buf users don't cheat)
- I think minimally an sg list form of dma-mapped stuff which does not
have a struct page, iirc last time we discussed that we agreed that
this really needs to be part of such a rework or it's not really
improving things much
- a few per-entry driver bits would be nice in both the phys/dma
chains, if we can have them. gpus have funny gpu interconnects, this
would allow us to put all the gpu addresses into dma_addr_t if we can
have some bits indicating whether it's on the pci bus, gpu local
memory or the gpu<->gpu interconnect.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11  9:05     ` Daniel Vetter
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2022-01-11  9:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, Matthew Wilcox, netdev,
	Joao Martins, Logan Gunthorpe, Christoph Hellwig

Dropping some thoughts from the gpu driver perspective, feel free to
tell me it's nonsense from the mm/block view :-)

Generally I think we really, really need something like this that's
across all subsystems and consistent.

On Tue, Jan 11, 2022 at 1:41 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> > Finally, it may be possible to stop using scatterlist to describe the
> > input to the DMA-mapping operation.  We may be able to get struct
> > scatterlist down to just dma_address and dma_length, with chaining
> > handled through an enclosing struct.
>
> Can you talk about this some more? IMHO one of the key properties of
> the scatterlist is that it can hold huge amounts of pages without
> having to do any kind of special allocation due to the chaining.
>
> The same will be true of the phyr idea right?
>
> > I would like to see phyr replace bio_vec everywhere it's currently used.
> > I don't have time to do that work now because I'm busy with folios.
> > If someone else wants to take that on, I shall cheer from the sidelines.
> > What I do intend to do is:
>
> I wonder if we mixed things though..
>
> IMHO there is a lot of optimization to be had by having a
> datastructure that is expressly 'the physical pages underlying a
> contiguous chunk of va'
>
> If you limit to that scenario then we can be more optimal because
> things like byte granular offsets and size in the interior pages don't
> need to exist. Every interior chunk is always aligned to its order and
> we only need to record the order.

At least from the gfx side of things only allowing page sized chunks
makes a lot of sense. sg chains kinda feel wrong because they allow
byte chunks (but really that's just not allowed), so quite some
mismatch.

If we go with page size I think hardcoding a PHYS_PAGE_SIZE KB(4)
would make sense, because thanks to x86 that's pretty much the lowest
common denominator that all hw (I know of at least) supports. Not
having to fiddle with "which page size do we have" in driver code
would be neat. It makes writing portable gup code in drivers just
needlessly silly.

> An overall starting offset and total length allow computing the slice
> of the first/last entry.
>
> If the physical address is always aligned then we get 12 free bits
> from the min 4k alignment and also only need to store order, not an
> arbitary byte granular length.
>
> The win is I think we can meaningfully cover most common cases using
> only 8 bytes per physical chunk. The 12 bits can be used to encode the
> common orders (4k, 2M, 1G, etc) and some smart mechanism to get
> another 16 bits to cover 'everything'.
>
> IMHO storage density here is quite important, we end up having to keep
> this stuff around for a long time.
>
> I say this here, because I've always though bio_vec/etc are more
> general than we actually need, being byte granular at every chunk.
>
> >  - Add an interface to gup.c to pin/unpin N phyrs
> >  - Add a sg_map_phyrs()
> >    This will take an array of phyrs and allocate an sg for them
> >  - Whatever else I need to do to make one RDMA driver happy with
> >    this scheme
>
> I spent alot of time already cleaning all the DMA code in RDMA - it is
> now nicely uniform and ready to do this sort of change. I was
> expecting to be a bio_vec, but this is fine too.
>
> What is needed is a full scatterlist replacement, including the IOMMU
> part.
>
> For the IOMMU I would expect the datastructure to be re-used, we start
> with a list of physical pages and then 'dma map' gives us a list of
> IOVA physical pages, in another allocation, but exactly the same
> datastructure.
>
> This 'dma map' could return a pointer to the first datastructure if
> there is no iommu, allocate a single entry list if the whole thing can
> be linearly mapped with the iommu, and other baroque cases (like pci
> offset/etc) will need to allocate full array. ie good HW runs fast and
> is memory efficient.
>
> It would be nice to see a patch sketching showing what this
> datastructure could look like.
>
> VFIO would like this structure as well as it currently is a very
> inefficient page at a time loop when it iommu maps things.

I also think that it would be nice if we can have this as a consistent
set of datastructures, both for dma_addr_t and phys_addr_t. Roughly my
wishlist:
- chained by default, because of the allocation headaches, so I'm with
Jason here
- comes compressed out of gup, I think we all agree on this, I don't
really care how it's compressed as long as I don't get 512 entries for
2M pages :-)
- ideally the dma_ and the phys_ part are split since for dma-buf and
some gpu driver internal use-case there's some where really only the
dma addr is valid, and nothing else. But we can also continue the
current hack of pretending the other side isn't there (atm we scramble
those pointers to make sure dma-buf users don't cheat)
- I think minimally an sg list form of dma-mapped stuff which does not
have a struct page, iirc last time we discussed that we agreed that
this really needs to be part of such a rework or it's not really
improving things much
- a few per-entry driver bits would be nice in both the phys/dma
chains, if we can have them. gpus have funny gpu interconnects, this
would allow us to put all the gpu addresses into dma_addr_t if we can
have some bits indicating whether it's on the pci bus, gpu local
memory or the gpu<->gpu interconnect.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-10 19:34 ` Matthew Wilcox
                   ` (2 preceding siblings ...)
  (?)
@ 2022-01-11 11:40 ` Thomas Zimmermann
  2022-01-11 13:56     ` Matthew Wilcox
  -1 siblings, 1 reply; 62+ messages in thread
From: Thomas Zimmermann @ 2022-01-11 11:40 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel
  Cc: nvdimm, linux-rdma, John Hubbard, dri-devel, Ming Lei,
	linux-block, linux-mm, Jason Gunthorpe, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig


[-- Attachment #1.1: Type: text/plain, Size: 4734 bytes --]

Hi

Am 10.01.22 um 20:34 schrieb Matthew Wilcox:
> TLDR: I want to introduce a new data type:
> 
> struct phyr {
>          phys_addr_t addr;
>          size_t len;
> };

Did you look at struct dma_buf_map? [1]

For graphics framebuffers, we have the problem that these buffers can be 
in I/O or system memory (and possibly move between them). Linux' 
traditional interfaces (memcpy_toio(), etc) don't deal with the 
differences well.

So we added struct dma_buf_map as an abstraction to the buffer address. 
There are interfaces for accessing and copying the data. I also have a 
patchset somewhere that adds caching information to the structure. 
struct dma_buf_map is for graphics, but really just another memory API.

When we introduced struct dma_buf_map we thought of additional use 
cases, but couldn't really find any at the time. Maybe what you're 
describing is that use case and struct dma_buf_map could be extended for 
this purpose.

Best regards
Thomas

[1] 
https://elixir.bootlin.com/linux/v5.16/source/include/linux/dma-buf-map.h#L115

> 
> and use it to replace bio_vec as well as using it to replace the array
> of struct pages used by get_user_pages() and friends.
> 
> ---
> 
> There are two distinct problems I want to address: doing I/O to memory
> which does not have a struct page and efficiently doing I/O to large
> blobs of physically contiguous memory, regardless of whether it has a
> struct page.  There are some other improvements which I regard as minor.
> 
> There are many types of memory that one might want to do I/O to that do
> not have a struct page, some examples:
>   - Memory on a graphics card (or other PCI card, but gfx seems to be
>     the primary provider of DRAM on the PCI bus today)
>   - DAX, or other pmem (there are some fake pages today, but this is
>     mostly a workaround for the IO problem today)
>   - Guest memory being accessed from the hypervisor (KVM needs to
>     create structpages to make this happen.  Xen doesn't ...)
> All of these kinds of memories can be addressed by the CPU and so also
> by a bus master.  That is, there is a physical address that the CPU
> can use which will address this memory, and there is a way to convert
> that to a DMA address which can be programmed into another device.
> There's no intent here to support memory which can be accessed by a
> complex scheme like writing an address to a control register and then
> accessing the memory through a FIFO; this is for memory which can be
> accessed by DMA and CPU loads and stores.
> 
> For get_user_pages() and friends, we currently fill an array of struct
> pages, each one representing PAGE_SIZE bytes.  For an application that
> is using 1GB hugepages, writing 2^18 entries is a significant overhead.
> It also makes drivers hard to write as they have to recoalesce the
> struct pages, even though the VM can tell it whether those 2^18 pages
> are contiguous.
> 
> On the minor side, struct phyr can represent any mappable chunk of memory.
> A bio_vec is limited to 2^32 bytes, while on 64-bit machines a phyr
> can represent larger than 4GB.  A phyr is the same size as a bio_vec
> on 64 bit (16 bytes), and the same size for 32-bit with PAE (12 bytes).
> It is smaller for 32-bit machines without PAE (8 bytes instead of 12).
> 
> Finally, it may be possible to stop using scatterlist to describe the
> input to the DMA-mapping operation.  We may be able to get struct
> scatterlist down to just dma_address and dma_length, with chaining
> handled through an enclosing struct.
> 
> I would like to see phyr replace bio_vec everywhere it's currently used.
> I don't have time to do that work now because I'm busy with folios.
> If someone else wants to take that on, I shall cheer from the sidelines.
> What I do intend to do is:
> 
>   - Add an interface to gup.c to pin/unpin N phyrs
>   - Add a sg_map_phyrs()
>     This will take an array of phyrs and allocate an sg for them
>   - Whatever else I need to do to make one RDMA driver happy with
>     this scheme
> 
> At that point, I intend to stop and let others more familiar with this
> area of the kernel continue the conversion of drivers.
> 
> P.S. If you've had the Prodigy song running through your head the whole
> time you've been reading this email ... I'm sorry / You're welcome.
> If people insist, we can rename this to phys_range or something boring,
> but I quite like the spelling of phyr with the pronunciation of "fire".

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 11:40 ` Thomas Zimmermann
@ 2022-01-11 13:56     ` Matthew Wilcox
  0 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11 13:56 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: linux-kernel, nvdimm, linux-rdma, John Hubbard, dri-devel,
	Ming Lei, linux-block, linux-mm, Jason Gunthorpe, netdev,
	Joao Martins, Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 12:40:10PM +0100, Thomas Zimmermann wrote:
> Hi
> 
> Am 10.01.22 um 20:34 schrieb Matthew Wilcox:
> > TLDR: I want to introduce a new data type:
> > 
> > struct phyr {
> >          phys_addr_t addr;
> >          size_t len;
> > };
> 
> Did you look at struct dma_buf_map? [1]

Thanks.  I wasn't aware of that.  It doesn't seem to actually solve the
problem, in that it doesn't carry any length information.  Did you mean
to point me at a different structure?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 13:56     ` Matthew Wilcox
  0 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11 13:56 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, Jason Gunthorpe, netdev,
	Joao Martins, Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 12:40:10PM +0100, Thomas Zimmermann wrote:
> Hi
> 
> Am 10.01.22 um 20:34 schrieb Matthew Wilcox:
> > TLDR: I want to introduce a new data type:
> > 
> > struct phyr {
> >          phys_addr_t addr;
> >          size_t len;
> > };
> 
> Did you look at struct dma_buf_map? [1]

Thanks.  I wasn't aware of that.  It doesn't seem to actually solve the
problem, in that it doesn't carry any length information.  Did you mean
to point me at a different structure?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11  8:17   ` John Hubbard
@ 2022-01-11 14:01     ` Matthew Wilcox
  -1 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11 14:01 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-kernel, Christoph Hellwig, Jason Gunthorpe, Joao Martins,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote:
> Zooming in on the pinning aspect for a moment: last time I attempted to
> convert O_DIRECT callers from gup to pup, I recall wanting very much to
> record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
> or some non-FOLL_PIN method. Because at the end of the IO, it is not
> easy to disentangle which pages require put_page() and which require
> unpin_user_page*().
> 
> And changing the bio_vec for *that* purpose was not really acceptable.
> 
> But now that you're looking to change it in a big way (and with some
> spare bits avaiable...oohh!), maybe I can go that direction after all.
> 
> Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
> if it exists at all?

That.  I think there's still good reasons to keep a single-page (or
maybe dual-page) GUP around, but no reason to mix it with ranges.

> Or any other thoughts in this area are very welcome.

That's there's no support for unpinning part of a range.  You pin it,
do the IO, unpin it.  That simplifies the accounting.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 14:01     ` Matthew Wilcox
  0 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11 14:01 UTC (permalink / raw)
  To: John Hubbard
  Cc: nvdimm, linux-rdma, netdev, linux-kernel, dri-devel, Ming Lei,
	linux-block, linux-mm, Jason Gunthorpe, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote:
> Zooming in on the pinning aspect for a moment: last time I attempted to
> convert O_DIRECT callers from gup to pup, I recall wanting very much to
> record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
> or some non-FOLL_PIN method. Because at the end of the IO, it is not
> easy to disentangle which pages require put_page() and which require
> unpin_user_page*().
> 
> And changing the bio_vec for *that* purpose was not really acceptable.
> 
> But now that you're looking to change it in a big way (and with some
> spare bits avaiable...oohh!), maybe I can go that direction after all.
> 
> Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
> if it exists at all?

That.  I think there's still good reasons to keep a single-page (or
maybe dual-page) GUP around, but no reason to mix it with ranges.

> Or any other thoughts in this area are very welcome.

That's there's no support for unpinning part of a range.  You pin it,
do the IO, unpin it.  That simplifies the accounting.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 13:56     ` Matthew Wilcox
  (?)
@ 2022-01-11 14:10     ` Thomas Zimmermann
  -1 siblings, 0 replies; 62+ messages in thread
From: Thomas Zimmermann @ 2022-01-11 14:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, Jason Gunthorpe, netdev,
	Joao Martins, Logan Gunthorpe, Christoph Hellwig


[-- Attachment #1.1: Type: text/plain, Size: 1057 bytes --]

Hi

Am 11.01.22 um 14:56 schrieb Matthew Wilcox:
> On Tue, Jan 11, 2022 at 12:40:10PM +0100, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 10.01.22 um 20:34 schrieb Matthew Wilcox:
>>> TLDR: I want to introduce a new data type:
>>>
>>> struct phyr {
>>>           phys_addr_t addr;
>>>           size_t len;
>>> };
>>
>> Did you look at struct dma_buf_map? [1]
> 
> Thanks.  I wasn't aware of that.  It doesn't seem to actually solve the
> problem, in that it doesn't carry any length information.  Did you mean
> to point me at a different structure?
> 

It's the structure I meant. It refers to a buffer, so the length could 
be added. For something more sophisticated, dma_buf_map could be changed 
to distinguish between the buffer and an iterator pointing into the buffer.

But if it's really different, then so be it.

Best regards
Thomas

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11  4:32     ` Matthew Wilcox
@ 2022-01-11 15:01       ` Jason Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 15:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> > 
> > > Finally, it may be possible to stop using scatterlist to describe the
> > > input to the DMA-mapping operation.  We may be able to get struct
> > > scatterlist down to just dma_address and dma_length, with chaining
> > > handled through an enclosing struct.
> > 
> > Can you talk about this some more? IMHO one of the key properties of
> > the scatterlist is that it can hold huge amounts of pages without
> > having to do any kind of special allocation due to the chaining.
> > 
> > The same will be true of the phyr idea right?
> 
> My thinking is that we'd pass a relatively small array of phyr (maybe 16
> entries) to get_user_phyr().  If that turned out not to be big enough,
> then we have two options; one is to map those 16 ranges with sg and use
> the sg chaining functionality before throwing away the phyr and calling
> get_user_phyr() again. 

Then we are we using get_user_phyr() at all if we are just storing it
in a sg?

Also 16 entries is way to small, it should be at least a whole PMD
worth so we don't have to relock the PMD level each iteration.

I would like to see a flow more like:

  cpu_phyr_list = get_user_phyr(uptr, 1G);
  dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
  [..]
  dma_unmap_phyr(device, dma_phyr_list);
  unpin_drity_free(cpu_phy_list);

Where dma_map_phyr() can build a temporary SGL for old iommu drivers
compatability. iommu drivers would want to implement natively, of
course.

ie no loops in drivers.

> The question is whether this is the right kind of optimisation to be
> doing.  I hear you that we want a dense format, but it's questionable
> whether the kind of thing you're suggesting is actually denser than this
> scheme.  For example, if we have 1GB pages and userspace happens to have
> allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> as a single phyr.  A power-of-two scheme would have us use four entries
> (3, 4-7, 8-9, 10).

That is not quite what I had in mind..

struct phyr_list {
   unsigned int first_page_offset_bytes;
   size_t total_length_bytes;
   phys_addr_t min_alignment;
   struct packed_phyr *list_of_pages;
};

Where each 'packed_phyr' is an aligned page of some kind. The packing
has to be able to represent any number of pfns, so we have four major
cases:
 - 4k pfns (use 8 bytes)
 - Natural order pfn (use 8 bytes)
 - 4k aligned pfns, arbitary number (use 12 bytes)
 - <4k aligned, arbitary length (use 16 bytes?)

In all cases the interior pages are fully used, only the first and
last page is sliced based on the two parameters in the phyr_list.

The first_page_offset_bytes/total_length_bytes mean we don't need to
use the inefficient coding for many common cases, just stick to the 4k
coding and slice the first/last page down.

The last case is, perhaps, a possible route to completely replace
scatterlist. Few places need true byte granularity for interior pages,
so we can invent some coding to say 'this is 8 byte aligned, and n
bytes long' that only fits < 4k or something. Exceptional cases can
then still work. I'm not sure what block needs here - is it just 512?

Basically think of list_of_pages as showing a contiguous list of at
least min_aligned pages and first_page_offset_bytes/total_length_bytes
taking a byte granular slice out of that logical range.

From a HW perspective I see two basic modalities:

 - Streaming HW, which read/writes in a single pass (think
   NVMe/storage/network). Usually takes a list of dma_addr_t and
   length that HW just walks over. Rarely cares about things like page
   boundaries. Optimization goal is to minimize list length. In this
   case we map each packed_phyr into a HW SGL

 - Random Access HW, which is randomly touching memory (think RDMA,
   VFIO, DRM, IOMMU). Usually stores either a linear list of same-size
   dma_addr_t pages, or a radix tree page table of dma_addr_t.
   Needs to have a minimum alignment of each chunk (usually 4k) to
   represent it. Optimization goal is to have maximum page size. In
   this case we use min_alignment to size the HW array and decode the
   packed_phyrs into individual pages.

> Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very
> cheap.

With the above this still works, the very last entry in list_of_pages
would be the 12 byte pfn type and when we start a new page the logic
would then optimize it down to 8 bytes, if possible. At that point we
know we are not going to change it:
 
 - An interior page that is up perfectly aligned is represented as a
   natural order
 - A starting page that ends on perfect alignment is widened to
   natural order and first_page_offset_bytes is corrected
 - An ending page that starts on perfect alignment is widened to
   natural order and total_length_bytes is set
   (though no harm in keeping the 12 byte represetation I suppose)

The main detail is to make the extra 4 bytes needed to store the
arbtiary pfn counts optional so when we don't need it, it isn't there.

> > VFIO would like this structure as well as it currently is a very
> > inefficient page at a time loop when it iommu maps things.
> 
> I agree that you need these things.  I think I'll run into trouble
> if I try to do them for you ... so I'm going to stop after doing the
> top end (pinning pages and getting them into an sg list) and let
> people who know that area better than I do work on that.

I agree, that is why I was asking for the datastructure 'phyr_list',
with chaining and so on.

I would imagine a few steps to this process:
 1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
 2) Wrapper around existing gup to get a phyr_list for user VA
 3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
    (However, with full performance for iommu passthrough)
 4) Patches changing RDMA/VFIO/DRM to this API
 5) Patches optimizing get_user_phyr()
 6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver

Obviously not all done by you.. I'm happy to help and take a swing at
the RDMA and VFIO parts.

I feel like we can go ahead with RDMA so long as the passthrough IOMMU
case is 100% efficient. VFIO is already maximually inefficient here,
so no worry there already. If VFIO and RDMA consume these APIs then
the server IOMMU driver owners should have a strong motivation to
implement.

My feeling is that at the core this project is about making a better
datastructure than scattterlist that can mostly replace it, then going
around the kernel and converting scatterlist users.

Given all the usage considerations I think it is an interesting
datastructure.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 15:01       ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 15:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> > 
> > > Finally, it may be possible to stop using scatterlist to describe the
> > > input to the DMA-mapping operation.  We may be able to get struct
> > > scatterlist down to just dma_address and dma_length, with chaining
> > > handled through an enclosing struct.
> > 
> > Can you talk about this some more? IMHO one of the key properties of
> > the scatterlist is that it can hold huge amounts of pages without
> > having to do any kind of special allocation due to the chaining.
> > 
> > The same will be true of the phyr idea right?
> 
> My thinking is that we'd pass a relatively small array of phyr (maybe 16
> entries) to get_user_phyr().  If that turned out not to be big enough,
> then we have two options; one is to map those 16 ranges with sg and use
> the sg chaining functionality before throwing away the phyr and calling
> get_user_phyr() again. 

Then we are we using get_user_phyr() at all if we are just storing it
in a sg?

Also 16 entries is way to small, it should be at least a whole PMD
worth so we don't have to relock the PMD level each iteration.

I would like to see a flow more like:

  cpu_phyr_list = get_user_phyr(uptr, 1G);
  dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
  [..]
  dma_unmap_phyr(device, dma_phyr_list);
  unpin_drity_free(cpu_phy_list);

Where dma_map_phyr() can build a temporary SGL for old iommu drivers
compatability. iommu drivers would want to implement natively, of
course.

ie no loops in drivers.

> The question is whether this is the right kind of optimisation to be
> doing.  I hear you that we want a dense format, but it's questionable
> whether the kind of thing you're suggesting is actually denser than this
> scheme.  For example, if we have 1GB pages and userspace happens to have
> allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> as a single phyr.  A power-of-two scheme would have us use four entries
> (3, 4-7, 8-9, 10).

That is not quite what I had in mind..

struct phyr_list {
   unsigned int first_page_offset_bytes;
   size_t total_length_bytes;
   phys_addr_t min_alignment;
   struct packed_phyr *list_of_pages;
};

Where each 'packed_phyr' is an aligned page of some kind. The packing
has to be able to represent any number of pfns, so we have four major
cases:
 - 4k pfns (use 8 bytes)
 - Natural order pfn (use 8 bytes)
 - 4k aligned pfns, arbitary number (use 12 bytes)
 - <4k aligned, arbitary length (use 16 bytes?)

In all cases the interior pages are fully used, only the first and
last page is sliced based on the two parameters in the phyr_list.

The first_page_offset_bytes/total_length_bytes mean we don't need to
use the inefficient coding for many common cases, just stick to the 4k
coding and slice the first/last page down.

The last case is, perhaps, a possible route to completely replace
scatterlist. Few places need true byte granularity for interior pages,
so we can invent some coding to say 'this is 8 byte aligned, and n
bytes long' that only fits < 4k or something. Exceptional cases can
then still work. I'm not sure what block needs here - is it just 512?

Basically think of list_of_pages as showing a contiguous list of at
least min_aligned pages and first_page_offset_bytes/total_length_bytes
taking a byte granular slice out of that logical range.

From a HW perspective I see two basic modalities:

 - Streaming HW, which read/writes in a single pass (think
   NVMe/storage/network). Usually takes a list of dma_addr_t and
   length that HW just walks over. Rarely cares about things like page
   boundaries. Optimization goal is to minimize list length. In this
   case we map each packed_phyr into a HW SGL

 - Random Access HW, which is randomly touching memory (think RDMA,
   VFIO, DRM, IOMMU). Usually stores either a linear list of same-size
   dma_addr_t pages, or a radix tree page table of dma_addr_t.
   Needs to have a minimum alignment of each chunk (usually 4k) to
   represent it. Optimization goal is to have maximum page size. In
   this case we use min_alignment to size the HW array and decode the
   packed_phyrs into individual pages.

> Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very
> cheap.

With the above this still works, the very last entry in list_of_pages
would be the 12 byte pfn type and when we start a new page the logic
would then optimize it down to 8 bytes, if possible. At that point we
know we are not going to change it:
 
 - An interior page that is up perfectly aligned is represented as a
   natural order
 - A starting page that ends on perfect alignment is widened to
   natural order and first_page_offset_bytes is corrected
 - An ending page that starts on perfect alignment is widened to
   natural order and total_length_bytes is set
   (though no harm in keeping the 12 byte represetation I suppose)

The main detail is to make the extra 4 bytes needed to store the
arbtiary pfn counts optional so when we don't need it, it isn't there.

> > VFIO would like this structure as well as it currently is a very
> > inefficient page at a time loop when it iommu maps things.
> 
> I agree that you need these things.  I think I'll run into trouble
> if I try to do them for you ... so I'm going to stop after doing the
> top end (pinning pages and getting them into an sg list) and let
> people who know that area better than I do work on that.

I agree, that is why I was asking for the datastructure 'phyr_list',
with chaining and so on.

I would imagine a few steps to this process:
 1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
 2) Wrapper around existing gup to get a phyr_list for user VA
 3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
    (However, with full performance for iommu passthrough)
 4) Patches changing RDMA/VFIO/DRM to this API
 5) Patches optimizing get_user_phyr()
 6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver

Obviously not all done by you.. I'm happy to help and take a swing at
the RDMA and VFIO parts.

I feel like we can go ahead with RDMA so long as the passthrough IOMMU
case is 100% efficient. VFIO is already maximually inefficient here,
so no worry there already. If VFIO and RDMA consume these APIs then
the server IOMMU driver owners should have a strong motivation to
implement.

My feeling is that at the core this project is about making a better
datastructure than scattterlist that can mostly replace it, then going
around the kernel and converting scatterlist users.

Given all the usage considerations I think it is an interesting
datastructure.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 14:01     ` Matthew Wilcox
@ 2022-01-11 15:02       ` Jason Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 15:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: John Hubbard, linux-kernel, Christoph Hellwig, Joao Martins,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 02:01:17PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote:
> > Zooming in on the pinning aspect for a moment: last time I attempted to
> > convert O_DIRECT callers from gup to pup, I recall wanting very much to
> > record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
> > or some non-FOLL_PIN method. Because at the end of the IO, it is not
> > easy to disentangle which pages require put_page() and which require
> > unpin_user_page*().
> > 
> > And changing the bio_vec for *that* purpose was not really acceptable.
> > 
> > But now that you're looking to change it in a big way (and with some
> > spare bits avaiable...oohh!), maybe I can go that direction after all.
> > 
> > Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
> > if it exists at all?
> 
> That.  I think there's still good reasons to keep a single-page (or
> maybe dual-page) GUP around, but no reason to mix it with ranges.
> 
> > Or any other thoughts in this area are very welcome.
> 
> That's there's no support for unpinning part of a range.  You pin it,
> do the IO, unpin it.  That simplifies the accounting.

VFIO wouldn't like this :(

Jason
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 15:02       ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 15:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 02:01:17PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote:
> > Zooming in on the pinning aspect for a moment: last time I attempted to
> > convert O_DIRECT callers from gup to pup, I recall wanting very much to
> > record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
> > or some non-FOLL_PIN method. Because at the end of the IO, it is not
> > easy to disentangle which pages require put_page() and which require
> > unpin_user_page*().
> > 
> > And changing the bio_vec for *that* purpose was not really acceptable.
> > 
> > But now that you're looking to change it in a big way (and with some
> > spare bits avaiable...oohh!), maybe I can go that direction after all.
> > 
> > Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
> > if it exists at all?
> 
> That.  I think there's still good reasons to keep a single-page (or
> maybe dual-page) GUP around, but no reason to mix it with ranges.
> 
> > Or any other thoughts in this area are very welcome.
> 
> That's there's no support for unpinning part of a range.  You pin it,
> do the IO, unpin it.  That simplifies the accounting.

VFIO wouldn't like this :(

Jason
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11  8:17   ` John Hubbard
@ 2022-01-11 17:31     ` Logan Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 17:31 UTC (permalink / raw)
  To: John Hubbard, Matthew Wilcox, linux-kernel
  Cc: Christoph Hellwig, Jason Gunthorpe, Joao Martins, Ming Lei,
	linux-block, netdev, linux-mm, linux-rdma, dri-devel, nvdimm



On 2022-01-11 1:17 a.m., John Hubbard wrote:
> On 1/10/22 11:34, Matthew Wilcox wrote:
>> TLDR: I want to introduce a new data type:
>>
>> struct phyr {
>>          phys_addr_t addr;
>>          size_t len;
>> };
>>
>> and use it to replace bio_vec as well as using it to replace the array
>> of struct pages used by get_user_pages() and friends.
>>
>> ---
> 
> This would certainly solve quite a few problems at once. Very compelling.

I agree.

> Zooming in on the pinning aspect for a moment: last time I attempted to
> convert O_DIRECT callers from gup to pup, I recall wanting very much to
> record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
> or some non-FOLL_PIN method. Because at the end of the IO, it is not
> easy to disentangle which pages require put_page() and which require
> unpin_user_page*().
> 
> And changing the bio_vec for *that* purpose was not really acceptable.
> 
> But now that you're looking to change it in a big way (and with some
> spare bits avaiable...oohh!), maybe I can go that direction after all.
> 
> Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
> if it exists at all?

I'd also second being able to store a handful of flags in each phyr. My
userspace P2PDMA patchset needs to add a flag to each sgl to indicate
whether it was mapped as a bus address or not (which would be necessary
for the DMA mapped side dma_map_phyr).

Though, it's not immediately obvious where to put the flags without
increasing the size of the structure :(

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 17:31     ` Logan Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 17:31 UTC (permalink / raw)
  To: John Hubbard, Matthew Wilcox, linux-kernel
  Cc: nvdimm, linux-rdma, netdev, dri-devel, Ming Lei, linux-block,
	linux-mm, Jason Gunthorpe, Joao Martins, Christoph Hellwig



On 2022-01-11 1:17 a.m., John Hubbard wrote:
> On 1/10/22 11:34, Matthew Wilcox wrote:
>> TLDR: I want to introduce a new data type:
>>
>> struct phyr {
>>          phys_addr_t addr;
>>          size_t len;
>> };
>>
>> and use it to replace bio_vec as well as using it to replace the array
>> of struct pages used by get_user_pages() and friends.
>>
>> ---
> 
> This would certainly solve quite a few problems at once. Very compelling.

I agree.

> Zooming in on the pinning aspect for a moment: last time I attempted to
> convert O_DIRECT callers from gup to pup, I recall wanting very much to
> record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
> or some non-FOLL_PIN method. Because at the end of the IO, it is not
> easy to disentangle which pages require put_page() and which require
> unpin_user_page*().
> 
> And changing the bio_vec for *that* purpose was not really acceptable.
> 
> But now that you're looking to change it in a big way (and with some
> spare bits avaiable...oohh!), maybe I can go that direction after all.
> 
> Or, are you looking at a design in which any phyr is implicitly FOLL_PIN'd
> if it exists at all?

I'd also second being able to store a handful of flags in each phyr. My
userspace P2PDMA patchset needs to add a flag to each sgl to indicate
whether it was mapped as a bus address or not (which would be necessary
for the DMA mapped side dma_map_phyr).

Though, it's not immediately obvious where to put the flags without
increasing the size of the structure :(

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 15:01       ` Jason Gunthorpe
@ 2022-01-11 18:33         ` Matthew Wilcox
  -1 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 11:01:42AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote:
> > On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> > > 
> > > > Finally, it may be possible to stop using scatterlist to describe the
> > > > input to the DMA-mapping operation.  We may be able to get struct
> > > > scatterlist down to just dma_address and dma_length, with chaining
> > > > handled through an enclosing struct.
> > > 
> > > Can you talk about this some more? IMHO one of the key properties of
> > > the scatterlist is that it can hold huge amounts of pages without
> > > having to do any kind of special allocation due to the chaining.
> > > 
> > > The same will be true of the phyr idea right?
> > 
> > My thinking is that we'd pass a relatively small array of phyr (maybe 16
> > entries) to get_user_phyr().  If that turned out not to be big enough,
> > then we have two options; one is to map those 16 ranges with sg and use
> > the sg chaining functionality before throwing away the phyr and calling
> > get_user_phyr() again. 
> 
> Then we are we using get_user_phyr() at all if we are just storing it
> in a sg?

I did consider just implementing get_user_sg() (actually 4 years ago),
but that cements the use of sg as both an input and output data structure
for DMA mapping, which I am under the impression we're trying to get
away from.

> Also 16 entries is way to small, it should be at least a whole PMD
> worth so we don't have to relock the PMD level each iteration.
> 
> I would like to see a flow more like:
> 
>   cpu_phyr_list = get_user_phyr(uptr, 1G);
>   dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
>   [..]
>   dma_unmap_phyr(device, dma_phyr_list);
>   unpin_drity_free(cpu_phy_list);
> 
> Where dma_map_phyr() can build a temporary SGL for old iommu drivers
> compatability. iommu drivers would want to implement natively, of
> course.
> 
> ie no loops in drivers.

Let me just rewrite that for you ...

	umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
	umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
			DMA_BIDIRECTIONAL, dma_attr);
	...
	dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
			umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
	sg_free_table(umem->sgt);
	free_user_phyrs(umem->phyrs, umem->phyr_len);

> > The question is whether this is the right kind of optimisation to be
> > doing.  I hear you that we want a dense format, but it's questionable
> > whether the kind of thing you're suggesting is actually denser than this
> > scheme.  For example, if we have 1GB pages and userspace happens to have
> > allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> > as a single phyr.  A power-of-two scheme would have us use four entries
> > (3, 4-7, 8-9, 10).
> 
> That is not quite what I had in mind..
> 
> struct phyr_list {
>    unsigned int first_page_offset_bytes;
>    size_t total_length_bytes;
>    phys_addr_t min_alignment;
>    struct packed_phyr *list_of_pages;
> };
> 
> Where each 'packed_phyr' is an aligned page of some kind. The packing
> has to be able to represent any number of pfns, so we have four major
> cases:
>  - 4k pfns (use 8 bytes)
>  - Natural order pfn (use 8 bytes)
>  - 4k aligned pfns, arbitary number (use 12 bytes)
>  - <4k aligned, arbitary length (use 16 bytes?)
> 
> In all cases the interior pages are fully used, only the first and
> last page is sliced based on the two parameters in the phyr_list.

This kind of representation works for a virtually contiguous range.
Unfortunately, that's not sufficient for some bio users (as I discovered
after getting a representation like this enshrined in the NVMe spec as
the PRP List).

> The last case is, perhaps, a possible route to completely replace
> scatterlist. Few places need true byte granularity for interior pages,
> so we can invent some coding to say 'this is 8 byte aligned, and n
> bytes long' that only fits < 4k or something. Exceptional cases can
> then still work. I'm not sure what block needs here - is it just 512?

Replacing scatterlist is not my goal.  That seems like a lot more work
for little gain.  I just want to delete page_link, offset and length
from struct scatterlist.  Given the above sequence of calls, we're going
to get sg lists that aren't chained.  They may have to be vmalloced,
but they should be contiguous.

> I would imagine a few steps to this process:
>  1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
>  2) Wrapper around existing gup to get a phyr_list for user VA
>  3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
>     (However, with full performance for iommu passthrough)
>  4) Patches changing RDMA/VFIO/DRM to this API
>  5) Patches optimizing get_user_phyr()
>  6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver

I was thinking ...

1. get_user_phyrs() & free_user_phyrs()
2. dma_map_phyrs() and dma_unmap_phyrs() wrappers that create a
   scatterlist from phyrs and call dma_map_sg() / dma_unmap_sg() to work
   with current IOMMU drivers
3. Convert umem to work with it
4-n. Hand it off to people who can implement dma_map_phyrs() properly
   and do the hard work of converting all the drivers.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 18:33         ` Matthew Wilcox
  0 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 11:01:42AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote:
> > On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> > > 
> > > > Finally, it may be possible to stop using scatterlist to describe the
> > > > input to the DMA-mapping operation.  We may be able to get struct
> > > > scatterlist down to just dma_address and dma_length, with chaining
> > > > handled through an enclosing struct.
> > > 
> > > Can you talk about this some more? IMHO one of the key properties of
> > > the scatterlist is that it can hold huge amounts of pages without
> > > having to do any kind of special allocation due to the chaining.
> > > 
> > > The same will be true of the phyr idea right?
> > 
> > My thinking is that we'd pass a relatively small array of phyr (maybe 16
> > entries) to get_user_phyr().  If that turned out not to be big enough,
> > then we have two options; one is to map those 16 ranges with sg and use
> > the sg chaining functionality before throwing away the phyr and calling
> > get_user_phyr() again. 
> 
> Then we are we using get_user_phyr() at all if we are just storing it
> in a sg?

I did consider just implementing get_user_sg() (actually 4 years ago),
but that cements the use of sg as both an input and output data structure
for DMA mapping, which I am under the impression we're trying to get
away from.

> Also 16 entries is way to small, it should be at least a whole PMD
> worth so we don't have to relock the PMD level each iteration.
> 
> I would like to see a flow more like:
> 
>   cpu_phyr_list = get_user_phyr(uptr, 1G);
>   dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
>   [..]
>   dma_unmap_phyr(device, dma_phyr_list);
>   unpin_drity_free(cpu_phy_list);
> 
> Where dma_map_phyr() can build a temporary SGL for old iommu drivers
> compatability. iommu drivers would want to implement natively, of
> course.
> 
> ie no loops in drivers.

Let me just rewrite that for you ...

	umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
	umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
			DMA_BIDIRECTIONAL, dma_attr);
	...
	dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
			umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
	sg_free_table(umem->sgt);
	free_user_phyrs(umem->phyrs, umem->phyr_len);

> > The question is whether this is the right kind of optimisation to be
> > doing.  I hear you that we want a dense format, but it's questionable
> > whether the kind of thing you're suggesting is actually denser than this
> > scheme.  For example, if we have 1GB pages and userspace happens to have
> > allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> > as a single phyr.  A power-of-two scheme would have us use four entries
> > (3, 4-7, 8-9, 10).
> 
> That is not quite what I had in mind..
> 
> struct phyr_list {
>    unsigned int first_page_offset_bytes;
>    size_t total_length_bytes;
>    phys_addr_t min_alignment;
>    struct packed_phyr *list_of_pages;
> };
> 
> Where each 'packed_phyr' is an aligned page of some kind. The packing
> has to be able to represent any number of pfns, so we have four major
> cases:
>  - 4k pfns (use 8 bytes)
>  - Natural order pfn (use 8 bytes)
>  - 4k aligned pfns, arbitary number (use 12 bytes)
>  - <4k aligned, arbitary length (use 16 bytes?)
> 
> In all cases the interior pages are fully used, only the first and
> last page is sliced based on the two parameters in the phyr_list.

This kind of representation works for a virtually contiguous range.
Unfortunately, that's not sufficient for some bio users (as I discovered
after getting a representation like this enshrined in the NVMe spec as
the PRP List).

> The last case is, perhaps, a possible route to completely replace
> scatterlist. Few places need true byte granularity for interior pages,
> so we can invent some coding to say 'this is 8 byte aligned, and n
> bytes long' that only fits < 4k or something. Exceptional cases can
> then still work. I'm not sure what block needs here - is it just 512?

Replacing scatterlist is not my goal.  That seems like a lot more work
for little gain.  I just want to delete page_link, offset and length
from struct scatterlist.  Given the above sequence of calls, we're going
to get sg lists that aren't chained.  They may have to be vmalloced,
but they should be contiguous.

> I would imagine a few steps to this process:
>  1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
>  2) Wrapper around existing gup to get a phyr_list for user VA
>  3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
>     (However, with full performance for iommu passthrough)
>  4) Patches changing RDMA/VFIO/DRM to this API
>  5) Patches optimizing get_user_phyr()
>  6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver

I was thinking ...

1. get_user_phyrs() & free_user_phyrs()
2. dma_map_phyrs() and dma_unmap_phyrs() wrappers that create a
   scatterlist from phyrs and call dma_map_sg() / dma_unmap_sg() to work
   with current IOMMU drivers
3. Convert umem to work with it
4-n. Hand it off to people who can implement dma_map_phyrs() properly
   and do the hard work of converting all the drivers.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 18:33         ` Matthew Wilcox
@ 2022-01-11 20:21           ` Jason Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 20:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 06:33:57PM +0000, Matthew Wilcox wrote:

> > Then we are we using get_user_phyr() at all if we are just storing it
> > in a sg?
> 
> I did consider just implementing get_user_sg() (actually 4 years ago),
> but that cements the use of sg as both an input and output data structure
> for DMA mapping, which I am under the impression we're trying to get
> away from.

I know every time I talked about a get_user_sg() Christoph is against
it and we need to stop using scatter list...

> > Also 16 entries is way to small, it should be at least a whole PMD
> > worth so we don't have to relock the PMD level each iteration.
> > 
> > I would like to see a flow more like:
> > 
> >   cpu_phyr_list = get_user_phyr(uptr, 1G);
> >   dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
> >   [..]
> >   dma_unmap_phyr(device, dma_phyr_list);
> >   unpin_drity_free(cpu_phy_list);
> > 
> > Where dma_map_phyr() can build a temporary SGL for old iommu drivers
> > compatability. iommu drivers would want to implement natively, of
> > course.
> > 
> > ie no loops in drivers.
> 
> Let me just rewrite that for you ...
> 
> 	umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
> 	umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
> 			DMA_BIDIRECTIONAL, dma_attr);
> 	...
> 	dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
> 			umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
> 	sg_free_table(umem->sgt);
> 	free_user_phyrs(umem->phyrs, umem->phyr_len);

Why? As above we want to get rid of the sgl, so you are telling me to
adopt phyrs I need to increase the memory consumption by a hefty
amount to store the phyrs and still keep the sgt now? Why?

I don't need the sgt at all. I just need another list of physical
addresses for DMA. I see no issue with a phsr_list storing either CPU
Physical Address or DMA Physical Addresses, same data structure.

In the fairly important passthrough DMA case the CPU list and DMA list
are identical, so we don't even need to do anything.

In the typical iommu case my dma map's phyrs is only one entry.

Other cases require a larger allocation. This is the advantage against
today's scatterlist - it forces 24 bytes/page for *everyone* to
support niche architectures even if 8 bytes would have been fine for a
server platform.

> > > The question is whether this is the right kind of optimisation to be
> > > doing.  I hear you that we want a dense format, but it's questionable
> > > whether the kind of thing you're suggesting is actually denser than this
> > > scheme.  For example, if we have 1GB pages and userspace happens to have
> > > allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> > > as a single phyr.  A power-of-two scheme would have us use four entries
> > > (3, 4-7, 8-9, 10).
> > 
> > That is not quite what I had in mind..
> > 
> > struct phyr_list {
> >    unsigned int first_page_offset_bytes;
> >    size_t total_length_bytes;
> >    phys_addr_t min_alignment;
> >    struct packed_phyr *list_of_pages;
> > };
> > 
> > Where each 'packed_phyr' is an aligned page of some kind. The packing
> > has to be able to represent any number of pfns, so we have four major
> > cases:
> >  - 4k pfns (use 8 bytes)
> >  - Natural order pfn (use 8 bytes)
> >  - 4k aligned pfns, arbitary number (use 12 bytes)
> >  - <4k aligned, arbitary length (use 16 bytes?)
> > 
> > In all cases the interior pages are fully used, only the first and
> > last page is sliced based on the two parameters in the phyr_list.
> 
> This kind of representation works for a virtually contiguous range.
> Unfortunately, that's not sufficient for some bio users (as I discovered
> after getting a representation like this enshrined in the NVMe spec as
> the PRP List).

This is what I was trying to convay with the 4th bullet, I'm not
suggesting a PRP list.

As an example coding - Use the first 8 bytes to encode this:

 51:0 - Physical address / 4k (ie pfn)
 56:52 - Order (simple, your order encoding can do better)
 61:57 - Unused
 63:62 - Mode, one of:
         00 = natural order pfn (8 bytes)
         01 = order aligned with length (12 bytes)
         10 = arbitary (12 bytes)

Then the optional 4 bytes are used as:

Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
  31:0 - # of order pages

Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
  11:0 - starting byte offset in the 4k
  31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
          length in bytes

I think this covers everything? I assume BIO cannot be doing
non-aligned contiguous transfers beyond 2M? The above can represent
33M of arbitary contiguous memory at 12 bytes/page.

If BIO really needs > 33M then we can use the extra mode to define a
16 byte entry that will cover everything.

> > The last case is, perhaps, a possible route to completely replace
> > scatterlist. Few places need true byte granularity for interior pages,
> > so we can invent some coding to say 'this is 8 byte aligned, and n
> > bytes long' that only fits < 4k or something. Exceptional cases can
> > then still work. I'm not sure what block needs here - is it just 512?
> 
> Replacing scatterlist is not my goal.  That seems like a lot more work
> for little gain.  

Well, I'm not comfortable with the idea above where RDMA would have to
take a memory penalty to use the new interface. To avoid that memory
penalty we need to get rid of scatterlist entirely.

If we do the 16 byte struct from the first email then a umem for MRs
will increase in memory consumption by 160% compared today's 24
bytes/page. I think the HPC workloads will veto this.

This is exactly why everything has been stuck here for so long. Nobody
wants to build on scatterlist and we don't have anything that can
feasibly replace it.

IMHO scatterlist has the wrong tradeoffs, the way it uses the 24 bytes
per page isn't a win in today's world.

And on the flip side, I don't see the iommu driver people being
enthused to implement something that isn't sufficiently general.

> I just want to delete page_link, offset and length from struct
> scatterlist.  Given the above sequence of calls, we're going to get
> sg lists that aren't chained.  They may have to be vmalloced, but
> they should be contiguous.

I don't understand that? Why would the SGL out of the iommu suddenly
not be chained?

From what I've heard I'm also not keen on a physr list using vmalloc
either, that is said to be quite slow?

> > I would imagine a few steps to this process:
> >  1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
> >  2) Wrapper around existing gup to get a phyr_list for user VA
> >  3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
> >     (However, with full performance for iommu passthrough)
> >  4) Patches changing RDMA/VFIO/DRM to this API
> >  5) Patches optimizing get_user_phyr()
> >  6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver
> 
> I was thinking ...
> 
> 1. get_user_phyrs() & free_user_phyrs()
> 2. dma_map_phyrs() and dma_unmap_phyrs() wrappers that create a
>    scatterlist from phyrs and call dma_map_sg() / dma_unmap_sg() to work
>    with current IOMMU drivers

IMHO, the scatterlist has to go away. The interface should be physr
list in, physr list out.

In the two cases I care most about in RDMA, not scatter list is alot
less memory than today.

For iommu pass through the DMA address and CPU address are the same,
so we can re-use the original physr list. So 8 byes/page.

For the iommu case where the iommu linearizes the whole map I end up
with 1 entry for the DMA list. Also 8 bytes/page for enough pages.

Even the degenerate case where I need to have unique DMA addresses for
each page (eg bus offset, no iommu) is still only 16 bytes per page
instead of 24. (I don't have a use case for RDMA)

The rarer case of non-page aligned transfer becomes 24 bytes and is
just as bad as SGL. (RDMA doesn't ever do this)

So in typical cases for RDMA HPC workloads I go from 24 -> 8 bytes of
long term storage. This is a huge win, IMHO.

I can live with the temporary compat performance overhead in IOMMU
mode of building a temporary SGL and then copying it to a DMA physr
list while we agitate to get the server IOMMU drivers updated, so long
as special passthrough mode is optimized to re-use the CPU list out of
the gate.

I don't realistically expect that replacing every scatterlist with
physr will ever happen. It just need to be a sufficiently general API
that it could be done to be worth all the work to implement it. IMHO.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 20:21           ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 20:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 06:33:57PM +0000, Matthew Wilcox wrote:

> > Then we are we using get_user_phyr() at all if we are just storing it
> > in a sg?
> 
> I did consider just implementing get_user_sg() (actually 4 years ago),
> but that cements the use of sg as both an input and output data structure
> for DMA mapping, which I am under the impression we're trying to get
> away from.

I know every time I talked about a get_user_sg() Christoph is against
it and we need to stop using scatter list...

> > Also 16 entries is way to small, it should be at least a whole PMD
> > worth so we don't have to relock the PMD level each iteration.
> > 
> > I would like to see a flow more like:
> > 
> >   cpu_phyr_list = get_user_phyr(uptr, 1G);
> >   dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
> >   [..]
> >   dma_unmap_phyr(device, dma_phyr_list);
> >   unpin_drity_free(cpu_phy_list);
> > 
> > Where dma_map_phyr() can build a temporary SGL for old iommu drivers
> > compatability. iommu drivers would want to implement natively, of
> > course.
> > 
> > ie no loops in drivers.
> 
> Let me just rewrite that for you ...
> 
> 	umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
> 	umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
> 			DMA_BIDIRECTIONAL, dma_attr);
> 	...
> 	dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
> 			umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
> 	sg_free_table(umem->sgt);
> 	free_user_phyrs(umem->phyrs, umem->phyr_len);

Why? As above we want to get rid of the sgl, so you are telling me to
adopt phyrs I need to increase the memory consumption by a hefty
amount to store the phyrs and still keep the sgt now? Why?

I don't need the sgt at all. I just need another list of physical
addresses for DMA. I see no issue with a phsr_list storing either CPU
Physical Address or DMA Physical Addresses, same data structure.

In the fairly important passthrough DMA case the CPU list and DMA list
are identical, so we don't even need to do anything.

In the typical iommu case my dma map's phyrs is only one entry.

Other cases require a larger allocation. This is the advantage against
today's scatterlist - it forces 24 bytes/page for *everyone* to
support niche architectures even if 8 bytes would have been fine for a
server platform.

> > > The question is whether this is the right kind of optimisation to be
> > > doing.  I hear you that we want a dense format, but it's questionable
> > > whether the kind of thing you're suggesting is actually denser than this
> > > scheme.  For example, if we have 1GB pages and userspace happens to have
> > > allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
> > > as a single phyr.  A power-of-two scheme would have us use four entries
> > > (3, 4-7, 8-9, 10).
> > 
> > That is not quite what I had in mind..
> > 
> > struct phyr_list {
> >    unsigned int first_page_offset_bytes;
> >    size_t total_length_bytes;
> >    phys_addr_t min_alignment;
> >    struct packed_phyr *list_of_pages;
> > };
> > 
> > Where each 'packed_phyr' is an aligned page of some kind. The packing
> > has to be able to represent any number of pfns, so we have four major
> > cases:
> >  - 4k pfns (use 8 bytes)
> >  - Natural order pfn (use 8 bytes)
> >  - 4k aligned pfns, arbitary number (use 12 bytes)
> >  - <4k aligned, arbitary length (use 16 bytes?)
> > 
> > In all cases the interior pages are fully used, only the first and
> > last page is sliced based on the two parameters in the phyr_list.
> 
> This kind of representation works for a virtually contiguous range.
> Unfortunately, that's not sufficient for some bio users (as I discovered
> after getting a representation like this enshrined in the NVMe spec as
> the PRP List).

This is what I was trying to convay with the 4th bullet, I'm not
suggesting a PRP list.

As an example coding - Use the first 8 bytes to encode this:

 51:0 - Physical address / 4k (ie pfn)
 56:52 - Order (simple, your order encoding can do better)
 61:57 - Unused
 63:62 - Mode, one of:
         00 = natural order pfn (8 bytes)
         01 = order aligned with length (12 bytes)
         10 = arbitary (12 bytes)

Then the optional 4 bytes are used as:

Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
  31:0 - # of order pages

Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
  11:0 - starting byte offset in the 4k
  31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
          length in bytes

I think this covers everything? I assume BIO cannot be doing
non-aligned contiguous transfers beyond 2M? The above can represent
33M of arbitary contiguous memory at 12 bytes/page.

If BIO really needs > 33M then we can use the extra mode to define a
16 byte entry that will cover everything.

> > The last case is, perhaps, a possible route to completely replace
> > scatterlist. Few places need true byte granularity for interior pages,
> > so we can invent some coding to say 'this is 8 byte aligned, and n
> > bytes long' that only fits < 4k or something. Exceptional cases can
> > then still work. I'm not sure what block needs here - is it just 512?
> 
> Replacing scatterlist is not my goal.  That seems like a lot more work
> for little gain.  

Well, I'm not comfortable with the idea above where RDMA would have to
take a memory penalty to use the new interface. To avoid that memory
penalty we need to get rid of scatterlist entirely.

If we do the 16 byte struct from the first email then a umem for MRs
will increase in memory consumption by 160% compared today's 24
bytes/page. I think the HPC workloads will veto this.

This is exactly why everything has been stuck here for so long. Nobody
wants to build on scatterlist and we don't have anything that can
feasibly replace it.

IMHO scatterlist has the wrong tradeoffs, the way it uses the 24 bytes
per page isn't a win in today's world.

And on the flip side, I don't see the iommu driver people being
enthused to implement something that isn't sufficiently general.

> I just want to delete page_link, offset and length from struct
> scatterlist.  Given the above sequence of calls, we're going to get
> sg lists that aren't chained.  They may have to be vmalloced, but
> they should be contiguous.

I don't understand that? Why would the SGL out of the iommu suddenly
not be chained?

From what I've heard I'm also not keen on a physr list using vmalloc
either, that is said to be quite slow?

> > I would imagine a few steps to this process:
> >  1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
> >  2) Wrapper around existing gup to get a phyr_list for user VA
> >  3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
> >     (However, with full performance for iommu passthrough)
> >  4) Patches changing RDMA/VFIO/DRM to this API
> >  5) Patches optimizing get_user_phyr()
> >  6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver
> 
> I was thinking ...
> 
> 1. get_user_phyrs() & free_user_phyrs()
> 2. dma_map_phyrs() and dma_unmap_phyrs() wrappers that create a
>    scatterlist from phyrs and call dma_map_sg() / dma_unmap_sg() to work
>    with current IOMMU drivers

IMHO, the scatterlist has to go away. The interface should be physr
list in, physr list out.

In the two cases I care most about in RDMA, not scatter list is alot
less memory than today.

For iommu pass through the DMA address and CPU address are the same,
so we can re-use the original physr list. So 8 byes/page.

For the iommu case where the iommu linearizes the whole map I end up
with 1 entry for the DMA list. Also 8 bytes/page for enough pages.

Even the degenerate case where I need to have unique DMA addresses for
each page (eg bus offset, no iommu) is still only 16 bytes per page
instead of 24. (I don't have a use case for RDMA)

The rarer case of non-page aligned transfer becomes 24 bytes and is
just as bad as SGL. (RDMA doesn't ever do this)

So in typical cases for RDMA HPC workloads I go from 24 -> 8 bytes of
long term storage. This is a huge win, IMHO.

I can live with the temporary compat performance overhead in IOMMU
mode of building a temporary SGL and then copying it to a DMA physr
list while we agitate to get the server IOMMU drivers updated, so long
as special passthrough mode is optimized to re-use the CPU list out of
the gate.

I don't realistically expect that replacing every scatterlist with
physr will ever happen. It just need to be a sufficiently general API
that it could be done to be worth all the work to implement it. IMHO.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11  9:05     ` Daniel Vetter
@ 2022-01-11 20:26       ` Jason Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 20:26 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Matthew Wilcox, nvdimm, linux-rdma, John Hubbard, linux-kernel,
	dri-devel, Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 10:05:40AM +0100, Daniel Vetter wrote:

> If we go with page size I think hardcoding a PHYS_PAGE_SIZE KB(4)
> would make sense, because thanks to x86 that's pretty much the lowest
> common denominator that all hw (I know of at least) supports. Not
> having to fiddle with "which page size do we have" in driver code
> would be neat. It makes writing portable gup code in drivers just
> needlessly silly.

What I did in RDMA was make an iterator rdma_umem_for_each_dma_block()

The driver passes in the page size it wants and the iterator breaks up
the SGL into that size.

So, eg on a 16k page size system the SGL would be full of 16K stuff,
but the driver only support 4k and so the iterator hands out 4 pages
for each SGL entry.

All the drivers use this to build their DMA lists and tables, it works
really well.

The other part is that most RDMA drivers support many page sizes, so
there is another API to inspect the SGL and take in the device's page
size support and compute what page size the driver should use.

> - I think minimally an sg list form of dma-mapped stuff which does not
> have a struct page, iirc last time we discussed that we agreed that
> this really needs to be part of such a rework or it's not really
> improving things much

Yes, this seems important..

> - a few per-entry driver bits would be nice in both the phys/dma
> chains, if we can have them. gpus have funny gpu interconnects, this
> would allow us to put all the gpu addresses into dma_addr_t if we can
> have some bits indicating whether it's on the pci bus, gpu local
> memory or the gpu<->gpu interconnect.

It seems useful, see my other email for a suggested coding..

Jason 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 20:26       ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 20:26 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, Matthew Wilcox, netdev,
	Joao Martins, Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 10:05:40AM +0100, Daniel Vetter wrote:

> If we go with page size I think hardcoding a PHYS_PAGE_SIZE KB(4)
> would make sense, because thanks to x86 that's pretty much the lowest
> common denominator that all hw (I know of at least) supports. Not
> having to fiddle with "which page size do we have" in driver code
> would be neat. It makes writing portable gup code in drivers just
> needlessly silly.

What I did in RDMA was make an iterator rdma_umem_for_each_dma_block()

The driver passes in the page size it wants and the iterator breaks up
the SGL into that size.

So, eg on a 16k page size system the SGL would be full of 16K stuff,
but the driver only support 4k and so the iterator hands out 4 pages
for each SGL entry.

All the drivers use this to build their DMA lists and tables, it works
really well.

The other part is that most RDMA drivers support many page sizes, so
there is another API to inspect the SGL and take in the device's page
size support and compute what page size the driver should use.

> - I think minimally an sg list form of dma-mapped stuff which does not
> have a struct page, iirc last time we discussed that we agreed that
> this really needs to be part of such a rework or it's not really
> improving things much

Yes, this seems important..

> - a few per-entry driver bits would be nice in both the phys/dma
> chains, if we can have them. gpus have funny gpu interconnects, this
> would allow us to put all the gpu addresses into dma_addr_t if we can
> have some bits indicating whether it's on the pci bus, gpu local
> memory or the gpu<->gpu interconnect.

It seems useful, see my other email for a suggested coding..

Jason 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 20:21           ` Jason Gunthorpe
@ 2022-01-11 21:25             ` Matthew Wilcox
  -1 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11 21:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 04:21:59PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 06:33:57PM +0000, Matthew Wilcox wrote:
> 
> > > Then we are we using get_user_phyr() at all if we are just storing it
> > > in a sg?
> > 
> > I did consider just implementing get_user_sg() (actually 4 years ago),
> > but that cements the use of sg as both an input and output data structure
> > for DMA mapping, which I am under the impression we're trying to get
> > away from.
> 
> I know every time I talked about a get_user_sg() Christoph is against
> it and we need to stop using scatter list...
> 
> > > Also 16 entries is way to small, it should be at least a whole PMD
> > > worth so we don't have to relock the PMD level each iteration.
> > > 
> > > I would like to see a flow more like:
> > > 
> > >   cpu_phyr_list = get_user_phyr(uptr, 1G);
> > >   dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
> > >   [..]
> > >   dma_unmap_phyr(device, dma_phyr_list);
> > >   unpin_drity_free(cpu_phy_list);
> > > 
> > > Where dma_map_phyr() can build a temporary SGL for old iommu drivers
> > > compatability. iommu drivers would want to implement natively, of
> > > course.
> > > 
> > > ie no loops in drivers.
> > 
> > Let me just rewrite that for you ...
> > 
> > 	umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
> > 	umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
> > 			DMA_BIDIRECTIONAL, dma_attr);
> > 	...
> > 	dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
> > 			umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
> > 	sg_free_table(umem->sgt);
> > 	free_user_phyrs(umem->phyrs, umem->phyr_len);
> 
> Why? As above we want to get rid of the sgl, so you are telling me to
> adopt phyrs I need to increase the memory consumption by a hefty
> amount to store the phyrs and still keep the sgt now? Why?
> 
> I don't need the sgt at all. I just need another list of physical
> addresses for DMA. I see no issue with a phsr_list storing either CPU
> Physical Address or DMA Physical Addresses, same data structure.

There's a difference between a phys_addr_t and a dma_addr_t.  They
can even be different sizes; some architectures use a 32-bit dma_addr_t
and a 64-bit phys_addr_t or vice-versa.  phyr cannot store DMA addresses.

> In the fairly important passthrough DMA case the CPU list and DMA list
> are identical, so we don't even need to do anything.
> 
> In the typical iommu case my dma map's phyrs is only one entry.

That becomes a very simple sg table then.

> As an example coding - Use the first 8 bytes to encode this:
> 
>  51:0 - Physical address / 4k (ie pfn)
>  56:52 - Order (simple, your order encoding can do better)
>  61:57 - Unused
>  63:62 - Mode, one of:
>          00 = natural order pfn (8 bytes)
>          01 = order aligned with length (12 bytes)
>          10 = arbitary (12 bytes)
> 
> Then the optional 4 bytes are used as:
> 
> Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
>   31:0 - # of order pages
> 
> Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
>   11:0 - starting byte offset in the 4k
>   31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
>           length in bytes

Honestly, this looks awful to operate on.  Mandatory 8-bytes per entry
with an optional 4 byte extension?

> > > The last case is, perhaps, a possible route to completely replace
> > > scatterlist. Few places need true byte granularity for interior pages,
> > > so we can invent some coding to say 'this is 8 byte aligned, and n
> > > bytes long' that only fits < 4k or something. Exceptional cases can
> > > then still work. I'm not sure what block needs here - is it just 512?
> > 
> > Replacing scatterlist is not my goal.  That seems like a lot more work
> > for little gain.  
> 
> Well, I'm not comfortable with the idea above where RDMA would have to
> take a memory penalty to use the new interface. To avoid that memory
> penalty we need to get rid of scatterlist entirely.
> 
> If we do the 16 byte struct from the first email then a umem for MRs
> will increase in memory consumption by 160% compared today's 24
> bytes/page. I think the HPC workloads will veto this.

Huh?  We do 16 bytes per physically contiguous range.  Then, if your HPC
workloads use an IOMMU that can map a virtually contiguous range
into a single sg entry, it uses 24 bytes for the entire mapping.
It should shrink.

> > I just want to delete page_link, offset and length from struct
> > scatterlist.  Given the above sequence of calls, we're going to get
> > sg lists that aren't chained.  They may have to be vmalloced, but
> > they should be contiguous.
> 
> I don't understand that? Why would the SGL out of the iommu suddenly
> not be chained?

Because it's being given a single set of ranges to map, instead of
being given 512 pages at a time.

> >From what I've heard I'm also not keen on a physr list using vmalloc
> either, that is said to be quite slow?

It would only be slow for degenerate cases where the pinned memory
is fragmented and not contiguous.

> > > I would imagine a few steps to this process:
> > >  1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
> > >  2) Wrapper around existing gup to get a phyr_list for user VA
> > >  3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
> > >     (However, with full performance for iommu passthrough)
> > >  4) Patches changing RDMA/VFIO/DRM to this API
> > >  5) Patches optimizing get_user_phyr()
> > >  6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver
> > 
> > I was thinking ...
> > 
> > 1. get_user_phyrs() & free_user_phyrs()
> > 2. dma_map_phyrs() and dma_unmap_phyrs() wrappers that create a
> >    scatterlist from phyrs and call dma_map_sg() / dma_unmap_sg() to work
> >    with current IOMMU drivers
> 
> IMHO, the scatterlist has to go away. The interface should be physr
> list in, physr list out.

That's reproducing the bad decision of the scatterlist, only with
a different encoding.  You end up with something like:

struct neoscat {
	dma_addr_t dma_addr;
	phys_addr_t phys_addr;
	size_t dma_len;
	size_t phys_len;
};

and the dma_addr and dma_len are unused by all-but-the-first entry when
you have a competent IOMMU.  We want a different data structure in and
out, and we may as well keep using the scatterlist for the dma-map-out.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 21:25             ` Matthew Wilcox
  0 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-11 21:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 04:21:59PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 06:33:57PM +0000, Matthew Wilcox wrote:
> 
> > > Then we are we using get_user_phyr() at all if we are just storing it
> > > in a sg?
> > 
> > I did consider just implementing get_user_sg() (actually 4 years ago),
> > but that cements the use of sg as both an input and output data structure
> > for DMA mapping, which I am under the impression we're trying to get
> > away from.
> 
> I know every time I talked about a get_user_sg() Christoph is against
> it and we need to stop using scatter list...
> 
> > > Also 16 entries is way to small, it should be at least a whole PMD
> > > worth so we don't have to relock the PMD level each iteration.
> > > 
> > > I would like to see a flow more like:
> > > 
> > >   cpu_phyr_list = get_user_phyr(uptr, 1G);
> > >   dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
> > >   [..]
> > >   dma_unmap_phyr(device, dma_phyr_list);
> > >   unpin_drity_free(cpu_phy_list);
> > > 
> > > Where dma_map_phyr() can build a temporary SGL for old iommu drivers
> > > compatability. iommu drivers would want to implement natively, of
> > > course.
> > > 
> > > ie no loops in drivers.
> > 
> > Let me just rewrite that for you ...
> > 
> > 	umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
> > 	umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
> > 			DMA_BIDIRECTIONAL, dma_attr);
> > 	...
> > 	dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
> > 			umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
> > 	sg_free_table(umem->sgt);
> > 	free_user_phyrs(umem->phyrs, umem->phyr_len);
> 
> Why? As above we want to get rid of the sgl, so you are telling me to
> adopt phyrs I need to increase the memory consumption by a hefty
> amount to store the phyrs and still keep the sgt now? Why?
> 
> I don't need the sgt at all. I just need another list of physical
> addresses for DMA. I see no issue with a phsr_list storing either CPU
> Physical Address or DMA Physical Addresses, same data structure.

There's a difference between a phys_addr_t and a dma_addr_t.  They
can even be different sizes; some architectures use a 32-bit dma_addr_t
and a 64-bit phys_addr_t or vice-versa.  phyr cannot store DMA addresses.

> In the fairly important passthrough DMA case the CPU list and DMA list
> are identical, so we don't even need to do anything.
> 
> In the typical iommu case my dma map's phyrs is only one entry.

That becomes a very simple sg table then.

> As an example coding - Use the first 8 bytes to encode this:
> 
>  51:0 - Physical address / 4k (ie pfn)
>  56:52 - Order (simple, your order encoding can do better)
>  61:57 - Unused
>  63:62 - Mode, one of:
>          00 = natural order pfn (8 bytes)
>          01 = order aligned with length (12 bytes)
>          10 = arbitary (12 bytes)
> 
> Then the optional 4 bytes are used as:
> 
> Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
>   31:0 - # of order pages
> 
> Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
>   11:0 - starting byte offset in the 4k
>   31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
>           length in bytes

Honestly, this looks awful to operate on.  Mandatory 8-bytes per entry
with an optional 4 byte extension?

> > > The last case is, perhaps, a possible route to completely replace
> > > scatterlist. Few places need true byte granularity for interior pages,
> > > so we can invent some coding to say 'this is 8 byte aligned, and n
> > > bytes long' that only fits < 4k or something. Exceptional cases can
> > > then still work. I'm not sure what block needs here - is it just 512?
> > 
> > Replacing scatterlist is not my goal.  That seems like a lot more work
> > for little gain.  
> 
> Well, I'm not comfortable with the idea above where RDMA would have to
> take a memory penalty to use the new interface. To avoid that memory
> penalty we need to get rid of scatterlist entirely.
> 
> If we do the 16 byte struct from the first email then a umem for MRs
> will increase in memory consumption by 160% compared today's 24
> bytes/page. I think the HPC workloads will veto this.

Huh?  We do 16 bytes per physically contiguous range.  Then, if your HPC
workloads use an IOMMU that can map a virtually contiguous range
into a single sg entry, it uses 24 bytes for the entire mapping.
It should shrink.

> > I just want to delete page_link, offset and length from struct
> > scatterlist.  Given the above sequence of calls, we're going to get
> > sg lists that aren't chained.  They may have to be vmalloced, but
> > they should be contiguous.
> 
> I don't understand that? Why would the SGL out of the iommu suddenly
> not be chained?

Because it's being given a single set of ranges to map, instead of
being given 512 pages at a time.

> >From what I've heard I'm also not keen on a physr list using vmalloc
> either, that is said to be quite slow?

It would only be slow for degenerate cases where the pinned memory
is fragmented and not contiguous.

> > > I would imagine a few steps to this process:
> > >  1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
> > >  2) Wrapper around existing gup to get a phyr_list for user VA
> > >  3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
> > >     (However, with full performance for iommu passthrough)
> > >  4) Patches changing RDMA/VFIO/DRM to this API
> > >  5) Patches optimizing get_user_phyr()
> > >  6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver
> > 
> > I was thinking ...
> > 
> > 1. get_user_phyrs() & free_user_phyrs()
> > 2. dma_map_phyrs() and dma_unmap_phyrs() wrappers that create a
> >    scatterlist from phyrs and call dma_map_sg() / dma_unmap_sg() to work
> >    with current IOMMU drivers
> 
> IMHO, the scatterlist has to go away. The interface should be physr
> list in, physr list out.

That's reproducing the bad decision of the scatterlist, only with
a different encoding.  You end up with something like:

struct neoscat {
	dma_addr_t dma_addr;
	phys_addr_t phys_addr;
	size_t dma_len;
	size_t phys_len;
};

and the dma_addr and dma_len are unused by all-but-the-first entry when
you have a competent IOMMU.  We want a different data structure in and
out, and we may as well keep using the scatterlist for the dma-map-out.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 21:25             ` Matthew Wilcox
@ 2022-01-11 22:09               ` Logan Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 22:09 UTC (permalink / raw)
  To: Matthew Wilcox, Jason Gunthorpe
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Ming Lei, linux-block, netdev, linux-mm, linux-rdma, dri-devel,
	nvdimm



On 2022-01-11 2:25 p.m., Matthew Wilcox wrote:
> That's reproducing the bad decision of the scatterlist, only with
> a different encoding.  You end up with something like:
> 
> struct neoscat {
> 	dma_addr_t dma_addr;
> 	phys_addr_t phys_addr;
> 	size_t dma_len;
> 	size_t phys_len;
> };
> 
> and the dma_addr and dma_len are unused by all-but-the-first entry when
> you have a competent IOMMU.  We want a different data structure in and
> out, and we may as well keep using the scatterlist for the dma-map-out.

With my P2PDMA patchset, even with a competent IOMMU, we need to support
multiple dma_addr/dma_len pairs (plus the flag bit). This is required to
program IOVAs and multiple bus addresses into a single DMA transactions.

I think using the scatter list for the DMA-out side is not ideal seeing
we don't need the page pointers or multiple length fields and we won't
be able to change the sgl substantially given the massive amount of
existing use cases that won't go away over night.

My hope would be along these lines:

struct phy_range {
    phys_addr_t phyr_addr;
    u32 phyr_len;
    u32 phyr_flags;
};

struct dma_range {
    dma_addr_t dmar_addr;
    u32 dmar_len;
    u32 dmar_flags;
};

A new GUP helper would somehow return a list of phy_range structs and
the new dma_map function would take that list and return a list of
dma_range structs. Each element in the list could represent a segment up
to 4GB, so any range longer than that would need multiple items in the
list. (Alternatively, perhaps the length could be a 64bit value and we
steal some of the top bits for flags or some such). The flags would not
only be needed by some of the use cases mentioned (FOLL_PIN or
DMA_BUS_ADDRESS) but could also support chaining these lists like SGLs
so continuous vmallocs would not be necessary for longer lists.

If there's an [phy|dma]_range_list struct (or some such) which contains
these range structs (per some details of Jason's suggestions) that'd be
fine by me too and would just depend on implementation details.

However, the main problem I see is a chicken and egg problem. The new
dma_map function would need to be implemented by every dma_map provider
or any driver trying to use it would need a messy fallback. Either that,
or we need a wrapper that allocates an appropriately sized SGL to pass
to any dma_map implementation that doesn't support the new structures.

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 22:09               ` Logan Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 22:09 UTC (permalink / raw)
  To: Matthew Wilcox, Jason Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Christoph Hellwig



On 2022-01-11 2:25 p.m., Matthew Wilcox wrote:
> That's reproducing the bad decision of the scatterlist, only with
> a different encoding.  You end up with something like:
> 
> struct neoscat {
> 	dma_addr_t dma_addr;
> 	phys_addr_t phys_addr;
> 	size_t dma_len;
> 	size_t phys_len;
> };
> 
> and the dma_addr and dma_len are unused by all-but-the-first entry when
> you have a competent IOMMU.  We want a different data structure in and
> out, and we may as well keep using the scatterlist for the dma-map-out.

With my P2PDMA patchset, even with a competent IOMMU, we need to support
multiple dma_addr/dma_len pairs (plus the flag bit). This is required to
program IOVAs and multiple bus addresses into a single DMA transactions.

I think using the scatter list for the DMA-out side is not ideal seeing
we don't need the page pointers or multiple length fields and we won't
be able to change the sgl substantially given the massive amount of
existing use cases that won't go away over night.

My hope would be along these lines:

struct phy_range {
    phys_addr_t phyr_addr;
    u32 phyr_len;
    u32 phyr_flags;
};

struct dma_range {
    dma_addr_t dmar_addr;
    u32 dmar_len;
    u32 dmar_flags;
};

A new GUP helper would somehow return a list of phy_range structs and
the new dma_map function would take that list and return a list of
dma_range structs. Each element in the list could represent a segment up
to 4GB, so any range longer than that would need multiple items in the
list. (Alternatively, perhaps the length could be a 64bit value and we
steal some of the top bits for flags or some such). The flags would not
only be needed by some of the use cases mentioned (FOLL_PIN or
DMA_BUS_ADDRESS) but could also support chaining these lists like SGLs
so continuous vmallocs would not be necessary for longer lists.

If there's an [phy|dma]_range_list struct (or some such) which contains
these range structs (per some details of Jason's suggestions) that'd be
fine by me too and would just depend on implementation details.

However, the main problem I see is a chicken and egg problem. The new
dma_map function would need to be implemented by every dma_map provider
or any driver trying to use it would need a messy fallback. Either that,
or we need a wrapper that allocates an appropriately sized SGL to pass
to any dma_map implementation that doesn't support the new structures.

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 21:25             ` Matthew Wilcox
@ 2022-01-11 22:53               ` Jason Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 22:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 09:25:40PM +0000, Matthew Wilcox wrote:
> > I don't need the sgt at all. I just need another list of physical
> > addresses for DMA. I see no issue with a phsr_list storing either CPU
> > Physical Address or DMA Physical Addresses, same data structure.
> 
> There's a difference between a phys_addr_t and a dma_addr_t.  They
> can even be different sizes; some architectures use a 32-bit dma_addr_t
> and a 64-bit phys_addr_t or vice-versa.  phyr cannot store DMA addresses.

I know, but I'm not sure optimizing for 32 bit phys_addr_t is
worthwhile. So I imagine phyrs forced to be 64 bits so it can always
hold a dma_addr_t and we can re-use all the machinery that supports it
for the DMA list as well.

Even on 32 bit physaddr platforms scatterlist is still 24 bytes,
forcing 8 bytes for the physr CPU list is still a net space win.

> > Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
> >   31:0 - # of order pages
> > 
> > Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
> >   11:0 - starting byte offset in the 4k
> >   31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
> >           length in bytes
> 
> Honestly, this looks awful to operate on.  Mandatory 8-bytes per entry
> with an optional 4 byte extension?

I expect it is, if we don't value memory efficiency then make it
simpler. A fixed 12 bytes means that the worst case is still only 24
bytes so it isn't a degredation from scatterlist. 

Unfortunately 16 bytes is a degredation.

My point is the structure can hold what scatterlist holds and we can
trade some CPU power to achieve memory compression. I don't know what
the right balance is, but it suggests to me that the idea of a general
flexable array to hold 64 bit addr/length intervals is a useful
generic data structure for this problem.

> > Well, I'm not comfortable with the idea above where RDMA would have to
> > take a memory penalty to use the new interface. To avoid that memory
> > penalty we need to get rid of scatterlist entirely.
> > 
> > If we do the 16 byte struct from the first email then a umem for MRs
> > will increase in memory consumption by 160% compared today's 24
> > bytes/page. I think the HPC workloads will veto this.
> 
> Huh?  We do 16 bytes per physically contiguous range.  Then, if your
> HPC workloads use an IOMMU that can map a virtually contiguous range
> into a single sg entry, it uses 24 bytes for the entire mapping.  It
> should shrink.

IOMMU is not common in those cases, it is slow.

So you end up with 16 bytes per entry then another 24 bytes in the
entirely redundant scatter list. That is now 40 bytes/page for typical
HPC case, and I can't see that being OK.

> > > I just want to delete page_link, offset and length from struct
> > > scatterlist.  Given the above sequence of calls, we're going to get
> > > sg lists that aren't chained.  They may have to be vmalloced, but
> > > they should be contiguous.
> > 
> > I don't understand that? Why would the SGL out of the iommu suddenly
> > not be chained?
> 
> Because it's being given a single set of ranges to map, instead of
> being given 512 pages at a time.

I still don't understand what you are describing here? I don't know of
any case where a struct scatterlist will be vmalloc'd not page chained
- we don't even support that??

> It would only be slow for degenerate cases where the pinned memory
> is fragmented and not contiguous.

Degenerate? This is the normal case today isn't it? I think it is for
RDMA the last time I looked. Even small allocations like < 64k were
fragmented...

> > IMHO, the scatterlist has to go away. The interface should be physr
> > list in, physr list out.
> 
> That's reproducing the bad decision of the scatterlist, only with
> a different encoding.  You end up with something like:
> 
> struct neoscat {
> 	dma_addr_t dma_addr;
> 	phys_addr_t phys_addr;
> 	size_t dma_len;
> 	size_t phys_len;
> };

This isn't what I mean at all!

I imagine a generic data structure that can hold an array of 64 bit
intervals.

The DMA map operation takes in this array that holds CPU addreses,
allocates a new array and fills it with DMA addresses and returns
that. The caller ends up with two arrays in two memory allocations.

No scatterlist required.

It is undoing the bad design of scatterlist by forcing the CPU and DMA
to be in different memory. 

I just want to share the whole API that will have to exist to
reasonably support this flexible array of intervals data structure..

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 22:53               ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 22:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 09:25:40PM +0000, Matthew Wilcox wrote:
> > I don't need the sgt at all. I just need another list of physical
> > addresses for DMA. I see no issue with a phsr_list storing either CPU
> > Physical Address or DMA Physical Addresses, same data structure.
> 
> There's a difference between a phys_addr_t and a dma_addr_t.  They
> can even be different sizes; some architectures use a 32-bit dma_addr_t
> and a 64-bit phys_addr_t or vice-versa.  phyr cannot store DMA addresses.

I know, but I'm not sure optimizing for 32 bit phys_addr_t is
worthwhile. So I imagine phyrs forced to be 64 bits so it can always
hold a dma_addr_t and we can re-use all the machinery that supports it
for the DMA list as well.

Even on 32 bit physaddr platforms scatterlist is still 24 bytes,
forcing 8 bytes for the physr CPU list is still a net space win.

> > Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
> >   31:0 - # of order pages
> > 
> > Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
> >   11:0 - starting byte offset in the 4k
> >   31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
> >           length in bytes
> 
> Honestly, this looks awful to operate on.  Mandatory 8-bytes per entry
> with an optional 4 byte extension?

I expect it is, if we don't value memory efficiency then make it
simpler. A fixed 12 bytes means that the worst case is still only 24
bytes so it isn't a degredation from scatterlist. 

Unfortunately 16 bytes is a degredation.

My point is the structure can hold what scatterlist holds and we can
trade some CPU power to achieve memory compression. I don't know what
the right balance is, but it suggests to me that the idea of a general
flexable array to hold 64 bit addr/length intervals is a useful
generic data structure for this problem.

> > Well, I'm not comfortable with the idea above where RDMA would have to
> > take a memory penalty to use the new interface. To avoid that memory
> > penalty we need to get rid of scatterlist entirely.
> > 
> > If we do the 16 byte struct from the first email then a umem for MRs
> > will increase in memory consumption by 160% compared today's 24
> > bytes/page. I think the HPC workloads will veto this.
> 
> Huh?  We do 16 bytes per physically contiguous range.  Then, if your
> HPC workloads use an IOMMU that can map a virtually contiguous range
> into a single sg entry, it uses 24 bytes for the entire mapping.  It
> should shrink.

IOMMU is not common in those cases, it is slow.

So you end up with 16 bytes per entry then another 24 bytes in the
entirely redundant scatter list. That is now 40 bytes/page for typical
HPC case, and I can't see that being OK.

> > > I just want to delete page_link, offset and length from struct
> > > scatterlist.  Given the above sequence of calls, we're going to get
> > > sg lists that aren't chained.  They may have to be vmalloced, but
> > > they should be contiguous.
> > 
> > I don't understand that? Why would the SGL out of the iommu suddenly
> > not be chained?
> 
> Because it's being given a single set of ranges to map, instead of
> being given 512 pages at a time.

I still don't understand what you are describing here? I don't know of
any case where a struct scatterlist will be vmalloc'd not page chained
- we don't even support that??

> It would only be slow for degenerate cases where the pinned memory
> is fragmented and not contiguous.

Degenerate? This is the normal case today isn't it? I think it is for
RDMA the last time I looked. Even small allocations like < 64k were
fragmented...

> > IMHO, the scatterlist has to go away. The interface should be physr
> > list in, physr list out.
> 
> That's reproducing the bad decision of the scatterlist, only with
> a different encoding.  You end up with something like:
> 
> struct neoscat {
> 	dma_addr_t dma_addr;
> 	phys_addr_t phys_addr;
> 	size_t dma_len;
> 	size_t phys_len;
> };

This isn't what I mean at all!

I imagine a generic data structure that can hold an array of 64 bit
intervals.

The DMA map operation takes in this array that holds CPU addreses,
allocates a new array and fills it with DMA addresses and returns
that. The caller ends up with two arrays in two memory allocations.

No scatterlist required.

It is undoing the bad design of scatterlist by forcing the CPU and DMA
to be in different memory. 

I just want to share the whole API that will have to exist to
reasonably support this flexible array of intervals data structure..

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 22:53               ` Jason Gunthorpe
@ 2022-01-11 22:57                 ` Logan Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 22:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Wilcox
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Ming Lei, linux-block, netdev, linux-mm, linux-rdma, dri-devel,
	nvdimm



On 2022-01-11 3:53 p.m., Jason Gunthorpe wrote:
> I just want to share the whole API that will have to exist to
> reasonably support this flexible array of intervals data structure..

Is that really worth it? I feel like type safety justifies replicating a
bit of iteration and allocation infrastructure. Then there's no silly
mistakes of thinking one array is one thing when it is not.

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 22:57                 ` Logan Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 22:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Wilcox
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Christoph Hellwig



On 2022-01-11 3:53 p.m., Jason Gunthorpe wrote:
> I just want to share the whole API that will have to exist to
> reasonably support this flexible array of intervals data structure..

Is that really worth it? I feel like type safety justifies replicating a
bit of iteration and allocation infrastructure. Then there's no silly
mistakes of thinking one array is one thing when it is not.

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 22:09               ` Logan Gunthorpe
@ 2022-01-11 22:57                 ` Jason Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 22:57 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Matthew Wilcox, linux-kernel, Christoph Hellwig, Joao Martins,
	John Hubbard, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 03:09:13PM -0700, Logan Gunthorpe wrote:

> Either that, or we need a wrapper that allocates an appropriately
> sized SGL to pass to any dma_map implementation that doesn't support
> the new structures.

This is what I think we should do. If we start with RDMA then we can
motivate the 4 main server IOMMU drivers to get updated ASAP, then it
can acceptably start to spread to other users.

The passthrough path would have to be optimized from the start to
avoid the SGL.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 22:57                 ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 22:57 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, Matthew Wilcox,
	Ming Lei, linux-block, linux-mm, dri-devel, netdev, Joao Martins,
	Christoph Hellwig

On Tue, Jan 11, 2022 at 03:09:13PM -0700, Logan Gunthorpe wrote:

> Either that, or we need a wrapper that allocates an appropriately
> sized SGL to pass to any dma_map implementation that doesn't support
> the new structures.

This is what I think we should do. If we start with RDMA then we can
motivate the 4 main server IOMMU drivers to get updated ASAP, then it
can acceptably start to spread to other users.

The passthrough path would have to be optimized from the start to
avoid the SGL.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 22:57                 ` Logan Gunthorpe
@ 2022-01-11 23:02                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 23:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Matthew Wilcox, linux-kernel, Christoph Hellwig, Joao Martins,
	John Hubbard, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 03:57:07PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2022-01-11 3:53 p.m., Jason Gunthorpe wrote:
> > I just want to share the whole API that will have to exist to
> > reasonably support this flexible array of intervals data structure..
> 
> Is that really worth it? I feel like type safety justifies replicating a
> bit of iteration and allocation infrastructure. Then there's no silly
> mistakes of thinking one array is one thing when it is not.

If it is a 'a bit' then sure, but I suspect doing a good job here will
be a lot of code here.

Look at how big scatterlist is, for instance.

Maybe we could have a generic 64 bit interval arry and then two type
wrappers that do dma and physaddr casting? IDK.

Not sure type safety of DMA vs CPU address is critical?

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 23:02                   ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 23:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, Matthew Wilcox,
	Ming Lei, linux-block, linux-mm, dri-devel, netdev, Joao Martins,
	Christoph Hellwig

On Tue, Jan 11, 2022 at 03:57:07PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2022-01-11 3:53 p.m., Jason Gunthorpe wrote:
> > I just want to share the whole API that will have to exist to
> > reasonably support this flexible array of intervals data structure..
> 
> Is that really worth it? I feel like type safety justifies replicating a
> bit of iteration and allocation infrastructure. Then there's no silly
> mistakes of thinking one array is one thing when it is not.

If it is a 'a bit' then sure, but I suspect doing a good job here will
be a lot of code here.

Look at how big scatterlist is, for instance.

Maybe we could have a generic 64 bit interval arry and then two type
wrappers that do dma and physaddr casting? IDK.

Not sure type safety of DMA vs CPU address is critical?

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 22:57                 ` Jason Gunthorpe
@ 2022-01-11 23:02                   ` Logan Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 23:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, linux-kernel, Christoph Hellwig, Joao Martins,
	John Hubbard, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm



On 2022-01-11 3:57 p.m., Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 03:09:13PM -0700, Logan Gunthorpe wrote:
> 
>> Either that, or we need a wrapper that allocates an appropriately
>> sized SGL to pass to any dma_map implementation that doesn't support
>> the new structures.
> 
> This is what I think we should do. If we start with RDMA then we can
> motivate the 4 main server IOMMU drivers to get updated ASAP, then it
> can acceptably start to spread to other users.

I suspect the preferred path forward is for the IOMMU drivers that don't
use dma-iommu should be converted to use it. Then anything we do to
dma-iommu will be applicable to the IOMMU drivers. Better than expecting
them to implement a bunch of new functionality themselves.

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 23:02                   ` Logan Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 23:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, Matthew Wilcox,
	Ming Lei, linux-block, linux-mm, dri-devel, netdev, Joao Martins,
	Christoph Hellwig



On 2022-01-11 3:57 p.m., Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 03:09:13PM -0700, Logan Gunthorpe wrote:
> 
>> Either that, or we need a wrapper that allocates an appropriately
>> sized SGL to pass to any dma_map implementation that doesn't support
>> the new structures.
> 
> This is what I think we should do. If we start with RDMA then we can
> motivate the 4 main server IOMMU drivers to get updated ASAP, then it
> can acceptably start to spread to other users.

I suspect the preferred path forward is for the IOMMU drivers that don't
use dma-iommu should be converted to use it. Then anything we do to
dma-iommu will be applicable to the IOMMU drivers. Better than expecting
them to implement a bunch of new functionality themselves.

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 23:02                   ` Jason Gunthorpe
@ 2022-01-11 23:08                     ` Logan Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 23:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, linux-kernel, Christoph Hellwig, Joao Martins,
	John Hubbard, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm



On 2022-01-11 4:02 p.m., Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 03:57:07PM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2022-01-11 3:53 p.m., Jason Gunthorpe wrote:
>>> I just want to share the whole API that will have to exist to
>>> reasonably support this flexible array of intervals data structure..
>>
>> Is that really worth it? I feel like type safety justifies replicating a
>> bit of iteration and allocation infrastructure. Then there's no silly
>> mistakes of thinking one array is one thing when it is not.
> 
> If it is a 'a bit' then sure, but I suspect doing a good job here will
> be a lot of code here.
> 
> Look at how big scatterlist is, for instance.

Yeah, but scatterlist has a ton of cruft; numerous ways to allocate,
multiple iterators, developers using it in different ways, etc, etc.
It's a big mess. bvec.h is much smaller (though includes stuff that
wouldn't necessarily be appropriate here).

Also some things apply to one but not the other. eg: a memcpy to/from
function might make sense for a phy_range but makes no sense for a
dma_range.

> Maybe we could have a generic 64 bit interval arry and then two type
> wrappers that do dma and physaddr casting? IDK.
> 
> Not sure type safety of DMA vs CPU address is critical?

I would argue it is. A DMA address is not a CPU address and should not
be treated the same.

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-11 23:08                     ` Logan Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Logan Gunthorpe @ 2022-01-11 23:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, Matthew Wilcox,
	Ming Lei, linux-block, linux-mm, dri-devel, netdev, Joao Martins,
	Christoph Hellwig



On 2022-01-11 4:02 p.m., Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 03:57:07PM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2022-01-11 3:53 p.m., Jason Gunthorpe wrote:
>>> I just want to share the whole API that will have to exist to
>>> reasonably support this flexible array of intervals data structure..
>>
>> Is that really worth it? I feel like type safety justifies replicating a
>> bit of iteration and allocation infrastructure. Then there's no silly
>> mistakes of thinking one array is one thing when it is not.
> 
> If it is a 'a bit' then sure, but I suspect doing a good job here will
> be a lot of code here.
> 
> Look at how big scatterlist is, for instance.

Yeah, but scatterlist has a ton of cruft; numerous ways to allocate,
multiple iterators, developers using it in different ways, etc, etc.
It's a big mess. bvec.h is much smaller (though includes stuff that
wouldn't necessarily be appropriate here).

Also some things apply to one but not the other. eg: a memcpy to/from
function might make sense for a phy_range but makes no sense for a
dma_range.

> Maybe we could have a generic 64 bit interval arry and then two type
> wrappers that do dma and physaddr casting? IDK.
> 
> Not sure type safety of DMA vs CPU address is critical?

I would argue it is. A DMA address is not a CPU address and should not
be treated the same.

Logan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 22:53               ` Jason Gunthorpe
@ 2022-01-12 18:37                 ` Matthew Wilcox
  -1 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-12 18:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 06:53:06PM -0400, Jason Gunthorpe wrote:
> IOMMU is not common in those cases, it is slow.
> 
> So you end up with 16 bytes per entry then another 24 bytes in the
> entirely redundant scatter list. That is now 40 bytes/page for typical
> HPC case, and I can't see that being OK.

Ah, I didn't realise what case you wanted to optimise for.

So, how about this ...

Since you want to get to the same destination as I do (a
16-byte-per-entry dma_addr+dma_len struct), but need to get there sooner
than "make all sg users stop using it wrongly", let's introduce a
(hopefully temporary) "struct dma_range".

But let's go further than that (which only brings us to 32 bytes per
range).  For the systems you care about which use an identity mapping,
and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply
point the dma_range pointer to the same memory as the phyr.  We just
have to not free it too early.  That gets us down to 16 bytes per range,
a saving of 33%.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-12 18:37                 ` Matthew Wilcox
  0 siblings, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2022-01-12 18:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 06:53:06PM -0400, Jason Gunthorpe wrote:
> IOMMU is not common in those cases, it is slow.
> 
> So you end up with 16 bytes per entry then another 24 bytes in the
> entirely redundant scatter list. That is now 40 bytes/page for typical
> HPC case, and I can't see that being OK.

Ah, I didn't realise what case you wanted to optimise for.

So, how about this ...

Since you want to get to the same destination as I do (a
16-byte-per-entry dma_addr+dma_len struct), but need to get there sooner
than "make all sg users stop using it wrongly", let's introduce a
(hopefully temporary) "struct dma_range".

But let's go further than that (which only brings us to 32 bytes per
range).  For the systems you care about which use an identity mapping,
and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply
point the dma_range pointer to the same memory as the phyr.  We just
have to not free it too early.  That gets us down to 16 bytes per range,
a saving of 33%.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-12 18:37                 ` Matthew Wilcox
@ 2022-01-12 19:08                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-12 19:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Christoph Hellwig, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Wed, Jan 12, 2022 at 06:37:03PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 11, 2022 at 06:53:06PM -0400, Jason Gunthorpe wrote:
> > IOMMU is not common in those cases, it is slow.
> > 
> > So you end up with 16 bytes per entry then another 24 bytes in the
> > entirely redundant scatter list. That is now 40 bytes/page for typical
> > HPC case, and I can't see that being OK.
> 
> Ah, I didn't realise what case you wanted to optimise for.

It is pretty common, even systems with the iommu turned on will run
the kernel drivers with an identity map due to the performance delta..

> Since you want to get to the same destination as I do (a
> 16-byte-per-entry dma_addr+dma_len struct), but need to get there sooner
> than "make all sg users stop using it wrongly", let's introduce a
> (hopefully temporary) "struct dma_range".
> 
> But let's go further than that (which only brings us to 32 bytes per
> range).  For the systems you care about which use an identity mapping,
> and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply
> point the dma_range pointer to the same memory as the phyr.  We just
> have to not free it too early.  That gets us down to 16 bytes per range,
> a saving of 33%.

Yes, that is more or less what I suggested.

I'm not sure I understand your "make all sg users stop using it
wrongly"

I suspect trying to change scatterlist is a tar pit.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-12 19:08                   ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-12 19:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, dri-devel,
	Ming Lei, linux-block, linux-mm, netdev, Joao Martins,
	Logan Gunthorpe, Christoph Hellwig

On Wed, Jan 12, 2022 at 06:37:03PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 11, 2022 at 06:53:06PM -0400, Jason Gunthorpe wrote:
> > IOMMU is not common in those cases, it is slow.
> > 
> > So you end up with 16 bytes per entry then another 24 bytes in the
> > entirely redundant scatter list. That is now 40 bytes/page for typical
> > HPC case, and I can't see that being OK.
> 
> Ah, I didn't realise what case you wanted to optimise for.

It is pretty common, even systems with the iommu turned on will run
the kernel drivers with an identity map due to the performance delta..

> Since you want to get to the same destination as I do (a
> 16-byte-per-entry dma_addr+dma_len struct), but need to get there sooner
> than "make all sg users stop using it wrongly", let's introduce a
> (hopefully temporary) "struct dma_range".
> 
> But let's go further than that (which only brings us to 32 bytes per
> range).  For the systems you care about which use an identity mapping,
> and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply
> point the dma_range pointer to the same memory as the phyr.  We just
> have to not free it too early.  That gets us down to 16 bytes per range,
> a saving of 33%.

Yes, that is more or less what I suggested.

I'm not sure I understand your "make all sg users stop using it
wrongly"

I suspect trying to change scatterlist is a tar pit.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-10 19:34 ` Matthew Wilcox
                   ` (3 preceding siblings ...)
  (?)
@ 2022-01-20 13:39 ` Christoph Hellwig
  -1 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2022-01-20 13:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Christoph Hellwig, Jason Gunthorpe, Joao Martins,
	John Hubbard, Logan Gunthorpe, Ming Lei, linux-block, netdev,
	linux-mm, linux-rdma, dri-devel, nvdimm

On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
> TLDR: I want to introduce a new data type:
> 
> struct phyr {
>         phys_addr_t addr;
>         size_t len;
> };
> 
> and use it to replace bio_vec as well as using it to replace the array
> of struct pages used by get_user_pages() and friends.

FYI, I've done a fair amount of work (some already mainline as in all
the helpers for biovec page access), some of it still waiting (switching
more users over to these helpers and cleaning up some other mess)
to move bio_vecs into a form like that.  The difference in my plan
is to have a u32 len, both to allow for flags space on 64-bit which
we might need to support things like P2P without dev_pagemap structures.

> Finally, it may be possible to stop using scatterlist to describe the
> input to the DMA-mapping operation.  We may be able to get struct
> scatterlist down to just dma_address and dma_length, with chaining
> handled through an enclosing struct.

Yes, I have some prototype could that takes a bio_vec as input and
returns an array of

struct dma_range {
	dma_addr_t	addr;
	u32		len;
}

І need to get back to it and especially back to the question if this
needs the chaining support that the current scatterlist has.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11  0:41   ` Jason Gunthorpe
                     ` (2 preceding siblings ...)
  (?)
@ 2022-01-20 13:56   ` Christoph Hellwig
  2022-01-20 15:27       ` Keith Busch
  -1 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2022-01-20 13:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, linux-kernel, Christoph Hellwig, Joao Martins,
	John Hubbard, Logan Gunthorpe, Ming Lei, linux-block, netdev,
	linux-mm, linux-rdma, dri-devel, nvdimm

On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
> > Finally, it may be possible to stop using scatterlist to describe the
> > input to the DMA-mapping operation.  We may be able to get struct
> > scatterlist down to just dma_address and dma_length, with chaining
> > handled through an enclosing struct.
> 
> Can you talk about this some more? IMHO one of the key properties of
> the scatterlist is that it can hold huge amounts of pages without
> having to do any kind of special allocation due to the chaining.
> 
> The same will be true of the phyr idea right?

No special allocations as in no vmalloc?  The chaining still has to
allocate memory using a mempool.

Anyway, to explain my idea which is very similar but not identical to
the one willy has:

 - on the input side to dma mapping the bio_vecs (or phyrs) are chained
   as bios or whatever the containing structure is.  These already exist
   and have infrastructure at least in the block layer
 - on the output side I plan for two options:

	1) we have a sane IOMMU and everyting will be coalesced into a
	   single dma_range.  This requires setting the block layer
	   merge boundary to match the IOMMU page size, but that is
	   a very good thing to do anyway.
	2) we have no IOMMU (or a weird one) and get one output dma_range
	   per input bio_vec.  We'd eithe have to support chaining or use
	   vmalloc or huge numbers of entries.

> If you limit to that scenario then we can be more optimal because
> things like byte granular offsets and size in the interior pages don't
> need to exist. Every interior chunk is always aligned to its order and
> we only need to record the order.

The block layer does not small offsets.  Direct I/O can often be
512 byte aligned, and some other passthrough commands can have even
smaller alignment, although I don't think we ever go below 4-byte
alignment anywhere in the block layer.

> IMHO storage density here is quite important, we end up having to keep
> this stuff around for a long time.

If we play these tricks it won't be general purpose enough to get rid
of the existing scatterlist usage.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 15:01       ` Jason Gunthorpe
  (?)
  (?)
@ 2022-01-20 14:00       ` Christoph Hellwig
  -1 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2022-01-20 14:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, linux-kernel, Christoph Hellwig, Joao Martins,
	John Hubbard, Logan Gunthorpe, Ming Lei, linux-block, netdev,
	linux-mm, linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 11:01:42AM -0400, Jason Gunthorpe wrote:
> Then we are we using get_user_phyr() at all if we are just storing it
> in a sg?

I think we need to stop calling the output of the phyr dma map
helper a sg.  Yes, a { dma_addr, len } tuple is scatter/gather I/O in its
purest form, but it will need a different name in Linux as the scatterlist
will have to stay around for a long time before all that mess is cleaned
up.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-12 18:37                 ` Matthew Wilcox
  (?)
  (?)
@ 2022-01-20 14:03                 ` Christoph Hellwig
  2022-01-20 17:17                     ` Jason Gunthorpe
  -1 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2022-01-20 14:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, linux-kernel, Christoph Hellwig, Joao Martins,
	John Hubbard, Logan Gunthorpe, Ming Lei, linux-block, netdev,
	linux-mm, linux-rdma, dri-devel, nvdimm

On Wed, Jan 12, 2022 at 06:37:03PM +0000, Matthew Wilcox wrote:
> But let's go further than that (which only brings us to 32 bytes per
> range).  For the systems you care about which use an identity mapping,
> and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply
> point the dma_range pointer to the same memory as the phyr.  We just
> have to not free it too early.  That gets us down to 16 bytes per range,
> a saving of 33%.

Even without an IOMMU the dma_addr_t can have offsets vs the actual
physical address.  Not on x86 except for a weirdo SOC, but just about
everywhere else.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11 20:26       ` Jason Gunthorpe
  (?)
@ 2022-01-20 14:09       ` Christoph Hellwig
  -1 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2022-01-20 14:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, Matthew Wilcox, nvdimm, linux-rdma, John Hubbard,
	linux-kernel, dri-devel, Ming Lei, linux-block, linux-mm, netdev,
	Joao Martins, Logan Gunthorpe, Christoph Hellwig

On Tue, Jan 11, 2022 at 04:26:48PM -0400, Jason Gunthorpe wrote:
> What I did in RDMA was make an iterator rdma_umem_for_each_dma_block()
> 
> The driver passes in the page size it wants and the iterator breaks up
> the SGL into that size.
> 
> So, eg on a 16k page size system the SGL would be full of 16K stuff,
> but the driver only support 4k and so the iterator hands out 4 pages
> for each SGL entry.
> 
> All the drivers use this to build their DMA lists and tables, it works
> really well.

The block layer also has the equivalent functionality by setting the
virt_boundary value in the queue_limits.  This is needed for NVMe
PRPs and RDMA drivers.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-11  8:17   ` John Hubbard
                     ` (2 preceding siblings ...)
  (?)
@ 2022-01-20 14:12   ` Christoph Hellwig
  2022-01-20 21:35       ` John Hubbard
  -1 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2022-01-20 14:12 UTC (permalink / raw)
  To: John Hubbard
  Cc: Matthew Wilcox, linux-kernel, Christoph Hellwig, Jason Gunthorpe,
	Joao Martins, Logan Gunthorpe, Ming Lei, linux-block, netdev,
	linux-mm, linux-rdma, dri-devel, nvdimm

On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote:
> Zooming in on the pinning aspect for a moment: last time I attempted to
> convert O_DIRECT callers from gup to pup, I recall wanting very much to
> record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
> or some non-FOLL_PIN method. Because at the end of the IO, it is not
> easy to disentangle which pages require put_page() and which require
> unpin_user_page*().

I don't think that is a problem.  Pinning only need to happen for
ITER_IOVEC, and the only non-user pages there is the ZERO_PAGE added
for padding that can be special cased.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-20 13:56   ` Christoph Hellwig
@ 2022-01-20 15:27       ` Keith Busch
  0 siblings, 0 replies; 62+ messages in thread
From: Keith Busch @ 2022-01-20 15:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Matthew Wilcox, linux-kernel, Joao Martins,
	John Hubbard, Logan Gunthorpe, Ming Lei, linux-block, netdev,
	linux-mm, linux-rdma, dri-devel, nvdimm

On Thu, Jan 20, 2022 at 02:56:02PM +0100, Christoph Hellwig wrote:
>  - on the input side to dma mapping the bio_vecs (or phyrs) are chained
>    as bios or whatever the containing structure is.  These already exist
>    and have infrastructure at least in the block layer
>  - on the output side I plan for two options:
> 
> 	1) we have a sane IOMMU and everyting will be coalesced into a
> 	   single dma_range.  This requires setting the block layer
> 	   merge boundary to match the IOMMU page size, but that is
> 	   a very good thing to do anyway.

It doesn't look like IOMMU page sizes are exported, or even necessarily
consistently sized on at least one arch (power).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-20 15:27       ` Keith Busch
  0 siblings, 0 replies; 62+ messages in thread
From: Keith Busch @ 2022-01-20 15:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, Matthew Wilcox,
	Ming Lei, linux-block, linux-mm, dri-devel, Jason Gunthorpe,
	netdev, Joao Martins, Logan Gunthorpe

On Thu, Jan 20, 2022 at 02:56:02PM +0100, Christoph Hellwig wrote:
>  - on the input side to dma mapping the bio_vecs (or phyrs) are chained
>    as bios or whatever the containing structure is.  These already exist
>    and have infrastructure at least in the block layer
>  - on the output side I plan for two options:
> 
> 	1) we have a sane IOMMU and everyting will be coalesced into a
> 	   single dma_range.  This requires setting the block layer
> 	   merge boundary to match the IOMMU page size, but that is
> 	   a very good thing to do anyway.

It doesn't look like IOMMU page sizes are exported, or even necessarily
consistently sized on at least one arch (power).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-20 15:27       ` Keith Busch
  (?)
@ 2022-01-20 15:28       ` Christoph Hellwig
  -1 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2022-01-20 15:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jason Gunthorpe, Matthew Wilcox, linux-kernel,
	Joao Martins, John Hubbard, Logan Gunthorpe, Ming Lei,
	linux-block, netdev, linux-mm, linux-rdma, dri-devel, nvdimm

On Thu, Jan 20, 2022 at 07:27:36AM -0800, Keith Busch wrote:
> It doesn't look like IOMMU page sizes are exported, or even necessarily
> consistently sized on at least one arch (power).

At the DMA API layer dma_get_merge_boundary is the API for it.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-20 14:03                 ` Christoph Hellwig
@ 2022-01-20 17:17                     ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-20 17:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-kernel, Joao Martins, John Hubbard,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On Thu, Jan 20, 2022 at 03:03:40PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 12, 2022 at 06:37:03PM +0000, Matthew Wilcox wrote:
> > But let's go further than that (which only brings us to 32 bytes per
> > range).  For the systems you care about which use an identity mapping,
> > and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply
> > point the dma_range pointer to the same memory as the phyr.  We just
> > have to not free it too early.  That gets us down to 16 bytes per range,
> > a saving of 33%.
> 
> Even without an IOMMU the dma_addr_t can have offsets vs the actual
> physical address.  Not on x86 except for a weirdo SOC, but just about
> everywhere else.

The point is dma_map knows if that is happening or not and giving
dma_map the option to just return a pointer to the input memory to
re-use as the dma list does optimize important widely used cases.

Yes, some weirdo SOC cannot do this optimization, but the weirdo SOC
will allocate a new memory and return the adjusted dma_addr_t just
fine.

Ideally we should not pay a cost for weirdo SOC on sane systems.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-20 17:17                     ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2022-01-20 17:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, Matthew Wilcox,
	Ming Lei, linux-block, linux-mm, dri-devel, netdev, Joao Martins,
	Logan Gunthorpe

On Thu, Jan 20, 2022 at 03:03:40PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 12, 2022 at 06:37:03PM +0000, Matthew Wilcox wrote:
> > But let's go further than that (which only brings us to 32 bytes per
> > range).  For the systems you care about which use an identity mapping,
> > and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply
> > point the dma_range pointer to the same memory as the phyr.  We just
> > have to not free it too early.  That gets us down to 16 bytes per range,
> > a saving of 33%.
> 
> Even without an IOMMU the dma_addr_t can have offsets vs the actual
> physical address.  Not on x86 except for a weirdo SOC, but just about
> everywhere else.

The point is dma_map knows if that is happening or not and giving
dma_map the option to just return a pointer to the input memory to
re-use as the dma list does optimize important widely used cases.

Yes, some weirdo SOC cannot do this optimization, but the weirdo SOC
will allocate a new memory and return the adjusted dma_addr_t just
fine.

Ideally we should not pay a cost for weirdo SOC on sane systems.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-20 15:27       ` Keith Busch
  (?)
  (?)
@ 2022-01-20 17:54       ` Robin Murphy
  -1 siblings, 0 replies; 62+ messages in thread
From: Robin Murphy @ 2022-01-20 17:54 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: nvdimm, linux-rdma, John Hubbard, linux-kernel, Matthew Wilcox,
	Ming Lei, linux-block, linux-mm, dri-devel, Jason Gunthorpe,
	netdev, Joao Martins, Logan Gunthorpe

On 2022-01-20 15:27, Keith Busch wrote:
> On Thu, Jan 20, 2022 at 02:56:02PM +0100, Christoph Hellwig wrote:
>>   - on the input side to dma mapping the bio_vecs (or phyrs) are chained
>>     as bios or whatever the containing structure is.  These already exist
>>     and have infrastructure at least in the block layer
>>   - on the output side I plan for two options:
>>
>> 	1) we have a sane IOMMU and everyting will be coalesced into a
>> 	   single dma_range.  This requires setting the block layer
>> 	   merge boundary to match the IOMMU page size, but that is
>> 	   a very good thing to do anyway.
> 
> It doesn't look like IOMMU page sizes are exported, or even necessarily
> consistently sized on at least one arch (power).

FWIW POWER does its own thing separate from the IOMMU API. For all the 
regular IOMMU API players, page sizes are published in the iommu_domain 
that the common iommu-dma layer operates on. In fact it already uses 
them to pick chunk sizes for composing large buffer allocations.

Robin.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
  2022-01-20 14:12   ` Christoph Hellwig
@ 2022-01-20 21:35       ` John Hubbard
  0 siblings, 0 replies; 62+ messages in thread
From: John Hubbard @ 2022-01-20 21:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-kernel, Jason Gunthorpe, Joao Martins,
	Logan Gunthorpe, Ming Lei, linux-block, netdev, linux-mm,
	linux-rdma, dri-devel, nvdimm

On 1/20/22 6:12 AM, Christoph Hellwig wrote:
> On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote:
>> Zooming in on the pinning aspect for a moment: last time I attempted to
>> convert O_DIRECT callers from gup to pup, I recall wanting very much to
>> record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
>> or some non-FOLL_PIN method. Because at the end of the IO, it is not
>> easy to disentangle which pages require put_page() and which require
>> unpin_user_page*().
> 
> I don't think that is a problem.  Pinning only need to happen for
> ITER_IOVEC, and the only non-user pages there is the ZERO_PAGE added
> for padding that can be special cased.

I am really glad to hear you say that. Because I just worked through it
again in detail yesterday (including your and others' old emails about
this), and tentatively reached the same conclusion from seeing the call 
paths. But I wanted to confirm with someone who actually knows this code 
well, and that's not me. :)

Things like dio_refill_pages() are mixing in the zero page, but like you 
say, that can be handled. I have a few ideas for that.

Now that the goal is a considerably narrower as compared to in 2019 
("convert DIO callers to pup", instead of "convert the world to pup", 
ha), this looks quite feasible after all.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Phyr Starter
@ 2022-01-20 21:35       ` John Hubbard
  0 siblings, 0 replies; 62+ messages in thread
From: John Hubbard @ 2022-01-20 21:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: nvdimm, linux-rdma, netdev, linux-kernel, Matthew Wilcox,
	Ming Lei, linux-block, linux-mm, dri-devel, Jason Gunthorpe,
	Joao Martins, Logan Gunthorpe

On 1/20/22 6:12 AM, Christoph Hellwig wrote:
> On Tue, Jan 11, 2022 at 12:17:18AM -0800, John Hubbard wrote:
>> Zooming in on the pinning aspect for a moment: last time I attempted to
>> convert O_DIRECT callers from gup to pup, I recall wanting very much to
>> record, in each bio_vec, whether these pages were acquired via FOLL_PIN,
>> or some non-FOLL_PIN method. Because at the end of the IO, it is not
>> easy to disentangle which pages require put_page() and which require
>> unpin_user_page*().
> 
> I don't think that is a problem.  Pinning only need to happen for
> ITER_IOVEC, and the only non-user pages there is the ZERO_PAGE added
> for padding that can be special cased.

I am really glad to hear you say that. Because I just worked through it
again in detail yesterday (including your and others' old emails about
this), and tentatively reached the same conclusion from seeing the call 
paths. But I wanted to confirm with someone who actually knows this code 
well, and that's not me. :)

Things like dio_refill_pages() are mixing in the zero page, but like you 
say, that can be handled. I have a few ideas for that.

Now that the goal is a considerably narrower as compared to in 2019 
("convert DIO callers to pup", instead of "convert the world to pup", 
ha), this looks quite feasible after all.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2022-01-20 21:35 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-10 19:34 Phyr Starter Matthew Wilcox
2022-01-10 19:34 ` Matthew Wilcox
2022-01-11  0:41 ` Jason Gunthorpe
2022-01-11  0:41   ` Jason Gunthorpe
2022-01-11  4:32   ` Matthew Wilcox
2022-01-11  4:32     ` Matthew Wilcox
2022-01-11 15:01     ` Jason Gunthorpe
2022-01-11 15:01       ` Jason Gunthorpe
2022-01-11 18:33       ` Matthew Wilcox
2022-01-11 18:33         ` Matthew Wilcox
2022-01-11 20:21         ` Jason Gunthorpe
2022-01-11 20:21           ` Jason Gunthorpe
2022-01-11 21:25           ` Matthew Wilcox
2022-01-11 21:25             ` Matthew Wilcox
2022-01-11 22:09             ` Logan Gunthorpe
2022-01-11 22:09               ` Logan Gunthorpe
2022-01-11 22:57               ` Jason Gunthorpe
2022-01-11 22:57                 ` Jason Gunthorpe
2022-01-11 23:02                 ` Logan Gunthorpe
2022-01-11 23:02                   ` Logan Gunthorpe
2022-01-11 22:53             ` Jason Gunthorpe
2022-01-11 22:53               ` Jason Gunthorpe
2022-01-11 22:57               ` Logan Gunthorpe
2022-01-11 22:57                 ` Logan Gunthorpe
2022-01-11 23:02                 ` Jason Gunthorpe
2022-01-11 23:02                   ` Jason Gunthorpe
2022-01-11 23:08                   ` Logan Gunthorpe
2022-01-11 23:08                     ` Logan Gunthorpe
2022-01-12 18:37               ` Matthew Wilcox
2022-01-12 18:37                 ` Matthew Wilcox
2022-01-12 19:08                 ` Jason Gunthorpe
2022-01-12 19:08                   ` Jason Gunthorpe
2022-01-20 14:03                 ` Christoph Hellwig
2022-01-20 17:17                   ` Jason Gunthorpe
2022-01-20 17:17                     ` Jason Gunthorpe
2022-01-20 14:00       ` Christoph Hellwig
2022-01-11  9:05   ` Daniel Vetter
2022-01-11  9:05     ` Daniel Vetter
2022-01-11 20:26     ` Jason Gunthorpe
2022-01-11 20:26       ` Jason Gunthorpe
2022-01-20 14:09       ` Christoph Hellwig
2022-01-20 13:56   ` Christoph Hellwig
2022-01-20 15:27     ` Keith Busch
2022-01-20 15:27       ` Keith Busch
2022-01-20 15:28       ` Christoph Hellwig
2022-01-20 17:54       ` Robin Murphy
2022-01-11  8:17 ` John Hubbard
2022-01-11  8:17   ` John Hubbard
2022-01-11 14:01   ` Matthew Wilcox
2022-01-11 14:01     ` Matthew Wilcox
2022-01-11 15:02     ` Jason Gunthorpe
2022-01-11 15:02       ` Jason Gunthorpe
2022-01-11 17:31   ` Logan Gunthorpe
2022-01-11 17:31     ` Logan Gunthorpe
2022-01-20 14:12   ` Christoph Hellwig
2022-01-20 21:35     ` John Hubbard
2022-01-20 21:35       ` John Hubbard
2022-01-11 11:40 ` Thomas Zimmermann
2022-01-11 13:56   ` Matthew Wilcox
2022-01-11 13:56     ` Matthew Wilcox
2022-01-11 14:10     ` Thomas Zimmermann
2022-01-20 13:39 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.