netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Generic page-pool recycle facility?
       [not found] <1460034425.20949.7.camel@HansenPartnership.com>
@ 2016-04-07 14:17 ` Jesper Dangaard Brouer
  2016-04-07 14:38   ` [Lsf-pc] " Christoph Hellwig
                     ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-07 14:17 UTC (permalink / raw)
  To: lsf, linux-mm
  Cc: brouer, James Bottomley, netdev, Tom Herbert, Alexei Starovoitov,
	Brenden Blanco, lsf-pc

(Topic proposal for MM-summit)

Network Interface Cards (NIC) drivers, and increasing speeds stress
the page-allocator (and DMA APIs).  A number of driver specific
open-coded approaches exists that work-around these bottlenecks in the
page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
allocating larger pages and handing-out page "fragments".

I'm proposing a generic page-pool recycle facility, that can cover the
driver use-cases, increase performance and open up for zero-copy RX.


The basic performance problem is that pages (containing packets at RX)
are cycled through the page allocator (freed at TX DMA completion
time).  While a system in a steady state, could avoid calling the page
allocator, when having a pool of pages equal to the size of the RX
ring plus the number of outstanding frames in the TX ring (waiting for
DMA completion).

The motivation for quick page recycling came primarily for performance
reasons.  But returning pages to the same pool also benefit other
use-cases.  If a NIC HW RX ring is strictly bound (e.g. to a process
or guest/KVM) then pages can be shared/mmap'ed (RX zero-copy) as
information leaking does not occur.  (Obviously for this use-case,
when adding pages into the pool these need to zero'ed out).


The motivation behind implemeting this (extremely fast page-pool) is
because we need it as a building block in the network stack, but
hopefully other areas could also benefit from this.


[Resources/Links]: It is specifically related to:

What Facebook calls XDP (eXpress Data Path)
 * https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
 * RFC patchset thread: http://thread.gmane.org/gmane.linux.network/406288

And what I call the "packet-page" level:
 * BoF on kernel network performance: http://lwn.net/Articles/676806/
 * http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html


See you soon at LFS/MM-summit :-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer
@ 2016-04-07 14:38   ` Christoph Hellwig
  2016-04-07 15:11     ` [Lsf] " Bart Van Assche
  2016-04-07 15:48     ` Chuck Lever
  2016-04-07 15:18   ` Eric Dumazet
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 35+ messages in thread
From: Christoph Hellwig @ 2016-04-07 14:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf, linux-mm, netdev, Brenden Blanco, James Bottomley,
	Tom Herbert, lsf-pc, Alexei Starovoitov

This is also very interesting for storage targets, which face the same
issue.  SCST has a mode where it caches some fully constructed SGLs,
which is probably very similar to what NICs want to do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 14:38   ` [Lsf-pc] " Christoph Hellwig
@ 2016-04-07 15:11     ` Bart Van Assche
  2016-04-10 18:45       ` Sagi Grimberg
  2016-04-07 15:48     ` Chuck Lever
  1 sibling, 1 reply; 35+ messages in thread
From: Bart Van Assche @ 2016-04-07 15:11 UTC (permalink / raw)
  To: Christoph Hellwig, Jesper Dangaard Brouer
  Cc: James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm,
	netdev, lsf-pc, Alexei Starovoitov

On 04/07/16 07:38, Christoph Hellwig wrote:
> This is also very interesting for storage targets, which face the same
> issue.  SCST has a mode where it caches some fully constructed SGLs,
> which is probably very similar to what NICs want to do.

I think a cached allocator for page sets + the scatterlists that 
describe these page sets would not only be useful for SCSI target 
implementations but also for the Linux SCSI initiator. Today the scsi-mq 
code reserves space in each scsi_cmnd for a scatterlist of 
SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page 
sets less memory would be needed per scsi_cmnd. See also 
scsi_mq_setup_tags() and scsi_alloc_sgtable().

Bart.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer
  2016-04-07 14:38   ` [Lsf-pc] " Christoph Hellwig
@ 2016-04-07 15:18   ` Eric Dumazet
  2016-04-09  9:11     ` [Lsf] " Jesper Dangaard Brouer
  2016-04-07 19:48   ` Waskiewicz, PJ
  2016-04-11  8:58   ` [Lsf-pc] " Mel Gorman
  3 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2016-04-07 15:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf, linux-mm, James Bottomley, netdev, Tom Herbert,
	Alexei Starovoitov, Brenden Blanco, lsf-pc

On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote:
> (Topic proposal for MM-summit)
> 
> Network Interface Cards (NIC) drivers, and increasing speeds stress
> the page-allocator (and DMA APIs).  A number of driver specific
> open-coded approaches exists that work-around these bottlenecks in the
> page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
> allocating larger pages and handing-out page "fragments".
> 
> I'm proposing a generic page-pool recycle facility, that can cover the
> driver use-cases, increase performance and open up for zero-copy RX.
> 
> 
> The basic performance problem is that pages (containing packets at RX)
> are cycled through the page allocator (freed at TX DMA completion
> time).  While a system in a steady state, could avoid calling the page
> allocator, when having a pool of pages equal to the size of the RX
> ring plus the number of outstanding frames in the TX ring (waiting for
> DMA completion).


We certainly used this at Google for quite a while.

The thing is : in steady state, the number of pages being 'in tx queues'
is lower than number of pages that were allocated for RX queues.

The page allocator is hardly hit, once you have big enough RX ring
buffers. (Nothing fancy, simply the default number of slots)

The 'hard coded´ code is quite small actually

if (page_count(page) != 1) {
    free the page and allocate another one, 
    since we are not the exclusive owner.
    Prefer __GFP_COLD pages btw.
}
page_ref_inc(page);

Problem of a 'pool' is that it matches a router workload, not host one.

With existing code, new pages are automatically allocated on demand, if
say previous pages are still used by skb stored in sockets receive
queues and consumers are slow to react to the presence of this data.

But in most cases (steady state), the refcount on the page is released
by the application reading the data before the driver cycled through the
RX ring buffer and drivers only increments the page count.

I also played with grouping pages into the same 2MB pages, but got mixed
results.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 14:38   ` [Lsf-pc] " Christoph Hellwig
  2016-04-07 15:11     ` [Lsf] " Bart Van Assche
@ 2016-04-07 15:48     ` Chuck Lever
  2016-04-07 16:14       ` [Lsf-pc] [Lsf] " Rik van Riel
  1 sibling, 1 reply; 35+ messages in thread
From: Chuck Lever @ 2016-04-07 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jesper Dangaard Brouer, James Bottomley, Tom Herbert,
	Brenden Blanco, lsf, linux-mm, netdev, lsf-pc,
	Alexei Starovoitov


> On Apr 7, 2016, at 7:38 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> This is also very interesting for storage targets, which face the same
> issue.  SCST has a mode where it caches some fully constructed SGLs,
> which is probably very similar to what NICs want to do.

+1 for NFS server.


--
Chuck Lever



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf-pc] [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 15:48     ` Chuck Lever
@ 2016-04-07 16:14       ` Rik van Riel
  2016-04-07 19:43         ` [Lsf] [Lsf-pc] " Jesper Dangaard Brouer
  0 siblings, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2016-04-07 16:14 UTC (permalink / raw)
  To: Chuck Lever, Christoph Hellwig
  Cc: lsf, Tom Herbert, Brenden Blanco, James Bottomley, linux-mm,
	netdev, Jesper Dangaard Brouer, lsf-pc, Alexei Starovoitov

[-- Attachment #1: Type: text/plain, Size: 655 bytes --]

On Thu, 2016-04-07 at 08:48 -0700, Chuck Lever wrote:
> > 
> > On Apr 7, 2016, at 7:38 AM, Christoph Hellwig <hch@infradead.org>
> > wrote:
> > 
> > This is also very interesting for storage targets, which face the
> > same
> > issue.  SCST has a mode where it caches some fully constructed
> > SGLs,
> > which is probably very similar to what NICs want to do.
> +1 for NFS server.

I have swapped around my slot (into the MM track)
with Jesper's slot (now a plenary session), since
there seems to be a fair amount of interest in
Jesper's proposal from IO and FS people, and my
topic is more MM specific.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 16:14       ` [Lsf-pc] [Lsf] " Rik van Riel
@ 2016-04-07 19:43         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-07 19:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chuck Lever, Christoph Hellwig, James Bottomley, Tom Herbert,
	Brenden Blanco, lsf, linux-mm, netdev, lsf-pc,
	Alexei Starovoitov, brouer

[-- Attachment #1: Type: text/plain, Size: 1010 bytes --]


On Thu, 07 Apr 2016 12:14:00 -0400 Rik van Riel <riel@redhat.com> wrote:

> On Thu, 2016-04-07 at 08:48 -0700, Chuck Lever wrote:
> > > 
> > > On Apr 7, 2016, at 7:38 AM, Christoph Hellwig <hch@infradead.org>
> > > wrote:
> > > 
> > > This is also very interesting for storage targets, which face the
> > > same issue.  SCST has a mode where it caches some fully constructed
> > > SGLs, which is probably very similar to what NICs want to do.  
> >
> > +1 for NFS server.  
> 
> I have swapped around my slot (into the MM track)
> with Jesper's slot (now a plenary session), since
> there seems to be a fair amount of interest in
> Jesper's proposal from IO and FS people, and my
> topic is more MM specific.

Wow - I'm impressed. I didn't expect such a good slot!
Glad to see the interest!
Thanks!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 213 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer
  2016-04-07 14:38   ` [Lsf-pc] " Christoph Hellwig
  2016-04-07 15:18   ` Eric Dumazet
@ 2016-04-07 19:48   ` Waskiewicz, PJ
  2016-04-07 20:38     ` Jesper Dangaard Brouer
  2016-04-11  8:58   ` [Lsf-pc] " Mel Gorman
  3 siblings, 1 reply; 35+ messages in thread
From: Waskiewicz, PJ @ 2016-04-07 19:48 UTC (permalink / raw)
  To: lsf, linux-mm, brouer
  Cc: netdev, bblanco, alexei.starovoitov, James.Bottomley, tom, lsf-pc

On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote:
> (Topic proposal for MM-summit)
> 
> Network Interface Cards (NIC) drivers, and increasing speeds stress
> the page-allocator (and DMA APIs).  A number of driver specific
> open-coded approaches exists that work-around these bottlenecks in
> the
> page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
> allocating larger pages and handing-out page "fragments".
> 
> I'm proposing a generic page-pool recycle facility, that can cover
> the
> driver use-cases, increase performance and open up for zero-copy RX.

Is this based on the page recycle stuff from ixgbe that used to be in
the driver?  If so I'd really like to be part of the discussion.

-PJ


-- 
PJ Waskiewicz
Principal Engineer, NetApp
e: pj.waskiewicz@netapp.com
d: 503.961.3705

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 19:48   ` Waskiewicz, PJ
@ 2016-04-07 20:38     ` Jesper Dangaard Brouer
  2016-04-08 16:12       ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-07 20:38 UTC (permalink / raw)
  To: Waskiewicz, PJ
  Cc: lsf, linux-mm, netdev, bblanco, alexei.starovoitov,
	James.Bottomley, tom, lsf-pc, brouer

On Thu, 7 Apr 2016 19:48:50 +0000
"Waskiewicz, PJ" <PJ.Waskiewicz@netapp.com> wrote:

> On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote:
> > (Topic proposal for MM-summit)
> > 
> > Network Interface Cards (NIC) drivers, and increasing speeds stress
> > the page-allocator (and DMA APIs).  A number of driver specific
> > open-coded approaches exists that work-around these bottlenecks in
> > the
> > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
> > allocating larger pages and handing-out page "fragments".
> > 
> > I'm proposing a generic page-pool recycle facility, that can cover
> > the
> > driver use-cases, increase performance and open up for zero-copy RX.  
> 
> Is this based on the page recycle stuff from ixgbe that used to be in
> the driver?  If so I'd really like to be part of the discussion.

Okay, so it is not part of the driver any-longer?  I've studied the
current ixgbe driver (and other NIC drivers) closely.  Do you have some
code pointers, to this older code?

The likely-fastest recycle code I've see is in the bnx2x driver.  If
you are interested see: bnx2x_reuse_rx_data().  Again is it a bit
open-coded produce/consumer ring queue (which would be nice to also
cleanup).


To amortize the cost of allocating a single page, most other drivers
use the trick of allocating a larger (compound) page, and partition
this page into smaller "fragments".  Which also amortize the cost of
dma_map/unmap (important on non-x86).

This is actually problematic performance wise, because packet-data
(in these page fragments) only get DMA_sync'ed, and is thus considered
"read-only".  As netstack need to write packet headers, yet-another
(writable) memory area is allocated per packet (plus the SKB meta-data
struct).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 20:38     ` Jesper Dangaard Brouer
@ 2016-04-08 16:12       ` Alexander Duyck
  0 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2016-04-08 16:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Waskiewicz, PJ, lsf, linux-mm, netdev, bblanco,
	alexei.starovoitov, James.Bottomley@HansenPartnership.com, tom,
	lsf-pc

On Thu, Apr 7, 2016 at 1:38 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu, 7 Apr 2016 19:48:50 +0000
> "Waskiewicz, PJ" <PJ.Waskiewicz@netapp.com> wrote:
>
>> On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote:
>> > (Topic proposal for MM-summit)
>> >
>> > Network Interface Cards (NIC) drivers, and increasing speeds stress
>> > the page-allocator (and DMA APIs).  A number of driver specific
>> > open-coded approaches exists that work-around these bottlenecks in
>> > the
>> > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
>> > allocating larger pages and handing-out page "fragments".
>> >
>> > I'm proposing a generic page-pool recycle facility, that can cover
>> > the
>> > driver use-cases, increase performance and open up for zero-copy RX.
>>
>> Is this based on the page recycle stuff from ixgbe that used to be in
>> the driver?  If so I'd really like to be part of the discussion.
>
> Okay, so it is not part of the driver any-longer?  I've studied the
> current ixgbe driver (and other NIC drivers) closely.  Do you have some
> code pointers, to this older code?

No, it is still in the driver.  I think when PJ said "used to" he was
referring to the fact that the code was present in the driver back
when he was working on it at Intel.

You have to realize that the page reuse code has been in the Intel
drivers for a long time.  I think I introduced it originally on igb in
July of 2008 as page recycling, commit bf36c1a0040c ("igb: add page
recycling support"), and it was copied over to ixgbe in September,
commit 762f4c571058 ("ixgbe: recycle pages in packet split mode").

> The likely-fastest recycle code I've see is in the bnx2x driver.  If
> you are interested see: bnx2x_reuse_rx_data().  Again is it a bit
> open-coded produce/consumer ring queue (which would be nice to also
> cleanup).

Yeah, that is essentially the same kind of code we have in
ixgbe_reuse_rx_page().  From what I can tell though the bnx2x doesn't
actually reuse the buffers in the common case.  That function is only
called in the copy-break and error cases to recycle the buffer so that
it doesn't have to be freed.

> To amortize the cost of allocating a single page, most other drivers
> use the trick of allocating a larger (compound) page, and partition
> this page into smaller "fragments".  Which also amortize the cost of
> dma_map/unmap (important on non-x86).

Right.  The only reason why I went the reuse route instead of the
compound page route is that I had speculated that you could still
bottleneck yourself since the issue I was trying to avoid was the
dma_map call hitting a global lock in IOMMU enabled systems.  With the
larger page route I could at best reduce the number of map calls to
1/16 or 1/32 of what it was.  By doing the page reuse I actually bring
it down to something approaching 0 as long as the buffers are being
freed in a reasonable timeframe.  This way the code would scale so I
wouldn't have to worry about how many rings were active at the same
time.

As PJ can attest we even saw bugs where the page reuse actually was
too effective in some cases leading to us carrying memory from one
node to another when the interrupt was migrated.  That was why we had
to add the code to force us to free the page if it came from another
node.

> This is actually problematic performance wise, because packet-data
> (in these page fragments) only get DMA_sync'ed, and is thus considered
> "read-only".  As netstack need to write packet headers, yet-another
> (writable) memory area is allocated per packet (plus the SKB meta-data
> struct).

Have you done any actual testing with build_skb recently that shows
how much of a gain there is to be had?  I'm just curious as I know I
saw a gain back in the day, but back when I ran that test we didn't
have things like napi_alloc_skb running around which should be a
pretty big win.  It might be useful to hack a driver such as ixgbe to
use build_skb and see if it is even worth the trouble to do it
properly.

Here is a patch I had generated back in 2013 to convert ixgbe over to
using build_skb, https://patchwork.ozlabs.org/patch/236044/.  You
might be able to updated to make it work against current ixgbe and
then could come back to us with data on what the actual gain is.  My
thought is the gain should have significantly decreased since back in
the day as we optimized napi_alloc_skb to the point where I think the
only real difference is probably the memcpy to pull the headers from
the page.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 15:18   ` Eric Dumazet
@ 2016-04-09  9:11     ` Jesper Dangaard Brouer
  2016-04-09 12:34       ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-09  9:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm,
	netdev, lsf-pc, Alexei Starovoitov, brouer

Hi Eric,

On Thu, 07 Apr 2016 08:18:29 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote:
> > (Topic proposal for MM-summit)
> > 
> > Network Interface Cards (NIC) drivers, and increasing speeds stress
> > the page-allocator (and DMA APIs).  A number of driver specific
> > open-coded approaches exists that work-around these bottlenecks in the
> > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
> > allocating larger pages and handing-out page "fragments".
> > 
> > I'm proposing a generic page-pool recycle facility, that can cover the
> > driver use-cases, increase performance and open up for zero-copy RX.
> > 
> > 
> > The basic performance problem is that pages (containing packets at RX)
> > are cycled through the page allocator (freed at TX DMA completion
> > time).  While a system in a steady state, could avoid calling the page
> > allocator, when having a pool of pages equal to the size of the RX
> > ring plus the number of outstanding frames in the TX ring (waiting for
> > DMA completion).  
> 
> 
> We certainly used this at Google for quite a while.
> 
> The thing is : in steady state, the number of pages being 'in tx queues'
> is lower than number of pages that were allocated for RX queues.

That was also my expectation, thanks for confirming my expectation.

> The page allocator is hardly hit, once you have big enough RX ring
> buffers. (Nothing fancy, simply the default number of slots)
> 
> The 'hard coded´ code is quite small actually
> 
> if (page_count(page) != 1) {
>     free the page and allocate another one, 
>     since we are not the exclusive owner.
>     Prefer __GFP_COLD pages btw.
> }
> page_ref_inc(page);

Above code is okay.  But do you think we also can get away with the same
trick we do with the SKB refcnf?  Where we avoid an atomic operation if
refcnt==1.

void kfree_skb(struct sk_buff *skb)
{
	if (unlikely(!skb))
		return;
	if (likely(atomic_read(&skb->users) == 1))
		smp_rmb();
	else if (likely(!atomic_dec_and_test(&skb->users)))
		return;
	trace_kfree_skb(skb, __builtin_return_address(0));
	__kfree_skb(skb);
}
EXPORT_SYMBOL(kfree_skb);


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-09  9:11     ` [Lsf] " Jesper Dangaard Brouer
@ 2016-04-09 12:34       ` Eric Dumazet
  2016-04-11 20:23         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2016-04-09 12:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm,
	netdev, lsf-pc, Alexei Starovoitov

On Sat, 2016-04-09 at 11:11 +0200, Jesper Dangaard Brouer wrote:
> Hi Eric,


> Above code is okay.  But do you think we also can get away with the same
> trick we do with the SKB refcnf?  Where we avoid an atomic operation if
> refcnt==1.
> 
> void kfree_skb(struct sk_buff *skb)
> {
> 	if (unlikely(!skb))
> 		return;
> 	if (likely(atomic_read(&skb->users) == 1))
> 		smp_rmb();
> 	else if (likely(!atomic_dec_and_test(&skb->users)))
> 		return;
> 	trace_kfree_skb(skb, __builtin_return_address(0));
> 	__kfree_skb(skb);
> }
> EXPORT_SYMBOL(kfree_skb);

No we can not use this trick this for pages :

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ec91698360b3818ff426488a1529811f7a7ab87f






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 15:11     ` [Lsf] " Bart Van Assche
@ 2016-04-10 18:45       ` Sagi Grimberg
  2016-04-11 21:41         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2016-04-10 18:45 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig, Jesper Dangaard Brouer
  Cc: lsf, Tom Herbert, Brenden Blanco, James Bottomley, linux-mm,
	netdev, lsf-pc, Alexei Starovoitov


>> This is also very interesting for storage targets, which face the same
>> issue.  SCST has a mode where it caches some fully constructed SGLs,
>> which is probably very similar to what NICs want to do.
>
> I think a cached allocator for page sets + the scatterlists that
> describe these page sets would not only be useful for SCSI target
> implementations but also for the Linux SCSI initiator. Today the scsi-mq
> code reserves space in each scsi_cmnd for a scatterlist of
> SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> sets less memory would be needed per scsi_cmnd.

If we go down this road how about also attaching some driver opaques
to the page sets?

I know of some drivers that can make good use of those ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer
                     ` (2 preceding siblings ...)
  2016-04-07 19:48   ` Waskiewicz, PJ
@ 2016-04-11  8:58   ` Mel Gorman
  2016-04-11 12:26     ` Jesper Dangaard Brouer
  3 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2016-04-11  8:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf, linux-mm, netdev, Brenden Blanco, James Bottomley,
	Tom Herbert, lsf-pc, Alexei Starovoitov

On Thu, Apr 07, 2016 at 04:17:15PM +0200, Jesper Dangaard Brouer wrote:
> (Topic proposal for MM-summit)
> 
> Network Interface Cards (NIC) drivers, and increasing speeds stress
> the page-allocator (and DMA APIs).  A number of driver specific
> open-coded approaches exists that work-around these bottlenecks in the
> page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
> allocating larger pages and handing-out page "fragments".
> 
> I'm proposing a generic page-pool recycle facility, that can cover the
> driver use-cases, increase performance and open up for zero-copy RX.
> 

Which bottleneck dominates -- the page allocator or the DMA API when
setting up coherent pages?

I'm wary of another page allocator API being introduced if it's for
performance reasons. In response to this thread, I spent two days on
a series that boosts performance of the allocator in the fast paths by
11-18% to illustrate that there was low-hanging fruit for optimising. If
the one-LRU-per-node series was applied on top, there would be a further
boost to performance on the allocation side. It could be further boosted
if debugging checks and statistic updates were conditionally disabled by
the caller.

The main reason another allocator concerns me is that those pages
are effectively pinned and cannot be reclaimed by the VM in low memory
situations. It ends up needing its own API for tuning the size and hoping
all the drivers get it right without causing OOM situations. It becomes
a slippery slope of introducing shrinkers, locking and complexity. Then
callers start getting concerned about NUMA locality and having to deal
with multiple lists to maintain performance. Ultimately, it ends up being
as slow as the page allocator and back to square 1 except now with more code.

If it's the DMA API that dominates then something may be required but it
should rely on the existing page allocator to alloc/free from. It would
also need something like drain_all_pages to force free everything in there
in low memory situations. Remember that multiple instances private to
drivers or tasks will require shrinker implementations and the complexity
may get unwieldly.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11  8:58   ` [Lsf-pc] " Mel Gorman
@ 2016-04-11 12:26     ` Jesper Dangaard Brouer
  2016-04-11 13:08       ` Mel Gorman
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-11 12:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: lsf, linux-mm, netdev, Brenden Blanco, James Bottomley,
	Tom Herbert, lsf-pc, Alexei Starovoitov, brouer

 
On Mon, 11 Apr 2016 09:58:19 +0100 Mel Gorman <mgorman@suse.de> wrote:

> On Thu, Apr 07, 2016 at 04:17:15PM +0200, Jesper Dangaard Brouer wrote:
> > (Topic proposal for MM-summit)
> > 
> > Network Interface Cards (NIC) drivers, and increasing speeds stress
> > the page-allocator (and DMA APIs).  A number of driver specific
> > open-coded approaches exists that work-around these bottlenecks in the
> > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
> > allocating larger pages and handing-out page "fragments".
> > 
> > I'm proposing a generic page-pool recycle facility, that can cover the
> > driver use-cases, increase performance and open up for zero-copy RX.
> >   
> 
> Which bottleneck dominates -- the page allocator or the DMA API when
> setting up coherent pages?
>

It is actually both, but mostly DMA on non-x86 archs.  The need to
support multiple archs, then also cause a slowdown on x86, due to a
side-effect.

On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
the cost of DMA calls, NIC driver alloc large order (compound) pages.
(dma_map compound page, handout page-fragments for RX ring, and later
dma_unmap when last RX page-fragments is seen).

The unfortunate side-effect is that these RX page-fragments (which
contain packet data) need to be considered 'read-only', because a
dma_unmap call can be destructive.  Network packets need to be
modified (minimum time-to-live).  Thus, netstack alloc new writable
memory, copy-over IP-headers, and adjust offset pointer into RX-page.
Avoiding the dma_unmap (AFAIK) will allow to make RX-pages writable.

Idea by page-pool is to recycling pages back to the originating
device, then we can avoid the need to call dma_unmap().  And only call
dma_map() when setting up pages.


> I'm wary of another page allocator API being introduced if it's for
> performance reasons. In response to this thread, I spent two days on
> a series that boosts performance of the allocator in the fast paths by
> 11-18% to illustrate that there was low-hanging fruit for optimising. If
> the one-LRU-per-node series was applied on top, there would be a further
> boost to performance on the allocation side. It could be further boosted
> if debugging checks and statistic updates were conditionally disabled by
> the caller.

It is always great if you can optimized the page allocator.  IMHO the
page allocator is too slow.  At least for my performance needs (67ns
per packet, approx 201 cycles at 3GHz).  I've measured[1]
alloc_pages(order=0) + __free_pages() to cost 277 cycles(tsc).

The trick described above, of allocating a higher order page and
handing out page-fragments, also workaround this page allocator
bottleneck (on x86).

I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.

I view this as a bulking trick... maybe the page allocator can just
give us a bulking API? ;-)


> The main reason another allocator concerns me is that those pages
> are effectively pinned and cannot be reclaimed by the VM in low memory
> situations. It ends up needing its own API for tuning the size and hoping
> all the drivers get it right without causing OOM situations. It becomes
> a slippery slope of introducing shrinkers, locking and complexity. Then
> callers start getting concerned about NUMA locality and having to deal
> with multiple lists to maintain performance. Ultimately, it ends up being
> as slow as the page allocator and back to square 1 except now with more code.

The pages assigned to the RX ring queue are pinned like today.  The
pages avail in the pool could easily be reclaimed.

I actually think we are better off providing a generic page pool
interface the drivers can use.  Instead of the situation where drivers
and subsystems invent their own, which does not cooperate in OOM
situations.

For the networking fast forwarding use-case (NOT localhost delivery),
then the page pool size would actually be limited at a fairly small
fixed size.  Packets will be hard dropped if exceeding this limit.
The idea is, you want to limit the maximum latency the system can
introduce then forwarding a packet, even in high overload situations.
There is a good argumentation in section 3.2. of Google's paper[2].
They limit the pool size to 3000 and calculate this can max introduce
300 micro-sec latency.


> If it's the DMA API that dominates then something may be required but it
> should rely on the existing page allocator to alloc/free from. It would
> also need something like drain_all_pages to force free everything in there
> in low memory situations. Remember that multiple instances private to
> drivers or tasks will require shrinker implementations and the complexity
> may get unwieldly.

I'll read up on the shrinker interface.


[1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

[2] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44824.pdf

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 12:26     ` Jesper Dangaard Brouer
@ 2016-04-11 13:08       ` Mel Gorman
  2016-04-11 16:19         ` [Lsf] " Jesper Dangaard Brouer
  2016-04-11 16:20         ` Matthew Wilcox
  0 siblings, 2 replies; 35+ messages in thread
From: Mel Gorman @ 2016-04-11 13:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, lsf, linux-mm, netdev, Brenden Blanco,
	James Bottomley, Tom Herbert, lsf-pc, Alexei Starovoitov

On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > Which bottleneck dominates -- the page allocator or the DMA API when
> > setting up coherent pages?
> >
> 
> It is actually both, but mostly DMA on non-x86 archs.  The need to
> support multiple archs, then also cause a slowdown on x86, due to a
> side-effect.
> 
> On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> the cost of DMA calls, NIC driver alloc large order (compound) pages.
> (dma_map compound page, handout page-fragments for RX ring, and later
> dma_unmap when last RX page-fragments is seen).
> 

So, IMO only holding onto the DMA pages is all that is justified but not a
recycle of order-0 pages built on top of the core allocator. For DMA pages,
it would take a bit of legwork but the per-cpu allocator could be split
and converted to hold arbitrary sized pages with a constructer/destructor
to do the DMA coherency step when pages are taken from or handed back to
the core allocator. I'm not volunteering to do that unfortunately but I
estimate it'd be a few days work unless it needs to be per-CPU and NUMA
aware in which case the memory footprint will be high.

> > I'm wary of another page allocator API being introduced if it's for
> > performance reasons. In response to this thread, I spent two days on
> > a series that boosts performance of the allocator in the fast paths by
> > 11-18% to illustrate that there was low-hanging fruit for optimising. If
> > the one-LRU-per-node series was applied on top, there would be a further
> > boost to performance on the allocation side. It could be further boosted
> > if debugging checks and statistic updates were conditionally disabled by
> > the caller.
> 
> It is always great if you can optimized the page allocator.  IMHO the
> page allocator is too slow.

It's why I spent some time on it as any improvement in the allocator is
an unconditional win without requiring driver modifications.

> At least for my performance needs (67ns
> per packet, approx 201 cycles at 3GHz).  I've measured[1]
> alloc_pages(order=0) + __free_pages() to cost 277 cycles(tsc).
> 

It'd be worth retrying this with the branch

http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5

This is an unreleased series that contains both the page allocator
optimisations and the one-LRU-per-node series which in combination remove a
lot of code from the page allocator fast paths. I have no data on how the
combined series behaves but each series individually is known to improve
page allocator performance.

Once you have that, do a hackjob to remove the debugging checks from both the
alloc and free path and see what that leaves. They could be bypassed properly
with a __GFP_NOACCT flag used only by drivers that absolutely require pages
as quickly as possible and willing to be less safe to get that performance.

I expect then that the free path to be dominated by zone and pageblock
lookups which are much harder to remove. The zone lookup can be removed
if the caller knows exactly where the free pages need to go which is
unlikely. The pageblock lookup could be removed if it was coming from a
dedicated pool if the allocation side refills using pageblocks that are
always MIGRATE_UNMOVABLE.

> The trick described above, of allocating a higher order page and
> handing out page-fragments, also workaround this page allocator
> bottleneck (on x86).
> 

Be aware that compound order allocs like this are a double edged sword as
it'll be fast sometimes and other times require reclaim/compaction which
can stall for prolonged periods of time.

> I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
> cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
> page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
> cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
> the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.
> 
> I view this as a bulking trick... maybe the page allocator can just
> give us a bulking API? ;-)
> 

It could on the alloc side relatively easily using either a variation of
rmqueue_bulk exposed at a higher level populating a linked list (link via
page->lru) or an array supplied by the caller.  It's harder to bulk free
quickly as the pages being freed are not necessarily in the same pageblock
requiring lookups in the free path.

Tricky to get right, but preferable to a whole new allocator.

> > The main reason another allocator concerns me is that those pages
> > are effectively pinned and cannot be reclaimed by the VM in low memory
> > situations. It ends up needing its own API for tuning the size and hoping
> > all the drivers get it right without causing OOM situations. It becomes
> > a slippery slope of introducing shrinkers, locking and complexity. Then
> > callers start getting concerned about NUMA locality and having to deal
> > with multiple lists to maintain performance. Ultimately, it ends up being
> > as slow as the page allocator and back to square 1 except now with more code.
> 
> The pages assigned to the RX ring queue are pinned like today.  The
> pages avail in the pool could easily be reclaimed.
> 

How easy depends on how it's structured. If it's a global per-cpu list
then it's an IPI to all CPUs which is straight-forward to implement but
slow to execute. If it's per-driver then there needs to be a locked list
of all pools and locking on each individual pool which could offset some
of the performance benefit of using the pool in the first place.

> I actually think we are better off providing a generic page pool
> interface the drivers can use.  Instead of the situation where drivers
> and subsystems invent their own, which does not cooperate in OOM
> situations.
> 

If it's offsetting DMA setup/teardown then I'd be a bit happier. If it's
yet-another-page allocator to bypass the core allocator then I'm less happy.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 13:08       ` Mel Gorman
@ 2016-04-11 16:19         ` Jesper Dangaard Brouer
  2016-04-11 16:53           ` Eric Dumazet
  2016-04-11 18:07           ` Mel Gorman
  2016-04-11 16:20         ` Matthew Wilcox
  1 sibling, 2 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-11 16:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, netdev, Brenden Blanco, lsf, linux-mm,
	Mel Gorman, Tom Herbert, lsf-pc, Alexei Starovoitov, brouer


On Mon, 11 Apr 2016 14:08:27 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
[...]
> > 
> > It is always great if you can optimized the page allocator.  IMHO the
> > page allocator is too slow.  
> 
> It's why I spent some time on it as any improvement in the allocator is
> an unconditional win without requiring driver modifications.
> 
> > At least for my performance needs (67ns
> > per packet, approx 201 cycles at 3GHz).  I've measured[1]
> > alloc_pages(order=0) + __free_pages() to cost 277 cycles(tsc).
> >   
> 
> It'd be worth retrying this with the branch
> 
> http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> 

The cost decreased to: 228 cycles(tsc), but there are some variations,
sometimes it increase to 238 cycles(tsc).

Nice, but there is still a looong way to my performance target, where I
can spend 201 cycles for the entire forwarding path....


> This is an unreleased series that contains both the page allocator
> optimisations and the one-LRU-per-node series which in combination remove a
> lot of code from the page allocator fast paths. I have no data on how the
> combined series behaves but each series individually is known to improve
> page allocator performance.
>
> Once you have that, do a hackjob to remove the debugging checks from both the
> alloc and free path and see what that leaves. They could be bypassed properly
> with a __GFP_NOACCT flag used only by drivers that absolutely require pages
> as quickly as possible and willing to be less safe to get that performance.

I would be interested in testing/benchmarking a patch where you remove
the debugging checks...

You are also welcome to try out my benchmarking modules yourself:
 https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst

This is really simple stuff (for rapid prototyping) I'm just doing:
 modprobe page_bench01; rmmod page_bench01 ; dmesg | tail -n40

[...]
> 
> Be aware that compound order allocs like this are a double edged sword as
> it'll be fast sometimes and other times require reclaim/compaction which
> can stall for prolonged periods of time.

Yes, I've notice that there can be a fairly high variation, when doing
compound order allocs, which is not so nice!  I really don't like these
variations....

Drivers also do tricks where they fallback to smaller order pages. E.g.
lookup function mlx4_alloc_pages().  I've tried to simulate that
function here:
https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69

It does not seem very optimal. I tried to mem pressure the system a bit
to cause the alloc_pages() to fail, and then the result were very bad,
something like 2500 cycles, and it usually got the next order pages.


> > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
> > cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
> > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
> > cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
> > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.

The order=3 cost were reduced to: 417 cycles(tsc), nice!  But I've also
seen it jump to 611 cycles.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 13:08       ` Mel Gorman
  2016-04-11 16:19         ` [Lsf] " Jesper Dangaard Brouer
@ 2016-04-11 16:20         ` Matthew Wilcox
  2016-04-11 17:46           ` Thadeu Lima de Souza Cascardo
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2016-04-11 16:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jesper Dangaard Brouer, James Bottomley, netdev, Brenden Blanco,
	lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc,
	Alexei Starovoitov

On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote:
> On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> > the cost of DMA calls, NIC driver alloc large order (compound) pages.
> > (dma_map compound page, handout page-fragments for RX ring, and later
> > dma_unmap when last RX page-fragments is seen).
> 
> So, IMO only holding onto the DMA pages is all that is justified but not a
> recycle of order-0 pages built on top of the core allocator. For DMA pages,
> it would take a bit of legwork but the per-cpu allocator could be split
> and converted to hold arbitrary sized pages with a constructer/destructor
> to do the DMA coherency step when pages are taken from or handed back to
> the core allocator. I'm not volunteering to do that unfortunately but I
> estimate it'd be a few days work unless it needs to be per-CPU and NUMA
> aware in which case the memory footprint will be high.

Have "we" tried to accelerate the DMA calls in PowerPC?  For example, it
could hold onto a cache of recently used mappings and recycle them if that
still works.  It trades off a bit of security (a device can continue to DMA
after the memory should no longer be accessible to it) for speed, but then
so does the per-driver hack of keeping pages around still mapped.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 16:19         ` [Lsf] " Jesper Dangaard Brouer
@ 2016-04-11 16:53           ` Eric Dumazet
  2016-04-11 19:47             ` Jesper Dangaard Brouer
  2016-04-11 18:07           ` Mel Gorman
  1 sibling, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2016-04-11 16:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, James Bottomley, netdev, Brenden Blanco, lsf,
	linux-mm, Mel Gorman, Tom Herbert, lsf-pc, Alexei Starovoitov

On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:

> Drivers also do tricks where they fallback to smaller order pages. E.g.
> lookup function mlx4_alloc_pages().  I've tried to simulate that
> function here:
> https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69

We use order-0 pages on mlx4 at Google, as order-3 pages are very
dangerous for some kind of attacks...

An out of order TCP packet can hold an order-3 pages, while claiming to
use 1.5 KBvia skb->truesize.

order-0 only pages allow the page recycle trick used by Intel driver,
and we hardly see any page allocations in typical workloads.

While order-3 pages are 'nice' for friendly datacenter kind of traffic,
they also are a higher risk on hosts connected to the wild Internet.

Maybe I should upstream this patch ;)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 16:20         ` Matthew Wilcox
@ 2016-04-11 17:46           ` Thadeu Lima de Souza Cascardo
  2016-04-11 18:37             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 35+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2016-04-11 17:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mel Gorman, Jesper Dangaard Brouer, James Bottomley, netdev,
	Brenden Blanco, lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc,
	Alexei Starovoitov

On Mon, Apr 11, 2016 at 12:20:47PM -0400, Matthew Wilcox wrote:
> On Mon, Apr 11, 2016 at 02:08:27PM +0100, Mel Gorman wrote:
> > On Mon, Apr 11, 2016 at 02:26:39PM +0200, Jesper Dangaard Brouer wrote:
> > > On arch's like PowerPC, the DMA API is the bottleneck.  To workaround
> > > the cost of DMA calls, NIC driver alloc large order (compound) pages.
> > > (dma_map compound page, handout page-fragments for RX ring, and later
> > > dma_unmap when last RX page-fragments is seen).
> > 
> > So, IMO only holding onto the DMA pages is all that is justified but not a
> > recycle of order-0 pages built on top of the core allocator. For DMA pages,
> > it would take a bit of legwork but the per-cpu allocator could be split
> > and converted to hold arbitrary sized pages with a constructer/destructor
> > to do the DMA coherency step when pages are taken from or handed back to
> > the core allocator. I'm not volunteering to do that unfortunately but I
> > estimate it'd be a few days work unless it needs to be per-CPU and NUMA
> > aware in which case the memory footprint will be high.
> 
> Have "we" tried to accelerate the DMA calls in PowerPC?  For example, it
> could hold onto a cache of recently used mappings and recycle them if that
> still works.  It trades off a bit of security (a device can continue to DMA
> after the memory should no longer be accessible to it) for speed, but then
> so does the per-driver hack of keeping pages around still mapped.
> 

There are two problems on the DMA calls on Power servers. One is scalability. A
new allocation method for the address space would be necessary to fix it.

The other one is the latency or the cost of updating the TCE tables. The only
number I have is that I could push around 1M updates per second. So, we could
guess 1us per operation, which is pretty much a no-no for Jesper use case.

Your solution could address both. But I am concerned about the security problem.
Here is why I think this problem should be ignored if we go this way. IOMMU can
be used for three problems: virtualization, paranoia security and debuggability.

For virtualization, there is a solution already, and it's in place for Power and
x86. Power servers have the ability to enlarge the DMA window, allowing the
entire VM memory to be mapped during PCI driver probe time. After that, dma_map
is a simple sum and dma_unmap is a nop. x86 KVM maps the entire VM memory even
before booting the guest. Unless we want to fix this for old Power servers, I
see no point in fixing it.

Now, if you are using IOMMU on the host with no passthrough or linear system
memory mapping, you are paranoid. It's not just a matter of security, in fact.
It's also a matter of stability. Hardware, firmware and drivers can be buggy,
and they are. When I worked with drivers on Power servers, I found and fixed a
lot of driver bugs that caused the device to write to memory it was not supposed
to. Good thing is that IOMMU prevented that memory write to happen and the
driver would be reset by EEH. If we can make this scenario faster, and if we
want it to be the default we need to, then your solution might not be desired.
Otherwise, just turn your IOMMU off or put it into passthrough.

Now, the driver keeps pages mapped, but those pages belong to the driver. They
are not pages we decide to give to a userspace process because it's no longer in
use by the driver. So, I don't quite agree this would be a good tradeoff.
Certainly not if we can do it in a way that does not require this.

So, Jesper, please take into consideration that this pool design would rather be
per device. Otherwise, we allow some device to write into another's
device/driver memory.

Cascardo.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 16:19         ` [Lsf] " Jesper Dangaard Brouer
  2016-04-11 16:53           ` Eric Dumazet
@ 2016-04-11 18:07           ` Mel Gorman
  2016-04-11 19:26             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2016-04-11 18:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, James Bottomley, netdev, Brenden Blanco, lsf,
	linux-mm, Tom Herbert, lsf-pc, Alexei Starovoitov

On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote:
> > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> > 
> 
> The cost decreased to: 228 cycles(tsc), but there are some variations,
> sometimes it increase to 238 cycles(tsc).
> 

In the free path, a bulk pcp free adds to the cycles. In the alloc path,
a refill of the pcp lists costs quite a bit. Either option introduces
variances. The bulk free path can be optimised a little so I chucked
some additional patches at it that are not released yet but I suspect the
benefit will be marginal. The real heavy costs there are splitting/merging
buddies. Fixing that is much more fundamental but even fronting the allocator
with a new recycle allocator would not offset that as the refill of the
page-recycling thing would incur high costs.

> Nice, but there is still a looong way to my performance target, where I
> can spend 201 cycles for the entire forwarding path....
> 

While I accept the cost is still too high, I think the effort should still
be spent on improving the allocator in general than trying to bypass it.

> 
> > This is an unreleased series that contains both the page allocator
> > optimisations and the one-LRU-per-node series which in combination remove a
> > lot of code from the page allocator fast paths. I have no data on how the
> > combined series behaves but each series individually is known to improve
> > page allocator performance.
> >
> > Once you have that, do a hackjob to remove the debugging checks from both the
> > alloc and free path and see what that leaves. They could be bypassed properly
> > with a __GFP_NOACCT flag used only by drivers that absolutely require pages
> > as quickly as possible and willing to be less safe to get that performance.
> 
> I would be interested in testing/benchmarking a patch where you remove
> the debugging checks...
> 

Right now, I'm not proposing to remove the debugging checks despite their
cost. They catch really difficult problems in the field unfortunately
including corruption from buggy hardware. A GFP flag that disables them
for a very specific case would be ok but I expect it to be resisted by
others if it's done for the general case. Even a static branch for runtime
debugging checks may be resisted.

Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on
the grounds it is of questionable value. Applying that would free a flag
for __GFP_NOACCT that bypasses debugging checks and statistic updates.
That would work for the allocation side at least but doing the same for
the free side would be hard (potentially impossible) to do transparently
for drivers.

> You are also welcome to try out my benchmarking modules yourself:
>  https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst
> 

I took a quick look and functionally it's similar to the systemtap-based
microbenchmark I'm using in mmtests so I don't think we have a problem
with reproduction at the moment.

> > Be aware that compound order allocs like this are a double edged sword as
> > it'll be fast sometimes and other times require reclaim/compaction which
> > can stall for prolonged periods of time.
> 
> Yes, I've notice that there can be a fairly high variation, when doing
> compound order allocs, which is not so nice!  I really don't like these
> variations....
> 

They can cripple you which is why I'm very wary of performance patches that
require compound pages. It tends to look great only on benchmarks and then
the corner cases hit in the real world and the bug reports are unpleasant.

> Drivers also do tricks where they fallback to smaller order pages. E.g.
> lookup function mlx4_alloc_pages().  I've tried to simulate that
> function here:
> https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69
> 
> It does not seem very optimal. I tried to mem pressure the system a bit
> to cause the alloc_pages() to fail, and then the result were very bad,
> something like 2500 cycles, and it usually got the next order pages.

The options for fallback tend to have one hazard after the next. It's
partially why the last series focused on order-0 pages only.

> > > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
> > > cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
> > > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
> > > cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
> > > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.
> 
> The order=3 cost were reduced to: 417 cycles(tsc), nice!  But I've also
> seen it jump to 611 cycles.
> 

The corner cases can be minimised to some extent -- lazy buddy merging for
example but it unfortunately has other consequences for users that require
high-order pages for functional reasons. I tried something like that once
(http://thread.gmane.org/gmane.linux.kernel/807683) but didn't pursue it
to the end as it was a small part of the problem I was dealing with at the
time. It shouldn't be ruled out but it should be considered a last resort.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 17:46           ` Thadeu Lima de Souza Cascardo
@ 2016-04-11 18:37             ` Jesper Dangaard Brouer
  2016-04-11 18:53               ` Bart Van Assche
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-11 18:37 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: Matthew Wilcox, Mel Gorman, James Bottomley, netdev,
	Brenden Blanco, lsf, linux-mm, Mel Gorman, Tom Herbert, lsf-pc,
	Alexei Starovoitov, brouer

On Mon, 11 Apr 2016 14:46:25 -0300
Thadeu Lima de Souza Cascardo <cascardo@redhat.com> wrote:

> So, Jesper, please take into consideration that this pool design
> would rather be per device. Otherwise, we allow some device to write
> into another's device/driver memory.

Yes, that was my intended use.  I want to have a page-pool per device.
I actually, want to go as far as a page-pool per NIC HW RX-ring queue.

Because the other use-case for the page-pool is zero-copy RX.

The NIC HW trick is that we today can create a HW filter in the NIC
(via ethtool) and place that traffic into a separate RX queue in the
NIC.  Lets say matching NFS traffic or guest traffic. Then we can allow
RX zero-copy of these pages, into the application/guest, somehow
binding it to RX queue, e.g. introducing a "cross-domain-id" in the
page-pool page that need to match.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 18:37             ` Jesper Dangaard Brouer
@ 2016-04-11 18:53               ` Bart Van Assche
  0 siblings, 0 replies; 35+ messages in thread
From: Bart Van Assche @ 2016-04-11 18:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Thadeu Lima de Souza Cascardo
  Cc: lsf, netdev, Brenden Blanco, James Bottomley, linux-mm,
	Mel Gorman, Tom Herbert, Matthew Wilcox, lsf-pc, Mel Gorman,
	Alexei Starovoitov

On 04/11/2016 11:37 AM, Jesper Dangaard Brouer wrote:
> On Mon, 11 Apr 2016 14:46:25 -0300
> Thadeu Lima de Souza Cascardo <cascardo@redhat.com> wrote:
>
>> So, Jesper, please take into consideration that this pool design
>> would rather be per device. Otherwise, we allow some device to write
>> into another's device/driver memory.
>
> Yes, that was my intended use.  I want to have a page-pool per device.
> I actually, want to go as far as a page-pool per NIC HW RX-ring queue.
>
> Because the other use-case for the page-pool is zero-copy RX.
>
> The NIC HW trick is that we today can create a HW filter in the NIC
> (via ethtool) and place that traffic into a separate RX queue in the
> NIC.  Lets say matching NFS traffic or guest traffic. Then we can allow
> RX zero-copy of these pages, into the application/guest, somehow
> binding it to RX queue, e.g. introducing a "cross-domain-id" in the
> page-pool page that need to match.

I think it is important to keep in mind that using a page pool for 
zero-copy RX is specific to protocols that are based on TCP/IP. 
Protocols like FC, SRP and iSER have been designed such that the side 
that allocates the buffers also initiates the data transfer (the target 
side). With TCP/IP however transferring data and allocating receive 
buffers happens by opposite sides of the connection.

Bart.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 18:07           ` Mel Gorman
@ 2016-04-11 19:26             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-11 19:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mel Gorman, James Bottomley, netdev, Brenden Blanco, lsf,
	linux-mm, Tom Herbert, lsf-pc, Alexei Starovoitov, brouer

On Mon, 11 Apr 2016 19:07:03 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Mon, Apr 11, 2016 at 06:19:07PM +0200, Jesper Dangaard Brouer wrote:
> > > http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git/log/?h=mm-vmscan-node-lru-v4r5
> > >   
> > 
> > The cost decreased to: 228 cycles(tsc), but there are some variations,
> > sometimes it increase to 238 cycles(tsc).
> >   
> 
> In the free path, a bulk pcp free adds to the cycles. In the alloc path,
> a refill of the pcp lists costs quite a bit. Either option introduces
> variances. The bulk free path can be optimised a little so I chucked
> some additional patches at it that are not released yet but I suspect the
> benefit will be marginal. The real heavy costs there are splitting/merging
> buddies. Fixing that is much more fundamental but even fronting the allocator
> with a new recycle allocator would not offset that as the refill of the
> page-recycling thing would incur high costs.
>

Yes, re-filling page-pool (in the non-steady state) could be
problematic for performance.  That is why I'm very motivated in helping
out with a bulk alloc/free scheme for the page allocator.

 
> > Nice, but there is still a looong way to my performance target, where I
> > can spend 201 cycles for the entire forwarding path....
> >   
> 
> While I accept the cost is still too high, I think the effort should still
> be spent on improving the allocator in general than trying to bypass it.
> 

I do think improving the page allocator is very important work.
I just don't see how we can ever reach my performance target, without a
page-pool recycle facility.

I work in the area, where I think the cost of a single atomic operation
is too high.  I work on amortizing the individual atomic operations.
That is what I did for the SLUB allocator, with the bulk API. see:

Commit d0ecd894e3d5 ("slub: optimize bulk slowpath free by detached freelist")
 https://git.kernel.org/torvalds/c/d0ecd894e3d5

Commit fbd02630c6e3 ("slub: initial bulk free implementation")
 https://git.kernel.org/torvalds/c/fbd02630c6e3
 
This is now also used in the network stack:
 Commit 3134b9f019f2 ("Merge branch 'net-mitigate-kmem_free-slowpath'")
 Commit a3a8749d34d8 ("ixgbe: bulk free SKBs during TX completion cleanup cycle")


> > > This is an unreleased series that contains both the page allocator
> > > optimisations and the one-LRU-per-node series which in combination remove a
> > > lot of code from the page allocator fast paths. I have no data on how the
> > > combined series behaves but each series individually is known to improve
> > > page allocator performance.
> > >
> > > Once you have that, do a hackjob to remove the debugging checks from both the
> > > alloc and free path and see what that leaves. They could be bypassed properly
> > > with a __GFP_NOACCT flag used only by drivers that absolutely require pages
> > > as quickly as possible and willing to be less safe to get that performance.  
> > 
> > I would be interested in testing/benchmarking a patch where you remove
> > the debugging checks...
> >   
> 
> Right now, I'm not proposing to remove the debugging checks despite their
> cost. They catch really difficult problems in the field unfortunately
> including corruption from buggy hardware. A GFP flag that disables them
> for a very specific case would be ok but I expect it to be resisted by
> others if it's done for the general case. Even a static branch for runtime
> debugging checks may be resisted.
> 
> Even if GFP flags are tight, I have a patch that deletes __GFP_COLD on
> the grounds it is of questionable value. Applying that would free a flag
> for __GFP_NOACCT that bypasses debugging checks and statistic updates.
> That would work for the allocation side at least but doing the same for
> the free side would be hard (potentially impossible) to do transparently
> for drivers.

Before spending too much work on something, I usually try to determine
what the maximum benefit of something would be.  Thus, I propose you
create a patch that hack remove all the debug checks that you think
could be beneficial to remove.  And then benchmark it yourself or send
it to me for benchmarking... that is the quickest way to determine if
this is worth spending time on.


 
> > You are also welcome to try out my benchmarking modules yourself:
> >  https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst
> >   
> 
> I took a quick look and functionally it's similar to the systemtap-based
> microbenchmark I'm using in mmtests so I don't think we have a problem
> with reproduction at the moment.
> 
> > > Be aware that compound order allocs like this are a double edged sword as
> > > it'll be fast sometimes and other times require reclaim/compaction which
> > > can stall for prolonged periods of time.  
> > 
> > Yes, I've notice that there can be a fairly high variation, when doing
> > compound order allocs, which is not so nice!  I really don't like these
> > variations....
> >   
> 
> They can cripple you which is why I'm very wary of performance patches that
> require compound pages. It tends to look great only on benchmarks and then
> the corner cases hit in the real world and the bug reports are unpleasant.

That confirms Eric's experience at Google, where they disabled this
compound order page feature in the driver...


> > Drivers also do tricks where they fallback to smaller order pages. E.g.
> > lookup function mlx4_alloc_pages().  I've tried to simulate that
> > function here:
> > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69
> > 
> > It does not seem very optimal. I tried to mem pressure the system a bit
> > to cause the alloc_pages() to fail, and then the result were very bad,
> > something like 2500 cycles, and it usually got the next order pages.  
> 
> The options for fallback tend to have one hazard after the next. It's
> partially why the last series focused on order-0 pages only.

Other places in the network stack, this falling down through the
order's got removed, and replaced with a single "falldown" to order-0
pages. (due to people reporting bad experiences of latency spikes)

 
> > > > I've measured order 3 (32KB) alloc_pages(order=3) + __free_pages() to
> > > > cost approx 500 cycles(tsc).  That was more expensive, BUT an order=3
> > > > page 32Kb correspond to 8 pages (32768/4096), thus 500/8 = 62.5
> > > > cycles.  Usually a network RX-frame only need to be 2048 bytes, thus
> > > > the "bulk" effect speed up is x16 (32768/2048), thus 31.25 cycles.  
> > 
> > The order=3 cost were reduced to: 417 cycles(tsc), nice!  But I've also
> > seen it jump to 611 cycles.
> >   
> 
> The corner cases can be minimised to some extent -- lazy buddy merging for
> example but it unfortunately has other consequences for users that require
> high-order pages for functional reasons. I tried something like that once
> (http://thread.gmane.org/gmane.linux.kernel/807683) but didn't pursue it
> to the end as it was a small part of the problem I was dealing with at the
> time. It shouldn't be ruled out but it should be considered a last resort.
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 16:53           ` Eric Dumazet
@ 2016-04-11 19:47             ` Jesper Dangaard Brouer
  2016-04-11 21:14               ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-11 19:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: lsf, netdev, Brenden Blanco, James Bottomley, linux-mm,
	Mel Gorman, Tom Herbert, lsf-pc, Mel Gorman, Alexei Starovoitov,
	brouer, Alexander Duyck, Waskiewicz, PJ

On Mon, 11 Apr 2016 09:53:54 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> 
> > Drivers also do tricks where they fallback to smaller order pages. E.g.
> > lookup function mlx4_alloc_pages().  I've tried to simulate that
> > function here:
> > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69  
> 
> We use order-0 pages on mlx4 at Google, as order-3 pages are very
> dangerous for some kind of attacks...

Interesting!

> An out of order TCP packet can hold an order-3 pages, while claiming to
> use 1.5 KBvia skb->truesize.
> 
> order-0 only pages allow the page recycle trick used by Intel driver,
> and we hardly see any page allocations in typical workloads.

Yes, I looked at the Intel ixgbe drivers page recycle trick. 

It is actually quite cool, but code wise it is a little hard to
follow.  I started to look at the variant in i40e, specifically
function i40e_clean_rx_irq_ps() explains it a bit more explicit.
 

> While order-3 pages are 'nice' for friendly datacenter kind of
> traffic, they also are a higher risk on hosts connected to the wild
> Internet.
> 
> Maybe I should upstream this patch ;)

Definitely!

Does this patch also include a page recycle trick?  Else how do you get
around the cost of allocating a single order-0 page?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-09 12:34       ` Eric Dumazet
@ 2016-04-11 20:23         ` Jesper Dangaard Brouer
  2016-04-11 21:27           ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-11 20:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: lsf, Tom Herbert, Brenden Blanco, James Bottomley, linux-mm,
	netdev, lsf-pc, Alexei Starovoitov, brouer


On Sat, 09 Apr 2016 05:34:38 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Sat, 2016-04-09 at 11:11 +0200, Jesper Dangaard Brouer wrote:
> 
> > Above code is okay.  But do you think we also can get away with the same
> > trick we do with the SKB refcnf?  Where we avoid an atomic operation if
> > refcnt==1.
> > 
> > void kfree_skb(struct sk_buff *skb)
> > {
> > 	if (unlikely(!skb))
> > 		return;
> > 	if (likely(atomic_read(&skb->users) == 1))
> > 		smp_rmb();
> > 	else if (likely(!atomic_dec_and_test(&skb->users)))
> > 		return;
> > 	trace_kfree_skb(skb, __builtin_return_address(0));
> > 	__kfree_skb(skb);
> > }
> > EXPORT_SYMBOL(kfree_skb);  
> 
> No we can not use this trick this for pages :
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ec91698360b3818ff426488a1529811f7a7ab87f
> 

If we have a page-pool recycle facility, then we could use the trick,
right? (As we know that get_page_unless_zero() cannot happen for pages
in the pool).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 19:47             ` Jesper Dangaard Brouer
@ 2016-04-11 21:14               ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2016-04-11 21:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf, netdev, Brenden Blanco, James Bottomley, linux-mm,
	Mel Gorman, Tom Herbert, lsf-pc, Mel Gorman, Alexei Starovoitov,
	Alexander Duyck, Waskiewicz, PJ

On Mon, 2016-04-11 at 21:47 +0200, Jesper Dangaard Brouer wrote:
> On Mon, 11 Apr 2016 09:53:54 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Mon, 2016-04-11 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> > 
> > > Drivers also do tricks where they fallback to smaller order pages. E.g.
> > > lookup function mlx4_alloc_pages().  I've tried to simulate that
> > > function here:
> > > https://github.com/netoptimizer/prototype-kernel/blob/91d323fc53/kernel/mm/bench/page_bench01.c#L69  
> > 
> > We use order-0 pages on mlx4 at Google, as order-3 pages are very
> > dangerous for some kind of attacks...
> 
> Interesting!
> 
> > An out of order TCP packet can hold an order-3 pages, while claiming to
> > use 1.5 KBvia skb->truesize.
> > 
> > order-0 only pages allow the page recycle trick used by Intel driver,
> > and we hardly see any page allocations in typical workloads.
> 
> Yes, I looked at the Intel ixgbe drivers page recycle trick. 
> 
> It is actually quite cool, but code wise it is a little hard to
> follow.  I started to look at the variant in i40e, specifically
> function i40e_clean_rx_irq_ps() explains it a bit more explicit.
>  
> 
> > While order-3 pages are 'nice' for friendly datacenter kind of
> > traffic, they also are a higher risk on hosts connected to the wild
> > Internet.
> > 
> > Maybe I should upstream this patch ;)
> 
> Definitely!
> 
> Does this patch also include a page recycle trick?  Else how do you get
> around the cost of allocating a single order-0 page?
> 

Yes, we use the page recycle trick.

Obviously not on powerpc (or any arch with PAGE_SIZE >= 8192), but
definitely on x86.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 20:23         ` Jesper Dangaard Brouer
@ 2016-04-11 21:27           ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2016-04-11 21:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf, Tom Herbert, Brenden Blanco, James Bottomley, linux-mm,
	netdev, lsf-pc, Alexei Starovoitov

On Mon, 2016-04-11 at 22:23 +0200, Jesper Dangaard Brouer wrote:

> If we have a page-pool recycle facility, then we could use the trick,
> right? (As we know that get_page_unless_zero() cannot happen for pages
> in the pool).

Well, if you disable everything that possibly use
get_page_unless_zero(), I guess this could work.

But then, you'll have to spy lkml traffic forever to make sure no new
feature is added in the kernel, using this get_page_unless_zero() in a
new clever way.

You could use a page flag so that z BUG() triggers if
get_page_unless_zero() is attempted on one of your precious pages ;)\

We had very subtle issues before my fixes (check
35b7a1915aa33da812074744647db0d9262a555c and children), so I would not
waste time on the lock prefix avoidance at this point.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-10 18:45       ` Sagi Grimberg
@ 2016-04-11 21:41         ` Jesper Dangaard Brouer
  2016-04-11 22:02           ` Alexander Duyck
  2016-04-11 22:21           ` Alexei Starovoitov
  0 siblings, 2 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-11 21:41 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, Christoph Hellwig, James Bottomley, Tom Herbert,
	Brenden Blanco, lsf, linux-mm, netdev, lsf-pc,
	Alexei Starovoitov, brouer


On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:

> >> This is also very interesting for storage targets, which face the same
> >> issue.  SCST has a mode where it caches some fully constructed SGLs,
> >> which is probably very similar to what NICs want to do.  
> >
> > I think a cached allocator for page sets + the scatterlists that
> > describe these page sets would not only be useful for SCSI target
> > implementations but also for the Linux SCSI initiator. Today the scsi-mq
> > code reserves space in each scsi_cmnd for a scatterlist of
> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> > sets less memory would be needed per scsi_cmnd.  
> 
> If we go down this road how about also attaching some driver opaques
> to the page sets?

That was the ultimate plan... to leave some opaques bytes left in the
page struct that drivers could use.

In struct page I would need a pointer back to my page_pool struct and a
page flag.  Then, I would need room to store the dma_unmap address.
(And then some of the usual fields are still needed, like the refcnt,
and reusing some of the list constructs).  And a zero-copy cross-domain
id.


For my packet-page idea, I would need a packet length and an offset
where data starts (I can derive the "head-room" for encap from these
two).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 21:41         ` Jesper Dangaard Brouer
@ 2016-04-11 22:02           ` Alexander Duyck
  2016-04-12  6:28             ` Jesper Dangaard Brouer
  2016-04-11 22:21           ` Alexei Starovoitov
  1 sibling, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2016-04-11 22:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Sagi Grimberg, Bart Van Assche, Christoph Hellwig,
	James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm,
	netdev, lsf-pc, Alexei Starovoitov

On Mon, Apr 11, 2016 at 2:41 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> >> This is also very interesting for storage targets, which face the same
>> >> issue.  SCST has a mode where it caches some fully constructed SGLs,
>> >> which is probably very similar to what NICs want to do.
>> >
>> > I think a cached allocator for page sets + the scatterlists that
>> > describe these page sets would not only be useful for SCSI target
>> > implementations but also for the Linux SCSI initiator. Today the scsi-mq
>> > code reserves space in each scsi_cmnd for a scatterlist of
>> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
>> > sets less memory would be needed per scsi_cmnd.
>>
>> If we go down this road how about also attaching some driver opaques
>> to the page sets?
>
> That was the ultimate plan... to leave some opaques bytes left in the
> page struct that drivers could use.
>
> In struct page I would need a pointer back to my page_pool struct and a
> page flag.  Then, I would need room to store the dma_unmap address.
> (And then some of the usual fields are still needed, like the refcnt,
> and reusing some of the list constructs).  And a zero-copy cross-domain
> id.
>
>
> For my packet-page idea, I would need a packet length and an offset
> where data starts (I can derive the "head-room" for encap from these
> two).

Have you taken a look at possibly trying to optimize the DMA pool API
to work with pages?  It sounds like it is supposed to do something
similar to what you are wanting to do.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 21:41         ` Jesper Dangaard Brouer
  2016-04-11 22:02           ` Alexander Duyck
@ 2016-04-11 22:21           ` Alexei Starovoitov
  2016-04-12  6:16             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2016-04-11 22:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Sagi Grimberg, Bart Van Assche, Christoph Hellwig,
	James Bottomley, Tom Herbert, Brenden Blanco, lsf, linux-mm,
	netdev, lsf-pc

On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> > >> This is also very interesting for storage targets, which face the same
> > >> issue.  SCST has a mode where it caches some fully constructed SGLs,
> > >> which is probably very similar to what NICs want to do.  
> > >
> > > I think a cached allocator for page sets + the scatterlists that
> > > describe these page sets would not only be useful for SCSI target
> > > implementations but also for the Linux SCSI initiator. Today the scsi-mq
> > > code reserves space in each scsi_cmnd for a scatterlist of
> > > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> > > sets less memory would be needed per scsi_cmnd.  
> > 
> > If we go down this road how about also attaching some driver opaques
> > to the page sets?
> 
> That was the ultimate plan... to leave some opaques bytes left in the
> page struct that drivers could use.
> 
> In struct page I would need a pointer back to my page_pool struct and a
> page flag.  Then, I would need room to store the dma_unmap address.
> (And then some of the usual fields are still needed, like the refcnt,
> and reusing some of the list constructs).  And a zero-copy cross-domain
> id.

I don't think we need to add anything to struct page.
This is supposed to be small cache of dma_mapped pages with lockless access.
It can be implemented as an array or link list where every element
is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
send it to back to page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 22:21           ` Alexei Starovoitov
@ 2016-04-12  6:16             ` Jesper Dangaard Brouer
  2016-04-12 17:20               ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-12  6:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: lsf, James Bottomley, Sagi Grimberg, Tom Herbert, Brenden Blanco,
	Christoph Hellwig, linux-mm, netdev, Bart Van Assche, lsf-pc,
	brouer


On Mon, 11 Apr 2016 15:21:26 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> > 
> > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
> >   
[...]
> > > 
> > > If we go down this road how about also attaching some driver opaques
> > > to the page sets?  
> > 
> > That was the ultimate plan... to leave some opaques bytes left in the
> > page struct that drivers could use.
> > 
> > In struct page I would need a pointer back to my page_pool struct and a
> > page flag.  Then, I would need room to store the dma_unmap address.
> > (And then some of the usual fields are still needed, like the refcnt,
> > and reusing some of the list constructs).  And a zero-copy cross-domain
> > id.  
> 
> I don't think we need to add anything to struct page.
> This is supposed to be small cache of dma_mapped pages with lockless access.
> It can be implemented as an array or link list where every element
> is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
> send it to back to page allocator.

It sounds like the Intel drivers recycle facility, where they split the
page into two parts, and keep page in RX-ring, by swapping to other
half of page, if page_count(page) is <= 2.  Thus, they use the atomic
page ref count to synchronize on.

Thus, we end-up having two atomic operations per RX packet, on the page
refcnt.  Where DPDK have zero...

By fully taking over the page as an allocator, almost like slab. I can
optimize the common case (of the packet-page getting allocated and
free'ed on the same CPU), and remove these atomic operations.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-11 22:02           ` Alexander Duyck
@ 2016-04-12  6:28             ` Jesper Dangaard Brouer
  2016-04-12 15:37               ` Alexander Duyck
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-12  6:28 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: lsf, James Bottomley, Sagi Grimberg, Tom Herbert, Brenden Blanco,
	Christoph Hellwig, linux-mm, netdev, Bart Van Assche, lsf-pc,
	Alexei Starovoitov, brouer


On Mon, 11 Apr 2016 15:02:51 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:

> Have you taken a look at possibly trying to optimize the DMA pool API
> to work with pages?  It sounds like it is supposed to do something
> similar to what you are wanting to do.

Yes, I have looked at the mm/dmapool.c API. AFAIK this is for DMA
coherent memory (see use of dma_alloc_coherent/dma_free_coherent). 

What we are doing is "streaming" DMA memory, when processing the RX
ring.

(NIC are only using DMA coherent memory for the descriptors, which are
allocated on driver init)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-12  6:28             ` Jesper Dangaard Brouer
@ 2016-04-12 15:37               ` Alexander Duyck
  0 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2016-04-12 15:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf, James Bottomley, Sagi Grimberg, Tom Herbert, Brenden Blanco,
	Christoph Hellwig, linux-mm, netdev, Bart Van Assche, lsf-pc,
	Alexei Starovoitov

On Mon, Apr 11, 2016 at 11:28 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Mon, 11 Apr 2016 15:02:51 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> Have you taken a look at possibly trying to optimize the DMA pool API
>> to work with pages?  It sounds like it is supposed to do something
>> similar to what you are wanting to do.
>
> Yes, I have looked at the mm/dmapool.c API. AFAIK this is for DMA
> coherent memory (see use of dma_alloc_coherent/dma_free_coherent).
>
> What we are doing is "streaming" DMA memory, when processing the RX
> ring.
>
> (NIC are only using DMA coherent memory for the descriptors, which are
> allocated on driver init)

Yes, I know that but it shouldn't take much to extend the API to
provide the option for a streaming DMA mapping.  That was why I
thought you might want to look in this direction.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
  2016-04-12  6:16             ` Jesper Dangaard Brouer
@ 2016-04-12 17:20               ` Alexei Starovoitov
  0 siblings, 0 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2016-04-12 17:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf, James Bottomley, Sagi Grimberg, Tom Herbert, Brenden Blanco,
	Christoph Hellwig, linux-mm, netdev, Bart Van Assche, lsf-pc

On Tue, Apr 12, 2016 at 08:16:49AM +0200, Jesper Dangaard Brouer wrote:
> 
> On Mon, 11 Apr 2016 15:21:26 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
> > >   
> [...]
> > > > 
> > > > If we go down this road how about also attaching some driver opaques
> > > > to the page sets?  
> > > 
> > > That was the ultimate plan... to leave some opaques bytes left in the
> > > page struct that drivers could use.
> > > 
> > > In struct page I would need a pointer back to my page_pool struct and a
> > > page flag.  Then, I would need room to store the dma_unmap address.
> > > (And then some of the usual fields are still needed, like the refcnt,
> > > and reusing some of the list constructs).  And a zero-copy cross-domain
> > > id.  
> > 
> > I don't think we need to add anything to struct page.
> > This is supposed to be small cache of dma_mapped pages with lockless access.
> > It can be implemented as an array or link list where every element
> > is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
> > send it to back to page allocator.
> 
> It sounds like the Intel drivers recycle facility, where they split the
> page into two parts, and keep page in RX-ring, by swapping to other
> half of page, if page_count(page) is <= 2.  Thus, they use the atomic
> page ref count to synchronize on.

actually I'm proposing the opposite. one page = one packet.
I'm perfectly happy to waste half a page, since number of such pages is small
and performance matter more. Typical performance vs memory tradeoff.

> Thus, we end-up having two atomic operations per RX packet, on the page
> refcnt.  Where DPDK have zero...

the page recycling cache should have zero atomic ops per packet
otherwise it's non starter.

> By fully taking over the page as an allocator, almost like slab. I can
> optimize the common case (of the packet-page getting allocated and
> free'ed on the same CPU), and remove these atomic operations.

slub is doing local cmpxchg. 40G networking cannot afford it per packet.
If it's amortized due to batching that will be ok.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-04-12 17:20 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1460034425.20949.7.camel@HansenPartnership.com>
2016-04-07 14:17 ` [LSF/MM TOPIC] Generic page-pool recycle facility? Jesper Dangaard Brouer
2016-04-07 14:38   ` [Lsf-pc] " Christoph Hellwig
2016-04-07 15:11     ` [Lsf] " Bart Van Assche
2016-04-10 18:45       ` Sagi Grimberg
2016-04-11 21:41         ` Jesper Dangaard Brouer
2016-04-11 22:02           ` Alexander Duyck
2016-04-12  6:28             ` Jesper Dangaard Brouer
2016-04-12 15:37               ` Alexander Duyck
2016-04-11 22:21           ` Alexei Starovoitov
2016-04-12  6:16             ` Jesper Dangaard Brouer
2016-04-12 17:20               ` Alexei Starovoitov
2016-04-07 15:48     ` Chuck Lever
2016-04-07 16:14       ` [Lsf-pc] [Lsf] " Rik van Riel
2016-04-07 19:43         ` [Lsf] [Lsf-pc] " Jesper Dangaard Brouer
2016-04-07 15:18   ` Eric Dumazet
2016-04-09  9:11     ` [Lsf] " Jesper Dangaard Brouer
2016-04-09 12:34       ` Eric Dumazet
2016-04-11 20:23         ` Jesper Dangaard Brouer
2016-04-11 21:27           ` Eric Dumazet
2016-04-07 19:48   ` Waskiewicz, PJ
2016-04-07 20:38     ` Jesper Dangaard Brouer
2016-04-08 16:12       ` Alexander Duyck
2016-04-11  8:58   ` [Lsf-pc] " Mel Gorman
2016-04-11 12:26     ` Jesper Dangaard Brouer
2016-04-11 13:08       ` Mel Gorman
2016-04-11 16:19         ` [Lsf] " Jesper Dangaard Brouer
2016-04-11 16:53           ` Eric Dumazet
2016-04-11 19:47             ` Jesper Dangaard Brouer
2016-04-11 21:14               ` Eric Dumazet
2016-04-11 18:07           ` Mel Gorman
2016-04-11 19:26             ` Jesper Dangaard Brouer
2016-04-11 16:20         ` Matthew Wilcox
2016-04-11 17:46           ` Thadeu Lima de Souza Cascardo
2016-04-11 18:37             ` Jesper Dangaard Brouer
2016-04-11 18:53               ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).