All of lore.kernel.org
 help / color / mirror / Atom feed
* Page allocator bottleneck
@ 2017-09-14 16:49 ` Tariq Toukan
  0 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2017-09-14 16:49 UTC (permalink / raw)
  To: David Miller, Jesper Dangaard Brouer, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm

Hi all,

As part of the efforts to support increasing next-generation NIC speeds,
I am investigating SW bottlenecks in network stack receive flow.

Here I share some numbers I got for a simple experiment, in which I 
simulate the page allocation rate needed in 200Gpbs NICs.

I ran the test below over 3 different (modified) mlx5 driver versions,
loaded on server side (RX):
1) RX page cache disabled, 2 packets per page.
2) RX page cache disabled, one packet per page.
3) Huge RX page cache, one packet per page.

All page allocations are of order 0.

NIC: Connectx-5 100 Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Test:
128 TCP streams (using super_netperf).
Changing num of RX queues.
HW LRO OFF, GRO ON, MTU 1500.
Observe: BW as a function of num RX queues.

Results:

Driver #1:
#rings	BW (Mbps)
1	23,813
2	44,086
3	62,128
4	78,058
6	94,210 (linerate)
8	94,205 (linerate)
12	94,202 (linerate)
16	94,191 (linerate)

Driver #2:
#rings	BW (Mbps)
1	18,835
2	36,716
3	50,521
4	61,746
6	63,637
8	60,299
12	51,048
16	43,337

Driver #3:
#rings	BW (Mbps)
1	19,316
2	44,850
3	69,549
4	87,434
6	94,342 (linerate)
8	94,350 (linerate)
12	94,327 (linerate)
16	94,327 (linerate)


Insights:
Major degradation between #1 and #2, not getting any close to linerate!
Degradation is fixed between #2 and #3.
This is because page allocator cannot stand the higher allocation rate.
In #2, we also see that the addition of rings (cores) reduces BW (!!), 
as result of increasing congestion over shared resources.

Congestion in this case is very clear.
When monitored in perf top:
85.58% [kernel] [k] queued_spin_lock_slowpath

I think that page allocator issues should be discussed separately:
1) Rate: Increase the allocation rate on a single core.
2) Scalability: Reduce congestion and sync overhead between cores.

This is clearly the current bottleneck in the network stack receive flow.

I know about some efforts that were made in the past two years.
For example the ones from Jesper et al.:
- Page-pool (not accepted AFAIK).
- Page-allocation bulking.
- Optimize order-0 allocations in Per-Cpu-Pages.

I am not an mm expert, but wanted to raise the issue again, to combine 
the efforts and hear from you guys about status and possible directions.

Best regards,
Tariq Toukan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Page allocator bottleneck
@ 2017-09-14 16:49 ` Tariq Toukan
  0 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2017-09-14 16:49 UTC (permalink / raw)
  To: David Miller, Jesper Dangaard Brouer, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm

Hi all,

As part of the efforts to support increasing next-generation NIC speeds,
I am investigating SW bottlenecks in network stack receive flow.

Here I share some numbers I got for a simple experiment, in which I 
simulate the page allocation rate needed in 200Gpbs NICs.

I ran the test below over 3 different (modified) mlx5 driver versions,
loaded on server side (RX):
1) RX page cache disabled, 2 packets per page.
2) RX page cache disabled, one packet per page.
3) Huge RX page cache, one packet per page.

All page allocations are of order 0.

NIC: Connectx-5 100 Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Test:
128 TCP streams (using super_netperf).
Changing num of RX queues.
HW LRO OFF, GRO ON, MTU 1500.
Observe: BW as a function of num RX queues.

Results:

Driver #1:
#rings	BW (Mbps)
1	23,813
2	44,086
3	62,128
4	78,058
6	94,210 (linerate)
8	94,205 (linerate)
12	94,202 (linerate)
16	94,191 (linerate)

Driver #2:
#rings	BW (Mbps)
1	18,835
2	36,716
3	50,521
4	61,746
6	63,637
8	60,299
12	51,048
16	43,337

Driver #3:
#rings	BW (Mbps)
1	19,316
2	44,850
3	69,549
4	87,434
6	94,342 (linerate)
8	94,350 (linerate)
12	94,327 (linerate)
16	94,327 (linerate)


Insights:
Major degradation between #1 and #2, not getting any close to linerate!
Degradation is fixed between #2 and #3.
This is because page allocator cannot stand the higher allocation rate.
In #2, we also see that the addition of rings (cores) reduces BW (!!), 
as result of increasing congestion over shared resources.

Congestion in this case is very clear.
When monitored in perf top:
85.58% [kernel] [k] queued_spin_lock_slowpath

I think that page allocator issues should be discussed separately:
1) Rate: Increase the allocation rate on a single core.
2) Scalability: Reduce congestion and sync overhead between cores.

This is clearly the current bottleneck in the network stack receive flow.

I know about some efforts that were made in the past two years.
For example the ones from Jesper et al.:
- Page-pool (not accepted AFAIK).
- Page-allocation bulking.
- Optimize order-0 allocations in Per-Cpu-Pages.

I am not an mm expert, but wanted to raise the issue again, to combine 
the efforts and hear from you guys about status and possible directions.

Best regards,
Tariq Toukan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-14 16:49 ` Tariq Toukan
@ 2017-09-14 20:19   ` Andi Kleen
  -1 siblings, 0 replies; 31+ messages in thread
From: Andi Kleen @ 2017-09-14 20:19 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David Miller, Jesper Dangaard Brouer, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm

Tariq Toukan <tariqt@mellanox.com> writes:
>
> Congestion in this case is very clear.
> When monitored in perf top:
> 85.58% [kernel] [k] queued_spin_lock_slowpath

Please look at the callers. Spinlock profiles without callers
are usually useless because it's just blaming the messenger.

Most likely the PCP lists are too small for your extreme allocation
rate, so it goes back too often to the shared pool.

You can play with the vm.percpu_pagelist_fraction setting.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
@ 2017-09-14 20:19   ` Andi Kleen
  0 siblings, 0 replies; 31+ messages in thread
From: Andi Kleen @ 2017-09-14 20:19 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David Miller, Jesper Dangaard Brouer, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm

Tariq Toukan <tariqt@mellanox.com> writes:
>
> Congestion in this case is very clear.
> When monitored in perf top:
> 85.58% [kernel] [k] queued_spin_lock_slowpath

Please look at the callers. Spinlock profiles without callers
are usually useless because it's just blaming the messenger.

Most likely the PCP lists are too small for your extreme allocation
rate, so it goes back too often to the shared pool.

You can play with the vm.percpu_pagelist_fraction setting.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-14 16:49 ` Tariq Toukan
  (?)
  (?)
@ 2017-09-15  7:28 ` Jesper Dangaard Brouer
  2017-09-17 16:16   ` Tariq Toukan
  -1 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-09-15  7:28 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David Miller, Mel Gorman, Eric Dumazet, Alexei Starovoitov,
	Saeed Mahameed, Eran Ben Elisha, Linux Kernel Network Developers,
	Andrew Morton, Michal Hocko, linux-mm, brouer

On Thu, 14 Sep 2017 19:49:31 +0300
Tariq Toukan <tariqt@mellanox.com> wrote:

> Hi all,
> 
> As part of the efforts to support increasing next-generation NIC speeds,
> I am investigating SW bottlenecks in network stack receive flow.
> 
> Here I share some numbers I got for a simple experiment, in which I 
> simulate the page allocation rate needed in 200Gpbs NICs.

Thanks for bringing this up again. 

> I ran the test below over 3 different (modified) mlx5 driver versions,
> loaded on server side (RX):
> 1) RX page cache disabled, 2 packets per page.

2 packets per page basically reduce the overhead you see from the page
allocator to half.

> 2) RX page cache disabled, one packet per page.

This, should stress the page allocator.

> 3) Huge RX page cache, one packet per page.

A driver level page-cache will look nice, as long as it "works".  

Drivers usually have no other option than basing their recycle facility
to be based on the page-refcnt (as there is no destructor callback).
Which implies packets/pages need to be returned quickly enough for it
to work.

> All page allocations are of order 0.
> 
> NIC: Connectx-5 100 Gbps.
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
> Test:
> 128 TCP streams (using super_netperf).
> Changing num of RX queues.
> HW LRO OFF, GRO ON, MTU 1500.

With TCP streams and GRO, is actually a good stress test for the page
allocator (or drivers page-recycle cache). As  Eric Dumazet have made
some nice optimizations, that (in most situations) cause us to quickly
free/recycle the SKB (coming from driver) and store the pages in 1-SKB.
This cause us to hit the SLUB fastpath for the SKBs, but once the pages
need to be free'ed this stress the page allocator more.

Also be aware that with TCP flows, the packets are likely delivered
into a socket, that is consumed on another CPU.  Thus, the pages are
allocated on one CPU and free'ed on another. AFAIK this stress the
order-0 cache PCP (Per-Cpu-Pages).


> Observe: BW as a function of num RX queues.
> 
> Results:
> 
> Driver #1:
> #rings	BW (Mbps)
> 1	23,813
> 2	44,086
> 3	62,128
> 4	78,058
> 6	94,210 (linerate)
> 8	94,205 (linerate)
> 12	94,202 (linerate)
> 16	94,191 (linerate)
> 
> Driver #2:
> #rings	BW (Mbps)
> 1	18,835
> 2	36,716
> 3	50,521
> 4	61,746
> 6	63,637
> 8	60,299
> 12	51,048
> 16	43,337
> 
> Driver #3:
> #rings	BW (Mbps)
> 1	19,316
> 2	44,850
> 3	69,549
> 4	87,434
> 6	94,342 (linerate)
> 8	94,350 (linerate)
> 12	94,327 (linerate)
> 16	94,327 (linerate)
> 
> 
> Insights:
> Major degradation between #1 and #2, not getting any close to linerate!
> Degradation is fixed between #2 and #3.
> This is because page allocator cannot stand the higher allocation rate.
> In #2, we also see that the addition of rings (cores) reduces BW (!!), 
> as result of increasing congestion over shared resources.
> 
> Congestion in this case is very clear.
> When monitored in perf top:
> 85.58% [kernel] [k] queued_spin_lock_slowpath

Well, we obviously need to know the caller of the spin_lock.  In this
case it is likely the page allocator lock.  It could also be the TCP
socket locks, but given GRO is enabled, they should be hit much less.


> I think that page allocator issues should be discussed separately:
> 1) Rate: Increase the allocation rate on a single core.
> 2) Scalability: Reduce congestion and sync overhead between cores.

Yes, but this no small task.  I is on my TODO-list (emacs org-mode),
but I have other tasks that have higher priority atm.  I'll be working
on XDP_REDIRECT for the next many months.  Currently trying to convince
people that we do an explicit packet-page return/free callback (which
would avoid many of these issues).


> This is clearly the current bottleneck in the network stack receive
> flow.
> 
> I know about some efforts that were made in the past two years.
> For example the ones from Jesper et al.:
>
> - Page-pool (not accepted AFAIK).

The page-pool have many purposes.
 1. generic page-cache for drivers,
 2. keep pages DMA-mapped
 3. facilitate drivers to change RX-ring memory model

From a MM-point-of-view the page pool is just a destructor callback,
that can "steal" the page.

If I can convince XDP_REDIRECT to use an explicit destructor callback,
then I almost get what I need.  Except for the generic part, and the
normal network path will not see the benefit.  Thus, not helping your
use-case, I guess.


> - Page-allocation bulking.

Notice, that page-allocator bulking, would still be needed by the
page-pool and other page-cache facilities. We should implement it
regardless of the page_pool.  

Without a page pool facility to hide the use of page bulking.  You
could use page-bulk-alloc in driver RX-ring refill, and find where TCP
free the GRO packets, and do page-bulk-free there.


> - Optimize order-0 allocations in Per-Cpu-Pages.

There is a need to optimize PCP some more for the single-core XDP
performance target (~14Mpps).  I guess, the easiest way around this is
implement/integrate a page bulk API into PCP.

The TCP-GRO use-case you are hitting is a different bottleneck.
It is a multi-CPU parallel workload, that exceed the PCP cache size,
and cause you to hit the page buddy allocator.

I wonder if you could "solve"/mitigate the issue if you tune the size
of the PCP cache?
AFAIK it only keeps 128 pages cached per CPU... I know you can see this
via a proc file, but I cannot remember which(?).  And I'm not sure how
you tune this(?)


> I am not an mm expert, but wanted to raise the issue again, to combine 
> the efforts and hear from you guys about status and possible directions.

Regarding recent changes... if you have you kernel compiled with
CONFIG_NUMA then the page-allocator is slower (due to keeping
numa-stats), except that this was recently optimized and merged(?)

What (exact) kernel git tree did you run these tests on?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-14 16:49 ` Tariq Toukan
                   ` (2 preceding siblings ...)
  (?)
@ 2017-09-15 10:23 ` Mel Gorman
  2017-09-18  9:16   ` Tariq Toukan
  -1 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2017-09-15 10:23 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm

On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
> Insights:
> Major degradation between #1 and #2, not getting any close to linerate!
> Degradation is fixed between #2 and #3.
> This is because page allocator cannot stand the higher allocation rate.
> In #2, we also see that the addition of rings (cores) reduces BW (!!), as
> result of increasing congestion over shared resources.
> 

Unfortunately, no surprises there. 

> Congestion in this case is very clear.
> When monitored in perf top:
> 85.58% [kernel] [k] queued_spin_lock_slowpath
> 

While it's not proven, the most likely candidate is the zone lock and
that should be confirmed using a call-graph profile. If so, then the
suggestion to tune to the size of the per-cpu allocator would mitigate
the problem.

> I think that page allocator issues should be discussed separately:
> 1) Rate: Increase the allocation rate on a single core.
> 2) Scalability: Reduce congestion and sync overhead between cores.
> 
> This is clearly the current bottleneck in the network stack receive flow.
> 
> I know about some efforts that were made in the past two years.
> For example the ones from Jesper et al.:
> - Page-pool (not accepted AFAIK).

Indeed not and it would also need driver conversion.

> - Page-allocation bulking.

Prototypes exist but it's pointless without the pool or driver
conversion so it's in the back burner for the moment.

> - Optimize order-0 allocations in Per-Cpu-Pages.
> 

This had a prototype that was reverted as it must be able to cope with
both irq and noirq contexts. Unfortunately I never found the time to
revisit it but a split there to handle both would mitigate the problem.
Probably not enough to actually reach line speed though so tuning of the
per-cpu allocator sizes would still be needed. I don't know when I'll
get the chance to revisit it. I'm travelling all next week and am mostly
occupied with other work at the moment that is consuming all my
concentration.

> I am not an mm expert, but wanted to raise the issue again, to combine the
> efforts and hear from you guys about status and possible directions.

The recent effort to reduce overhead from stats will help mitigate the
problem. Finishing the page pool, the bulk allocator and converting drivers
would be the most likely successful path forward but it's currently stalled
as everyone that was previously involved is too busy.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-14 20:19   ` Andi Kleen
  (?)
@ 2017-09-17 15:43   ` Tariq Toukan
  -1 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2017-09-17 15:43 UTC (permalink / raw)
  To: Andi Kleen, Tariq Toukan
  Cc: David Miller, Jesper Dangaard Brouer, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm



On 14/09/2017 11:19 PM, Andi Kleen wrote:
> Tariq Toukan <tariqt@mellanox.com> writes:
>>
>> Congestion in this case is very clear.
>> When monitored in perf top:
>> 85.58% [kernel] [k] queued_spin_lock_slowpath
> 
> Please look at the callers. Spinlock profiles without callers
> are usually useless because it's just blaming the messenger.
> 
> Most likely the PCP lists are too small for your extreme allocation
> rate, so it goes back too often to the shared pool.
> 
> You can play with the vm.percpu_pagelist_fraction setting.

Thanks Andi.
That was my initial guess, but I wasn't familiar with these tunes in VM 
to verify that.
Indeed, bottleneck is released when increasing the PCP size, and BW 
becomes significantly better.

> 
> -Andi
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-15  7:28 ` Jesper Dangaard Brouer
@ 2017-09-17 16:16   ` Tariq Toukan
  2017-09-18  7:34     ` Aaron Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Tariq Toukan @ 2017-09-17 16:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Tariq Toukan
  Cc: David Miller, Mel Gorman, Eric Dumazet, Alexei Starovoitov,
	Saeed Mahameed, Eran Ben Elisha, Linux Kernel Network Developers,
	Andrew Morton, Michal Hocko, linux-mm



On 15/09/2017 10:28 AM, Jesper Dangaard Brouer wrote:
> On Thu, 14 Sep 2017 19:49:31 +0300
> Tariq Toukan <tariqt@mellanox.com> wrote:
> 
>> Hi all,
>>
>> As part of the efforts to support increasing next-generation NIC speeds,
>> I am investigating SW bottlenecks in network stack receive flow.
>>
>> Here I share some numbers I got for a simple experiment, in which I
>> simulate the page allocation rate needed in 200Gpbs NICs.
> 
> Thanks for bringing this up again.

Sure. We need to keep up with the increasing NIC speeds.

> 
>> I ran the test below over 3 different (modified) mlx5 driver versions,
>> loaded on server side (RX):
>> 1) RX page cache disabled, 2 packets per page.
> 
> 2 packets per page basically reduce the overhead you see from the page
> allocator to half.
> 
>> 2) RX page cache disabled, one packet per page.
> 
> This, should stress the page allocator.
> 
>> 3) Huge RX page cache, one packet per page.
> 
> A driver level page-cache will look nice, as long as it "works".

I verified that it worked in the experiment.

> 
> Drivers usually have no other option than basing their recycle facility
> to be based on the page-refcnt (as there is no destructor callback).
> Which implies packets/pages need to be returned quickly enough for it
> to work.

Yes, that's how our current default (small) RX page-cache is 
implemented. Unfortunately, the timing and terms for a fair reuse rate 
are not always satisfied.

> 
>> All page allocations are of order 0.
>>
>> NIC: Connectx-5 100 Gbps.
>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>
>> Test:
>> 128 TCP streams (using super_netperf).
>> Changing num of RX queues.
>> HW LRO OFF, GRO ON, MTU 1500.
> 
> With TCP streams and GRO, is actually a good stress test for the page
> allocator (or drivers page-recycle cache). As  Eric Dumazet have made
> some nice optimizations, that (in most situations) cause us to quickly
> free/recycle the SKB (coming from driver) and store the pages in 1-SKB.
> This cause us to hit the SLUB fastpath for the SKBs, but once the pages
> need to be free'ed this stress the page allocator more.

Yep, bulking would help here, as you mention below.

> 
> Also be aware that with TCP flows, the packets are likely delivered
> into a socket, that is consumed on another CPU.  Thus, the pages are
> allocated on one CPU and free'ed on another. AFAIK this stress the
> order-0 cache PCP (Per-Cpu-Pages).
> 
Good point.
Do you know of any tool/kernel counters that help observe and quantify 
this behavior?

> 
>> Observe: BW as a function of num RX queues.
>>
>> Results:
>>
>> Driver #1:
>> #rings	BW (Mbps)
>> 1	23,813
>> 2	44,086
>> 3	62,128
>> 4	78,058
>> 6	94,210 (linerate)
>> 8	94,205 (linerate)
>> 12	94,202 (linerate)
>> 16	94,191 (linerate)
>>
>> Driver #2:
>> #rings	BW (Mbps)
>> 1	18,835
>> 2	36,716
>> 3	50,521
>> 4	61,746
>> 6	63,637
>> 8	60,299
>> 12	51,048
>> 16	43,337
>>
>> Driver #3:
>> #rings	BW (Mbps)
>> 1	19,316
>> 2	44,850
>> 3	69,549
>> 4	87,434
>> 6	94,342 (linerate)
>> 8	94,350 (linerate)
>> 12	94,327 (linerate)
>> 16	94,327 (linerate)
>>
>>
>> Insights:
>> Major degradation between #1 and #2, not getting any close to linerate!
>> Degradation is fixed between #2 and #3.
>> This is because page allocator cannot stand the higher allocation rate.
>> In #2, we also see that the addition of rings (cores) reduces BW (!!),
>> as result of increasing congestion over shared resources.
>>
>> Congestion in this case is very clear.
>> When monitored in perf top:
>> 85.58% [kernel] [k] queued_spin_lock_slowpath
> 
> Well, we obviously need to know the caller of the spin_lock.  In this
> case it is likely the page allocator lock.  It could also be the TCP
> socket locks, but given GRO is enabled, they should be hit much less.
> 

It is the page allocator lock.
I verified this based on Andi's suggestion, see other mail.

It's nice to have the option to dynamically play with the parameter.
But maybe we should also think of changing the default fraction 
guaranteed to the PCP, so that unaware admins of networking servers 
would also benefit.

> 
>> I think that page allocator issues should be discussed separately:
>> 1) Rate: Increase the allocation rate on a single core.
>> 2) Scalability: Reduce congestion and sync overhead between cores.
> 
> Yes, but this no small task.  I is on my TODO-list (emacs org-mode),
> but I have other tasks that have higher priority atm.  I'll be working
> on XDP_REDIRECT for the next many months.  Currently trying to convince
> people that we do an explicit packet-page return/free callback (which
> would avoid many of these issues).
> 
> 
>> This is clearly the current bottleneck in the network stack receive
>> flow.
>>
>> I know about some efforts that were made in the past two years.
>> For example the ones from Jesper et al.:
>>
>> - Page-pool (not accepted AFAIK).
> 
> The page-pool have many purposes.
>   1. generic page-cache for drivers,
>   2. keep pages DMA-mapped
>   3. facilitate drivers to change RX-ring memory model
> 
>  From a MM-point-of-view the page pool is just a destructor callback,
> that can "steal" the page.
> 
> If I can convince XDP_REDIRECT to use an explicit destructor callback,
> then I almost get what I need.  Except for the generic part, and the
> normal network path will not see the benefit.  Thus, not helping your
> use-case, I guess.
> 

I see.

> 
>> - Page-allocation bulking.
> 
> Notice, that page-allocator bulking, would still be needed by the
> page-pool and other page-cache facilities. We should implement it
> regardless of the page_pool.

I agree.
It fits perfectly with our Striding RQ feature, in which each RX 
descriptor is relatively large and serves multiple received packets, 
requiring the allocation of many order-0 pages.

> 
> Without a page pool facility to hide the use of page bulking.  You
> could use page-bulk-alloc in driver RX-ring refill, and find where TCP
> free the GRO packets, and do page-bulk-free there.
> 

Exactly.

> 
>> - Optimize order-0 allocations in Per-Cpu-Pages.
> 
> There is a need to optimize PCP some more for the single-core XDP
> performance target (~14Mpps).  I guess, the easiest way around this is
> implement/integrate a page bulk API into PCP.
> 
> The TCP-GRO use-case you are hitting is a different bottleneck.
> It is a multi-CPU parallel workload, that exceed the PCP cache size,
> and cause you to hit the page buddy allocator.
> 
Indeed, I verified that.

> I wonder if you could "solve"/mitigate the issue if you tune the size
> of the PCP cache?
> AFAIK it only keeps 128 pages cached per CPU... I know you can see this
> via a proc file, but I cannot remember which(?).  And I'm not sure how
> you tune this(?)
> 

/proc/sys/vm/percpu_pagelist_fraction

> 
>> I am not an mm expert, but wanted to raise the issue again, to combine
>> the efforts and hear from you guys about status and possible directions.
> 
> Regarding recent changes... if you have you kernel compiled with
> CONFIG_NUMA then the page-allocator is slower (due to keeping
Yes it is.

> numa-stats), except that this was recently optimized and merged(?)
> 
Sounds useful, I should get familiar with these stats.
Do you how to observe them?

> What (exact) kernel git tree did you run these tests on?
> 
I had a few mlx5 driver patches on top of:
96e5ae4e76f1 bpf: fix numa_node validation

Many thanks!

Regards,
Tariq

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-17 16:16   ` Tariq Toukan
@ 2017-09-18  7:34     ` Aaron Lu
  2017-09-18  7:44       ` Aaron Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Lu @ 2017-09-18  7:34 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Jesper Dangaard Brouer, David Miller, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm, Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 1563 bytes --]

On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote:
> 
> It's nice to have the option to dynamically play with the parameter.
> But maybe we should also think of changing the default fraction guaranteed
> to the PCP, so that unaware admins of networking servers would also benefit.

I collected some performance data with will-it-scale/page_fault1 process
mode on different machines with different pcp->batch sizes, starting
from the default 31(calculated by zone_batchsize(), 31 is the standard
value for any zone that has more than 1/2MiB memory), then incremented
by 31 upwards till 527. PCP's upper limit is 6*batch.

An image is plotted and attached: batch_full.png(full here means the
number of process started equals to CPU number).

>From the image:
- For EX machines, they all see throughput increase with increased batch
  size and peaked at around batch_size=310, then fall;
- For EP machines, Haswell-EP and Broadwell-EP also see throughput
  increase with increased batch size and peaked at batch_size=279, then
  fall, batch_size=310 also delivers pretty good result. Skylake-EP is
  quite different in that it doesn't see any obvious throughput increase
  after batch_size=93, though the trend is still increasing, but in a very
  small way and finally peaked at batch_size=403, then fall.
  Ivybridge EP behaves much like desktop ones.
- For Desktop machines, they do not see any obvious changes with
  increased batch_size.

So the default batch size(31) doesn't deliver good enough result, we
probbaly should change the default value.

[-- Attachment #2: batch_full.png --]
[-- Type: image/png, Size: 25626 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-18  7:34     ` Aaron Lu
@ 2017-09-18  7:44       ` Aaron Lu
  2017-09-18 15:33         ` Tariq Toukan
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Lu @ 2017-09-18  7:44 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Jesper Dangaard Brouer, David Miller, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm, Dave Hansen

On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote:
> On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote:
> > 
> > It's nice to have the option to dynamically play with the parameter.
> > But maybe we should also think of changing the default fraction guaranteed
> > to the PCP, so that unaware admins of networking servers would also benefit.
> 
> I collected some performance data with will-it-scale/page_fault1 process
> mode on different machines with different pcp->batch sizes, starting
> from the default 31(calculated by zone_batchsize(), 31 is the standard
> value for any zone that has more than 1/2MiB memory), then incremented
> by 31 upwards till 527. PCP's upper limit is 6*batch.
> 
> An image is plotted and attached: batch_full.png(full here means the
> number of process started equals to CPU number).

To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527),
Y-axis is the value of per_process_ops, generated by will-it-scale,
higher is better.

> 
> From the image:
> - For EX machines, they all see throughput increase with increased batch
>   size and peaked at around batch_size=310, then fall;
> - For EP machines, Haswell-EP and Broadwell-EP also see throughput
>   increase with increased batch size and peaked at batch_size=279, then
>   fall, batch_size=310 also delivers pretty good result. Skylake-EP is
>   quite different in that it doesn't see any obvious throughput increase
>   after batch_size=93, though the trend is still increasing, but in a very
>   small way and finally peaked at batch_size=403, then fall.
>   Ivybridge EP behaves much like desktop ones.
> - For Desktop machines, they do not see any obvious changes with
>   increased batch_size.
> 
> So the default batch size(31) doesn't deliver good enough result, we
> probbaly should change the default value.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-15 10:23 ` Mel Gorman
@ 2017-09-18  9:16   ` Tariq Toukan
  2017-11-02 17:21       ` Tariq Toukan
  0 siblings, 1 reply; 31+ messages in thread
From: Tariq Toukan @ 2017-09-18  9:16 UTC (permalink / raw)
  To: Mel Gorman, Tariq Toukan
  Cc: David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm



On 15/09/2017 1:23 PM, Mel Gorman wrote:
> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>> Insights: Major degradation between #1 and #2, not getting any
>> close to linerate! Degradation is fixed between #2 and #3. This is
>> because page allocator cannot stand the higher allocation rate. In
>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>> as result of increasing congestion over shared resources.
>> 
> 
> Unfortunately, no surprises there.
> 
>> Congestion in this case is very clear. When monitored in perf top: 
>> 85.58% [kernel] [k] queued_spin_lock_slowpath
>> 
> 
> While it's not proven, the most likely candidate is the zone lock
> and that should be confirmed using a call-graph profile. If so, then
> the suggestion to tune to the size of the per-cpu allocator would
> mitigate the problem.
> 
Indeed, I tuned the per-cpu allocator and bottleneck is released.

>> I think that page allocator issues should be discussed separately: 
>> 1) Rate: Increase the allocation rate on a single core. 2)
>> Scalability: Reduce congestion and sync overhead between cores.
>> 
>> This is clearly the current bottleneck in the network stack receive
>> flow.
>> 
>> I know about some efforts that were made in the past two years. For
>> example the ones from Jesper et al.: - Page-pool (not accepted
>> AFAIK).
> 
> Indeed not and it would also need driver conversion.
> 
>> - Page-allocation bulking.
> 
> Prototypes exist but it's pointless without the pool or driver 
> conversion so it's in the back burner for the moment.
> 

As I already mentioned in another reply (to Jesper), this would
perfectly fit with our Striding RQ feature, as we have large descriptors
that serve several packets, requiring the allocation of several pages at
once. I'd gladly move to using the bulking API.

>> - Optimize order-0 allocations in Per-Cpu-Pages.
>> 
> 
> This had a prototype that was reverted as it must be able to cope
> with both irq and noirq contexts.
Yeah, I remember that I tested and reported the issue.

Unfortunately I never found the time to
> revisit it but a split there to handle both would mitigate the
> problem. Probably not enough to actually reach line speed though so
> tuning of the per-cpu allocator sizes would still be needed. I don't
> know when I'll get the chance to revisit it. I'm travelling all next
> week and am mostly occupied with other work at the moment that is
> consuming all my concentration.
> 
>> I am not an mm expert, but wanted to raise the issue again, to
>> combine the efforts and hear from you guys about status and
>> possible directions.
> 
> The recent effort to reduce overhead from stats will help mitigate
> the problem.
I should get more familiar with these stats, check how costly they are, 
and whether they can be turned off in Kconfig.

> Finishing the page pool, the bulk allocator and converting drivers 
> would be the most likely successful path forward but it's currently
> stalled as everyone that was previously involved is too busy.
> 
I think we should consider changing the default allocation of PCP 
fraction as well, or implement some smart dynamic heuristic.
This turned on to have significant effect over networking performance.

Many thanks Mel!

Regards,
Tariq

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-18  7:44       ` Aaron Lu
@ 2017-09-18 15:33         ` Tariq Toukan
  2017-09-19  7:23             ` Aaron Lu
  0 siblings, 1 reply; 31+ messages in thread
From: Tariq Toukan @ 2017-09-18 15:33 UTC (permalink / raw)
  To: Aaron Lu, Tariq Toukan
  Cc: Jesper Dangaard Brouer, David Miller, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm, Dave Hansen



On 18/09/2017 10:44 AM, Aaron Lu wrote:
> On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote:
>> On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote:
>>>
>>> It's nice to have the option to dynamically play with the parameter.
>>> But maybe we should also think of changing the default fraction guaranteed
>>> to the PCP, so that unaware admins of networking servers would also benefit.
>>
>> I collected some performance data with will-it-scale/page_fault1 process
>> mode on different machines with different pcp->batch sizes, starting
>> from the default 31(calculated by zone_batchsize(), 31 is the standard
>> value for any zone that has more than 1/2MiB memory), then incremented
>> by 31 upwards till 527. PCP's upper limit is 6*batch.
>>
>> An image is plotted and attached: batch_full.png(full here means the
>> number of process started equals to CPU number).
> 
> To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527),
> Y-axis is the value of per_process_ops, generated by will-it-scale,
> higher is better.
> 
>>
>>  From the image:
>> - For EX machines, they all see throughput increase with increased batch
>>    size and peaked at around batch_size=310, then fall;
>> - For EP machines, Haswell-EP and Broadwell-EP also see throughput
>>    increase with increased batch size and peaked at batch_size=279, then
>>    fall, batch_size=310 also delivers pretty good result. Skylake-EP is
>>    quite different in that it doesn't see any obvious throughput increase
>>    after batch_size=93, though the trend is still increasing, but in a very
>>    small way and finally peaked at batch_size=403, then fall.
>>    Ivybridge EP behaves much like desktop ones.
>> - For Desktop machines, they do not see any obvious changes with
>>    increased batch_size.
>>
>> So the default batch size(31) doesn't deliver good enough result, we
>> probbaly should change the default value.

Thanks Aaron for sharing your experiment results.
That's a good analysis of the effect of the batch value.
I agree with your conclusion.

 From networking perspective, we should reconsider the defaults to be 
able to reach the increasing NICs linerates.
Not only for pcp->batch, but also for pcp->high.

Regards,
Tariq

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-18 15:33         ` Tariq Toukan
@ 2017-09-19  7:23             ` Aaron Lu
  0 siblings, 0 replies; 31+ messages in thread
From: Aaron Lu @ 2017-09-19  7:23 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Jesper Dangaard Brouer, David Miller, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm, Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 3473 bytes --]

On Mon, Sep 18, 2017 at 06:33:20PM +0300, Tariq Toukan wrote:
> 
> 
> On 18/09/2017 10:44 AM, Aaron Lu wrote:
> > On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote:
> > > On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote:
> > > > 
> > > > It's nice to have the option to dynamically play with the parameter.
> > > > But maybe we should also think of changing the default fraction guaranteed
> > > > to the PCP, so that unaware admins of networking servers would also benefit.
> > > 
> > > I collected some performance data with will-it-scale/page_fault1 process
> > > mode on different machines with different pcp->batch sizes, starting
> > > from the default 31(calculated by zone_batchsize(), 31 is the standard
> > > value for any zone that has more than 1/2MiB memory), then incremented
> > > by 31 upwards till 527. PCP's upper limit is 6*batch.
> > > 
> > > An image is plotted and attached: batch_full.png(full here means the
> > > number of process started equals to CPU number).
> > 
> > To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527),
> > Y-axis is the value of per_process_ops, generated by will-it-scale,

One correction here, Y-axis isn't per_process_ops but per_process_ops *
nr_processes. Still, higher is better.

> > higher is better.
> > 
> > > 
> > >  From the image:
> > > - For EX machines, they all see throughput increase with increased batch
> > >    size and peaked at around batch_size=310, then fall;
> > > - For EP machines, Haswell-EP and Broadwell-EP also see throughput
> > >    increase with increased batch size and peaked at batch_size=279, then
> > >    fall, batch_size=310 also delivers pretty good result. Skylake-EP is
> > >    quite different in that it doesn't see any obvious throughput increase
> > >    after batch_size=93, though the trend is still increasing, but in a very
> > >    small way and finally peaked at batch_size=403, then fall.
> > >    Ivybridge EP behaves much like desktop ones.
> > > - For Desktop machines, they do not see any obvious changes with
> > >    increased batch_size.
> > > 
> > > So the default batch size(31) doesn't deliver good enough result, we
> > > probbaly should change the default value.
> 
> Thanks Aaron for sharing your experiment results.
> That's a good analysis of the effect of the batch value.
> I agree with your conclusion.
> 
> From networking perspective, we should reconsider the defaults to be able to
> reach the increasing NICs linerates.
> Not only for pcp->batch, but also for pcp->high.

I guess I didn't make it clear in my last email: when pcp->batch is
changed, pcp->high is also changed. Their relationship is:
pcp->high = pcp->batch * 6.

Manipulating percpu_pagelist_fraction could increase pcp->high, but not
pcp->batch(it has an upper limit as 96 currently).

My test shows even when pcp->high being the same, changing pcp->batch
could further improve will-it-scale's performance. e.g. in the below two
cases, pcp->high are both set to 1860 but with different pcp->batch:

                 will-it-scale    native_queued_spin_lock_slowpath(perf)
pcp->batch=96    15762348         79.95%
pcp->batch=310   19291492 +22.3%  74.87% -5.1%

Granted, this is the case for will-it-scale and may not apply to your
case. I have a small patch that adds a batch interface for debug
purpose, echo a value could set batch and high will be batch * 6. You
are welcome to give it a try if you think it's worth(attached).

Regards,
Aaron

[-- Attachment #2: 0001-percpu_pagelist_batch-add-a-batch-interface.patch --]
[-- Type: text/plain, Size: 3764 bytes --]

>From e3c9516beb8302cb8fb2f5ab866bbe2686fda5fb Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Thu, 6 Jul 2017 15:00:07 +0800
Subject: [PATCH] percpu_pagelist_batch: add a batch interface

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 include/linux/mmzone.h |  2 ++
 kernel/sysctl.c        |  9 +++++++++
 mm/page_alloc.c        | 40 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef6a13b7bd3e..0548d038b7cd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -875,6 +875,8 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int percpu_pagelist_batch_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4dfba1a76cc3..85cc4544db1b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -108,6 +108,7 @@ extern unsigned int core_pipe_limit;
 extern int pid_max;
 extern int pid_max_min, pid_max_max;
 extern int percpu_pagelist_fraction;
+extern int percpu_pagelist_batch;
 extern int latencytop_enabled;
 extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
@@ -1440,6 +1441,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= percpu_pagelist_fraction_sysctl_handler,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "percpu_pagelist_batch",
+		.data		= &percpu_pagelist_batch,
+		.maxlen		= sizeof(percpu_pagelist_batch),
+		.mode		= 0644,
+		.proc_handler	= percpu_pagelist_batch_sysctl_handler,
+		.extra1		= &zero,
+	},
 #ifdef CONFIG_MMU
 	{
 		.procname	= "max_map_count",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2302f250d6b1..aa96a4bd6467 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -129,6 +129,7 @@ unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;
 
 int percpu_pagelist_fraction;
+int percpu_pagelist_batch;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
 /*
@@ -5477,7 +5478,8 @@ static void pageset_set_high_and_batch(struct zone *zone,
 			(zone->managed_pages /
 				percpu_pagelist_fraction));
 	else
-		pageset_set_batch(pcp, zone_batchsize(zone));
+		pageset_set_batch(pcp, percpu_pagelist_batch ?
+				percpu_pagelist_batch : zone_batchsize(zone));
 }
 
 static void __meminit zone_pageset_init(struct zone *zone, int cpu)
@@ -7157,6 +7159,42 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int percpu_pagelist_batch_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	struct zone *zone;
+	int old_percpu_pagelist_batch;
+	int ret;
+
+	mutex_lock(&pcp_batch_high_lock);
+	old_percpu_pagelist_batch = percpu_pagelist_batch;
+
+	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (!write || ret < 0)
+		goto out;
+
+	/* Sanity checking to avoid pcp imbalance */
+	if (percpu_pagelist_batch <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* No change? */
+	if (percpu_pagelist_batch == old_percpu_pagelist_batch)
+		goto out;
+
+	for_each_populated_zone(zone) {
+		unsigned int cpu;
+
+		for_each_possible_cpu(cpu)
+			pageset_set_high_and_batch(zone,
+					per_cpu_ptr(zone->pageset, cpu));
+	}
+out:
+	mutex_unlock(&pcp_batch_high_lock);
+	return ret;
+}
+
 #ifdef CONFIG_NUMA
 int hashdist = HASHDIST_DEFAULT;
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
@ 2017-09-19  7:23             ` Aaron Lu
  0 siblings, 0 replies; 31+ messages in thread
From: Aaron Lu @ 2017-09-19  7:23 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Jesper Dangaard Brouer, David Miller, Mel Gorman, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Linux Kernel Network Developers, Andrew Morton, Michal Hocko,
	linux-mm, Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 3473 bytes --]

On Mon, Sep 18, 2017 at 06:33:20PM +0300, Tariq Toukan wrote:
> 
> 
> On 18/09/2017 10:44 AM, Aaron Lu wrote:
> > On Mon, Sep 18, 2017 at 03:34:47PM +0800, Aaron Lu wrote:
> > > On Sun, Sep 17, 2017 at 07:16:15PM +0300, Tariq Toukan wrote:
> > > > 
> > > > It's nice to have the option to dynamically play with the parameter.
> > > > But maybe we should also think of changing the default fraction guaranteed
> > > > to the PCP, so that unaware admins of networking servers would also benefit.
> > > 
> > > I collected some performance data with will-it-scale/page_fault1 process
> > > mode on different machines with different pcp->batch sizes, starting
> > > from the default 31(calculated by zone_batchsize(), 31 is the standard
> > > value for any zone that has more than 1/2MiB memory), then incremented
> > > by 31 upwards till 527. PCP's upper limit is 6*batch.
> > > 
> > > An image is plotted and attached: batch_full.png(full here means the
> > > number of process started equals to CPU number).
> > 
> > To be clear: X-axis is the value of batch size(31, 62, 93, ..., 527),
> > Y-axis is the value of per_process_ops, generated by will-it-scale,

One correction here, Y-axis isn't per_process_ops but per_process_ops *
nr_processes. Still, higher is better.

> > higher is better.
> > 
> > > 
> > >  From the image:
> > > - For EX machines, they all see throughput increase with increased batch
> > >    size and peaked at around batch_size=310, then fall;
> > > - For EP machines, Haswell-EP and Broadwell-EP also see throughput
> > >    increase with increased batch size and peaked at batch_size=279, then
> > >    fall, batch_size=310 also delivers pretty good result. Skylake-EP is
> > >    quite different in that it doesn't see any obvious throughput increase
> > >    after batch_size=93, though the trend is still increasing, but in a very
> > >    small way and finally peaked at batch_size=403, then fall.
> > >    Ivybridge EP behaves much like desktop ones.
> > > - For Desktop machines, they do not see any obvious changes with
> > >    increased batch_size.
> > > 
> > > So the default batch size(31) doesn't deliver good enough result, we
> > > probbaly should change the default value.
> 
> Thanks Aaron for sharing your experiment results.
> That's a good analysis of the effect of the batch value.
> I agree with your conclusion.
> 
> From networking perspective, we should reconsider the defaults to be able to
> reach the increasing NICs linerates.
> Not only for pcp->batch, but also for pcp->high.

I guess I didn't make it clear in my last email: when pcp->batch is
changed, pcp->high is also changed. Their relationship is:
pcp->high = pcp->batch * 6.

Manipulating percpu_pagelist_fraction could increase pcp->high, but not
pcp->batch(it has an upper limit as 96 currently).

My test shows even when pcp->high being the same, changing pcp->batch
could further improve will-it-scale's performance. e.g. in the below two
cases, pcp->high are both set to 1860 but with different pcp->batch:

                 will-it-scale    native_queued_spin_lock_slowpath(perf)
pcp->batch=96    15762348         79.95%
pcp->batch=310   19291492 +22.3%  74.87% -5.1%

Granted, this is the case for will-it-scale and may not apply to your
case. I have a small patch that adds a batch interface for debug
purpose, echo a value could set batch and high will be batch * 6. You
are welcome to give it a try if you think it's worth(attached).

Regards,
Aaron

[-- Attachment #2: 0001-percpu_pagelist_batch-add-a-batch-interface.patch --]
[-- Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-09-18  9:16   ` Tariq Toukan
@ 2017-11-02 17:21       ` Tariq Toukan
  0 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2017-11-02 17:21 UTC (permalink / raw)
  To: Tariq Toukan, Linux Kernel Network Developers, linux-mm
  Cc: Mel Gorman, David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko



On 18/09/2017 12:16 PM, Tariq Toukan wrote:
> 
> 
> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>> Insights: Major degradation between #1 and #2, not getting any
>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>> because page allocator cannot stand the higher allocation rate. In
>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>> as result of increasing congestion over shared resources.
>>>
>>
>> Unfortunately, no surprises there.
>>
>>> Congestion in this case is very clear. When monitored in perf top: 
>>> 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>
>>
>> While it's not proven, the most likely candidate is the zone lock
>> and that should be confirmed using a call-graph profile. If so, then
>> the suggestion to tune to the size of the per-cpu allocator would
>> mitigate the problem.
>>
> Indeed, I tuned the per-cpu allocator and bottleneck is released.
> 

Hi all,

After leaving this task for a while doing other tasks, I got back to it 
now and see that the good behavior I observed earlier was not stable.

Recall: I work with a modified driver that allocates a page (4K) per 
packet (MTU=1500), in order to simulate the stress on page-allocator in 
200Gbps NICs.

Performance is good as long as pages are available in the allocating 
cores's PCP.
Issue is that pages are allocated in one core, then free'd in another, 
making it's hard for the PCP to work efficiently, and both the allocator 
core and the freeing core need to access the buddy allocator very often.

I'd like to share with you some testing numbers:

Test: ./super_netperf 128 -H 24.134.0.51 -l 1000

100% cpu on all cores, top func in perf:
    84.98%  [kernel]             [k] queued_spin_lock_slowpath

system wide (all cores)
            1135941      kmem:mm_page_alloc 

            2606629      kmem:mm_page_free 

                  0      kmem:mm_page_alloc_extfrag
            4784616      kmem:mm_page_alloc_zone_locked 

               1337      kmem:mm_page_free_batched 

            6488213      kmem:mm_page_pcpu_drain 

            8925503      net:napi_gro_receive_entry 


Two types of cores:
A core mostly running napi (8 such cores):
             221875      kmem:mm_page_alloc 

              17100      kmem:mm_page_free 

                  0      kmem:mm_page_alloc_extfrag
             766584      kmem:mm_page_alloc_zone_locked 

                 16      kmem:mm_page_free_batched 

                 35      kmem:mm_page_pcpu_drain 

            1340139      net:napi_gro_receive_entry 


Other core, mostly running user application (40 such):
                  2      kmem:mm_page_alloc 

              38922      kmem:mm_page_free 

                  0      kmem:mm_page_alloc_extfrag
                  1      kmem:mm_page_alloc_zone_locked 

                  8      kmem:mm_page_free_batched 

             107289      kmem:mm_page_pcpu_drain 

                 34      net:napi_gro_receive_entry 


As you can see, sync overhead is enormous.

PCP-wise, a key improvement in such scenarios would be reached if we 
could (1) keep and handle the allocated page on same cpu, or (2) somehow 
get the page back to the allocating core's PCP in a fast-path, without 
going through the regular buddy allocator paths.

Regards,
Tariq

>>> I think that page allocator issues should be discussed separately: 1) 
>>> Rate: Increase the allocation rate on a single core. 2)
>>> Scalability: Reduce congestion and sync overhead between cores.
>>>
>>> This is clearly the current bottleneck in the network stack receive
>>> flow.
>>>
>>> I know about some efforts that were made in the past two years. For
>>> example the ones from Jesper et al.: - Page-pool (not accepted
>>> AFAIK).
>>
>> Indeed not and it would also need driver conversion.
>>
>>> - Page-allocation bulking.
>>
>> Prototypes exist but it's pointless without the pool or driver 
>> conversion so it's in the back burner for the moment.
>>
> 
> As I already mentioned in another reply (to Jesper), this would
> perfectly fit with our Striding RQ feature, as we have large descriptors
> that serve several packets, requiring the allocation of several pages at
> once. I'd gladly move to using the bulking API.
> 
>>> - Optimize order-0 allocations in Per-Cpu-Pages.
>>>
>>
>> This had a prototype that was reverted as it must be able to cope
>> with both irq and noirq contexts.
> Yeah, I remember that I tested and reported the issue.
> 
> Unfortunately I never found the time to
>> revisit it but a split there to handle both would mitigate the
>> problem. Probably not enough to actually reach line speed though so
>> tuning of the per-cpu allocator sizes would still be needed. I don't
>> know when I'll get the chance to revisit it. I'm travelling all next
>> week and am mostly occupied with other work at the moment that is
>> consuming all my concentration.
>>
>>> I am not an mm expert, but wanted to raise the issue again, to
>>> combine the efforts and hear from you guys about status and
>>> possible directions.
>>
>> The recent effort to reduce overhead from stats will help mitigate
>> the problem.
> I should get more familiar with these stats, check how costly they are, 
> and whether they can be turned off in Kconfig.
> 
>> Finishing the page pool, the bulk allocator and converting drivers 
>> would be the most likely successful path forward but it's currently
>> stalled as everyone that was previously involved is too busy.
>>
> I think we should consider changing the default allocation of PCP 
> fraction as well, or implement some smart dynamic heuristic.
> This turned on to have significant effect over networking performance.
> 
> Many thanks Mel!
> 
> Regards,
> Tariq

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
@ 2017-11-02 17:21       ` Tariq Toukan
  0 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2017-11-02 17:21 UTC (permalink / raw)
  To: Tariq Toukan, Linux Kernel Network Developers, linux-mm
  Cc: Mel Gorman, David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko



On 18/09/2017 12:16 PM, Tariq Toukan wrote:
> 
> 
> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>> Insights: Major degradation between #1 and #2, not getting any
>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>> because page allocator cannot stand the higher allocation rate. In
>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>> as result of increasing congestion over shared resources.
>>>
>>
>> Unfortunately, no surprises there.
>>
>>> Congestion in this case is very clear. When monitored in perf top: 
>>> 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>
>>
>> While it's not proven, the most likely candidate is the zone lock
>> and that should be confirmed using a call-graph profile. If so, then
>> the suggestion to tune to the size of the per-cpu allocator would
>> mitigate the problem.
>>
> Indeed, I tuned the per-cpu allocator and bottleneck is released.
> 

Hi all,

After leaving this task for a while doing other tasks, I got back to it 
now and see that the good behavior I observed earlier was not stable.

Recall: I work with a modified driver that allocates a page (4K) per 
packet (MTU=1500), in order to simulate the stress on page-allocator in 
200Gbps NICs.

Performance is good as long as pages are available in the allocating 
cores's PCP.
Issue is that pages are allocated in one core, then free'd in another, 
making it's hard for the PCP to work efficiently, and both the allocator 
core and the freeing core need to access the buddy allocator very often.

I'd like to share with you some testing numbers:

Test: ./super_netperf 128 -H 24.134.0.51 -l 1000

100% cpu on all cores, top func in perf:
    84.98%  [kernel]             [k] queued_spin_lock_slowpath

system wide (all cores)
            1135941      kmem:mm_page_alloc 

            2606629      kmem:mm_page_free 

                  0      kmem:mm_page_alloc_extfrag
            4784616      kmem:mm_page_alloc_zone_locked 

               1337      kmem:mm_page_free_batched 

            6488213      kmem:mm_page_pcpu_drain 

            8925503      net:napi_gro_receive_entry 


Two types of cores:
A core mostly running napi (8 such cores):
             221875      kmem:mm_page_alloc 

              17100      kmem:mm_page_free 

                  0      kmem:mm_page_alloc_extfrag
             766584      kmem:mm_page_alloc_zone_locked 

                 16      kmem:mm_page_free_batched 

                 35      kmem:mm_page_pcpu_drain 

            1340139      net:napi_gro_receive_entry 


Other core, mostly running user application (40 such):
                  2      kmem:mm_page_alloc 

              38922      kmem:mm_page_free 

                  0      kmem:mm_page_alloc_extfrag
                  1      kmem:mm_page_alloc_zone_locked 

                  8      kmem:mm_page_free_batched 

             107289      kmem:mm_page_pcpu_drain 

                 34      net:napi_gro_receive_entry 


As you can see, sync overhead is enormous.

PCP-wise, a key improvement in such scenarios would be reached if we 
could (1) keep and handle the allocated page on same cpu, or (2) somehow 
get the page back to the allocating core's PCP in a fast-path, without 
going through the regular buddy allocator paths.

Regards,
Tariq

>>> I think that page allocator issues should be discussed separately: 1) 
>>> Rate: Increase the allocation rate on a single core. 2)
>>> Scalability: Reduce congestion and sync overhead between cores.
>>>
>>> This is clearly the current bottleneck in the network stack receive
>>> flow.
>>>
>>> I know about some efforts that were made in the past two years. For
>>> example the ones from Jesper et al.: - Page-pool (not accepted
>>> AFAIK).
>>
>> Indeed not and it would also need driver conversion.
>>
>>> - Page-allocation bulking.
>>
>> Prototypes exist but it's pointless without the pool or driver 
>> conversion so it's in the back burner for the moment.
>>
> 
> As I already mentioned in another reply (to Jesper), this would
> perfectly fit with our Striding RQ feature, as we have large descriptors
> that serve several packets, requiring the allocation of several pages at
> once. I'd gladly move to using the bulking API.
> 
>>> - Optimize order-0 allocations in Per-Cpu-Pages.
>>>
>>
>> This had a prototype that was reverted as it must be able to cope
>> with both irq and noirq contexts.
> Yeah, I remember that I tested and reported the issue.
> 
> Unfortunately I never found the time to
>> revisit it but a split there to handle both would mitigate the
>> problem. Probably not enough to actually reach line speed though so
>> tuning of the per-cpu allocator sizes would still be needed. I don't
>> know when I'll get the chance to revisit it. I'm travelling all next
>> week and am mostly occupied with other work at the moment that is
>> consuming all my concentration.
>>
>>> I am not an mm expert, but wanted to raise the issue again, to
>>> combine the efforts and hear from you guys about status and
>>> possible directions.
>>
>> The recent effort to reduce overhead from stats will help mitigate
>> the problem.
> I should get more familiar with these stats, check how costly they are, 
> and whether they can be turned off in Kconfig.
> 
>> Finishing the page pool, the bulk allocator and converting drivers 
>> would be the most likely successful path forward but it's currently
>> stalled as everyone that was previously involved is too busy.
>>
> I think we should consider changing the default allocation of PCP 
> fraction as well, or implement some smart dynamic heuristic.
> This turned on to have significant effect over networking performance.
> 
> Many thanks Mel!
> 
> Regards,
> Tariq

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-11-02 17:21       ` Tariq Toukan
  (?)
@ 2017-11-03 13:40       ` Mel Gorman
  2017-11-08  5:42           ` Tariq Toukan
  -1 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2017-11-03 13:40 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, David Miller,
	Jesper Dangaard Brouer, Eric Dumazet, Alexei Starovoitov,
	Saeed Mahameed, Eran Ben Elisha, Andrew Morton, Michal Hocko

On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
> 
> 
> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
> > 
> > 
> > On 15/09/2017 1:23 PM, Mel Gorman wrote:
> > > On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
> > > > Insights: Major degradation between #1 and #2, not getting any
> > > > close to linerate! Degradation is fixed between #2 and #3. This is
> > > > because page allocator cannot stand the higher allocation rate. In
> > > > #2, we also see that the addition of rings (cores) reduces BW (!!),
> > > > as result of increasing congestion over shared resources.
> > > > 
> > > 
> > > Unfortunately, no surprises there.
> > > 
> > > > Congestion in this case is very clear. When monitored in perf
> > > > top: 85.58% [kernel] [k] queued_spin_lock_slowpath
> > > > 
> > > 
> > > While it's not proven, the most likely candidate is the zone lock
> > > and that should be confirmed using a call-graph profile. If so, then
> > > the suggestion to tune to the size of the per-cpu allocator would
> > > mitigate the problem.
> > > 
> > Indeed, I tuned the per-cpu allocator and bottleneck is released.
> > 
> 
> Hi all,
> 
> After leaving this task for a while doing other tasks, I got back to it now
> and see that the good behavior I observed earlier was not stable.
> 
> Recall: I work with a modified driver that allocates a page (4K) per packet
> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
> NICs.
> 

There is almost new in the data that hasn't been discussed before. The
suggestion to free on a remote per-cpu list would be expensive as it would
require per-cpu lists to have a lock for safe remote access.  However,
I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch
ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git .  It's an
unfinished prototype I worked on a few weeks ago. I was going to revisit
in about a months time when 4.15-rc1 was out. I'd be interested in seeing
if it has a postive gain in normal page allocations without destroying
the performance of interrupt and softirq allocation contexts. The
interrupt/softirq context testing is crucial as that is something that
hurt us before when trying to improve page allocator performance.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-11-03 13:40       ` Mel Gorman
@ 2017-11-08  5:42           ` Tariq Toukan
  0 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2017-11-08  5:42 UTC (permalink / raw)
  To: Mel Gorman, Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, David Miller,
	Jesper Dangaard Brouer, Eric Dumazet, Alexei Starovoitov,
	Saeed Mahameed, Eran Ben Elisha, Andrew Morton, Michal Hocko



On 03/11/2017 10:40 PM, Mel Gorman wrote:
> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
>>
>>
>> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
>>>
>>>
>>> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>>>> Insights: Major degradation between #1 and #2, not getting any
>>>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>>>> because page allocator cannot stand the higher allocation rate. In
>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>>>> as result of increasing congestion over shared resources.
>>>>>
>>>>
>>>> Unfortunately, no surprises there.
>>>>
>>>>> Congestion in this case is very clear. When monitored in perf
>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>>>
>>>>
>>>> While it's not proven, the most likely candidate is the zone lock
>>>> and that should be confirmed using a call-graph profile. If so, then
>>>> the suggestion to tune to the size of the per-cpu allocator would
>>>> mitigate the problem.
>>>>
>>> Indeed, I tuned the per-cpu allocator and bottleneck is released.
>>>
>>
>> Hi all,
>>
>> After leaving this task for a while doing other tasks, I got back to it now
>> and see that the good behavior I observed earlier was not stable.
>>
>> Recall: I work with a modified driver that allocates a page (4K) per packet
>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
>> NICs.
>>
> 
> There is almost new in the data that hasn't been discussed before. The
> suggestion to free on a remote per-cpu list would be expensive as it would
> require per-cpu lists to have a lock for safe remote access.
That's right, but each such lock will be significantly less congested 
than the buddy allocator lock. In the flow in subject two cores need to 
synchronize (one allocates, one frees).
We also need to evaluate the cost of acquiring and releasing the lock in 
the case of no congestion at all.

>  However,
> I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch
> ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git .  It's an
> unfinished prototype I worked on a few weeks ago. I was going to revisit
> in about a months time when 4.15-rc1 was out. I'd be interested in seeing
> if it has a postive gain in normal page allocations without destroying
> the performance of interrupt and softirq allocation contexts. The
> interrupt/softirq context testing is crucial as that is something that
> hurt us before when trying to improve page allocator performance.
> 
Yes, I will test that once I get back in office (after netdev conference 
and vacation).
Can you please elaborate in a few words about the idea behind the prototype?
Does it address page-allocator scalability issues, or only the rate of 
single core page allocations?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
@ 2017-11-08  5:42           ` Tariq Toukan
  0 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2017-11-08  5:42 UTC (permalink / raw)
  To: Mel Gorman, Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, David Miller,
	Jesper Dangaard Brouer, Eric Dumazet, Alexei Starovoitov,
	Saeed Mahameed, Eran Ben Elisha, Andrew Morton, Michal Hocko



On 03/11/2017 10:40 PM, Mel Gorman wrote:
> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
>>
>>
>> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
>>>
>>>
>>> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>>>> Insights: Major degradation between #1 and #2, not getting any
>>>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>>>> because page allocator cannot stand the higher allocation rate. In
>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>>>> as result of increasing congestion over shared resources.
>>>>>
>>>>
>>>> Unfortunately, no surprises there.
>>>>
>>>>> Congestion in this case is very clear. When monitored in perf
>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>>>
>>>>
>>>> While it's not proven, the most likely candidate is the zone lock
>>>> and that should be confirmed using a call-graph profile. If so, then
>>>> the suggestion to tune to the size of the per-cpu allocator would
>>>> mitigate the problem.
>>>>
>>> Indeed, I tuned the per-cpu allocator and bottleneck is released.
>>>
>>
>> Hi all,
>>
>> After leaving this task for a while doing other tasks, I got back to it now
>> and see that the good behavior I observed earlier was not stable.
>>
>> Recall: I work with a modified driver that allocates a page (4K) per packet
>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
>> NICs.
>>
> 
> There is almost new in the data that hasn't been discussed before. The
> suggestion to free on a remote per-cpu list would be expensive as it would
> require per-cpu lists to have a lock for safe remote access.
That's right, but each such lock will be significantly less congested 
than the buddy allocator lock. In the flow in subject two cores need to 
synchronize (one allocates, one frees).
We also need to evaluate the cost of acquiring and releasing the lock in 
the case of no congestion at all.

>  However,
> I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch
> ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git .  It's an
> unfinished prototype I worked on a few weeks ago. I was going to revisit
> in about a months time when 4.15-rc1 was out. I'd be interested in seeing
> if it has a postive gain in normal page allocations without destroying
> the performance of interrupt and softirq allocation contexts. The
> interrupt/softirq context testing is crucial as that is something that
> hurt us before when trying to improve page allocator performance.
> 
Yes, I will test that once I get back in office (after netdev conference 
and vacation).
Can you please elaborate in a few words about the idea behind the prototype?
Does it address page-allocator scalability issues, or only the rate of 
single core page allocations?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-11-08  5:42           ` Tariq Toukan
  (?)
@ 2017-11-08  9:35           ` Mel Gorman
  2017-11-09  3:51             ` Figo.zhang
                               ` (2 more replies)
  -1 siblings, 3 replies; 31+ messages in thread
From: Mel Gorman @ 2017-11-08  9:35 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, David Miller,
	Jesper Dangaard Brouer, Eric Dumazet, Alexei Starovoitov,
	Saeed Mahameed, Eran Ben Elisha, Andrew Morton, Michal Hocko

On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote:
> > > Hi all,
> > > 
> > > After leaving this task for a while doing other tasks, I got back to it now
> > > and see that the good behavior I observed earlier was not stable.
> > > 
> > > Recall: I work with a modified driver that allocates a page (4K) per packet
> > > (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
> > > NICs.
> > > 
> > 
> > There is almost new in the data that hasn't been discussed before. The
> > suggestion to free on a remote per-cpu list would be expensive as it would
> > require per-cpu lists to have a lock for safe remote access.
>
> That's right, but each such lock will be significantly less congested than
> the buddy allocator lock.

That is not necessarily true if all the allocations and frees always happen
on the same CPUs. The contention will be equivalent to the zone lock.
Your point will only hold true if there are also heavy allocation streams
from other CPUs that are unrelated.

> In the flow in subject two cores need to
> synchronize (one allocates, one frees).
> We also need to evaluate the cost of acquiring and releasing the lock in the
> case of no congestion at all.
> 

If the per-cpu structures have a lock, there will be a light amount of
overhead. Nothing too severe, but it shouldn't be done lightly either.

> >  However,
> > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch
> > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git .  It's an
> > unfinished prototype I worked on a few weeks ago. I was going to revisit
> > in about a months time when 4.15-rc1 was out. I'd be interested in seeing
> > if it has a postive gain in normal page allocations without destroying
> > the performance of interrupt and softirq allocation contexts. The
> > interrupt/softirq context testing is crucial as that is something that
> > hurt us before when trying to improve page allocator performance.
> > 
> Yes, I will test that once I get back in office (after netdev conference and
> vacation).

Thanks.

> Can you please elaborate in a few words about the idea behind the prototype?
> Does it address page-allocator scalability issues, or only the rate of
> single core page allocations?

Short answer -- maybe. All scalability issues or rates of allocation are
context and workload dependant so the question is impossible to answer
for the general case.

Broadly speaking, the patch reintroduces the per-cpu lists being for !irq
context allocations again. The last time we did this, hard and soft IRQ
allocations went through the buddy allocator which couldn't scale and
the patch was reverted. With this patch, it goes through a very large
pagevec-like structure that is protected by a lock but the fast paths
for alloc/free are extremely simple operations so the lock hold times are
very small. Potentially, a development path is that the current per-cpu
allocator is replaced with pagevec-like structures that are dynamically
allocated which would also allow pages to be freed to remote CPU lists
(if we could detect when that is appropriate which is unclear). We could
also drain remote lists without using IPIs. The downside is that the memory
footprint of the allocator would be higher and the size could no longer
be tuned so there would need to be excellent justification for such a move.

I haven't posted the patches properly yet because mmotm is carrying too
many patches as it is and this patch indirectly depends on the contents. I
also didn't write memory hot-remove support which would be a requirement
before merging. I hadn't intended to put further effort into it until I
had some evidence the approach had promise. My own testing indicated it
worked but the drivers I was using for network tests did not allocate
intensely enough to show any major gain/loss.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-11-08  9:35           ` Mel Gorman
@ 2017-11-09  3:51             ` Figo.zhang
  2017-11-09  5:06             ` Tariq Toukan
  2017-11-09  5:21               ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 31+ messages in thread
From: Figo.zhang @ 2017-11-09  3:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Tariq Toukan, Linux Kernel Network Developers, linux-mm,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 4573 bytes --]

@Tariq, some ideas would steal from DPDK to improve the
high speed network card?
such as a physical CPU dedicated for the RX and TX thread (no
context switch and interrupt latency), and the
memory has prepared and allocated.

2017-11-08 17:35 GMT+08:00 Mel Gorman <mgorman@techsingularity.net>:

> On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote:
> > > > Hi all,
> > > >
> > > > After leaving this task for a while doing other tasks, I got back to
> it now
> > > > and see that the good behavior I observed earlier was not stable.
> > > >
> > > > Recall: I work with a modified driver that allocates a page (4K) per
> packet
> > > > (MTU=1500), in order to simulate the stress on page-allocator in
> 200Gbps
> > > > NICs.
> > > >
> > >
> > > There is almost new in the data that hasn't been discussed before. The
> > > suggestion to free on a remote per-cpu list would be expensive as it
> would
> > > require per-cpu lists to have a lock for safe remote access.
> >
> > That's right, but each such lock will be significantly less congested
> than
> > the buddy allocator lock.
>
> That is not necessarily true if all the allocations and frees always happen
> on the same CPUs. The contention will be equivalent to the zone lock.
> Your point will only hold true if there are also heavy allocation streams
> from other CPUs that are unrelated.
>
> > In the flow in subject two cores need to
> > synchronize (one allocates, one frees).
> > We also need to evaluate the cost of acquiring and releasing the lock in
> the
> > case of no congestion at all.
> >
>
> If the per-cpu structures have a lock, there will be a light amount of
> overhead. Nothing too severe, but it shouldn't be done lightly either.
>
> > >  However,
> > > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch
> > > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git .  It's
> an
> > > unfinished prototype I worked on a few weeks ago. I was going to
> revisit
> > > in about a months time when 4.15-rc1 was out. I'd be interested in
> seeing
> > > if it has a postive gain in normal page allocations without destroying
> > > the performance of interrupt and softirq allocation contexts. The
> > > interrupt/softirq context testing is crucial as that is something that
> > > hurt us before when trying to improve page allocator performance.
> > >
> > Yes, I will test that once I get back in office (after netdev conference
> and
> > vacation).
>
> Thanks.
>
> > Can you please elaborate in a few words about the idea behind the
> prototype?
> > Does it address page-allocator scalability issues, or only the rate of
> > single core page allocations?
>
> Short answer -- maybe. All scalability issues or rates of allocation are
> context and workload dependant so the question is impossible to answer
> for the general case.
>
> Broadly speaking, the patch reintroduces the per-cpu lists being for !irq
> context allocations again. The last time we did this, hard and soft IRQ
> allocations went through the buddy allocator which couldn't scale and
> the patch was reverted. With this patch, it goes through a very large
> pagevec-like structure that is protected by a lock but the fast paths
> for alloc/free are extremely simple operations so the lock hold times are
> very small. Potentially, a development path is that the current per-cpu
> allocator is replaced with pagevec-like structures that are dynamically
> allocated which would also allow pages to be freed to remote CPU lists
> (if we could detect when that is appropriate which is unclear). We could
> also drain remote lists without using IPIs. The downside is that the memory
> footprint of the allocator would be higher and the size could no longer
> be tuned so there would need to be excellent justification for such a move.
>
> I haven't posted the patches properly yet because mmotm is carrying too
> many patches as it is and this patch indirectly depends on the contents. I
> also didn't write memory hot-remove support which would be a requirement
> before merging. I hadn't intended to put further effort into it until I
> had some evidence the approach had promise. My own testing indicated it
> worked but the drivers I was using for network tests did not allocate
> intensely enough to show any major gain/loss.
>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

[-- Attachment #2: Type: text/html, Size: 6156 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-11-08  9:35           ` Mel Gorman
  2017-11-09  3:51             ` Figo.zhang
@ 2017-11-09  5:06             ` Tariq Toukan
  2017-11-09  5:21               ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2017-11-09  5:06 UTC (permalink / raw)
  To: Mel Gorman, Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, David Miller,
	Jesper Dangaard Brouer, Eric Dumazet, Alexei Starovoitov,
	Saeed Mahameed, Eran Ben Elisha, Andrew Morton, Michal Hocko



On 08/11/2017 6:35 PM, Mel Gorman wrote:
> On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote:
>>>> Hi all,
>>>>
>>>> After leaving this task for a while doing other tasks, I got back to it now
>>>> and see that the good behavior I observed earlier was not stable.
>>>>
>>>> Recall: I work with a modified driver that allocates a page (4K) per packet
>>>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
>>>> NICs.
>>>>
>>>
>>> There is almost new in the data that hasn't been discussed before. The
>>> suggestion to free on a remote per-cpu list would be expensive as it would
>>> require per-cpu lists to have a lock for safe remote access.
>>
>> That's right, but each such lock will be significantly less congested than
>> the buddy allocator lock.
> 
> That is not necessarily true if all the allocations and frees always happen
> on the same CPUs. The contention will be equivalent to the zone lock.
> Your point will only hold true if there are also heavy allocation streams
> from other CPUs that are unrelated.

That's exactly the case.
I saw no issues when working with a single core allocating pages (and 
many others consuming the SKBs), this does not stress the buddy 
allocator enough to expose the problem.
On my server, problem becomes visible when working with >= 4 allocator 
cores (RX rings).
So "distributing" the locks between the different PCPs and doing 
remote-free (instead of using the centralized buddy allocator lock), 
would give a huge performance under high load (although it might cause a 
slight degradation when load is low).

> 
>> In the flow in subject two cores need to
>> synchronize (one allocates, one frees).
>> We also need to evaluate the cost of acquiring and releasing the lock in the
>> case of no congestion at all.
>>
> 
> If the per-cpu structures have a lock, there will be a light amount of
> overhead. Nothing too severe, but it shouldn't be done lightly either.
> 
If the trade-off is a huge gain under load, it might be worth it.

>>>   However,
>>> I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch
>>> ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git .  It's an
>>> unfinished prototype I worked on a few weeks ago. I was going to revisit
>>> in about a months time when 4.15-rc1 was out. I'd be interested in seeing
>>> if it has a postive gain in normal page allocations without destroying
>>> the performance of interrupt and softirq allocation contexts. The
>>> interrupt/softirq context testing is crucial as that is something that
>>> hurt us before when trying to improve page allocator performance.
>>>
>> Yes, I will test that once I get back in office (after netdev conference and
>> vacation).
> 
> Thanks.
> 
>> Can you please elaborate in a few words about the idea behind the prototype?
>> Does it address page-allocator scalability issues, or only the rate of
>> single core page allocations?
> 
> Short answer -- maybe. All scalability issues or rates of allocation are
> context and workload dependant so the question is impossible to answer
> for the general case.
> 
> Broadly speaking, the patch reintroduces the per-cpu lists being for !irq
> context allocations again. The last time we did this, hard and soft IRQ
> allocations went through the buddy allocator which couldn't scale and
> the patch was reverted. With this patch, it goes through a very large
> pagevec-like structure that is protected by a lock but the fast paths
> for alloc/free are extremely simple operations so the lock hold times are
> very small. Potentially, a development path is that the current per-cpu
> allocator is replaced with pagevec-like structures that are dynamically
> allocated which would also allow pages to be freed to remote CPU lists
> (if we could detect when that is appropriate which is unclear). We could
> also drain remote lists without using IPIs. The downside is that the memory
> footprint of the allocator would be higher and the size could no longer
> be tuned so there would need to be excellent justification for such a move.
> 
> I haven't posted the patches properly yet because mmotm is carrying too
> many patches as it is and this patch indirectly depends on the contents. I
> also didn't write memory hot-remove support which would be a requirement
> before merging. I hadn't intended to put further effort into it until I
> had some evidence the approach had promise. My own testing indicated it
> worked but the drivers I was using for network tests did not allocate
> intensely enough to show any major gain/loss.
> 
Thanks for the description. This sounds intriguing.
Once I'll get to testing it, I'll magnify the effect by stressing the 
page-allocator the same way I did earlier to simulate a load of 200Gbps.

Regards,
Tariq

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-11-08  9:35           ` Mel Gorman
@ 2017-11-09  5:21               ` Jesper Dangaard Brouer
  2017-11-09  5:06             ` Tariq Toukan
  2017-11-09  5:21               ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-11-09  5:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Tariq Toukan, Linux Kernel Network Developers, linux-mm,
	David Miller, Eric Dumazet, Alexei Starovoitov, Saeed Mahameed,
	Eran Ben Elisha, Andrew Morton, Michal Hocko, brouer,
	Michael S. Tsirkin

On Wed, 8 Nov 2017 09:35:47 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote:
> > > > Hi all,
> > > > 
> > > > After leaving this task for a while doing other tasks, I got back to it now
> > > > and see that the good behavior I observed earlier was not stable.
> > > > 
> > > > Recall: I work with a modified driver that allocates a page (4K) per packet
> > > > (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
> > > > NICs.
> > > >   
> > > 
> > > There is almost new in the data that hasn't been discussed before. The
> > > suggestion to free on a remote per-cpu list would be expensive as it would
> > > require per-cpu lists to have a lock for safe remote access.  
> >
> > That's right, but each such lock will be significantly less congested than
> > the buddy allocator lock.  
> 
> That is not necessarily true if all the allocations and frees always happen
> on the same CPUs. The contention will be equivalent to the zone lock.
> Your point will only hold true if there are also heavy allocation streams
> from other CPUs that are unrelated.
> 
> > In the flow in subject two cores need to
> > synchronize (one allocates, one frees).
> > We also need to evaluate the cost of acquiring and releasing the lock in the
> > case of no congestion at all.
> >   
> 
> If the per-cpu structures have a lock, there will be a light amount of
> overhead. Nothing too severe, but it shouldn't be done lightly either.
> 
> > >  However,
> > > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch
> > > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git .  It's an
> > > unfinished prototype I worked on a few weeks ago. I was going to revisit
> > > in about a months time when 4.15-rc1 was out. I'd be interested in seeing
> > > if it has a postive gain in normal page allocations without destroying
> > > the performance of interrupt and softirq allocation contexts. The
> > > interrupt/softirq context testing is crucial as that is something that
> > > hurt us before when trying to improve page allocator performance.
> > >   
> > Yes, I will test that once I get back in office (after netdev conference and
> > vacation).  
> 
> Thanks.

I'll also commit to testing this (when I return home, as Tariq I'm also
in Seoul ATM).

 
> > Can you please elaborate in a few words about the idea behind the prototype?
> > Does it address page-allocator scalability issues, or only the rate of
> > single core page allocations?  
> 
> Short answer -- maybe. All scalability issues or rates of allocation are
> context and workload dependant so the question is impossible to answer
> for the general case.
> 
> Broadly speaking, the patch reintroduces the per-cpu lists being for !irq
> context allocations again. The last time we did this, hard and soft IRQ
> allocations went through the buddy allocator which couldn't scale and
> the patch was reverted. With this patch, it goes through a very large
> pagevec-like structure that is protected by a lock but the fast paths
> for alloc/free are extremely simple operations so the lock hold times are
> very small. Potentially, a development path is that the current per-cpu
> allocator is replaced with pagevec-like structures that are dynamically
> allocated which would also allow pages to be freed to remote CPU lists

I've had huge success using ptr_ring, as a queue between CPUs, to
minimize cross-CPU cache-line touching.  With the recently accepted BPF
map called "cpumap" used for XDP_REDIRECT.

It's important to handle the two borderline cases in ptr_ring, of the
queue being almost full (default handled in ptr_ring) or almost empty.
Like describe in[1] slide 14:

[1] http://people.netfilter.org/hawk/presentations/NetConf2017_Seoul/XDP_devel_update_NetConf2017_Seoul.pdf

The use of XDP_REDIRECT + cpumap, do expose issues with the page
allocator.  E.g. slide 19 show ixgbe recycle scheme failing, but still
hitting the PCP.  Also notice slide 22 deducing the overhead.  Scale
stressing ptr_ring is showed in extra slides 35-39.


> (if we could detect when that is appropriate which is unclear). We could
> also drain remote lists without using IPIs. The downside is that the memory
> footprint of the allocator would be higher and the size could no longer
> be tuned so there would need to be excellent justification for such a move.
> 
> I haven't posted the patches properly yet because mmotm is carrying too
> many patches as it is and this patch indirectly depends on the contents. I
> also didn't write memory hot-remove support which would be a requirement
> before merging. I hadn't intended to put further effort into it until I
> had some evidence the approach had promise. My own testing indicated it
> worked but the drivers I was using for network tests did not allocate
> intensely enough to show any major gain/loss.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
@ 2017-11-09  5:21               ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-11-09  5:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Tariq Toukan, Linux Kernel Network Developers, linux-mm,
	David Miller, Eric Dumazet, Alexei Starovoitov, Saeed Mahameed,
	Eran Ben Elisha, Andrew Morton, Michal Hocko, brouer,
	Michael S. Tsirkin

On Wed, 8 Nov 2017 09:35:47 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Nov 08, 2017 at 02:42:04PM +0900, Tariq Toukan wrote:
> > > > Hi all,
> > > > 
> > > > After leaving this task for a while doing other tasks, I got back to it now
> > > > and see that the good behavior I observed earlier was not stable.
> > > > 
> > > > Recall: I work with a modified driver that allocates a page (4K) per packet
> > > > (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
> > > > NICs.
> > > >   
> > > 
> > > There is almost new in the data that hasn't been discussed before. The
> > > suggestion to free on a remote per-cpu list would be expensive as it would
> > > require per-cpu lists to have a lock for safe remote access.  
> >
> > That's right, but each such lock will be significantly less congested than
> > the buddy allocator lock.  
> 
> That is not necessarily true if all the allocations and frees always happen
> on the same CPUs. The contention will be equivalent to the zone lock.
> Your point will only hold true if there are also heavy allocation streams
> from other CPUs that are unrelated.
> 
> > In the flow in subject two cores need to
> > synchronize (one allocates, one frees).
> > We also need to evaluate the cost of acquiring and releasing the lock in the
> > case of no congestion at all.
> >   
> 
> If the per-cpu structures have a lock, there will be a light amount of
> overhead. Nothing too severe, but it shouldn't be done lightly either.
> 
> > >  However,
> > > I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch
> > > ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git .  It's an
> > > unfinished prototype I worked on a few weeks ago. I was going to revisit
> > > in about a months time when 4.15-rc1 was out. I'd be interested in seeing
> > > if it has a postive gain in normal page allocations without destroying
> > > the performance of interrupt and softirq allocation contexts. The
> > > interrupt/softirq context testing is crucial as that is something that
> > > hurt us before when trying to improve page allocator performance.
> > >   
> > Yes, I will test that once I get back in office (after netdev conference and
> > vacation).  
> 
> Thanks.

I'll also commit to testing this (when I return home, as Tariq I'm also
in Seoul ATM).

 
> > Can you please elaborate in a few words about the idea behind the prototype?
> > Does it address page-allocator scalability issues, or only the rate of
> > single core page allocations?  
> 
> Short answer -- maybe. All scalability issues or rates of allocation are
> context and workload dependant so the question is impossible to answer
> for the general case.
> 
> Broadly speaking, the patch reintroduces the per-cpu lists being for !irq
> context allocations again. The last time we did this, hard and soft IRQ
> allocations went through the buddy allocator which couldn't scale and
> the patch was reverted. With this patch, it goes through a very large
> pagevec-like structure that is protected by a lock but the fast paths
> for alloc/free are extremely simple operations so the lock hold times are
> very small. Potentially, a development path is that the current per-cpu
> allocator is replaced with pagevec-like structures that are dynamically
> allocated which would also allow pages to be freed to remote CPU lists

I've had huge success using ptr_ring, as a queue between CPUs, to
minimize cross-CPU cache-line touching.  With the recently accepted BPF
map called "cpumap" used for XDP_REDIRECT.

It's important to handle the two borderline cases in ptr_ring, of the
queue being almost full (default handled in ptr_ring) or almost empty.
Like describe in[1] slide 14:

[1] http://people.netfilter.org/hawk/presentations/NetConf2017_Seoul/XDP_devel_update_NetConf2017_Seoul.pdf

The use of XDP_REDIRECT + cpumap, do expose issues with the page
allocator.  E.g. slide 19 show ixgbe recycle scheme failing, but still
hitting the PCP.  Also notice slide 22 deducing the overhead.  Scale
stressing ptr_ring is showed in extra slides 35-39.


> (if we could detect when that is appropriate which is unclear). We could
> also drain remote lists without using IPIs. The downside is that the memory
> footprint of the allocator would be higher and the size could no longer
> be tuned so there would need to be excellent justification for such a move.
> 
> I haven't posted the patches properly yet because mmotm is carrying too
> many patches as it is and this patch indirectly depends on the contents. I
> also didn't write memory hot-remove support which would be a requirement
> before merging. I hadn't intended to put further effort into it until I
> had some evidence the approach had promise. My own testing indicated it
> worked but the drivers I was using for network tests did not allocate
> intensely enough to show any major gain/loss.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2017-11-02 17:21       ` Tariq Toukan
  (?)
  (?)
@ 2018-04-21  8:15       ` Aaron Lu
  2018-04-22 16:43         ` Tariq Toukan
  -1 siblings, 1 reply; 31+ messages in thread
From: Aaron Lu @ 2018-04-21  8:15 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko

Sorry to bring up an old thread...

On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
> 
> 
> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
> > 
> > 
> > On 15/09/2017 1:23 PM, Mel Gorman wrote:
> > > On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
> > > > Insights: Major degradation between #1 and #2, not getting any
> > > > close to linerate! Degradation is fixed between #2 and #3. This is
> > > > because page allocator cannot stand the higher allocation rate. In
> > > > #2, we also see that the addition of rings (cores) reduces BW (!!),
> > > > as result of increasing congestion over shared resources.
> > > > 
> > > 
> > > Unfortunately, no surprises there.
> > > 
> > > > Congestion in this case is very clear. When monitored in perf
> > > > top: 85.58% [kernel] [k] queued_spin_lock_slowpath
> > > > 
> > > 
> > > While it's not proven, the most likely candidate is the zone lock
> > > and that should be confirmed using a call-graph profile. If so, then
> > > the suggestion to tune to the size of the per-cpu allocator would
> > > mitigate the problem.
> > > 
> > Indeed, I tuned the per-cpu allocator and bottleneck is released.
> > 
> 
> Hi all,
> 
> After leaving this task for a while doing other tasks, I got back to it now
> and see that the good behavior I observed earlier was not stable.

I posted a patchset to improve zone->lock contention for order-0 pages
recently, it can almost eliminate 80% zone->lock contention for
will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel
Skylake server and it doesn't require PCP size tune, so should have
some effects on your workload where one CPU does allocation while
another does free.

It did this by some disruptive changes:
1 on free path, it skipped doing merge(so could be bad for mixed
  workloads where both 4K and high order pages are needed);
2 on allocation path, it avoided touching multiple cachelines.

RFC v2 patchset:
https://lkml.org/lkml/2018/3/20/171

repo:
https://github.com/aaronlu/linux zone_lock_rfc_v2

 
> Recall: I work with a modified driver that allocates a page (4K) per packet
> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
> NICs.
> 
> Performance is good as long as pages are available in the allocating cores's
> PCP.
> Issue is that pages are allocated in one core, then free'd in another,
> making it's hard for the PCP to work efficiently, and both the allocator
> core and the freeing core need to access the buddy allocator very often.
> 
> I'd like to share with you some testing numbers:
> 
> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000
> 
> 100% cpu on all cores, top func in perf:
>    84.98%  [kernel]             [k] queued_spin_lock_slowpath
> 
> system wide (all cores)
>            1135941      kmem:mm_page_alloc
> 
>            2606629      kmem:mm_page_free
> 
>                  0      kmem:mm_page_alloc_extfrag
>            4784616      kmem:mm_page_alloc_zone_locked
> 
>               1337      kmem:mm_page_free_batched
> 
>            6488213      kmem:mm_page_pcpu_drain
> 
>            8925503      net:napi_gro_receive_entry
> 
> 
> Two types of cores:
> A core mostly running napi (8 such cores):
>             221875      kmem:mm_page_alloc
> 
>              17100      kmem:mm_page_free
> 
>                  0      kmem:mm_page_alloc_extfrag
>             766584      kmem:mm_page_alloc_zone_locked
> 
>                 16      kmem:mm_page_free_batched
> 
>                 35      kmem:mm_page_pcpu_drain
> 
>            1340139      net:napi_gro_receive_entry
> 
> 
> Other core, mostly running user application (40 such):
>                  2      kmem:mm_page_alloc
> 
>              38922      kmem:mm_page_free
> 
>                  0      kmem:mm_page_alloc_extfrag
>                  1      kmem:mm_page_alloc_zone_locked
> 
>                  8      kmem:mm_page_free_batched
> 
>             107289      kmem:mm_page_pcpu_drain
> 
>                 34      net:napi_gro_receive_entry
> 
> 
> As you can see, sync overhead is enormous.
> 
> PCP-wise, a key improvement in such scenarios would be reached if we could
> (1) keep and handle the allocated page on same cpu, or (2) somehow get the
> page back to the allocating core's PCP in a fast-path, without going through
> the regular buddy allocator paths.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2018-04-21  8:15       ` Aaron Lu
@ 2018-04-22 16:43         ` Tariq Toukan
  2018-04-23  8:54             ` Tariq Toukan
  0 siblings, 1 reply; 31+ messages in thread
From: Tariq Toukan @ 2018-04-22 16:43 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko



On 21/04/2018 11:15 AM, Aaron Lu wrote:
> Sorry to bring up an old thread...
> 

I want to thank you very much for bringing this up!

> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
>>
>>
>> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
>>>
>>>
>>> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>>>> Insights: Major degradation between #1 and #2, not getting any
>>>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>>>> because page allocator cannot stand the higher allocation rate. In
>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>>>> as result of increasing congestion over shared resources.
>>>>>
>>>>
>>>> Unfortunately, no surprises there.
>>>>
>>>>> Congestion in this case is very clear. When monitored in perf
>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>>>
>>>>
>>>> While it's not proven, the most likely candidate is the zone lock
>>>> and that should be confirmed using a call-graph profile. If so, then
>>>> the suggestion to tune to the size of the per-cpu allocator would
>>>> mitigate the problem.
>>>>
>>> Indeed, I tuned the per-cpu allocator and bottleneck is released.
>>>
>>
>> Hi all,
>>
>> After leaving this task for a while doing other tasks, I got back to it now
>> and see that the good behavior I observed earlier was not stable.
> 
> I posted a patchset to improve zone->lock contention for order-0 pages
> recently, it can almost eliminate 80% zone->lock contention for
> will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel
> Skylake server and it doesn't require PCP size tune, so should have
> some effects on your workload where one CPU does allocation while
> another does free.
> 

That is great news. In our driver's memory scheme (and many others as 
well) we allocate only order-0 pages (the only flow that does not do 
that yet in upstream will do so very soon, we already have the patches 
in our internal branch).
Allocation of order-0 pages is not only the common case, but is the only 
type of allocation in our data-path. Let's optimize it!


> It did this by some disruptive changes:
> 1 on free path, it skipped doing merge(so could be bad for mixed
>    workloads where both 4K and high order pages are needed);

I think there are so many advantages to not using high order 
allocations, especially in production servers that are not rebooted for 
long periods and become fragmented.
AFAIK, the community direction (at least in networking) is using order-0 
pages in datapath, so optimizing their allocaiton is a very good idea. 
Need of course to perf evaluate possible degradations, and see how 
important these use cases are.

> 2 on allocation path, it avoided touching multiple cachelines.
> 

Great!

> RFC v2 patchset:
> https://lkml.org/lkml/2018/3/20/171
> 
> repo:
> https://github.com/aaronlu/linux zone_lock_rfc_v2
> 

I will check them out first thing tomorrow!

p.s., I will be on vacation for a week starting Tuesday.
I hope I can make some progress before that :)

Thanks,
Tariq

>   
>> Recall: I work with a modified driver that allocates a page (4K) per packet
>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
>> NICs.
>>
>> Performance is good as long as pages are available in the allocating cores's
>> PCP.
>> Issue is that pages are allocated in one core, then free'd in another,
>> making it's hard for the PCP to work efficiently, and both the allocator
>> core and the freeing core need to access the buddy allocator very often.
>>
>> I'd like to share with you some testing numbers:
>>
>> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000
>>
>> 100% cpu on all cores, top func in perf:
>>     84.98%  [kernel]             [k] queued_spin_lock_slowpath
>>
>> system wide (all cores)
>>             1135941      kmem:mm_page_alloc
>>
>>             2606629      kmem:mm_page_free
>>
>>                   0      kmem:mm_page_alloc_extfrag
>>             4784616      kmem:mm_page_alloc_zone_locked
>>
>>                1337      kmem:mm_page_free_batched
>>
>>             6488213      kmem:mm_page_pcpu_drain
>>
>>             8925503      net:napi_gro_receive_entry
>>
>>
>> Two types of cores:
>> A core mostly running napi (8 such cores):
>>              221875      kmem:mm_page_alloc
>>
>>               17100      kmem:mm_page_free
>>
>>                   0      kmem:mm_page_alloc_extfrag
>>              766584      kmem:mm_page_alloc_zone_locked
>>
>>                  16      kmem:mm_page_free_batched
>>
>>                  35      kmem:mm_page_pcpu_drain
>>
>>             1340139      net:napi_gro_receive_entry
>>
>>
>> Other core, mostly running user application (40 such):
>>                   2      kmem:mm_page_alloc
>>
>>               38922      kmem:mm_page_free
>>
>>                   0      kmem:mm_page_alloc_extfrag
>>                   1      kmem:mm_page_alloc_zone_locked
>>
>>                   8      kmem:mm_page_free_batched
>>
>>              107289      kmem:mm_page_pcpu_drain
>>
>>                  34      net:napi_gro_receive_entry
>>
>>
>> As you can see, sync overhead is enormous.
>>
>> PCP-wise, a key improvement in such scenarios would be reached if we could
>> (1) keep and handle the allocated page on same cpu, or (2) somehow get the
>> page back to the allocating core's PCP in a fast-path, without going through
>> the regular buddy allocator paths.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2018-04-22 16:43         ` Tariq Toukan
@ 2018-04-23  8:54             ` Tariq Toukan
  0 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2018-04-23  8:54 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko



On 22/04/2018 7:43 PM, Tariq Toukan wrote:
> 
> 
> On 21/04/2018 11:15 AM, Aaron Lu wrote:
>> Sorry to bring up an old thread...
>>
> 
> I want to thank you very much for bringing this up!
> 
>> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
>>>
>>>
>>> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
>>>>
>>>>
>>>> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>>>>> Insights: Major degradation between #1 and #2, not getting any
>>>>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>>>>> because page allocator cannot stand the higher allocation rate. In
>>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>>>>> as result of increasing congestion over shared resources.
>>>>>>
>>>>>
>>>>> Unfortunately, no surprises there.
>>>>>
>>>>>> Congestion in this case is very clear. When monitored in perf
>>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>>>>
>>>>>
>>>>> While it's not proven, the most likely candidate is the zone lock
>>>>> and that should be confirmed using a call-graph profile. If so, then
>>>>> the suggestion to tune to the size of the per-cpu allocator would
>>>>> mitigate the problem.
>>>>>
>>>> Indeed, I tuned the per-cpu allocator and bottleneck is released.
>>>>
>>>
>>> Hi all,
>>>
>>> After leaving this task for a while doing other tasks, I got back to 
>>> it now
>>> and see that the good behavior I observed earlier was not stable.
>>
>> I posted a patchset to improve zone->lock contention for order-0 pages
>> recently, it can almost eliminate 80% zone->lock contention for
>> will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel
>> Skylake server and it doesn't require PCP size tune, so should have
>> some effects on your workload where one CPU does allocation while
>> another does free.
>>
> 
> That is great news. In our driver's memory scheme (and many others as 
> well) we allocate only order-0 pages (the only flow that does not do 
> that yet in upstream will do so very soon, we already have the patches 
> in our internal branch).
> Allocation of order-0 pages is not only the common case, but is the only 
> type of allocation in our data-path. Let's optimize it!
> 
> 
>> It did this by some disruptive changes:
>> 1 on free path, it skipped doing merge(so could be bad for mixed
>>    workloads where both 4K and high order pages are needed);
> 
> I think there are so many advantages to not using high order 
> allocations, especially in production servers that are not rebooted for 
> long periods and become fragmented.
> AFAIK, the community direction (at least in networking) is using order-0 
> pages in datapath, so optimizing their allocaiton is a very good idea. 
> Need of course to perf evaluate possible degradations, and see how 
> important these use cases are.
> 
>> 2 on allocation path, it avoided touching multiple cachelines.
>>
> 
> Great!
> 
>> RFC v2 patchset:
>> https://lkml.org/lkml/2018/3/20/171
>>
>> repo:
>> https://github.com/aaronlu/linux zone_lock_rfc_v2
>>
> 
> I will check them out first thing tomorrow!
> 
> p.s., I will be on vacation for a week starting Tuesday.
> I hope I can make some progress before that :)
> 
> Thanks,
> Tariq
> 

Hi,

I ran my tests with your patches.
Initial BW numbers are significantly higher than I documented back then 
in this mail-thread.
For example, in driver #2 (see original mail thread), with 6 rings, I 
now get 92Gbps (slightly less than linerate) in comparison to 64Gbps 
back then.

However, there were many kernel changes since then, I need to isolate 
your changes. I am not sure I can finish this today, but I will surely 
get to it next week after I'm back from vacation.

Still, when I increase the scale (more rings, i.e. more cpus), I see 
that queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower 
than it used to be.

This should be root solved by the (orthogonal) changes planned in 
network subsystem, which will change the SKB allocation/free scheme so 
that SKBs are released on the originating cpu.

Thanks,
Tariq

>>> Recall: I work with a modified driver that allocates a page (4K) per 
>>> packet
>>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
>>> NICs.
>>>
>>> Performance is good as long as pages are available in the allocating 
>>> cores's
>>> PCP.
>>> Issue is that pages are allocated in one core, then free'd in another,
>>> making it's hard for the PCP to work efficiently, and both the allocator
>>> core and the freeing core need to access the buddy allocator very often.
>>>
>>> I'd like to share with you some testing numbers:
>>>
>>> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000
>>>
>>> 100% cpu on all cores, top func in perf:
>>>     84.98%  [kernel]             [k] queued_spin_lock_slowpath
>>>
>>> system wide (all cores)
>>>             1135941      kmem:mm_page_alloc
>>>
>>>             2606629      kmem:mm_page_free
>>>
>>>                   0      kmem:mm_page_alloc_extfrag
>>>             4784616      kmem:mm_page_alloc_zone_locked
>>>
>>>                1337      kmem:mm_page_free_batched
>>>
>>>             6488213      kmem:mm_page_pcpu_drain
>>>
>>>             8925503      net:napi_gro_receive_entry
>>>
>>>
>>> Two types of cores:
>>> A core mostly running napi (8 such cores):
>>>              221875      kmem:mm_page_alloc
>>>
>>>               17100      kmem:mm_page_free
>>>
>>>                   0      kmem:mm_page_alloc_extfrag
>>>              766584      kmem:mm_page_alloc_zone_locked
>>>
>>>                  16      kmem:mm_page_free_batched
>>>
>>>                  35      kmem:mm_page_pcpu_drain
>>>
>>>             1340139      net:napi_gro_receive_entry
>>>
>>>
>>> Other core, mostly running user application (40 such):
>>>                   2      kmem:mm_page_alloc
>>>
>>>               38922      kmem:mm_page_free
>>>
>>>                   0      kmem:mm_page_alloc_extfrag
>>>                   1      kmem:mm_page_alloc_zone_locked
>>>
>>>                   8      kmem:mm_page_free_batched
>>>
>>>              107289      kmem:mm_page_pcpu_drain
>>>
>>>                  34      net:napi_gro_receive_entry
>>>
>>>
>>> As you can see, sync overhead is enormous.
>>>
>>> PCP-wise, a key improvement in such scenarios would be reached if we 
>>> could
>>> (1) keep and handle the allocated page on same cpu, or (2) somehow 
>>> get the
>>> page back to the allocating core's PCP in a fast-path, without going 
>>> through
>>> the regular buddy allocator paths.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
@ 2018-04-23  8:54             ` Tariq Toukan
  0 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2018-04-23  8:54 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko



On 22/04/2018 7:43 PM, Tariq Toukan wrote:
> 
> 
> On 21/04/2018 11:15 AM, Aaron Lu wrote:
>> Sorry to bring up an old thread...
>>
> 
> I want to thank you very much for bringing this up!
> 
>> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
>>>
>>>
>>> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
>>>>
>>>>
>>>> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>>>>> Insights: Major degradation between #1 and #2, not getting any
>>>>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>>>>> because page allocator cannot stand the higher allocation rate. In
>>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>>>>> as result of increasing congestion over shared resources.
>>>>>>
>>>>>
>>>>> Unfortunately, no surprises there.
>>>>>
>>>>>> Congestion in this case is very clear. When monitored in perf
>>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>>>>
>>>>>
>>>>> While it's not proven, the most likely candidate is the zone lock
>>>>> and that should be confirmed using a call-graph profile. If so, then
>>>>> the suggestion to tune to the size of the per-cpu allocator would
>>>>> mitigate the problem.
>>>>>
>>>> Indeed, I tuned the per-cpu allocator and bottleneck is released.
>>>>
>>>
>>> Hi all,
>>>
>>> After leaving this task for a while doing other tasks, I got back to 
>>> it now
>>> and see that the good behavior I observed earlier was not stable.
>>
>> I posted a patchset to improve zone->lock contention for order-0 pages
>> recently, it can almost eliminate 80% zone->lock contention for
>> will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel
>> Skylake server and it doesn't require PCP size tune, so should have
>> some effects on your workload where one CPU does allocation while
>> another does free.
>>
> 
> That is great news. In our driver's memory scheme (and many others as 
> well) we allocate only order-0 pages (the only flow that does not do 
> that yet in upstream will do so very soon, we already have the patches 
> in our internal branch).
> Allocation of order-0 pages is not only the common case, but is the only 
> type of allocation in our data-path. Let's optimize it!
> 
> 
>> It did this by some disruptive changes:
>> 1 on free path, it skipped doing merge(so could be bad for mixed
>> A A  workloads where both 4K and high order pages are needed);
> 
> I think there are so many advantages to not using high order 
> allocations, especially in production servers that are not rebooted for 
> long periods and become fragmented.
> AFAIK, the community direction (at least in networking) is using order-0 
> pages in datapath, so optimizing their allocaiton is a very good idea. 
> Need of course to perf evaluate possible degradations, and see how 
> important these use cases are.
> 
>> 2 on allocation path, it avoided touching multiple cachelines.
>>
> 
> Great!
> 
>> RFC v2 patchset:
>> https://lkml.org/lkml/2018/3/20/171
>>
>> repo:
>> https://github.com/aaronlu/linux zone_lock_rfc_v2
>>
> 
> I will check them out first thing tomorrow!
> 
> p.s., I will be on vacation for a week starting Tuesday.
> I hope I can make some progress before that :)
> 
> Thanks,
> Tariq
> 

Hi,

I ran my tests with your patches.
Initial BW numbers are significantly higher than I documented back then 
in this mail-thread.
For example, in driver #2 (see original mail thread), with 6 rings, I 
now get 92Gbps (slightly less than linerate) in comparison to 64Gbps 
back then.

However, there were many kernel changes since then, I need to isolate 
your changes. I am not sure I can finish this today, but I will surely 
get to it next week after I'm back from vacation.

Still, when I increase the scale (more rings, i.e. more cpus), I see 
that queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower 
than it used to be.

This should be root solved by the (orthogonal) changes planned in 
network subsystem, which will change the SKB allocation/free scheme so 
that SKBs are released on the originating cpu.

Thanks,
Tariq

>>> Recall: I work with a modified driver that allocates a page (4K) per 
>>> packet
>>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
>>> NICs.
>>>
>>> Performance is good as long as pages are available in the allocating 
>>> cores's
>>> PCP.
>>> Issue is that pages are allocated in one core, then free'd in another,
>>> making it's hard for the PCP to work efficiently, and both the allocator
>>> core and the freeing core need to access the buddy allocator very often.
>>>
>>> I'd like to share with you some testing numbers:
>>>
>>> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000
>>>
>>> 100% cpu on all cores, top func in perf:
>>> A A A  84.98%A  [kernel]A A A A A A A A A A A A  [k] queued_spin_lock_slowpath
>>>
>>> system wide (all cores)
>>> A A A A A A A A A A A  1135941A A A A A  kmem:mm_page_alloc
>>>
>>> A A A A A A A A A A A  2606629A A A A A  kmem:mm_page_free
>>>
>>> A A A A A A A A A A A A A A A A A  0A A A A A  kmem:mm_page_alloc_extfrag
>>> A A A A A A A A A A A  4784616A A A A A  kmem:mm_page_alloc_zone_locked
>>>
>>> A A A A A A A A A A A A A A  1337A A A A A  kmem:mm_page_free_batched
>>>
>>> A A A A A A A A A A A  6488213A A A A A  kmem:mm_page_pcpu_drain
>>>
>>> A A A A A A A A A A A  8925503A A A A A  net:napi_gro_receive_entry
>>>
>>>
>>> Two types of cores:
>>> A core mostly running napi (8 such cores):
>>> A A A A A A A A A A A A  221875A A A A A  kmem:mm_page_alloc
>>>
>>> A A A A A A A A A A A A A  17100A A A A A  kmem:mm_page_free
>>>
>>> A A A A A A A A A A A A A A A A A  0A A A A A  kmem:mm_page_alloc_extfrag
>>> A A A A A A A A A A A A  766584A A A A A  kmem:mm_page_alloc_zone_locked
>>>
>>> A A A A A A A A A A A A A A A A  16A A A A A  kmem:mm_page_free_batched
>>>
>>> A A A A A A A A A A A A A A A A  35A A A A A  kmem:mm_page_pcpu_drain
>>>
>>> A A A A A A A A A A A  1340139A A A A A  net:napi_gro_receive_entry
>>>
>>>
>>> Other core, mostly running user application (40 such):
>>> A A A A A A A A A A A A A A A A A  2A A A A A  kmem:mm_page_alloc
>>>
>>> A A A A A A A A A A A A A  38922A A A A A  kmem:mm_page_free
>>>
>>> A A A A A A A A A A A A A A A A A  0A A A A A  kmem:mm_page_alloc_extfrag
>>> A A A A A A A A A A A A A A A A A  1A A A A A  kmem:mm_page_alloc_zone_locked
>>>
>>> A A A A A A A A A A A A A A A A A  8A A A A A  kmem:mm_page_free_batched
>>>
>>> A A A A A A A A A A A A  107289A A A A A  kmem:mm_page_pcpu_drain
>>>
>>> A A A A A A A A A A A A A A A A  34A A A A A  net:napi_gro_receive_entry
>>>
>>>
>>> As you can see, sync overhead is enormous.
>>>
>>> PCP-wise, a key improvement in such scenarios would be reached if we 
>>> could
>>> (1) keep and handle the allocated page on same cpu, or (2) somehow 
>>> get the
>>> page back to the allocating core's PCP in a fast-path, without going 
>>> through
>>> the regular buddy allocator paths.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2018-04-23  8:54             ` Tariq Toukan
  (?)
@ 2018-04-23 13:10             ` Aaron Lu
  2018-04-27  8:45               ` Aaron Lu
  -1 siblings, 1 reply; 31+ messages in thread
From: Aaron Lu @ 2018-04-23 13:10 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko

On Mon, Apr 23, 2018 at 11:54:57AM +0300, Tariq Toukan wrote:
> Hi,
> 
> I ran my tests with your patches.
> Initial BW numbers are significantly higher than I documented back then in
> this mail-thread.
> For example, in driver #2 (see original mail thread), with 6 rings, I now
> get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then.
> 
> However, there were many kernel changes since then, I need to isolate your
> changes. I am not sure I can finish this today, but I will surely get to it
> next week after I'm back from vacation.
> 
> Still, when I increase the scale (more rings, i.e. more cpus), I see that
> queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it
> used to be.

I wonder if it is on allocation path or free path?

Also, increasing PCP size through vm.percpu_pagelist_fraction would
still help with my patches since it can avoid touching even more cache
lines on allocation path with a higher PCP->batch(which has an upper
limit of 96 though at the moment).

> 
> This should be root solved by the (orthogonal) changes planned in network
> subsystem, which will change the SKB allocation/free scheme so that SKBs are
> released on the originating cpu.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2018-04-23 13:10             ` Aaron Lu
@ 2018-04-27  8:45               ` Aaron Lu
  2018-05-02 13:38                 ` Tariq Toukan
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Lu @ 2018-04-27  8:45 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko

On Mon, Apr 23, 2018 at 09:10:33PM +0800, Aaron Lu wrote:
> On Mon, Apr 23, 2018 at 11:54:57AM +0300, Tariq Toukan wrote:
> > Hi,
> > 
> > I ran my tests with your patches.
> > Initial BW numbers are significantly higher than I documented back then in
> > this mail-thread.
> > For example, in driver #2 (see original mail thread), with 6 rings, I now
> > get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then.
> > 
> > However, there were many kernel changes since then, I need to isolate your
> > changes. I am not sure I can finish this today, but I will surely get to it
> > next week after I'm back from vacation.
> > 
> > Still, when I increase the scale (more rings, i.e. more cpus), I see that
> > queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it
> > used to be.
> 
> I wonder if it is on allocation path or free path?

Just FYI, I have pushed two more commits on top of the branch.
They should improve free path zone lock contention for MIGRATE_UNMOVABLE
pages(most kernel code alloc such pages), you may consider apply them if
free path contention is a problem.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page allocator bottleneck
  2018-04-27  8:45               ` Aaron Lu
@ 2018-05-02 13:38                 ` Tariq Toukan
  0 siblings, 0 replies; 31+ messages in thread
From: Tariq Toukan @ 2018-05-02 13:38 UTC (permalink / raw)
  To: Aaron Lu, Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko



On 27/04/2018 11:45 AM, Aaron Lu wrote:
> On Mon, Apr 23, 2018 at 09:10:33PM +0800, Aaron Lu wrote:
>> On Mon, Apr 23, 2018 at 11:54:57AM +0300, Tariq Toukan wrote:
>>> Hi,
>>>
>>> I ran my tests with your patches.
>>> Initial BW numbers are significantly higher than I documented back then in
>>> this mail-thread.
>>> For example, in driver #2 (see original mail thread), with 6 rings, I now
>>> get 92Gbps (slightly less than linerate) in comparison to 64Gbps back then.
>>>
>>> However, there were many kernel changes since then, I need to isolate your
>>> changes. I am not sure I can finish this today, but I will surely get to it
>>> next week after I'm back from vacation.
>>>
>>> Still, when I increase the scale (more rings, i.e. more cpus), I see that
>>> queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower than it
>>> used to be.
>>
>> I wonder if it is on allocation path or free path?
> 
> Just FYI, I have pushed two more commits on top of the branch.
> They should improve free path zone lock contention for MIGRATE_UNMOVABLE
> pages(most kernel code alloc such pages), you may consider apply them if
> free path contention is a problem.
> 

Hi Aaron,
Thanks for the update, I did not analyze the contention yet.
I am back in office and will start testing soon.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-05-02 13:38 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-14 16:49 Page allocator bottleneck Tariq Toukan
2017-09-14 16:49 ` Tariq Toukan
2017-09-14 20:19 ` Andi Kleen
2017-09-14 20:19   ` Andi Kleen
2017-09-17 15:43   ` Tariq Toukan
2017-09-15  7:28 ` Jesper Dangaard Brouer
2017-09-17 16:16   ` Tariq Toukan
2017-09-18  7:34     ` Aaron Lu
2017-09-18  7:44       ` Aaron Lu
2017-09-18 15:33         ` Tariq Toukan
2017-09-19  7:23           ` Aaron Lu
2017-09-19  7:23             ` Aaron Lu
2017-09-15 10:23 ` Mel Gorman
2017-09-18  9:16   ` Tariq Toukan
2017-11-02 17:21     ` Tariq Toukan
2017-11-02 17:21       ` Tariq Toukan
2017-11-03 13:40       ` Mel Gorman
2017-11-08  5:42         ` Tariq Toukan
2017-11-08  5:42           ` Tariq Toukan
2017-11-08  9:35           ` Mel Gorman
2017-11-09  3:51             ` Figo.zhang
2017-11-09  5:06             ` Tariq Toukan
2017-11-09  5:21             ` Jesper Dangaard Brouer
2017-11-09  5:21               ` Jesper Dangaard Brouer
2018-04-21  8:15       ` Aaron Lu
2018-04-22 16:43         ` Tariq Toukan
2018-04-23  8:54           ` Tariq Toukan
2018-04-23  8:54             ` Tariq Toukan
2018-04-23 13:10             ` Aaron Lu
2018-04-27  8:45               ` Aaron Lu
2018-05-02 13:38                 ` Tariq Toukan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.