Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

From: Tariq Toukan <tariqt@mellanox.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>,
	Michal Hocko <mhocko@kernel.org>,
	Tariq Toukan <tariqt@mellanox.com>
Cc: Aaron Lu <aaron.lu@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Huang Ying <ying.huang@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Kemi Wang <kemi.wang@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Andi Kleen <ak@linux.intel.com>, Vlastimil Babka <vbabka@suse.cz>,
	Mel Gorman <mgorman@techsingularity.net>,
	Saeed Mahameed <saeedm@mellanox.com>
Subject: Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
Date: Thu, 12 Jul 2018 18:01:12 +0300	[thread overview]
Message-ID: <2b51fa24-5fc7-f328-1bf3-a78f28eb742f@mellanox.com> (raw)
In-Reply-To: <20180712155536.20023cc4@redhat.com>



On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote:
> On Thu, 12 Jul 2018 14:54:08 +0200
> Michal Hocko <mhocko@kernel.org> wrote:
> 
>> [CC Jesper - I remember he was really concerned about the worst case
>>   latencies for highspeed network workloads.]
> 
> Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s),
> where we are contenting on the page allocator lock, in a CPU scaling
> netperf test AFAIK.  I also have some special-case micro-benchmarks
> where I can hit it, but it a micro-bench...
> 

Thanks! Looks good.

Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a 
major PCP/buddy bottleneck, where spinning the zonelock took up to 80% 
CPU, with dramatic BW degradation.

Test ran relatively small number of TCP streams (4-16) with unpinned 
application (iperf).

Larger batching reduces the contention on the zone lock and improves the 
CPU util. I also considered increasing the percpu_pagelist_fraction to a 
larger value (thought of 512, see patch below), which also affects the 
batch size (in pageset_set_high_and_batch).

As far as I see it, to totally solve the page allocation bottleneck for 
the increasing networking speeds, the following is still required:
1) optimize order-0 allocations (even on the cost of higher-order 
allocations).
2) bulking API for page allocations.
3) do SKB remote-release (on the originating core).

Regards,
Tariq

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 697ef8c225df..88763bd716a5 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -741,9 +741,9 @@ of hot per cpu pagelists.  User can specify a number 
like 100 to allocate
  The batch value of each per cpu pagelist is also updated as a result. 
It is
  set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

-The initial value is zero.  Kernel does not use this value at boot time 
to set
+The initial value is 512.  Kernel uses this value at boot time to set
  the high water marks for each per cpu page list.  If the user writes 
'0' to this
-sysctl, it will revert to this default behavior.
+sysctl, it will revert to a behavior based on batchsize calculation.

  ==============================================================

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..c88e8eb50bcb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -129,7 +129,7 @@
  unsigned long totalreserve_pages __read_mostly;
  unsigned long totalcma_pages __read_mostly;

-int percpu_pagelist_fraction;
+int percpu_pagelist_fraction = 512;
  gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

  /*