Re: [PATCH 5/6] mm/page_alloc: Free pages in a single pass during bulk free

From: Aaron Lu <aaron.lu@intel.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Michal Hocko <mhocko@kernel.org>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: [PATCH 5/6] mm/page_alloc: Free pages in a single pass during bulk free
Date: Thu, 17 Feb 2022 09:53:08 +0800	[thread overview]
Message-ID: <Yg2qhJyTovY2oQhe@ziqianlu-nuc9qn> (raw)
In-Reply-To: <20220217002227.5739-6-mgorman@techsingularity.net>

On Thu, Feb 17, 2022 at 12:22:26AM +0000, Mel Gorman wrote:
> free_pcppages_bulk() has taken two passes through the pcp lists since
> commit 0a5f4e5b4562 ("mm/free_pcppages_bulk: do not hold lock when picking
> pages to free") due to deferring the cost of selecting PCP lists until
> the zone lock is held. Now that list selection is simplier, the main
> cost during selection is bulkfree_pcp_prepare() which in the normal case
> is a simple check and prefetching. As the list manipulations have cost
> in itself, go back to freeing pages in a single pass.
> 
> The series up to this point was evaulated using a trunc microbenchmark
> that is truncating sparse files stored in page cache (mmtests config
> config-io-trunc). Sparse files were used to limit filesystem interaction.
> The results versus a revert of storing high-order pages in the PCP lists is
> 
> 1-socket Skylake
>                               5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
>                                  vanilla      mm-reverthighpcp-v1     mm-highpcpopt-v2
> Min       elapsed      540.00 (   0.00%)      530.00 (   1.85%)      530.00 (   1.85%)
> Amean     elapsed      543.00 (   0.00%)      530.00 *   2.39%*      530.00 *   2.39%*
> Stddev    elapsed        4.83 (   0.00%)        0.00 ( 100.00%)        0.00 ( 100.00%)
> CoeffVar  elapsed        0.89 (   0.00%)        0.00 ( 100.00%)        0.00 ( 100.00%)
> Max       elapsed      550.00 (   0.00%)      530.00 (   3.64%)      530.00 (   3.64%)
> BAmean-50 elapsed      540.00 (   0.00%)      530.00 (   1.85%)      530.00 (   1.85%)
> BAmean-95 elapsed      542.22 (   0.00%)      530.00 (   2.25%)      530.00 (   2.25%)
> BAmean-99 elapsed      542.22 (   0.00%)      530.00 (   2.25%)      530.00 (   2.25%)
> 
> 2-socket CascadeLake
>                               5.17.0-rc3             5.17.0-rc3             5.17.0-rc3
>                                  vanilla    mm-reverthighpcp-v1       mm-highpcpopt-v2
> Min       elapsed      510.00 (   0.00%)      500.00 (   1.96%)      500.00 (   1.96%)
> Amean     elapsed      529.00 (   0.00%)      521.00 (   1.51%)      510.00 *   3.59%*
> Stddev    elapsed       16.63 (   0.00%)       12.87 (  22.64%)       11.55 (  30.58%)
> CoeffVar  elapsed        3.14 (   0.00%)        2.47 (  21.46%)        2.26 (  27.99%)
> Max       elapsed      550.00 (   0.00%)      540.00 (   1.82%)      530.00 (   3.64%)
> BAmean-50 elapsed      516.00 (   0.00%)      512.00 (   0.78%)      500.00 (   3.10%)
> BAmean-95 elapsed      526.67 (   0.00%)      518.89 (   1.48%)      507.78 (   3.59%)
> BAmean-99 elapsed      526.67 (   0.00%)      518.89 (   1.48%)      507.78 (   3.59%)
> 
> The original motivation for multi-passes was will-it-scale page_fault1
> using $nr_cpu processes.
> 
> 2-socket CascadeLake (40 cores, 80 CPUs HT enabled)
>                                                     5.17.0-rc3                 5.17.0-rc3
>                                                        vanilla           mm-highpcpopt-v2
> Hmean     page_fault1-processes-2        2694662.26 (   0.00%)      2695780.35 (   0.04%)
> Hmean     page_fault1-processes-5        6425819.34 (   0.00%)      6435544.57 *   0.15%*
> Hmean     page_fault1-processes-8        9642169.10 (   0.00%)      9658962.39 (   0.17%)
> Hmean     page_fault1-processes-12      12167502.10 (   0.00%)     12190163.79 (   0.19%)
> Hmean     page_fault1-processes-21      15636859.03 (   0.00%)     15612447.26 (  -0.16%)
> Hmean     page_fault1-processes-30      25157348.61 (   0.00%)     25169456.65 (   0.05%)
> Hmean     page_fault1-processes-48      27694013.85 (   0.00%)     27671111.46 (  -0.08%)
> Hmean     page_fault1-processes-79      25928742.64 (   0.00%)     25934202.02 (   0.02%) <--
> Hmean     page_fault1-processes-110     25730869.75 (   0.00%)     25671880.65 *  -0.23%*
> Hmean     page_fault1-processes-141     25626992.42 (   0.00%)     25629551.61 (   0.01%)
> Hmean     page_fault1-processes-172     25611651.35 (   0.00%)     25614927.99 (   0.01%)
> Hmean     page_fault1-processes-203     25577298.75 (   0.00%)     25583445.59 (   0.02%)
> Hmean     page_fault1-processes-234     25580686.07 (   0.00%)     25608240.71 (   0.11%)
> Hmean     page_fault1-processes-265     25570215.47 (   0.00%)     25568647.58 (  -0.01%)
> Hmean     page_fault1-processes-296     25549488.62 (   0.00%)     25543935.00 (  -0.02%)
> Hmean     page_fault1-processes-320     25555149.05 (   0.00%)     25575696.74 (   0.08%)
> 
> The differences are mostly within the noise and the difference close to
> $nr_cpus is negligible.

I have queued will-it-scale/page_fault1/processes/$nr_cpu on 2 4-sockets
servers: CascadeLake and CooperLaker and will let you know the result
once it's out.

I'm using 'https://github.com/hnaz/linux-mm master' and doing the
comparison with commit c000d687ce22("mm/page_alloc: simplify how many
pages are selected per pcp list during bulk free") and commit 8391e0a7e172
("mm/page_alloc: free pages in a single pass during bulk free") there.

The kernel for each commit will have to be fetched and built and then
the job will be run for 3 times for each commit. These servers are also
busy with other jobs so it may take a while.

Regards,
Aaron