[PATCH] mm, slub: better heuristic for number of cpus when calculating slab order

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order
       [not found] <aac07668-99a0-4c7e-5f8b-10751af364c5@suse.cz>
@ 2021-02-08 13:41 ` Vlastimil Babka
  2021-02-08 14:54   ` Vincent Guittot
  2021-02-10 14:07   ` Mel Gorman
  0 siblings, 2 replies; 3+ messages in thread
From: Vlastimil Babka @ 2021-02-08 13:41 UTC (permalink / raw)
  To: vbabka
  Cc: Catalin.Marinas, akpm, aneesh.kumar, bharata, cl, guro, hannes,
	iamjoonsoo.kim, jannh, linux-kernel, linux-mm, mhocko, rientjes,
	shakeelb, vincent.guittot, will, Mel Gorman, stable

When creating a new kmem cache, SLUB determines how large the slab pages will
based on number of inputs, including the number of CPUs in the system. Larger
slab pages mean that more objects can be allocated/free from per-cpu slabs
before accessing shared structures, but also potentially more memory can be
wasted due to low slab usage and fragmentation.
The rough idea of using number of CPUs is that larger systems will be more
likely to benefit from reduced contention, and also should have enough memory
to spare.

Number of CPUs used to be determined as nr_cpu_ids, which is number of possible
cpus, but on some systems many will never be onlined, thus commit 045ab8c9487b
("mm/slub: let number of online CPUs determine the slub page order") changed it
to nr_online_cpus(). However, for kmem caches created early before CPUs are
onlined, this may lead to permamently low slab page sizes.

Vincent reports a regression [1] of hackbench on arm64 systems:

> I'm facing significant performances regression on a large arm64 server
> system (224 CPUs). Regressions is also present on small arm64 system
> (8 CPUs) but in a far smaller order of magnitude

> On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> v5.11-rc4 : 9.135sec (+/- 0.45%)
> v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> v5.10: 3.136sec (+/- 0.40%)

Mel reports a regression [2] of hackbench on x86_64, with lockstat suggesting
page allocator contention:

> i.e. the patch incurs a 7% to 32% performance penalty. This bisected
> cleanly yesterday when I was looking for the regression and then found
> the thread.

> Numerous caches change size. For example, kmalloc-512 goes from order-0
> (vanilla) to order-2 with the revert.

> So mostly this is down to the number of times SLUB calls into the page
> allocator which only caches order-0 pages on a per-cpu basis.

Clearly num_online_cpus() doesn't work too early in bootup. We could change
the order dynamically in a memory hotplug callback, but runtime order changing
for existing kmem caches has been already shown as dangerous, and removed in
32a6f409b693 ("mm, slub: remove runtime allocation order changes"). It could be
resurrected in a safe manner with some effort, but to fix the regression we
need something simpler.

We could use num_present_cpus() that should be the number of physically present
CPUs even before they are onlined. That would for for PowerPC [3], which
triggered the original commit,  but that still doesn't work on arm64 [4] as
explained in [5].

So this patch tries to determine the best available value without specific arch
knowledge.
- num_present_cpus() if the number is larger than 1, as that means the arch is
likely setting it properly
- nr_cpu_ids otherwise

This should fix the reported regressions while also keeping the effect of
045ab8c9487b for PowerPC systems. It's possible there are configurations where
num_present_cpus() is 1 during boot while nr_cpu_ids is at the same time
bloated, so these (if they exist) would keep the large orders based on
nr_cpu_ids as was before 045ab8c9487b.

[1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9irgwFj7Rou=xzZg@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20210128134512.GF3592@techsingularity.net/
[3] https://lore.kernel.org/linux-mm/20210123051607.GC2587010@in.ibm.com/
[4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbngf-03YsY18cmWv_g@mail.gmail.com/
[5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-truck/

Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---

OK, this is a 5.11 regression, so we should try to it by 5.12. I've also
Cc'd stable for that reason although it's not a crash fix.
We can still try later to replace this with a safe order update in hotplug
callbacks, but that's infeasible for 5.12.

 mm/slub.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 176b1cb0d006..8fc9190e6cb3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3454,6 +3454,7 @@ static inline int calculate_order(unsigned int size)
 	unsigned int order;
 	unsigned int min_objects;
 	unsigned int max_objects;
+	unsigned int nr_cpus;

 	/*
 	 * Attempt to find best configuration for a slab. This
@@ -3464,8 +3465,21 @@ static inline int calculate_order(unsigned int size)
 	 * we reduce the minimum objects required in a slab.
 	 */
 	min_objects = slub_min_objects;
-	if (!min_objects)
-		min_objects = 4 * (fls(num_online_cpus()) + 1);
+	if (!min_objects) {
+		/*
+		 * Some architectures will only update present cpus when
+		 * onlining them, so don't trust the number if it's just 1. But
+		 * we also don't want to use nr_cpu_ids always, as on some other
+		 * architectures, there can be many possible cpus, but never
+		 * onlined. Here we compromise between trying to avoid too high
+		 * order on systems that appear larger than they are, and too
+		 * low order on systems that appear smaller than they are.
+		 */
+		nr_cpus = num_present_cpus();
+		if (nr_cpus <= 1)
+			nr_cpus = nr_cpu_ids;
+		min_objects = 4 * (fls(nr_cpus) + 1);
+	}
 	max_objects = order_objects(slub_max_order, size);
 	min_objects = min(min_objects, max_objects);

-- 
2.30.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order
  2021-02-08 13:41 ` [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order Vlastimil Babka
@ 2021-02-08 14:54   ` Vincent Guittot
  2021-02-10 14:07   ` Mel Gorman
  1 sibling, 0 replies; 3+ messages in thread
From: Vincent Guittot @ 2021-02-08 14:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Catalin Marinas, Andrew Morton, aneesh.kumar, Bharata B Rao,
	Christoph Lameter, guro, Johannes Weiner, Joonsoo Kim, Jann Horn,
	linux-kernel, linux-mm, Michal Hocko, David Rientjes,
	Shakeel Butt, Will Deacon, Mel Gorman, # v4 . 16+

On Mon, 8 Feb 2021 at 14:41, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> When creating a new kmem cache, SLUB determines how large the slab pages will
> based on number of inputs, including the number of CPUs in the system. Larger
> slab pages mean that more objects can be allocated/free from per-cpu slabs
> before accessing shared structures, but also potentially more memory can be
> wasted due to low slab usage and fragmentation.
> The rough idea of using number of CPUs is that larger systems will be more
> likely to benefit from reduced contention, and also should have enough memory
> to spare.
>
> Number of CPUs used to be determined as nr_cpu_ids, which is number of possible
> cpus, but on some systems many will never be onlined, thus commit 045ab8c9487b
> ("mm/slub: let number of online CPUs determine the slub page order") changed it
> to nr_online_cpus(). However, for kmem caches created early before CPUs are
> onlined, this may lead to permamently low slab page sizes.
>
> Vincent reports a regression [1] of hackbench on arm64 systems:
>
> > I'm facing significant performances regression on a large arm64 server
> > system (224 CPUs). Regressions is also present on small arm64 system
> > (8 CPUs) but in a far smaller order of magnitude
>
> > On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> > v5.11-rc4 : 9.135sec (+/- 0.45%)
> > v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> > v5.10: 3.136sec (+/- 0.40%)
>
> Mel reports a regression [2] of hackbench on x86_64, with lockstat suggesting
> page allocator contention:
>
> > i.e. the patch incurs a 7% to 32% performance penalty. This bisected
> > cleanly yesterday when I was looking for the regression and then found
> > the thread.
>
> > Numerous caches change size. For example, kmalloc-512 goes from order-0
> > (vanilla) to order-2 with the revert.
>
> > So mostly this is down to the number of times SLUB calls into the page
> > allocator which only caches order-0 pages on a per-cpu basis.
>
> Clearly num_online_cpus() doesn't work too early in bootup. We could change
> the order dynamically in a memory hotplug callback, but runtime order changing
> for existing kmem caches has been already shown as dangerous, and removed in
> 32a6f409b693 ("mm, slub: remove runtime allocation order changes"). It could be
> resurrected in a safe manner with some effort, but to fix the regression we
> need something simpler.
>
> We could use num_present_cpus() that should be the number of physically present
> CPUs even before they are onlined. That would for for PowerPC [3], which

minor typo : "That would for for PowerPC" should be "That would work
for PowerPC" ?

> triggered the original commit,  but that still doesn't work on arm64 [4] as
> explained in [5].
>
> So this patch tries to determine the best available value without specific arch
> knowledge.
> - num_present_cpus() if the number is larger than 1, as that means the arch is
> likely setting it properly
> - nr_cpu_ids otherwise
>
> This should fix the reported regressions while also keeping the effect of
> 045ab8c9487b for PowerPC systems. It's possible there are configurations where
> num_present_cpus() is 1 during boot while nr_cpu_ids is at the same time
> bloated, so these (if they exist) would keep the large orders based on
> nr_cpu_ids as was before 045ab8c9487b.
>
> [1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9irgwFj7Rou=xzZg@mail.gmail.com/
> [2] https://lore.kernel.org/linux-mm/20210128134512.GF3592@techsingularity.net/
> [3] https://lore.kernel.org/linux-mm/20210123051607.GC2587010@in.ibm.com/
> [4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbngf-03YsY18cmWv_g@mail.gmail.com/
> [5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-truck/
>
> Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Reported-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Tested on both large and small arm64 systems. There is no regression
with this patch applied

Tested-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>
> OK, this is a 5.11 regression, so we should try to it by 5.12. I've also
> Cc'd stable for that reason although it's not a crash fix.
> We can still try later to replace this with a safe order update in hotplug
> callbacks, but that's infeasible for 5.12.
>
>  mm/slub.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 176b1cb0d006..8fc9190e6cb3 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3454,6 +3454,7 @@ static inline int calculate_order(unsigned int size)
>         unsigned int order;
>         unsigned int min_objects;
>         unsigned int max_objects;
> +       unsigned int nr_cpus;
>
>         /*
>          * Attempt to find best configuration for a slab. This
> @@ -3464,8 +3465,21 @@ static inline int calculate_order(unsigned int size)
>          * we reduce the minimum objects required in a slab.
>          */
>         min_objects = slub_min_objects;
> -       if (!min_objects)
> -               min_objects = 4 * (fls(num_online_cpus()) + 1);
> +       if (!min_objects) {
> +               /*
> +                * Some architectures will only update present cpus when
> +                * onlining them, so don't trust the number if it's just 1. But
> +                * we also don't want to use nr_cpu_ids always, as on some other
> +                * architectures, there can be many possible cpus, but never
> +                * onlined. Here we compromise between trying to avoid too high
> +                * order on systems that appear larger than they are, and too
> +                * low order on systems that appear smaller than they are.
> +                */
> +               nr_cpus = num_present_cpus();
> +               if (nr_cpus <= 1)
> +                       nr_cpus = nr_cpu_ids;
> +               min_objects = 4 * (fls(nr_cpus) + 1);
> +       }
>         max_objects = order_objects(slub_max_order, size);
>         min_objects = min(min_objects, max_objects);
>
> --
> 2.30.0
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order
  2021-02-08 13:41 ` [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order Vlastimil Babka
  2021-02-08 14:54   ` Vincent Guittot
@ 2021-02-10 14:07   ` Mel Gorman
  1 sibling, 0 replies; 3+ messages in thread
From: Mel Gorman @ 2021-02-10 14:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Catalin.Marinas, akpm, aneesh.kumar, bharata, cl, guro, hannes,
	iamjoonsoo.kim, jannh, linux-kernel, linux-mm, mhocko, rientjes,
	shakeelb, vincent.guittot, will, stable

On Mon, Feb 08, 2021 at 02:41:08PM +0100, Vlastimil Babka wrote:
> When creating a new kmem cache, SLUB determines how large the slab pages will
> based on number of inputs, including the number of CPUs in the system. Larger
> slab pages mean that more objects can be allocated/free from per-cpu slabs
> before accessing shared structures, but also potentially more memory can be
> wasted due to low slab usage and fragmentation.
> The rough idea of using number of CPUs is that larger systems will be more
> likely to benefit from reduced contention, and also should have enough memory
> to spare.
> 
> <SNIP>
>
> So this patch tries to determine the best available value without specific arch
> knowledge.
> - num_present_cpus() if the number is larger than 1, as that means the arch is
> likely setting it properly
> - nr_cpu_ids otherwise
> 
> This should fix the reported regressions while also keeping the effect of
> 045ab8c9487b for PowerPC systems. It's possible there are configurations where
> num_present_cpus() is 1 during boot while nr_cpu_ids is at the same time
> bloated, so these (if they exist) would keep the large orders based on
> nr_cpu_ids as was before 045ab8c9487b.
> 

Tested-by: Mel Gorman <mgorman@techsingularity.net>

Only x86-64 tested, three machines, all showing similar results as would
be expected. One example;

hackbench-process-sockets
                          5.11.0-rc7             5.11.0-rc7             5.11.0-rc7
                             vanilla            revert-v1r1        vbabka-fix-v1r1
Amean     1        0.3873 (   0.00%)      0.4060 (  -4.82%)      0.3747 (   3.27%)
Amean     4        1.3767 (   0.00%)      0.7700 *  44.07%*      0.7790 *  43.41%*
Amean     7        2.4710 (   0.00%)      1.2753 *  48.39%*      1.2680 *  48.68%*
Amean     12       3.7103 (   0.00%)      1.9570 *  47.26%*      1.9470 *  47.52%*
Amean     21       5.9790 (   0.00%)      2.9760 *  50.23%*      2.9830 *  50.11%*
Amean     30       8.0467 (   0.00%)      4.0590 *  49.56%*      4.0410 *  49.78%*
Amean     48      12.8180 (   0.00%)      6.5167 *  49.16%*      6.4070 *  50.02%*
Amean     79      20.5150 (   0.00%)     10.3580 *  49.51%*     10.3740 *  49.43%*
Amean     110     25.5320 (   0.00%)     14.0453 *  44.99%*     14.0577 *  44.94%*
Amean     141     32.4170 (   0.00%)     17.3267 *  46.55%*     17.4977 *  46.02%*
Amean     172     40.0883 (   0.00%)     21.0360 *  47.53%*     21.1480 *  47.25%*
Amean     203     47.2923 (   0.00%)     25.2367 *  46.64%*     25.4923 *  46.10%*
Amean     234     55.2623 (   0.00%)     29.0720 *  47.39%*     29.3273 *  46.93%*
Amean     265     61.4513 (   0.00%)     33.0260 *  46.26%*     33.0617 *  46.20%*
Amean     296     73.2960 (   0.00%)     36.6920 *  49.94%*     37.2520 *  49.18%*

Comparing just a revert and the patch

                          5.11.0-rc7             5.11.0-rc7
                         revert-v1r1        vbabka-fix-v1r1
Amean     1        0.4060 (   0.00%)      0.3747 (   7.72%)
Amean     4        0.7700 (   0.00%)      0.7790 (  -1.17%)
Amean     7        1.2753 (   0.00%)      1.2680 (   0.58%)
Amean     12       1.9570 (   0.00%)      1.9470 (   0.51%)
Amean     21       2.9760 (   0.00%)      2.9830 (  -0.24%)
Amean     30       4.0590 (   0.00%)      4.0410 (   0.44%)
Amean     48       6.5167 (   0.00%)      6.4070 (   1.68%)
Amean     79      10.3580 (   0.00%)     10.3740 (  -0.15%)
Amean     110     14.0453 (   0.00%)     14.0577 (  -0.09%)
Amean     141     17.3267 (   0.00%)     17.4977 *  -0.99%*
Amean     172     21.0360 (   0.00%)     21.1480 (  -0.53%)
Amean     203     25.2367 (   0.00%)     25.4923 (  -1.01%)
Amean     234     29.0720 (   0.00%)     29.3273 (  -0.88%)
Amean     265     33.0260 (   0.00%)     33.0617 (  -0.11%)
Amean     296     36.6920 (   0.00%)     37.2520 (  -1.53%)

That's a negligible difference and all but one group (141) was within the
noise. Even for 141, it's very marginal and with the degree of overload
at that group count, it can be ignored.

Thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-02-10 14:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <aac07668-99a0-4c7e-5f8b-10751af364c5@suse.cz>
2021-02-08 13:41 ` [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order Vlastimil Babka
2021-02-08 14:54   ` Vincent Guittot
2021-02-10 14:07   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).